Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The robustification of the lasso and the elastic net: utility in practical research settings
(USC Thesis Other)
The robustification of the lasso and the elastic net: utility in practical research settings
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2021 Matthew Daniel Multach THE ROBUSTIFICATION OF THE LASSO AND THE ELASTIC NET: UTILITY IN PRACTICAL RESEARCH SETTINGS by Matthew Daniel Multach A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (PSYCHOLOGY) August 2021 Acknowledgements I would first like to thank my doctoral advisor, Dr. Rand R. Wilcox, whose research and guidanceinspiredbeforeIhadevenstartingmydoctoralstudies,andwhosephilosophyunderlies my orientation towards statistics. I’d like to thank Dr. T.J. McCarthy, who exposed me to modeling techniques I otherwise would never have known. Dr. Jonas Kaplan, fellow traveler and science-fiction fan, for his continued support and enthusiasm over the past three years. Dr. Christopher R. Beam, for pushing critical revisions throughout this process and for providing insightful questions for my qualifying exams. Each of my committee provided invaluable support to a young statistician and budding datascientists. Intheirownway,eachhasdrasticallyexpandedthesizeofmystatisticaltoolbox, and the tools within it. I would also like to give thanks to the researchers who so generously supported my project by providing code from their own simulations and answering all my questions about their work: Dr. Yoonsuh Jung, Dr. Ezequiel Smucler, and Dr. Qi Zheng. I must express gratitude to my parents for their unending love and support throughout my years of study. This would not have been possible without them, and their achievements and encouragement have served as incredible inspiration to me at the toughest of times. And last, but not least, I’d like to give special thanks and appreciation to Dr. Morteza Dehghani. Morteza has instilled an attention to detail in me since my very first course at USC, where he refused to accept any errors or warnings in my code, even if the result was functional. AmongthelonglistofreasonsIhavetothankMorteza, hesteppedupwhenitwasneededmost, and provided the oversight I needed over these last, most intensive months of my dissertation process. This would not have been possible without your support and respect, and I will never be able to fully express my gratitude to you. Matthew Multach ii iii TABLE OF CONTENTS Acknowledgements..........................................................................................................................ii List of Tables...................................................................................................................................v List of Figures................................................................................................................................vii Abstract...........................................................................................................................................ix Chapter 1: Introduction....................................................................................................................1 Robust Statistical Inference.....................................................................................1 Variable Selection as a Robustness Problem...........................................................6 Overview of the Current Course of Study...............................................................7 Chapter 2: The Lasso, the Elastic Net, and their Adaptations..............................................................9 The Lasso and the Elastic Net.........................................................................................9 The Lasso, the Elastic Net, Outliers, and Normality.............................................13 Clarifications on Some Mathematical Representations.............................................15 Adaptive Tuning Hyperparameter.........................................................................16 Least Absolute Deviation (LAD) Loss Function...................................................18 Huber Loss Function..............................................................................................22 Loss Functions Based on S-Estimators, M-Estimators, and MM-Estimators.......24 Outlier Shifting......................................................................................................30 General Summary of Findings...............................................................................31 Chapter 3: Simulations: Outlier Robustness..........................................................................................33 Review of Relevant Findings.........................................................................................33 Methods and Design...............................................................................................35 Results....................................................................................................................40 Discussion..............................................................................................................68 Chapter 4: Simulations: Distributional Robustness..............................................................................75 Review of Relevant Findings.........................................................................................75 Methods and Design...............................................................................................76 Results....................................................................................................................81 Discussion............................................................................................................105 Chapter 5: Simulations: Boundaries of Dimensionality.....................................................................109 Review of Relevant Findings.......................................................................................109 Methods and Design.............................................................................................109 Results..................................................................................................................115 Discussion............................................................................................................118 Chapter 6: Real-World Data: Entry Status of Partial Hospital Patients..............................................120 Introduction...................................................................................................................120 iv Participants...........................................................................................................122 Measures..............................................................................................................123 Analyses...............................................................................................................125 Results..................................................................................................................129 Follow-Up Analyses............................................................................................144 Discussion............................................................................................................173 Chapter 7: Discussion and Future Directions.....................................................................................178 General Discussion.......................................................................................................178 Recommendations................................................................................................180 Limitations and Future Directions.......................................................................186 Conclusions..........................................................................................................189 References....................................................................................................................................190 Appendices...............................................................................................................................................195 Appendix A: Bonus Mathematical Formulations..........................................................195 Appendix B: R Code............................................................................................198 Appendix C: Applied Data Measures..................................................................234 List of Tables 1 Properties of the g-and-h distribution ........................ 3 2 Null Models: Outliers, Low Dimensionality, p = 8, 1/2 ............... 47 3 Null Models: Outliers, Low Dimensionality, p = 8, 2/2 ............... 47 4 Null Models: Outliers, Low Dimensionality, p = 30, 1/4 .............. 57 5 Null Models: Outliers, Low Dimensionality, p = 30, 2/4 .............. 57 6 Null Models: Outliers, Low Dimensionality, p = 30, 3/4 .............. 57 7 Null Models: Outliers, Low Dimensionality, p = 30, 4/4 .............. 58 8 Null Models: Outliers, High Dimensionality ..................... 67 9 Null Models: Non-Normality, Low Dimensionality, p=8 .............. 88 10 Null Models: Non-Normality, Low Dimensionality, p = 30, 1/2........... 96 11 Null Models: Non-Normality, Low Dimensionality, p = 30, 2/2........... 96 12 Null Models: Non-Normality, High Dimensionality ................. 104 13 Null Models: Boundaries of Dimensionality ..................... 118 14 Demographic Characteristics of PHP SCID Sample, N = 619 ........... 130 15 Primary DSM-IV Diagnoses Upon Initial PHP SCID Evaluation ......... 132 16 All Current DSM-IV Diagnoses Upon Initial PHP SCID Evaluation ....... 133 17 15 Most-Frequently Selected PHP Coe cients: Standard Elastic Net ....... 138 18 15 Most-Frequently Selected PHP Coe cients: Standard Lasso .......... 138 19 15 Most-Frequently Selected PHP Coe cients: Adaptive Elastic Net ....... 139 20 15 Most-Frequently Selected PHP Coe cients: Adaptive Lasso .......... 139 21 15 Most-Frequently Selected PHP Coe cients: Multi-Step Adaptive Elastic Net 140 22 15 Most-Frequently Selected PHP Coe cients: Adaptive Huber Elastic Net ... 140 23 15 Most-Frequently Selected PHP Coe cients: Adaptive Huber Lasso ...... 141 24 15 Most-Frequently Selected PHP Coe cients: Adaptive LAD Elastic Net.... 141 25 15 Most-Frequently Selected PHP Coe cients: Adaptive LAD Lasso ....... 142 26 15 Most-Frequently Selected PHP Coe cients: Outlier-Shifted Lasso....... 142 27 15 Most-Frequently Selected PHP Coe cients: Outlier-Shifted Huber Lasso ... 143 28 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, All Models ....................................... 156 29 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Standard Lasso ..................................... 162 30 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Standard Elastic Net.................................. 163 31 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Lasso ..................................... 164 32 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Elastic Net.................................. 165 v 33 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Multi-Step Adaptive Elastic Net ........................... 166 34 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Huber Lasso ................................. 167 35 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Huber Elastic Net .............................. 168 36 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive LAD Lasso.................................. 169 37 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive LAD Elastic Net............................... 170 38 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Outlier-Shifted Lasso.................................. 171 39 Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Outlier-Shifted Huber Lasso.............................. 172 vi List of Figures 1.1 Actual Type I Error Rate of Student’s t-test When Constructing a 95% CI ... 4 1.2 Nominal vs. Actual Power of Student’s t-test to Detect an E↵ ect of = 1 at ↵ =0.05......................................... 5 2.1 Constraint Surfaces of ` 1 , ` 2 , and Elastic Net Regularization (Zou and Hastie, 2005)........................................... 12 3.1 FPR: Outliers, Low Dimensionality, p=8 ...................... 41 3.2 FNR: Outliers, Low Dimensionality, p=8 ...................... 43 3.3 RMSE: Outliers, Low Dimensionality, p=8 ..................... 45 3.4 Precision: Outliers, Low Dimensionality, p=8 ................... 46 3.5 FPR: Outliers, Low Dimensionality, p = 30 ..................... 49 3.6 FNR: Outliers, Low Dimensionality, p = 30 ..................... 52 3.7 RMSE: Outliers, Low Dimensionality, p = 30 .................... 54 3.8 precision: Outliers, Low Dimensionality, p = 30 ................... 55 3.9 FPR: Outliers, High Dimensionality ......................... 60 3.10 FNR: Outliers, High Dimensionality ......................... 62 3.11 RMSE: Outliers, High Dimensionality ........................ 64 3.12 precision: Outliers, Low Dimensionality, p = 30 ................... 65 3.13 Di↵ erent Types of Outliers .............................. 71 4.1 FPR: Non-Normality, Low Dimensionality, p=8 .................. 82 4.2 FNR: Non-Normality, Low Dimensionality, p=8 .................. 84 4.3 RMSE: Non-Normality, Low Dimensionality, p=8 ................. 86 4.4 precision: Non-Normality, Low Dimensionality, p=8................ 87 4.5 FPR: Non-Normality, Low Dimensionality, p = 30.................. 89 4.6 FNR: Non-Normality, Low Dimensionality, p = 30 ................. 91 4.7 RMSE: Non-Normality, Low Dimensionality, p = 30 ................ 93 4.8 precision: Non-Normality, Low Dimensionality, p = 30 ............... 94 4.9 FPR: Outliers, High Dimensionality ......................... 98 4.10 FNR: Outliers, High Dimensionality ......................... 100 4.11 RMSE: Outliers, High Dimensionality ........................ 101 4.12 precision: Non-Normality, High Dimensionality ................... 102 5.1 FPR, FNR, RMSE, and Precision: Boundaries of Dimensionality, n = 200 .... 115 6.1 Test-Set RMSE, Run-Time (seconds), and Number of Non-Zero Coe cients for Day 0 PHP Patients .................................. 134 6.2 Correlations of Top-15 Predictors with Age and CUXOS Items .......... 146 6.3 Correlations of Top-15 Predictors with CUDOS and FFMQ-SF Items....... 147 6.4 Correlation Changes Resulting from Perturbation 1................. 150 6.5 Correlation Changes Resulting from Perturbation 2a ................ 151 vii 6.6 Correlation Changes Resulting from Perturbation 2b ................ 152 6.7 Correlation Changes Resulting from Perturbation 3................. 153 6.8 Correlation Changes Resulting from Perturbation of Perturbation 4........ 154 7.1 Decision Tree, Low Dimensionality; **See Recommendation inSection 7.2.1.1, 3rd Paragraph ..................................... 182 7.2 Decision Tree, High Dimensionality; **SeeRecommendation inSection 7.2.1.2, 2nd Paragraph ..................................... 183 7.3 StructureofCombinedModel, PlusResearchQuestionsforFurtherStudy(Orange)185 viii Abstract Statisticaltoolsforvariableselectionprovidethemostutilitytheclosertheyareabletoproperly select the true variables underlying a data-generating mechanism. Likewise the selection of only these true predictors stands important. Applying the logic of robust hypothesis testing to variable selection, an ideal tool provides reliable selection properties across a variety of underlying data conditions. The proliferation of adaptations to two popular machine learning methods for dimension-reduction, the lasso and the elastic net, provide a basis from which to inform tool choice for an applied researcher interested in variable selection methods. The currentresearch, therefore, attemptstolayafoundationforstudyofmanyoftheseadaptations’ robust selection properties while making recommendations for practical application with real data. Through numerous simulation studies and an applied example, I demonstrate the distinct limitations of some lasso and elastic net adaptations. On the other hand, the adaptive Huber lasso, the adaptive Huber elastic net, adaptive Least Absolute Deviation (LAD) lasso, and the newly-studiedadaptiveLADelasticnetshowrelatively-greaterreliabilityandconsistencyacross simulationconditions. Thisincludesinresponsetoheavyoutliercontaminationandheavy-tailed errordistributions, althoughsomeselectioninstabilityisnotedunderperturbationsmadetothe intercorrelations among real-data predictors. ix Chapter 1 Introduction Theimportanceofpropervariableselection,andtherebye↵ ectivemodelselection,istwo-foldfor the understanding of a data-generating process. Qualitatively, selecting the correct predictors promotes proper understanding of the process and contributes to its future study. Selection of characteristics that do not indeed underlie a process, worse than impeding understanding of a given process, negatively impacts understanding by taking researchers further from the actual mechanism involved. Statistically speaking, selecting variables that do not underlie a given process contributes to inflated coe cient instability, reduces prediction accuracy, and generally impedes e cient model performance. Meanwhile, failure to include a true variable producesamodelbiasedinpredictioncapabilityandcoe cientestimates. Giventheseconcerns and that applied researchers can never know the actual mechanism underlying observed data, the necessity of modeling techniques that properly select the variables contributing to a data- generating mechanism cannot be understated. The presence of non-normality (Wilcox, 1990, Hill and Dixon, 1982,Micceri, 1989) and outliers(RousseeuwandLeroy,1987)inapplieddatacontextshavebeendemonstratedformany years, while heavy-tailed errors made themselves known as early as the 19th century (Bessel, 1818, Newcomb, 1886, Rousseeuw and Leroy, 1987). Each of these characteristics can seriously impact inferences made by statistical models. Resulting issues with e ciency, power, bias, and analogous concerns outside of direct hypothesis testing therefore present further obstacles to proper variable selection when modeling a process. This dissertation will thus consider variable selection, and its relevance to the applied data setting, in the context of robust statistical methodology. 1.1 Robust Statistical Inference Robust statistics concerns itself with statistical measures and tools that provide stability de- spite changes in observed data. Distributional form, accompanying issues such as tailed-ness 1 and outlier contamination, and group variances are of particular concern given that relevant assumptions underpin well-known and oft-taught methods 2 . We consider here Student’s t-test 1 i.e., when values are more or less dispersed over the distributional space relative to the central tendency 2 e.g., Student’s t-test, Ordinary Least Squares regression 1 as a straightforward example of some of these issues: T = p n( ¯ Xµ) s , (1.1) where n is the sample size, ¯ X is the sample mean, s is the sample standard deviation, and µ the true population mean. The typical application of the T-statistic for hypothesis-testing on the population mean µ relies on the fundamental assumption that the test statistic follows a Student’s t distribution with degrees of freedom ⌫ =n1. Researcherscanreasonablyrelyonthisassumptionundernormality. Unfortunately, evena slight departure of the observation-generating distribution from normality can have meaningful consequences for the results of hypothesis tests. For example, when dealing with symmetric heavy-taileddistributions 3 ,confidenceintervalswillbelargerthanthenominallevel(Benjamini, 1983). Thus, abstractly, the test will be less likely to reject the null hypothesis. Statistically speaking,thisresultsinreducedpowerandTypeIerrorratesrelativetotheirrespectivenominal levels due to wider confidence intervals than anticipated. These problems are consistent across symmetric, heavy-tailed distributions and low power can hold under such circumstances even with large n (Basu and DasGupta, 1995). Skewed distributions, even when light-tailed, present the opposite issue, and outlier-heavy distributions with heavier tails exacerbate this problem. As a result, the confidence intervals of a T-statistic taken from skewed data can be much smaller than the nominal level. Therefore, the test is more likely to reject the null hypothesis than expected, resulting in increased power and Type I error rates relative to the nominal level. Depending on how extreme of a departure skewed distribution is from normality, the true sample size n necessary for proper inference at the nominal level can be quite large, with sample sizes of 200, 300, or larger being required (Westfall and Young, 1993). 1.1.1 Two Useful Distributional Tools Two useful distributional tools merit description before considering a practical example. The g-and-h distribution is a generalization of the normal distribution that takes the following form, given that Z is normally distributed with mean µ = 0 and standard deviation =1 4 : W = 8 < : exp(gZ) 1 g exp( hZ 2 2 ) g> 0, Zexp( hZ 2 2 ) g=0 (1.2) with greater asymmetry in increasing g, heavier tails in increasing h, and the standard normal distribution as a special case when g =h = 0. Table 1, taken from Wilcox et al. (2013), gives some distributional characteristics for certain values of g and h. One potential criticism of the g-and-hdistributionforstudyingdeparturesfromnormalityisthatthesevaluesdonotrepresent a su cient deviation given conditions observed in real data of skew over 15 and kurtosis over 3 Where more distant values from the majority of the data are more likely to be observed 4 Aka, the standard normal distribution. 2 g h 1 =skew 2 =kurtosis 0.0 0.0 0.00 3.0 0.2 0.0 0.61 3.68 0.0 0.2 0.00 21.46 0.2 0.2 2.81 155.98 Table 1: Properties of the g-and-h distribution 250 (Pedersen et al., 2002). Wilcox (2016) also notes real data with observed skew and kurtosis of up to 115.5 and 13,357, respectively. However, the author could not locate the citation for this data. Insu cient non-normality using these values is a valid consideration and one worth keeping in mind as a potential limitation of the current course of study. The second distributional tool used in this and subsequent simulations herein is the con- taminated normal or mixed normal distribution. Mixed normal distributions arise from two distinct normally distributed subpopulations. Consider a random variable X such that: X⇠ (1⌘ x )N(µ 1 , 2 1 )+⌘ x N(µ 2 , 2 2 ), (1.3) where⇠ denotes“hasthefollowingprobabilitydistribution,”µ 1 and 1 representtheparameters for the first underlying subpopulation, and µ 2 and 2 represent the parameters for the second underlying subpopulation. ⌘ x is a parameter expressing the proportion of each population that contributes to the generation of the overall population. A typical application of the mixed normal distribution sees the majority population distributed by a standard normal distribution with µ 1 = 0 and 1 = 1. In contrast, the minority population is generated with a larger population mean, a larger population standard deviation, or both. Subsequent simulations use these two distributions to generate with the desired character- istics, in addition to the practical example of robust concerns below. 1.1.2 Practical Data Example 5 Anexamplewithconcretedatacanhelptoillustratesomeoftheconcernsunderlyingthefieldof robust statistics. Let’s compare the nominal and actual power and Type I error rates for a few normal-adjacent population distributions with characteristics of interest. Suppose populations with N = 10,000 are generated from each of the following: • Normal distribution The standard normal distribution, with population mean µ = 0 and population standard deviation = 1. • Heavy-tailed normal distribution 90% of the data generated from the standard normal distribution, and 10% of the data generated from a distribution with population mean µ = 0 and population standard deviation = 15. • Outlier-contaminated normal distribution : 90% of the data generated from the stan- dardnormaldistribution, and10%dataaregeneratedfromadistributionwithpopulation mean µ = 10 and population standard deviation = 0. 5 The code for which can be found in Appendix B.2. 3 0.00 0.05 0.10 0.15 0.20 0 100 200 300 400 500 Sample Size Type I Distribution Standard Normal Heavy−Tailed Normal Outlier−Contaminated Normal g−and−h(0,0) aka Standard Normal g−and−h(0.2,0) g−and−h(0,0.2) g−and−h(0.2,0.2) Type I Error Rate Figure 1.1: Actual Type I Error Rate of Student’s t-test When Constructing a 95% CI • g-and-h distribution(0,0) A g-and-h distribution with g = h = 0, which is equivalent to the standard normal distribution. • g-and-h distribution(0.2,0) A g-and-h distribution with g=0.2 and h = 0. This is a skewed distribution. • g-and-h distribution(0,0.2) A g-and-h distribution with g = 0 and h=0.2. This is a heavy-tailed distribution. • g-and-h distribution(0.2,0.2) A g-and-h distribution with g =h=0.2. This distribution is both skewed and heavy-tailed. Consider populations generated from the seven distributions listed above. After creating the simulated populations, 5000 random samples of sample size 10 n 500 are taken from eachpopulation,increasingnbyincrementsof10. Finally,thesamplet-statisticatasignificance level of 0.05 is calculated for each of these random samples and then calculate the sample t- statistic at a significance level of 0.05 given the true population mean for each distribution. The actual Type I error rate (Figure 1.1) is the proportion of rejected null hypotheses, as the null-hypothesized value is the true population mean in each case. For the most part, the actual Type I error rates are relatively close to the nominal level of ↵ =0.05, even with small samples. However, two distributions do not stabilize around the nominal level until much larger sample sizes. The actual Type I error rate for the outlier-contaminated population is much larger than the nominal level untiln> 100, with actual error rates greater than 0.10 at n = 50. On the other hand, the actual error rate for the heavy-tailed normal distribution is far less than the nominal with sample sizes less than 50. To determine the relative performance of each method concerning power, the author first calculatedthesamplet-statisticatasignificancelevelof0.05givenanull-hypothesizedvalueof1 +thetruepopulationmeansofeachpopulationdistribution. Thenominalpowerateachsample size is the power of the t-statistic to detect a true di↵ erence of = 1 at that sample size and ↵ =0.05. The actual power is the proportion of the calculated t-statistics that reject the null hypothesis. Suppose the test statistic’s actual probability coverage is close to the nominal. In that case, the corresponding lines in the nominal and actual power plots should roughly match, 4 0.00 0.25 0.50 0.75 1.00 0 100 200 300 400 500 Sample Size Power Nominal Power 0.00 0.25 0.50 0.75 1.00 0 100 200 300 400 500 Sample Size Power Actual Power Distribution Standard Normal Heavy−Tailed Normal Outlier−Contaminated Normal g−and−h(0,0) aka Standard Normal g−and−h(0.2,0) g−and−h(0,0.2) g−and−h(0.2,0.2) Figure 1.2: Nominal vs. Actual Power of Student’s t-test to Detect an E↵ ect of =1at ↵ =0.05 while disparate line trends reflect a discrepancy between the nominal and actual power. As seen in Figure 1.2, at the smallest of sample sizes, nominal power overestimates the actual power at least moderately for all methods except the two standard normal distributions as well as the g-and-h distribution with g=0.2 and h=0 6 . g-and-h(0,0.2) 7 andg-and-h(0.2,0.2) 8 do not stabilize around their nominal levels until sample sizes ofn> 100, the outlier-contaminated distribution until approximately n = 200, and the heavy-tailed distribution untiln> 400. It is also worth considering these discrepancies more abstractly and thinking about why departures from nominal, pre-specified probability coverage are so problematic. The nature of statistical inference is uncertainty, most significantly surrounding the unobserved and unknown data-generating process. Statistical tools allow inference about the process given that it can never be truly observed. In the hypothesis-testing context, researchers typically design a study to have a particular power and ↵ . These values help to constrain uncertainty regarding the data-generating process within more clear boundaries and constrain the uncertainty around our inferences. There is always uncertainty when proper inferences are drawn based on these pre-determined values, but there is some known quality. Drawing conclusions based on these values, when in truth they are inaccurate, renders resulting findings not just uncertain; the dis- parity between the specified uncertainty and the actual uncertainty renders them invalid. Such circumstances provide neither the true uncertainty nor the disparity between this uncertainty and actuality. Given that no method dominates under every data-generating mechanism, the best choice of method performs both well and consistently, even if that means it does not always perform the best or the most e ciently. Wilcox (2016) provides an illustrative example of this point by comparing some performance and theoretical characteristics of 9 estimators of location, includingthesamplemean,thesamplemedian,anM-estimator 9 ,andthe10%and20%trimmed means 10 . Although the sample mean produced the smallest standard error under normality, this was true only marginally compared to other estimators, and it performed far worse under di↵ erentconditions. ThemedianandM-estimator,meanwhile,haveidealtheoreticalproperties, 6 Aka the skewed, normal-tailed distribution. 7 Symmetric, heavy-tailed distribution. 8 Skewed and heavy-tailed distribution 9 Described in Chapter 2 10 A %-trimmedmeancorrespondswiththemeanafterremovingthe %ofthedatawiththelargestobserved values and the % of the data with the smallest observed values 5 including asymptotic resistance to outlier contamination. However, the 20% trimmed mean produced competitive standard errors across distributions and is more consistently accurate in its probability coverage. Consequently, 20% trimmed mean-based statistical methods can often be preferable, even though the median and M-estimators have ideal theoretical characteristics. 1.2 Variable Selection as a Robustness Problem Consider now the issue of variable selection in the context of robust statistical methodology, framing the selection of each variable as a testable hypothesis of inclusion into 11 or exclusion from 12 the model. This hypothesis-testing interpretation of model coe cients is quite literal in specific frameworks. Consider the following model information for an Ordinary Least Squares model. The model below was produced using the Cars93 dataset from the MASS package (Venables and Ripley, 2002) in R. In this model, Engine Size, Horsepower, Car Length, and Car Width predict Highway MPG. Interested readers can find the code used to produce these model results in Appendix B.3. Each coe cient estimate includes an actual hypothesis test of the estimate’s corresponding t-statistic against a null-hypothesized value of 0. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 71.792329 13.298368 5.399 5.63e-07 *** EngineSize -0.240055 0.915001 -0.262 0.7937 Horsepower -0.034452 0.011481 -3.001 0.0035 ** Length -0.006931 0.050705 -0.137 0.8916 Width -0.516614 0.245824 -2.102 0.0384 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Theaboveexampledoesnotdemonstratesvariableselectionitself, noristheinterpretation of this exact model critical, but rather the results above solidify the hypothesis-testing inter- pretation of selecting variables into a model. Abstracting from models with explicit hypothesis tests,powerinvariableselection,then,istheabilityofamodeltocorrectlyselecttruepredictors of an outcome. On the other hand, Type I error rate would correspond with the rate at which a model includes true zero variables in a model. In this framework, a variable selection method shouldselectapredictor(orassignanon-zerocoe cientvalue)totruepredictorsofanoutcome. An e↵ ective selection tool should also correctly eliminate (or assign a coe cient of 0 to) true zero variables from the model. A robust variable selection procedure should accomplish both tasks relatively well and consistently under various underlying data-generating mechanisms. 11 Aka, the alternative hypothesis. 12 Aka, the null hypothesis. 6 1.3 Overview of the Current Course of Study The logic of robust statistical methods and an orientation of variable selection towards this logic provide a more clear structure to the goals of the current research. First, this research attempts to cohesively evaluate the variable selection properties of two prominent selection tools, the lasso and the elastic net. This research also studies the robustness of their variable selection properties to outlier contamination and non-normal error distributions. Finally, the author makes recommendations for sound use and application of these tools by the applied researcher. As a consequence of and secondary to these primary goals, this research will also begin to create a more cohesive image from the motley band of studies comprising the literature on robustification of the lasso and elastic net. The following structure is used to accomplish these goals: Chapter 2 outlines the standard lasso and elastic net procedures and their variable selec- tion properties and reviews the extensive and heterogeneous literature on adaptations proposed to improve their performance. Chapter 3 reports on a simulation study of the selection performance of several lasso and elastic net adaptations in the context of outlier contamination. Additional simulations are considered with data in higher dimensions. Chapter 4 reports on a simulation study on the selection performance of these tech- niques in the context of non-normal error distributions, with additional considerations in higher dimensions. Chapter 5 reports on the results of a smaller simulation study assessing the boundaries of dimensionality concerns and their impacts on variable selection performance. Chapter 6 applies the lasso and elastic net models in the current research to an applied dataset: intake characteristics of psychiatric patients on the day of their initial evaluation for an intensive day-treatment program. This data has not yet seen application as conducted in the current program of research; instead, relevant findings in lower-intensity psychiatric settings willcontextualizethedataandanalyses. Follow-upanalysesareconductedtostudytheimpacts of altered correlations among potential predictors on variable selection in modeling this data. Chapter 7 provides a general discussion of the findings across the current research pro- gram,highlightinganylimitationsandpotentialdirectionsforfutureresearch. Thechapterends with a summary of this research and recommendations for practical application of the studied techniques. Given especially the intent of this research to provide value to the applied researcher with real data, the focus herein will be on model performance with real and finite data. In the words of a wise statistician 13 , “We don’t live in Asymptotia;” theoretical considerations and asymptotic characteristics, while important, are secondary to the current research, and will largely be left to the original developers of each method. A particular theoretical/asymptotic 13 Whose name I cannot remember, and whose quote I cannot locate 7 property called the oracle property (Fan and Li, 2001) will be briefly addressed in Chapter 2 given its relevance to the variable selection problem. 8 Chapter 2 The Lasso, the Elastic Net, and Their Adaptations This chapter begins with an introduction and general overview of the lasso and the elastic net, two modern developments in sparse machine learning methods. After the general overview, the chapter reviews many of the adaptations proposed to both tools. This review also outlines their performance concerning robust variable selection with respect to outliers and non-normal error distributions. 2.1 The Lasso and the Elastic Net 2.1.1 The Lasso The Least Absolute Shrinking and Selection Operator (lasso) was first proposed by Tibshirani (1996) to address limitations in the standard Ordinary Least Squares regression framework. Extreme variance in the prediction of the Ordinary Least Squares (OLS) estimates and reduced interpretability given large numbers of predictors underlie this initial proposal. The two con- temporary solutions to these problems, ridge regression and subset selection, each su↵ ered from drawbacks. Subset selection provided highly interpretable results. However, as a result of the discrete selection process, subset selection su↵ ered from extreme variance. This extreme vari- anceproducedespeciallynon-robustselections,asrelativelyminutechangesinthemodeleddata could select vastly di↵ erent variables. Consequently, prediction accuracy also su↵ ered. Ridge regression, meanwhile, proved much more e↵ ective in producing stable and consistently strong predictions from its models. Unfortunately, ridge regression never estimates any coe cients to 0, and thus models are not easily interpreted. Enter the lasso, intended to produce a model with the stability and prediction capability of ridge regression and the interpretability of subset selection. Consider first the standard linear regression model with p predictors: y i = 0 +x i p X j=1 j , (2.1) 9 where y i represents an outcome value for subject i,i1,...,n, 0 is some intercept or group average for the outcome vectory,x i is a p-length vector of values for subject i for all possible predictors, and j is the coe cient estimate corresponding with the jth predictor, j=1,...,p. The standard solution to the linear regression problem is to estimate coe cients j using least- squares, which attempts to minimize the following objective function: OLS = argmin 0 , 2 p ⇢ 1 2n ky 0 1X k 2 2 , (2.2) where y is the vector of responses, 1 is an n-length vector of all 1’s, X is an n⇥ p matrix containing vectors x i , and k · k 2 is the squared-error loss function, otherwise known as the Euclidian or ` 2 norm 1 . The general purpose of this technique is to determine a line that passes as close as possible to all observations, doing so by minimizing the square of the di↵ erence between the model-predicted value and the observed outcome y i (in other words, the residual sum of squares). Regularization 2 operatesinthiscontextbyaddinganadditionaltermtotheOLSobjective functionthatconstrainscoe cientestimatesbasedonaparticularcriterion. Equation(2.2)then becomes the following in the case of the lasso: lasso = argmin 2 p ⇢ 1 2n ky 0 1X k 2 2 subject tok k 1 t, (2.3) where k k 1 t represents the ` 1 -norm constraint, t a constant chosen to induce variable shrinkage, and the equation to be minimized the standard least-squares minimization problem as described in (2.2). The constant t bounds the parameter estimates of the resulting model by limiting the sum of the absolute values of these estimates, constraining the model by shrinking parameter estimates and reducing some coe cients completely to zero. This constraint, and its isomorphic equivalent , is typically chosen using an external, data-driven technique such as cross-validation. The ` 1 -norm, alternatively known as the Least Absolute Deviation, minimizes the absolute value of the di↵ erence between an observed value and the corresponding value pre- dicted by the model, and compares to the ` 2 -norm described previously. Alternately expressed in Lagrangian form, lasso = argmin 2 p ⇢ 1 2n kyX k 2 2 + k k 1 (2.4) for some 0, where represents a tuning hyperparameter 3 with a direct relationship to t under the constraintk k 1 t. For convenience with later adaptations, another formulation of 1 Further clarification on mathematical norms are provided in Section 2.3.1. 2 Which includes both ridge regression and the lasso 3 In machine learning, a hyperparameter is a parameter that is determined by the model-user or the model itself and which has some impact on the way the model “learns” from the data. This is in contrast to typical parameters, such as regression coe cients, which are estimated as a result of the model. 10 the lasso criterion is as follows (Tibshirani, 1996; H. Wang et al., 2007): lasso = argmin 2 p n X i=1 (y i x 0 i ) 2 +n p X j=1 | j | (2.5) The first term on the right-hand side of the equation represents the least-squares criterion (i.e., the ` 2 -norm from (2.2)), and the second represents the ` 1 -norm lasso constraint in relation to sample size, n. Ridge regression follows a similar form as the lasso. Ridge uses the ` 2 -norm constraint, n P p j=1 ( j ) 2 , instead of the ` 1 -norm constraint in (2.5). 2.1.2 The Elastic Net Intheirpreliminarypaperontheelasticnet, ZouandHastie(2005)describedthreemajorissues with the general lasso: • Whenp>n, the lasso can only select up to p variables. • Given correlated variables or groups of variables, the lasso will arbitrarily select one into the model and eliminate the others. • Even inn>p data, given su cient predictor correlations, lasso’s prediction performance is poor relative to other methods like ridge regression. To address these concerns, Zou and Hastie (2005) propose the na¨ ıve elastic net, which utilizes both ` 1 - and ` 2 -norm regularization: NaiveENet = argmin 2 p n X i=1 (y i x 0 i ) 2 +(↵ ) 1 p X j=1 | j |+(1↵ ) 2 p X j=1 ( j ) 2 (2.6) where 1 is the same lasso tuning hyperparameter outlined previously, 2 is the corresponding tuning hyperparameter for ` 2 shrinkage, and 0 ↵ 1 is a hyperparameter that controls each regularization terms’ contributions to model estimation and variable shrinkage. Note that when ↵ = 0, this reduces to ridge regression. Similarly, this reduces to the lasso when ↵ = 1. ↵ =0.5 corresponds with equal contributions from both ridge and lasso regularization. Although this initial method addressed the first and second concerns about the lasso, Zou and Hastie (2005) noted unsatisfactory predictive performance in most cases in the third scenario. The authors, therefore, propose a corrected version, known now as the elastic net: ENet=(1 2 n ){argmin 2 p n X i=1 (y i x 0 i ) 2 +(↵ ) 1 p X j=1 | j |+(1↵ ) 2 p X j=1 ( j ) 2 } (2.7) The authors demonstrate the superior performance of the elastic net over the lasso and ridge regression on the three lasso concerns, with both real and simulated data. A simplified visualization of the three regularization methods mentioned so far can help illustrate the mechanical process by which they operate. Figure 2.1, shown in the original 11 Figure 2.1: Constraint Surfaces of ` 1 , ` 2 ,andElasticNetRegularization(ZouandHastie, 2005) proposal by Zou and Hastie (2005), shows the optimized solution region ✓ opt 4 for two predictors determined by some arbitrary regression estimator, and the constraints placed on those regions by the ` 1 -norm constraint (the lasso), the ` 2 -norm constraint (ridge regression), and combined ` 1 - and ` 2 -norm constraint (elastic net regression), respectively. The lasso’s penalty results in potentially optimal values at vertices of the constraint surface - in other words, at a point cor- responding with one of the coe cient estimates set to 0. Meanwhile, the ` 2 -norm constraint in ridge regression is perfectly circular, with all points on the constraint surface equally optimal. Although in practice there exists a minimal possibility that ridge might estimate a coe cient to 0 on this surface, the theoretical probability is 0 due to the continuous nature of the con- straint surface. Finally, the elastic net constraint surface features vertices at 0-estimates and a curvilinear surface in between these vertices. As balancing hyperparameter ↵ increases from 0 to 1, the constraint surface shifts from a perfectly circular region to a perfectly square region. Not quite so evident from these plots, however, is the added benefit of selecting most or all correlated predictors into or out of the model, compared to the lasso’s tendency to arbitrarily select one of a correlated group of variables and eliminating the rest of the group (Zou and Hastie, 2005) 5 . All subsequent formulations will use 1 to represent the lasso tuning hyperparameter and 2 to represent the ridge hyperparameter. Other tuning hyperparameters will be formulated and defined as necessary. 2.1.3 On the Oracle Property This research would be remiss not to acknowledge a popular property in the regularization literature. Fan and Li (2001) originally proposed the Oracle Property. A variable selection procedure is said to have the oracle property if, as n!1 , its selection capacity converges to the selection of the ideal procedure applied with knowledge of the true underlying model. On its face, this seems like an ideal property in the context of determining a robust variable selection procedure, as the primary goal should be to accurately and consistently select the true variables of a data-generating process and thereby consistently select the true underlying 4 Where ✓ represents the parameter space for predictors . 5 Although we will see that this does not always select all correlated variables into a model, at least under specific conditions studied in Chapter 6. 12 model. However, thispropertyismathematicallycomplex, andtheconditionsnecessarytomeet its weak or strong equivalents are often highly convoluted and vary significantly depending on the particular technique. Not every applied researcher has access to the formal mathematical expertise necessary to ascertain these conditions in a given context. More importantly, consider the following question: At what point does sample size n become “su ciently large” for the model to converge on the model as selected by the “ideal” techniqueappliedwithfullknowledge 6 ? Thisquestionarisesinthegeneralrobustnessliterature concerning one of the most famous and fundamental theorems in statistical probability theory: theCentralLimitTheorem, whichstatesthatasn!1 , thesamplingdistributionofastatistic convergestothenormaldistribution. Thistheoremhistoricallyjustifiedtheuseofmethodsthat assume normality with relatively small sample sizes. The Central Limit Theorem supported the belief that any deviations from normality in the underlying distribution would be addressed as a result of the theorem. Wilcox (2016), among many others, has pointed out that the sample size necessary for methods that assume normality to perform as expected varies widely given the nature of the actual underlying distribution, which in reality is neither known nor observed. These comments do not nullify the theoretical utility of either the oracle property or the Central Limit Theorem; however, both concepts mean little given the focus of the current research to provide recommendations to the applied researcher, and thus the oracle property is not discussed further. Readers interested in theoretical and asymptotic statistical concepts such as the oracle property are strongly encouraged to read the original proposal in Fan and Li (2001) as well as a recent critique of the property found in X. Wu and Zhou (2019). For completeness, Appendix A.1 outliines the full mathematical description of the property. 2.2 The Lasso, the Elastic Net, Outliers, and Normality The remainder of this chapter is devoted to outlining many adaptations made to the lasso and the elastic net. The author reviews adaptations and their corresponding models, including mathematical formulations. Relevant findings compared to other techniques are outlined where applicable, particularly in the context of handling outliers or non-normal error distributions. The review presents adaptations roughly in chronological order of initial published proposal. However, adaptations which relate to previously-outlined adaptations are subsumed under the corresponding adaptation. The chapter concludes with a general review of the relevant findings across the various methods formulated below. 2.2.1 Selecting Models Into the Current Research Due to limitations on time and computational resources, it is simply not feasible for a handful of studies to adequately cover all of the many adaptations made to the lasso and elastic net. 6 Tangentially: what is the “ideal” technique? 13 Consequently, only some methods were selected to be studied herein. “Convenience” in many forms served as the primary justification for inclusion of any given adaptation into the current study. Primarily, the author is concerned regarding the convenience with which a researcher can properly and confidently apply the given methods. This research specifically focuses on implementation in R, an environment especially friendly to statistical software development and advanced research statistics. The author included lasso and elastic net adaptations with existing R implementation that did not require more than a day to pro- duce reliable results in singular models. The primary rationale for this selection criteria ensures practical usability by the applied researcher. If the author, a reasonably-skilled R program- mer, encountered significant obstacles to implementation, this implementation is unlikely to be practical for the applied researcher. The ultimate goal is to provide access to robust lasso and elastic net tools that the typical researcher can apply in practice with little di culty. This research also attempts to include both well-tread adaptations and one-o↵ adaptations that have seen little study beyond their initial proposal. Studies included both the lasso and elasticnetvariantsofanygivenadaptationwhereavailable. Asaresult,twoadaptationsfeature in the following studies that have not been previously studied in the context of robustness. The only previous study of the adaptive Huber elastic net, Yi and Huang (2017), focuses solely on computational metrics in their simulations. Meanwhile, the adaptive LAD elastic net has never been explicitly proposed. 2.2.2 Literature Search and General State of the Literature This program of research began with a literature search and review. The initial search on Web of Science used search terms such as “robust*,” “lasso,” “elastic net,” “outlier*,” and “non-normality.” This search produced the initial proposals for many of the methods outlined below and a larger pool of studies to review. Using this pool and any subsequent studies citing a proposal, the search process resulted in 62 studies that discussed either the lasso or the elastic net and statistical robustness or proposed an adaptation to either with potential robust qualities. This sample of studies also features only non-hierarchical, cross-sectional data and single-outcome models. Theliteraturereviewedbelowcompriseslessthan20studiesfromtheoriginalsampleof62. Themajorityofthesestudiesdidnotstudyrobustcharacteristicsorproposeapotentiallyrobust method. Additional studies (for example, Park, 2017) lacked su cient information about their simulations to fully understand the robust characteristics studied or the precise adaptations implemented. Adaptations to the lasso or elastic net typically take one of two forms: modification to the squared-error loss function, LAD loss, Huber loss, M-estimators and relatives, etc.; or to the penalizationterm(s),suchasfusedpenalizationortheadaptivetuninghyperparameter. Asmall number of adaptations incorporate additional terms into the objective function, including the Outlier-Shifting method described in Section 2.8. Modifications intended to improve robust characteristics typically take the form of loss-function modifications. In contrast, penalty-term 14 modifications typically seek to increase model stability, most often in the context of covariate collinearity or cross-validation. These could be considered “robustness” concerns because they robustify the results against instability from particular characteristics. However, we primarily focus on these methods insofar as they relate to outliers or non-normal error distributions. Many adaptations in the robustification literature involve amalgamations of multiple pre- viously proposed modifications. The “X + Adaptive Lasso Tuning Hyperarameter” is the most common example of this in the literature. However, this is a unique case given the practical utility of the adaptive formulation discussed below. Other examples include the weighted fused lasso, weightedLADlasso, ortheHuberizedOutlier-shiftedlasso. Thisreviewfocusesprimarily on methods included in the current simulations; derivatives or generalizations of those relevant to the discussion; and a class of methods that was initially included and is worth reviewing despite its ultimate removal. Many other adaptations exist to both the lasso and the elastic net whose variable selection properties deserve further study. Although the last few years have seen a drastic increase in modification and adaptation of the elastic net, these adaptations in particular remain largely unstudied from a robustness perspective. Manyoftheproposalstudiesforelasticnetmodificationsdonotaddressrobustness concerns such as outliers or non-normal error distributions or do not consider metrics relevant to robust variable selection. Out of less than 20 studies reviewed in the following sections, only three studies (Lambert-Lacroix and Zwald, 2016, Kurnaz et al., 2018, and Freue et al., 2019) included the elastic net or an adaptation in their simulations. Given the particular lack of research on the robustification of the elastic net, the current research program hopes to provide an initial starting point and comprehensive review around which future robust elastic net research can orient. 2.3 Clarifications on Some Mathematical Representations 2.3.1 A Note on Mathematical “Norms” The descriptions provided below are simplified descriptions provided for practical clarity. They do not represent the complete formal definitions of norms more generally, nor of specific norms mentioned. Mathematical norms measure a“distance” of sorts between points along a vector, typically represented symbolically by ` q or ` p . As a representation of abstracted distance, this valueisalwaysnonnegative. Giventheuseofpinthecurrentresearchtoindicatethenumberof potential predictors from which a method selects model variables, ` q represents the generalized norm throughout this research. Mathematically, the ` q -norm is: k ·k q = n X i=1 |· i | q 1/q (2.8) 15 forsomerealnumberq1,where ·representsavectoroflengthn.The ` 2 -normrepresents the straight distance between points on a vector, while the ` 1 -norm represents the absolute magnitude of the path between points on a vector 7 . Both of these norms occur frequently in the current context. The ` 2 -norm of the vector of residuals is used both as the primary minimizing criteria for OLS regression and the coe cient penalizer in ridge regression. Meanwhile, the ` 1 -norm penalizes coe cients in the lasso and also replaces the ` 2 -norm used in standard OLS regression for LAD-loss regression. 2.3.2 On Two Similar Symbols A commonly-occurring Greek letter in statistical formulations, %, closely resembles the letter used herein to represent the number of potential predictors, p. p in the current research al- ways refers to the number of potential predictors provided to a model for variable selection; this analogously refers to the number of columns in the design matrix consisting of predictor variables. %, on the other hand, refers to one of two contexts. This symbol often indicates the statistical value of a correlation or collinearity. Some adaptations discussed in this chapter also use functions delineated by %. The author attempts to replace other uses of p with alternate letters or symbols, and the version of % used will be clarified for the contexts in which it arises. 2.4 Adaptive Tuning Hyperparameter 8 One of the first proposed modifications of the lasso uses adaptive tuning hyperparameters 1,j (Zou, 2006). The adaptive lasso, or weighted lasso, replaces the single tuning hyperparameter 1 withaweightedcombinationofvariable-specifictuninghyperparameters. Inlayman’sterms, the adaptive tuning parameters di↵ erentially shrinks each variable. This modification primarily attempts to address problems of inconsistent variable selection that arise from finding the op- timal tuning hyperparameter 1 for best prediction accuracy (Fan and Li, 2001; Meinshausen and Buhlman, 2004): AdLasso = argmin 2 p n X i=1 (y i x 0 i ) 2 + 1,j p X j=1 ˆ w| j |, (2.9) where ˆ w=1/ ˆ is a vector of coe cient weights for some weighting hyperparameter > 0 and for some root-n consistent estimator of , ˆ . The initial coe cient estimates ˆ are estimated in a preliminary regression model. The choice of estimator ˆ is relatively arbitrary, and no convention exists in the literature. For consistency with previous research, ridge regression provides the preliminary coe cient estimates ˆ . Typically, is chosen via cross-validation in a fashion similar to the selection of 1 . 7 `q for q2 represent distance notions with less-clear practical examples. 8 See Zou (2006)andZouandZhang(2009) for primary mathematical considerations in the lasso and elastic Net, respectively. 16 Zou and Zhang (2009) made a similar adaptation to the elastic net due to similar concerns withthelassotuninghyperparameter,aswellasconcernsofthecombinedimpactsofcollinearity and dimensionality on model selection and estimation. AdENet=(1+ 2 n ){argmin 2 p n X i=1 (y i x 0 i ) 2 +(↵ ) 1,j p X j=1 ˆ w j | j |+(1↵ ) 2 p X j=1 ( j ) 2 }, (2.10) Note in the elastic net formulation that only the lasso penalty term incorporates the adaptive tuning hyperparameter. 2.4.1 Multi-step adaptive elastic net 9 Xiao and Xu (2017) propose that three fundamental issues arise when using the standard lasso penalty: (1) coe cient estimates are biased; (2) multicollinearity results in the selection of only one predictor among a group of correlated predictors; and (3) excessive false-positives when selecting true zero coe cients into the final model which generally contains the true non-zero coe cients. Given that the adaptive tuning hyperparameter addresses the first concern and the ridge penalty in the elastic net deals with the second, the authors suggest the use of a multi-step estimation procedure for handling the third. This procedure relies on the notion that the model estimated by adaptive lasso and elastic net penalties generally succeeds at selecting true non-zero coe cients but includes too many true zero coe cients in the model with only a single estimation step. The multi-step adaptive elastic net is conducted as follows: 1. Initialize adaptive weights ˆ w using the estimator of choice, ˆ . 2. For k=1,2,...,M, solve the elastic net problem (1+ 2 n ){argmin 2 p n X i=1 (y i x 0 i ) 2 +(↵ ) ⇤ (k) 1,j p X j=1 ˆ w (k 1) j | j |+(1↵ ) ⇤ (k) 2 p X j=1 ( j ) 2 } (2.11) where ⇤ (k) 1 and ⇤ (k) 2 representtheoptimizedtuninghyperparametersatthecurrentstage, and ˆ w (k 1) j the weights for the lasso tuning hyperparameter determined at the previous step. Note that the parenthetical superscripts (k) and (k 1) a represent indices rather than ex- ponents for the tuning hyperparameter and tuning hyperparameter weight, respectively. k in this process represents the number of elastic net estimation stages. Running the procedure with k = 1 stage amounts to the standard elastic net (since there can be no 0th set of weights), and with k = 2 stages, the result is the previously described adaptive elastic net. k> 2 results in a multi-step adaptive process as proposed by Xiao and Xu (2017). The authors do not clearly state how many steps they used for their multi-step procedure, a concern discussed in multiple subsequent chapters. 9 See Xiao and Xu (2017) for primary mathematical considerations 17 2.4.2 Relevant Findings Zou and Zhang (2009) found that the adaptive elastic net’s predictions outperformed or per- formed competitively against methods including the standard lasso, adaptive lasso, standard elastic net, and SCAD (smoothly-clipped absolute deviation). However, their study focuses primarily on predictor collinearity, with no inclusion of outlier or distributional robustness. Xiao and Xu (2017)’s proposed multi-step method had middling predictive performance compared to the lasso, elastic net, adaptive lasso, and adaptive elastic net with small and moderate predictor collinearity (% = .25 and % = .5, respectively). However, when collinearity was high (% =.75), the other methods significantly outperformed the multi-step procedure. On the other hand, their method dominated all other methods in addressing their primary concern: false positives (i.e., true zero coe cients estimated to be non-zero) in the face of collinearity. In all scenarios, the multi-step elastic net had a lower false-positive count than the other methods, includingacountofzerogivenmoderateandhighcollinearity. Thestudydidnotaddressoutlier or distributional robustness, did not discuss false negatives, and only varied collinearity. Given that the reduced false positives could potentially impact false negatives, false negatives should be of particular interest for this adaptation. Subsequentsectionsdescribetherelevantrobustnessperformanceoftheadaptivelassoand adaptive elastic net in the context of other included adaptations. Themultistep adaptive elastic net has thus far remains unstudied beyond its initial proposal by Xiao and Xu (2017). 2.4.3 The Current Study Given the ubiquity of adaptive lasso tuning in any discussions of the lasso or elastic net, both the adaptive lasso and adaptive elastic net are included in all simulations. The simulations also include the adaptive elastic net procedure included, as the msaenet package (Xiao and Xu (2019)) provides practical implementation in R. 2.5 Least Absolute Deviation (LAD) Loss Function 10 H.Wangetal.(2007)Proposedoneoftheearliestattemptsatrobustifyingthelassotoresponse outliers. They propose replacing the least squares criterion with the least absolute deviation (LAD), i.e. the ` 1 -norm: LADLasso = argmin 2 p n X i=1 y i x 0 i + 1,j p X j=1 ˆ w j | j |, (2.12) 10 See H. Wang et al. (2007) for primary mathematical considerations with respect to the lasso. 18 where |y i x 0 i |, the LAD-loss function (aka the ` 1 -norm), replaces the typical least- squares loss function. This loss function therefore penalizes the absolute value of residuals rather than the square. Due to the timing of their study, H. Wang et al. (2007) do not ex- plicitly make use of the findings and techniques from Zou (2006). However, their LAD-lasso technique also uses an adaptive tuning hyperparameter 1,j and coe cient weights vector ˆ w j . 1,j and ˆ w j =1/ ˆ beta are as defined previously in Section 2.4. This method shall hereafter be referred to as “adaptive LAD lasso.” The LAD elastic net has not yet been studied in the literature and exists primarily due to related software implementation developed by Yi and Huang (2017). Via the hqreg pack- age (Yi, 2017), it is possible to simultaneously utilize the LAD loss with lasso regularization. Consequently, the package provides implementation for the elastic net with the LAD loss func- tion. As with the lasso formulation, LAD elastic net incorporates the adaptive lasso tuning hyperparameter. The LAD elastic net formulation is as follows: LADENet=(1+ 2 n ){argmin 2 p n X i=1 y i x 0 i +(↵ ) 1,j p X j=1 ˆ w j | j |+(1↵ ) 2 p X j=1 ( j ) 2 .}, (2.13) We will hereafter refer to this method as the “adaptive LAD elastic net” for clarity. 2.5.1 Quantile Loss Function 11 QuantileLasso = argmin ⌧ 2 p n X i=1 % ⌧ (y i x 0 i ⌧ )+ 1,j p X j=1 ˆ w j | j |, (2.14) where 0< ⌧ < 1 is some quantile of interest and % ⌧ (r)= 8 < : ⌧rr> 0 (1⌧ )rr 0 , (2.15) is a function on the residualsr that estimates the ⌧ -th quantile of the responsey givenx.When ⌧ =0.5, this corresponds with ` 1 or LAD-loss. Consequently, quantile regression and quantile lasso are generalizations of LAD regression and the LAD lasso. Note the adaptive lasso tuning hyperparameter. 11 See Zou and Ying (2008) for primary consideration with a specific implementation of the quantile lasso and Y. Wu and Liu (2009) for mathematical considerations with generalized quantile lasso. First use in the elastic net context is unclear, but see Yi and Huang (2017) for some mathematical considerations. See also Koenker and Bassett (1978) for initial considerations on quantile regression. 19 2.5.2 Weighted Loss Functions 12 and the Weighted LAD Lasso 13 Another adaptation, found primarily in the context of the LAD lasso, is to adaptively weight observations’ residuals to such that larger residuals have a reduced impact on coe cient esti- mation. We see this first formulated by Arslan (2012): WLADLasso = argmin 2 p n X i=1 w 1,i y i x 0 i + 1,j p X j=1 ˆ w 2,j | j |, (2.16) wherew 1,i is a robust measure of distance for observationsx i for downweighting leverage points (aka x-outliers), and ˆ w 2,j is the coe cient weights vector for the adaptive lasso tuning hyper- parameter as discussed previously. This method shall be referred to as the “adaptive weighted LAD lasso.” Park (2017) proposes a weighted-loss elastic net: WENet = argmin 2 p n X i=1 w 1,i (y i x 0 i ) 2 + 1,j p X j=1 ˆ w 2,j | j |+(1↵ ) 2 p X j=1 ( j ) 2 (2.17) Whereas application to the lasso has only been in the context of the LAD loss function, applica- tion to the elastic net has only been in the context of the squared-error loss. Note the adaptive lasso tuning hyperparameter in both cases. This method shall be referred to as the “adaptive weighted elastic net.” 2.5.3 Robust Adaptive Lasso 14 Zheng et al. (2016) proposed the Robust Adaptive Lasso (hereafter RA lasso) to address both robustness and e ciency issues in the standard Lasso. The RA lasso minimizes the objective function: RALasso = argmin 2 p ↵ n X i=1 (y i x 0 i ) 2 +(1↵ ) n X i=1 y i x 0 i + 1,j p X j=1 ˆ w j | j |. (2.18) where 0 ↵ 1 balances typical squared-error loss and LAD loss functions, comparable to the balance of ridge and lasso penalties in elastic net regularization. The goal is to balance the benefits of adaptive LAD lasso and standard adaptive lasso to improve the characteristics of the standard lasso formulation. 13 See Arslan (2012) for initial consideration of the combined weighted and LAD loss function in the lasso. 14 See Zheng et al. (2016) for primary mathematical considerations. 20 2.5.4 Relevant Findings H. Wang et al. (2007), Lambert-Lacroix and Zwald (2011), and Zheng et al. (2016) all demon- strated superior model selection performance of the adaptive LAD lasso over the standard lasso (without adaptive tuning hyperparameter or LAD loss function) when handling non-normality in the error distribution via the Cauchy distribution. However, Lambert-Lacroix and Zwald (2011) noted the adaptive LAD lasso’s reduced e ciency with normal, light-tailed error distri- butions compared to the non-LAD adaptive lasso (Lambert-Lacroix and Zwald, 2011). We will return to Zheng et al. (2016)’s study alongside its proposed RA lasso and another example of the adaptive LAD lasso in subsequent discussion of the adaptive Huber lasso. We also return to Lambert-Lacroix and Zwald (2011) and Lambert-Lacroix and Zwald (2016) in the context of the adaptive Huber lasso. Fan et al. (2014) compared the performance of the non-adaptive LAD lasso 15 to the stan- dard lasso, the Smoothly-Clipped Absolute Deviation-penalized regression 16 , and an adaptive lasso with estimates ˆ for tuning hyperparameter weighting calculated using SCAD regres- sion. They study performance under high dimensionality 17 , specificallyn = 100,p = 400 with 7 truenon-zeropredictors, andwiththefollowingrelevanterror-generatingdistributionscenarios: N(0, p 2); normal mixture with 90% generated from N(0,1) and 10% generated from N(0,5); normal mixture with 90% generated from N(0, 2 i ), i ⇠ Unif(1,5); and the Cauchy distribu- tion. Although their proposed SCAD-based adaptive lasso consistently outperformed all other methods, performance was competitive, with FPR’s 18 ranging between 5-10% for all methods across all scenarios. FNR 19 typically ranged between 5-10%. All methods, however, saw drasti- callyreducedperformanceconcerningcorrectselectionoftruenon-zerovariableswhenpresented with the normal mixture with standard deviation generated by the uniform distribution. FNR’s in this scenario were approximately 33% for all methods, including the non-adaptive LAD lasso and the SCAD-based adaptive lasso. Fan et al. (2014) is notably one of only two studies that specifically utilize the non-adaptive formulation of the LAD lasso. Arslan(2012)comparedtheirproposedadaptiveweightedLADlassotothestandardadap- tive LAD lasso in variable selection and model selection performance with between 0% and 10% error non-normality via a mixture with the Cauchy distribution. The adaptive LAD lasso and theadaptiveweightedLADlassoselectedpreciselythecorrectmodelatcomparableproportions under normality. However, the LAD lasso never selected precisely the correct model under the Caucy mixture. The adaptive weighted LAD lasso appeared to accurately select the correct model in the majority of simulations, regardless of non-normality. The only exception arose at thesmallestsamplesizecombinedwiththelargestpercentageoferrornon-normality. Theadap- tive weighted LAD lasso appears to correctly select more true zero coe cients out of the model 15 Labelled in-study as the “robust lasso,” but the formulation is the quantile lasso with ⌧ =0.5-inother words, the LAD lasso. 16 Proposed and discussed in Fan and Li (2001) 17 Referring to data scenarios where the number of potential predictors p in the data are greater than the sample size n. 18 “False Positive Rate,” aka the proportion of true zero coe cients selected into a model. 19 “False Negative Rate,” aka the proportion of true non-zero coe cients selected out of the model. 21 across data conditions. The adaptive LAD lasso, on the other hand, saw increased FPR’s under worseningerrornon-normality. Arslan(2012)didnotincorporateoutliersintotheirsimulations, and all data scenarios featured collinearity. TheLADlassoisamongstthemoststudiedadaptationstothelasso,particularlyregarding robustification. SubsequentsectionsincludediscussionoftheLADlassointhecontextoffurther adaptations. Thequantilelassowillbediscussedinthecontextofanotherlassoadaptationinsubsequent sections. The adaptive LAD elastic net has thus far neither been proposed nor studied. It exists as a consequence of software implementation of other lasso and elastic net adaptations. 2.5.5 The Current Study The current study incorporates the adaptive LAD lasso and the adaptive LAD elastic net into simulations via the hqreg package (Yi, 2017)inR. The current study includes the adaptive LAD lasso and elastic net as exemplars of quantile regularization. The adaptive weighted LAD lasso and the RA lasso were excluded for lack of accessible implementation despite promising results. 2.6 Huber Loss Function 20 Looking towards older work in robust estimation, Rosset and Zhu (2007) and Lambert-Lacroix and Zwald (2011) adapted the lasso by using the Huber loss function (Huber, 1964) rather than the squared-error loss: HuberLasso = argmin 2 p n X i=1 L(y i x 0 i )+ 1,j p X j=1 ˆ w j | j | (2.19) where L(z) is the Huber loss function: L(z)= 8 < : z 2 |z| M, 2M |z|M 2 |z|>M , (2.20) and M is a transition hyperparameter that adjusts the value represented by z. z can be any value; in the current setting, z corresponds with the regression residuals (y i x 0 i ). Huber (1964) and Huber (1981) originally developed the Huber estimator to enhance the robustness of regression methods and measures of central tendency. M will be fixed to 1.345 20 See Rosset and Zhu (2007) and Lambert-Lacroix and Zwald (2011) for primary mathematical considerations with the lasso, and Yi and Huang (2017) with respect to the elastic net. Huber (1964)andHuber(1981)outline mathematical considerations with Huberized estimators in general 22 based on empirical results by Huber (1981) that demonstrated this transition’s balance of ro- bustness as well as e ciency under normality. However, as Zheng et al. (2016) noted, further work on transition hyperparameter selection for the Huberized lasso should be conducted. Independently of Rosset and Zhu (2007)’s and Lambert-Lacroix and Zwald (2011)’s work incorporating Huber loss into the lasso, Yi and Huang (2017) outlined a Huberized elastic net: HuberENet=(1+ 2 n ){argmin 2 p n X i=1 L(y i x 0 i )+(↵ ) 1 p X j=1 ˆ w j | j |+(1↵ ) 2 p X j=1 ( j ) 2 .} (2.21) Their work coincided with the development of the hqreg package (Yi, 2017) for handling high-dimensionalitydatae cientlyusinganinnovativeSemismoothNewtonCoordinateDescent Algorithm. Theoretical considerations of the algorithm fall outside the scope of the current applied statistical research; interested readers should consider their original paper for further details. Yi and Huang (2017)’s simulations focus solely on the run-time of their SNCD-based methods compared to similar methods and not on performance metrics. The Huberized lasso and elastic net both incorporate the adaptive tuning hyperparameter in the current study. They will therefore be referred to as “adaptive Huberized lasso” or “adap- tive Huber lasso” and “adaptive Huberized elastic net” or “adaptive Huber elastic net.” Some studies implement a non-adaptive version of the Huber lasso or elastic net, and those will be referred to simply as “Huber lasso” or “Huber elastic net.” 2.6.1 Relevant Findings Lambert-LacroixandZwald(2011)foundthattheadaptiveHuberlassooutperformedtheadap- tive LAD lasso when sampling from a double exponential distribution. However, the adaptive Huber lasso underperformed relative to the adaptive lasso when sampling from a mixed normal distribution with 10% contamination in the error distribution fromN(0,15). A follow-up study by Lambert-Lacroix and Zwald (2016) utilized the same mixture process to study the perfor- mance of the adaptive lasso, adaptive elastic net, adaptive Huber lasso, and a one-o↵ procedure proposed therein called the “BerHu” 21 penalty. In all data scenarios, the adaptive Huber lasso typically showed average performance concerning correct elimination of true zero coe cients. When sampling errors from the double exponential distribution, on the other hand, it vastly outperformed the adaptive lasso and adaptive elastic net in eliminating true zero coe cients from the model. The proposed BerHu method performed better than all methods in all sce- narios besides the uncontaminated, normally-distributed error scenarios, in which it performed poorly. A few elements of Lambert-Lacroix and Zwald (2011)’s and Lambert-Lacroix and Zwald (2016)’s studies are worth noting that limit the applicability to the task at hand. First, all simulated data scenarios include heavy multicollinearity; none assess model performance in the 21 Aka, Reverse Huber 23 context solely of contamination or non-normality alone. Lambert-Lacroix and Zwald (2016) also do not consider erroneous elimination of true non-zero predictors from the model, despite implications that the models might di↵ er in performance in this regard. This discrepancy is implied, for instance, by comparing the reported number of zero coe cients estimated by each model with the number of correct zero coe cients estimated by each model. The former being higher indicates that, among the zero estimates produced, some correspond with true non-zero coe cients underlying the model. Given the importance of selecting the correct predictors into a model, in addition to correctly eliminating true zero coe cients from the model, this omission is significant. However, to their credit, Lambert-Lacroix and Zwald (2016) was also the only study of robustness in the lasso and elastic net which included the adaptive formulation of the elastic net. Zheng et al. (2016) found that the RA lasso, the adaptive LAD lasso, and the adaptive Huber lasso all outperformed the adaptive lasso in correctly selecting valid predictors into the model and eliminating true zero coe cients. Their relative performances were consistent con- cerningbothnormally-distributederrorsand10%Cauchymixtureintheerrordistribution. This appears to be the only study of the robust performance of the RA lasso thus far. Unfortunately, Zheng et al. (2016) only incorporated non-normality into their simulations. We will return to the adaptive Huber lasso in the review of subsequent adaptations. Yi and Huang (2017) only recently proposed and studied the adaptive Huber elastic net, and no further study of the method exists. Their simulation findings pertain exclusively to computational metrics such as run-time, and they do not consider variable selection metrics, model selection metrics, or robust characteristics. 2.6.2 The Current Study The current simulations include both the lasso and elastic net variants of the adaptive Huber method via implementation with the hqreg package (Yi, 2017)inR. The adaptive Huber formulations of both the lasso and elastic net will be included in the current simulations through implementation with the hqreg package (Yi, 2017)inR. 2.7 LossFunctionsBasedonS-Estimators,M-Estimators,andMM-Estimators 22 The next methods use related robust estimators of scale called S-estimators, M-estimators, and MM-estimators. The primary characteristic relating these estimators as used in regression loss functions are %-functions. As defined by Maronna (2011), a %-function, %(·), is a function with the following characteristics: 1. the function is continuous, even, and bounded. 22 See Susanti et al. (2014) for a broad discussion of all three estimators in the context of robust regression. Further considerations noted in each subsection as appropriate. 24 2. %(x) is an increasing function of |x|. 3. %(0) = 0. 4. lim x!1 %(x) = 1. 5. %(u) %(v) for any %(v) 1 and 0 u v. Notethatthe%-functionhereisdistinctfromthatdefinedinSection2.5.1forthequantile loss. 2.7.1 M-Estimators 23 M-estimator-based objective functions take the following form: ˆ M = argmin 2 p n X i=1 % ✓ r i s ◆ (2.22) where r i is the residual of the ith observation and s is some robust measure of scale analogous to standard deviation. One recommendation for the measure of scale s in the divisor of the %-function is the Median Absolute Deviation, defined as: ˆ MAD = median|r i median(r i )| 0.6745 , (2.23) wherer i isonceagaintheregressionresidualfortheithobservation,and0.6745correspondswith the .75 quantile of the z-distribution. This value rescales the MAD to better approximate the standarddeviation undernormality(Wilcox, 2016). Huber’sestimator, usedintheHuberized lassoandelasticnetdiscussedinSection2.6,isanexampleofanM-estimatorbutisconsidered separately due to the timing of its development and the standalone nature of its consideration in the literature. Another example of an M-estimator is Tukey’s bisquare: %(z)= 8 < : z 2 2 z 4 2M 2 + z 6 6M 4 |z| M, M 2 6 |z|>M , (2.24) whereM is a transition hyperparameter similar to the one described for the Huber loss function inSection 2.6. Although this parameter is relatively well defined in the Huber case, proposals for e cient and robust values in Tukey’s bisquare vary in the literature. The author there- fore recommends selection of this hyperparameter via cross-validation. In the general robust regression context, Susanti et al. (2014)use M=4.685. This choice is not justified in the text. M-estimators arise in the lasso and elastic net as a secondary step to either an ini- tial S-estimator or a preliminary M-estimator. Both variants are variously labeled as ”MM- estimators.” 23 See Huber (1964)andHuber(1973) for considerations of M-estimators generally. 25 2.7.2 S-Estimators 24 S-estimators(RousseeuwandYohai,1984)areaspecializedimplementationofM-estimatorscor- responding with regression estimates and which satisfy particular conditions on the %-function: ˆ S = argmin 2 p n X i=1 % ✓ r i ˆ s ◆ , (2.25) where ˆ s = v u u t 1 nK n X i=1 w i r 2 i , (2.26) and w i = %(r i ) r i , (2.27) for some tuning hyperparameter K.ˆ s in this case is an estimator for the measure of scale s used in (2.28), %(·)isa %-function as defined Section 2.7, and r i is the residual corresponding withtheithobservationasdefinedinSection 2.7.1. Successfullycomputingthevaluesdefined above is somewhat more mathematically involved than is appropriate for the current research. However, interested readers are encouraged to refer to Rousseeuw and Yohai (1984)’s seminal workontheclassofestimators, Susantietal.(2014)’sdiscussioninthegeneralrobustregression framework, and Freue et al. (2019)’s implementation in elastic net-penalized regression. Freueetal.(2019)proposedincorporationofanS-estimator-basedlossfunctionintoelastic net-penalized regression, named the Penalized Elastic Net S-Estimator (PENSE): PENSE=(1+ 2 n )argmin 2 p n X i=1 % ✓ r i ˆ s ◆ + 1 p X j=1 | j |+(1↵ ) 2 p X j=1 ( j ) 2 . (2.28) Note the lack of adaptive lasso tuning hyperparameter. 2.7.3 MM-Estimators 25 Smucler and Yohai (2017) adapted the concept of MM-estimators for use in the lasso setting. InitiallyproposedbyYohai(1987),MM-estimatorsincorporateresidualsofestimatestoperform e ciently under idealized conditions of normality in addition to handling a large percentage of arbitrary contamination of data by outliers: MMlasso = argmin 2 p n X i=1 % 1 ✓ r i s n (r( ˜ )) ◆ + 1 p X j=1 | j |, (2.29) 24 See Rousseeuw and Yohai (1984) for consideration of S-estimators generally and Freue et al. (2019) for consideration with the elastic net. 25 See Smucler and Yohai (2017) for lasso considerations and Freue et al. (2019) for elastic net considerations. 26 and s n (r( )) is the M-estimate of scale of the residuals. A formulation incorporating the adaptive lasso tuning hyperparameter is outlined below. AdaMMlasso = argmin 2 p n X i=1 % 1 ✓ r i s n (r( ˜ )) ◆ + 1,j p X j=1 ˆ w j | j | (2.30) These two methods will hereafter be referred to as the “MM lasso” and “adaptive MM lasso,” respectively. 2.7.4 Relevant Findings Smucler and Yohai (2017)’s review of robust sparse regression methods evaluated several other robustness-related data conditions, including a predictor contamination condition that replaced a random 10% of values with 5; a heavy-tailed error distribution generated from the Cauchy distribution; the combination of predictor contamination and heavy-tailed error; and conditions of collinearity and high dimensionality. Smucler and Yohai (2017) found that the MM-lasso and its adaptive version both performed well in false positive and false negative rates under simu- lated data conditions with various underlying distributions. High-leverage outliers provided the strongest of the MM-lasso’s competitive performance. Although the ESL lasso outperformed the other included methods in terms of FPR when there was only one true non-zero predictor, it was other methods generally surpassed it in all other data scenarios. It also always per- formed poorly regarding incorrectly eliminating true non-zero predictors from the model, with FNR’sgoingashighas71%underlow-dimensionalityconditions,regardlessoferrordistribution tailed-ness. The adaptive LTS lasso demonstrated the opposite problem; although its FNR ap- proached or equaled 0% in all settings, its FPR was always among the worst. The non-adaptive LAD lasso, non-adaptive MM lasso, and adaptive MM lasso performed competitively in FPR and FNR regardless of data scenarios, although the non-adaptive LAD lasso showed diminished performance in the context of leverage point contamination. Although the standard and adap- tive lasso performed well under normality and without contamination by leverage points, they performed poorly compared to the remaining methods. AfewconcernsarepresentwithSmuclerandYohai(2017). Allscenariosincludedcollinear- ity. Consequently, parsing out the distinct impacts of non-normality or predictor contamination on the findings proves di cult. Furthermore, the only data they present regarding predictor contamination scenarios are maximum FPR and FNR values for each method and only for 100 simulated iterations per scenario. Machkouretal.(2020)studiedtheperformanceoftheadaptiveLTSlasso,thenon-adaptive andadaptiveHuberlasso 26 ,theMMlassoandadaptiveMMlassowithTukey’sbisquare,andan additionalmethoddescribedastheoutlier-corrected-data-adaptivelassoattributedtoMachkour et al. (2017) 27 . Machkour et al. (2020) also included either the standard lasso or the adaptive 26 Described in-study as M-lasso and adaptive M-lasso using Huber’s estimator for the %-function. 27 Unfortunately, full formulation of this method was unable to be located through USC’s publication access. This method is therefore not included in the current study. 27 lasso, although their literature review and wording make it unclear which they included in their simulations. Their data scenarios all incorporated high dimensionality 28 ; predictor contamina- tion alone by a normal mixture with 0%, 10%, 20%, and 30% contamination by N(0,100); and the previous predictor contamination alongside response contamination by a normal mixture in the error distribution with 5% contamination by N(0,100). All scenarios included collinearity. The adaptive Huber lasso consistently outperformed or performed competitively in terms of FPR across contamination conditions, generally in the range of 3%7%. Its FNR did reach as high as 56% under the most severe predictor + response contamination, although this perfor- mance cannot be disentangled from the lasso’s di culty in selecting collinear predictors. The adaptive MM lasso performed comparatively. The adaptive LTS lasso’s performance was worse but more stable across conditions, with generally inferior FPR (9%12%). The adaptive LTS lasso’sFNRbehavedsimilarlytothatintheadaptiveHuberlassoandadaptiveMMlasso, wors- eningwithworseningcontamination, althoughthistoowasmorestable(16%41%). Similarto other studies, their simulations did not separate collinearity and outlier contamination. Their study did include predictor contamination in the absence of response contamination. However, they only simulated response contamination in combination with predictor contamination and response contamination did not exceed 5%. Li et al. (2020) implemented Tukey’s bisquare in their M-estimator adaptation, although they did not utilize the adaptive lasso tuning hyperparameter. Their lasso implementation, as well as the LAD lasso and Huber lasso included in the study, incorporated the adaptive tuning hyperparameter. They evaluated predictor and response contamination in isolation and combined. Outlier-generatingdistributionsincludedthenormaldistributionswith10%contam- inationfromN(10,1)andN(0,10)forthepredictoranderrordistributions, respectively. Under bothnormalityand responsecontaminationalone, allfour lossfunctionsstudied (squared-error, LAD, Huber, and Tukey-M) approached 0% FNR and FPR’s ranging between 24% and 28%. Under predictor contamination alone, all four lasso implementations had 0% FNR; the adaptive lasso showed an FPR of more than 44%. However, the adaptive LAD lasso and adaptive Huber lasso produced FPR’s in the mid 30%’s and the Tukey-M showed an FPR of approximately 26%. Finally, under both predictor and response contamination, all FNR’s were 0% except for the adaptive lasso (7%), and all FPR’s were in the mid 30%’s besides the Tukey-M (⇠ 28%). Whereas the studies by Smucler and Yohai (2017) and Li et al. (2020) feature among the more method-inclusive simulations, Freue et al. (2019) is one of the only studies found in the course of this research that includes robustified elastic net formulations. Freue et al. (2019) provides the initial proposal for S- and MM-penalized elastic net formulations, comparing these methods (PENSE and PENSEM, respectively) to the adaptive LTS lasso and the non-adaptive MMlasso. Unfortunately,althoughtheauthorsincludethestandardlassoandelasticnetintheir non-contaminated data scenarios, neither are included when handling outlier contamination. Therefore, comparisons of robustness cannot be made to the other methods regarding outliers, and they will not be further discussed. Data conditions include simultaneous predictor and 28 Specifically, n=30,p=50. 28 response contamination and varying examples of low and high dimensionality. All conditions also feature high collinearity among true non-zero predictors. The proposed M-estimator-related methods tended towards 0% FNR, with drastically- reducedperformanceinadataconditionwithn = 100,p = 995,and15truenon-zeropredictors. In this scenario, PENSE’s FNR approached 75% and PENSEM’s FNR was approximately 60%. However, this also correspond with 0% FPR’s for both methods, regardless of contamination. Their performance in this condition compared to that of the adaptive LTS lasso and the MM lasso. Aside from this condition, PENSE and PENSEM generally performed competitively in terms of variable selection relative to the adaptive LTS lasso and the MM lasso. The MM lasso typically underperformed in variable selection relative to PENSE and PENSEM, with the exception of one data scenario where n = 100,p = 81, and 27 out of the 81 potential predic- tors were true non-zero predictors of the outcome. In this scenario, the MM lasso performed competitively with the PENSE and PENSEM, with FNR approaching 0% and FPR between 30% and 40%. The adaptive LTS lasso underperformed in all data conditions, although in contrast to Kurnaz et al. (2018) and Smucler and Yohai (2017), extremely high FNR’s with the adaptive LTS lasso corresponded with extremely low FPR’s when it performed the worst. One potential explanation for this contradiction with other findings of the adaptive LTS lasso is that these simulations included a greater proportion of true non-zero predictors. Smucler and Yohai (2017), for instance, generated data with only 6-8 true non-zero predictors out of a potential parameter space of 8-250. Meanwhile, Freue et al. (2019) generated data with approximately one-third of potential parameters being true predictors of the outcome. This discrepancy merits further consideration in future research. 2.7.5 The Current Study Although the initial stages of this study incorporated the MM lasso and adaptive MM lasso via themmlasso package (citation unavailable), either the underlying C++ architecture or the package’s C++-interfacing functionality through RcppArmadillo (Eddelbuettel and Sander- son, 2014) was deprecated when conducting the main simulations. Additionally, although the pense package (Kepplinger et al., 2020) provides an updated version of this package that also includes PENSE, the author encountered multiple obstacles in attempting implementation through this package. Obstacles included: discrepancies between vignettes and application on a personal computer; extremely high FPR over 70% inconsistent with current understanding of this or any other “outlier-robust” method studied thus far; and long run-times. Practical impli- cations for applied researchers arise from the package methods’ performance and the excessive run times. Consequently, although MM-based methods are amongst the most- romising and most studied adaptations, they are not included in the current study. Future research would benefit greatly from resolving these software issues and incorporating the MM lasso, adaptive MM lasso, PENSE, and PENSEM into the future of this cohesive research framework of lasso and elastic net adaptations. 29 2.8 Outlier Shifting 29 Jung et al. (2016) proposed a general robust regression adaptation that penalizes potentially- outlying observations in the data with an additional penalty term. The authors compare the technique to best subset selection used for “selecting observations instead of variables,” (3). Applied to the lasso, the resulting outlier-shifted model procedure is as follows: OSLasso = argmin 2 p n X i=1 (y i x 0 i os,i ) 2 + 1 p X j=1 | j |+ n X i=1 2 os,i I( y i x 0 i < os ) (2.31) where the 2 os,i ’s are case-specific parameters that shrink towards zero for non-outlying observations in the data, based on the outlier threshold parameter os . These parameters are conceptually analogous to the 1 tuning hyperparameter in the general lasso, except tailored towards the outlier-ness of observed values in particular cases. I() represents the indicator function I(C)= 8 < : 1 C true 0 C not , (2.32) where C is some condition of interest, C true indicates the condition is met, and C not indicates the condition is not met. Put into words, the indicator function here penalizes the observation-specific outlier-shifting parameter to 0 if the residuals are less than the constraint placed by os . This is achieved by a 1 in the indicator function, resulting in the inclusion of the penalty term. Suppose the residuals exceed the constraint placed by os . In that case, the penalty term is multiplied by 0 and not included, resulting in the full impact of the outlier- shifting procedure for observation i. Currently, no discussion or formulation of an outlier shifted elastic net exists, nor is there implementation that conveniently incorporates the two together as with LAD loss. Outlier shifting does not utilize the adaptive tuning hyperparameter. 2.8.1 Outlier Shifting with Huber Loss Function Jung et al. (2016) also proposed a Huberized version of their outlier shifted lasso: OSHuberLasso = argmin 2 p n X i=1 L(y i x 0 i os,i )+ 1 p X j=1 | j |+ n X i=1 2 os,i I( y i x 0 i < os ) (2.33) No discussion or formulation of a Huberized outlier shifted elastic net exists, nor is there adaptable implementation. The Huberized outlier shifted lasso does not utilize the adaptive tuning hyperparameter. 29 See Jung et al. (2016) for primary mathematical considerations 30 2.8.2 Relevant Findings Jung et al. (2016) assessed the false negative and false positive rates of the non-adaptive Huber lasso, the lasso, the outlier-shifted (OS) lasso, and the OS Huber lasso. Their data scenarios include response outlier contamination with 10%, 20%, and 30% contamination by N(0, ), =3,6,10 in the error distribution. Jung et al. (2016) found that all four methods produced comparably low false-negative rates. The Huber lasso tended towards mediocre per- formance among the four methods. However, all four methods showed average false positive rates greater than 44% across conditions. Jung et al. (2016) also found that their outlier-shifted lasso typically produced competitive predictions in the presence of small outlier magnitude ( = 3) under all contamination levels. Increases in outlier magnitude (contamination from N(0,6) and N(0,10)) showed corresponding reductions in relative performance. The outlier- shifted lasso always produced at least mediocre prediction performance. Overall, the OS lasso performed comparably to the lasso, the non-adaptive Huber lasso and the combined OS Huber lasso procedure (Jung et al., 2016), including high rates of false-positive coe cients. However, these outlier-shifting methods have not been studied elsewhere. Notably, this study only evalu- ated scenarios with response outliers and did not include leverage point. Furthermore, all data scenarios incorporated collinearity. 2.8.3 The Current Study Both the OS lasso and the OS Huber lasso are included in the current simulations thanks to R code generously provided by one of the authors of Jung et al. (2016). 2.9 General Summary of Findings Regarding Outliers and Non-Normal Error Distributions Painting a cohesive and comprehensive image of the results of attempted robustifications of the lasso is incredibly di cult. Heterogeneous simulation methods abound in the less than 20 studiesofrobustnessidentified. Contributingfurthertodi cultiesparsingthee↵ ectsofoutliers and non-normality, few studies separate outlier contamination from collinearity, non-normality from collinearity, or predictor contamination from response contamination (if both are even present). Studieshaveconsistentlyfoundthatthestandardlassoandadaptivelassounderperformin selectingtruenon-zerovariablesinthepresenceofeithernon-normalityoroutliercontamination. Theybothalsounderperforminallscenariosconcerningeliminatingtruezerocoe cients. These methods will therefore likely perform well under normality and non-contamination but will otherwise underperform. 31 M-estimator-based lasso methods perform consistently in the literature under various data scenarios. These methods often produced competitive results to the adaptive Huber and adap- tive LAD lasso while. However, the MM lasso su↵ ered compared to two elastic net formulations featuring M-estimators (PENSE and PENSEM) when the potential predictor space included greater proportions of true predictors. The adaptive LAD lasso and adaptive Huber lasso both appeared to generally perform competitively, although the adaptive LAD lasso su↵ ered in the presence of many true non- zero predictors combined with heavy outlier contamination. Notably, in the two studies that utilized the non-adaptive LAD lasso, the non-adaptive formulation performed incredibly poorly regarding proper variable selection and elimination. However 30 , the elastic net formulation gave much more promising results in Kurnaz et al. (2018)’s proposal, providing a better balance of FPR and FNR than the lasso formulation. It was only compared to the adaptive LTS lasso and the standard elastic net, however, and thus little else can be said about its capabilities. This finding, along with Freue et al. (2019)’s results with PENSE and PENSEM relative to the MM lasso, suggests that elastic net formulations of these robust modifications might serve even better than their lasso counterparts in handling outliers and non-normally-distributed errors. This possibility, however, illustrates one of the most significant steps this program of re- search contributes to robust variable selection techniques. To say nothing of little-studied ex- isting elastic net formulations such as the adaptive elastic net or M-estimator-based elastic net formulations, Yi and Huang (2017) presents two entirely new elastic net adaptations. Although theirproposaldoesnotaddressconcernsofrobustvariableselection, theyexplicitlyproposeone new elastic net formulation in the adaptive Huber elastic net and provide easy implementation. Furthermore, although not explicitly formulated, the adaptive LAD elastic net arises naturally from the R software they developed to implement their Huberized elastic net. These two elas- tic net possibilities, the promising but only once-studied multi-step adaptive elastic net and outlier-shifted lasso techniques, and the well-covered adaptive LAD lasso and adaptive Huber lasso all provide a strong basis from which to develop understanding of practically-relevant and robust variable selection methods. 30 Therearemany“however’s”andexceptionsgiventheheterogeneityofthelassoandelasticnetrobustification research and the adaptations it studies. 32 Chapter 3 Simulations: Outlier Robustness This chapter outlines simulations conducted to evaluate the robustness of various lasso and elastic net adaptations to outlier contamination. The chapter begins with a brief review of relevant findings in the extant literature, followed by a description of the methods used to simulate and study outlier contamination. Results are presented and discussed. Additionally, thechapterincludesadescriptionofandresultsfromasmallersimulationconductedtoevaluate outlier robustness in higher dimensions. 3.1 Review of Relevant Findings The following review is constrained to findings involving at least one of the included adapta- tionsdesignedforrobustcharacteristicsandinthecontextofeitherpredictororresponseoutlier contamination. Consequently, the review below does not include several studies reviewed while describe lasso and elastic net adaptations inChapter 2. Studies that assessed the standard or adaptive formulations of the lasso or elastic net, but did not consider any other adaptations in- cludedinthecurrentsimulations, arenotreviewed. However, theperformanceofthesemethods is discussed below in the context of other included adaptations. The standard and adaptive lasso consistently perform competitively under normality and without outlier contamination. However, their performance deteriorates significantly with any form of outlier contamination. The only exceptions to this finding come from Lambert-Lacroix and Zwald (2011) and Lambert-Lacroix and Zwald (2016). The former found that the adap- tive lasso performed competitively in FPR with the adaptive LAD lasso and adaptive Huber lasso, even under heavy response contamination. Unfortunately, underperformance in FNR in these scenarios o↵ set this competitive FPR, with FNR’s approaching 15%20%. Meanwhile, Lambert-Lacroix and Zwald (2016) found that the adaptive elastic net performed competitively underresponsecontamination,andtheadaptivelassooutperformedallothermethods,including their proposed “reverse Huber” regularization penalty. The adaptive lasso, in its strong perfor- mance, approached 0% FPR. Consistent findings of FPR’s above 10% and often surpassing 30% or 40% of many methods, even under normality, render these results noteworthy. The adaptive Huber lasso also produced extreme FPR’s as high as 70% under the combination of response contamination,non-sparsity,andcollinearityandatthesmallestsamplesizeofn = 100Although 33 this contradicts an earlier finding by Lambert-Lacroix and Zwald (2011) of FPR’s between 5% and 10% for the adaptive Huber lasso under combined response contamination and collinearity.. Lambert-Lacroix and Zwald (2016) do not provide su cient information to calculate FNR. Is is therefore impossible to evaluate how the adaptive lasso and elastic net’s strong FPR’s corresponded with increase FNR’s in those data conditions. Both studies also feature extreme levels of predictor collinearity; the minimum correlation between adjacent covariates in these studies ranged from r=0.50 to r=0.95. Deducing the adaptive lasso and elastic net’s perfor- mance under outlier contamination proves di cult from these studies alone due to the decisions made by the authors in devising their simulation scenarios. Two further studies of the adaptive LAD lasso merit consideration concerning inconsistent findings. Both separate predictor and response contamination but include collinearity in all scenarios. Whereas Alfons et al. (2013) found strong performance of the adaptive LAD lasso under combined predictor contamination andcollinearity(FPR = 18%),X.Wangetal.(2013)foundinferiorperformance(FPR = 56%). These findings are especially confounding given that Alfons et al. (2013)’s predictor contami- nation appears more severe; although they only contaminate 10% of observations, the mixture came from N(50,1). Meanwhile, X. Wang et al. (2013) studied a 20% normal mixture with N(3,1). Both studies also utilize similar covariate correlation structures. Consequently, it is di cult to reconcile these highly disparate findings. The adaptive LAD lasso and adaptive Huber lasso otherwise appear to perform relatively competitively under various outlier-related data scenarios, including against M-estimator-based methods. Li et al. (2020) provide a useful evaluation of these methods’ performance. Their simulations include predictor and response contamination in isolation and combined, both with and without collinearity. Under normality and without outlier contamination, the adaptive lasso, adaptive LAD lasso, adaptive Huber lasso, and adaptive Tukey-M lasso all had 0% FNR andcomparableFPR’saround25%. Thesefourmethodsperformedsimilarlywith10%response contamination by N(0,100). Li et al. (2020)’s study also found that leverage point contamination had negative impacts on FPR performance of both the adaptive LAD and adaptive Huber lassos, consistent with findings from other studies of predictor contamination such as Alfons et al. (2013) and X. Wang et al. (2013). However, the two methods still performed competitively, with FPR’s of 36% and 34% for the adaptive LAD and adaptive Huber lasso, respectively. These compare to 45% and 26%for thestandard lassoand adaptiveTukey-M lasso’s, respectively. Although theTukey-M’s performancedeterioratedslightlyto28%undercombinedpredictorandresponsecontamination, theadaptiveLAD andadaptiveHuber lassoimprovedslightly toFPRof 32%, and thestandard lasso to 33% 1 . No other study of the adaptive LAD lasso or adaptive Huber lasso provides a separate look of predictor and response contamination, combined contamination, and in the absence of collinearity, so there is little else against which to compare these findings. Jung et al. (2016)’s proposal study for the outlier-shifting lasso procedures examined the performanceofthestandardlasso, thenon-adaptiveHuberlasso, OSlasso, andOSHuberlasso. 1 Although the standard lasso’s competitive performance in this case also corresponded with a 7% FNR, compared to 0% for the other three methods. 34 Their simulation conditions included collinearity and 10%, 20%, and 30% response contamina- tion by both N(0,6) and N(0,10), although they did not include FPR and FNR performance for uncontaminated data. The FPR of all methods never fell below 44% and went as high as 63% under the worst conditions for the OS Huber lasso. FNR was at or approached 0% for all methods and scenarios. Taken together, we can likely expect the adaptive LAD lasso and adaptive Huber lasso to perform competitively in outlier robustness. However, both are likely to be negatively impacted by outliers in the predictor space. The standard and adaptive lasso formulations will likely underperform in this regard. The OS and OS Huber lasso are likely to have high FPR at least under response contamination, although how this will compare to the other methods is unclear. No study yet exists of the outlier robustness of the MS adaptive elastic net, adaptive LAD elastic net, or adaptive Huber elastic net. 3.2 Methods and Design 3.2.1 Models and Software Implementation The current simulations evaluate the variable selection abilities of the following methods. Sec- tion 3.2.2 includes further details about cross-validation procedures that are non-specific to method or implementation. Appendix B.4 provides the code used to generate simulated pre- dictor and response data. Section 3.2.4 and3.2.5 outline the procedure used to generate this data. Examplesofcodeformethodswithandwithouttheadaptivelassotuninghyperparameter can be found in Appendices B.5 and B.6,respectively. • Standard Lasso and Elastic Net The standard lasso and elastic net were included as a benchmark against which to compare the other methods. Both are implemented using the cv.glmnet function from the glmnet package (Friedman et al., 2019). • Adaptive Lasso and Elastic Net The weights for the adaptive lasso tuning hyperparam- eter were applied using the “penalty.factor” argument from the cv.glmnet function from the glmnet package (Friedman et al., 2019). This argument only applies the weights for a given weighting hyperparameter , and does not conduct cross-validation to select . This process is addressed in Section 3.2.2. • Multi-Step Adaptive Elastic Net The multi-step adaptive elastic net was conducted us- ing the msaenet function from the msaenet package (Xiao and Xu, 2019). This function includes a built-in procedure for calculating initial coe cient estimates ˆ for subsequent weighting during the adaptive lasso process. The initial ridge estimates and subsequent weight application were thus all implemented within this function rather than through running a preliminary model, although cross-validation still needed to be incorporated for 35 selecting optimal values for the weighting hyperparameter 2 . The original study does not outline the number of stages used in their simulations, nor a recommended number of stages. Therefore, the number of stages k was arbitrarily set to 10 3 . •Adaptive LAD Lasso and Elastic Net The adaptive LAD lasso and elastic net were im- plemented using the cv.hqreg function from the hqreg package (Yi, 2017)bysettingthe “method” argument to “quantile” and the corresponding hyperparameter “tau” argument to 0.5 for the LAD lasso criterion. This function includes a similar “penalty.factor” ar- gument to cv.glmnet, which was used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weightinghyperparameter 4 . Avariablescreeningprocedureforcomputationaloptimiza- tion developed by Tibshirani et al. (2012) was incorporated into the function by setting the “screen” argument to “SR.” The authors also developed a new, less computationally- intensive screening rule, which is less conservative as a result and can be found detailed in their study. The original, more conservative screening rule developed by Tibshirani et al. (2012)waschosenforthisstudy. Althoughthechoicetouseascreeningrule(asopposedto none) was made at the recommendation of the developers of the hqreg package, the choice of particular rule was arbitrary on my part given the lack of specific recommendations. •Adaptive Huberized Lasso and Elastic Net The adaptive Huberized lasso and elastic net were implemented using the cv.hqreg function from the hqreg package (Yi, 2017) by setting the “method” argument to “huber” and the corresponding transition hyper- parameter “gamma” argument to 1.345 to balance both robustness of the Huber loss as well as e ciency under ideal data conditions 5 . The “penalty.factor” argument was again used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weighting hyperparameter 6 . The same screening rule was used as described in the adaptive LAD lasso and elastic net. • Outlier-Shifted Lasso The outlier-shifted lasso was implemented using R code adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the code needed to be made to correct issues with internal cross- validation procedures and internal object references. Furthermore, although an internal procedure calculates the outlier-shifting hyperparameter os , the tuning hyperparameter OS for estimating these values required selection. The optimal value for os was chosen via an additional cross-validation procedure, which was conducted in a similar fashion to that described in Section 3.2.2. No other choices or alterations were made in the applicationofthecustomcode. TheRcodeusedtosimulatetheOSlassomodel,including the adapted custom OS lasso code, can be found in Appendix B.7. • Outlier-Shifted Huberized Lasso The outlier-shifted Huber lasso was implemented using a custom R function adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the custom function needed to be made 2 This was selected using the same procedure described in Section 3.2.2 3 This choice is discussed further in the general discussion in Chapter 7. 4 See Section 3.2.2. 5 Per recommendation by Huber (1981) 6 Yes, Section 3.2.2. 36 to correct issues with internal cross-validation procedures and internal object references. As opposed to the standalone OS implementation, potential values for os were chosen from the same potential values for the regularization tuning hyperparameters , and no other cross-validation procedures needed to be incorporated to run the simulated models. No other choices or alterations were made in the application of the custom code. The R code used to simulate the OS Huber lasso model, including the adaptation of the custom R function, can be found in Appendix B.8. 3.2.2 Hyperparameter Selection For all instances of the elastic net, balancing hyperparameter ↵ was set to 0.5. Values of .75 and .9 were also considered, but discarded due to lack of meaningful di↵ erences on primary performance metrics in initial simulations 7 . It is standard to select tuning hyperparameters via a cross-validation procedure rather than running models such as the lasso or ridge regression using a pre-determined value. No standard exists for the cross-validation procedure itself, so these simulations use the following procedure. Regularization tuning hyperparameters were chosen from 100 logarithmically- equidistant values between 0.01 and 1400 by 5-fold cross-validation 8 . When utilizing the adaptive lasso tuning hyperparameter, a preliminary 5-fold cross- validated ridge regression was conducted using the same 100 lambda values and the resulting coe cients used for determining the weights vector. Unless otherwise specified, the initial ridge stepwasperformedusingthecv.glmnetfunctionintheglmnet package(Friedmanetal., 2019) 9 . A subsequent 5-fold cross-validation procedure was conducted over 100 potential values of scal- ing hyperparameter chosen from the logarithmic sequence described previous. The literature does not provide a standard selection procedure for this hyperparameter. Furthermore, the literature does not make any recommendations for the sequence of potential values to use in cross-validation. The criterion for selection of optimal hyperparameters was mean prediction error in cross- validation test sets 10 . This metric was the only metric available in all software implementations of the included methods and adaptations 11 . 7 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 8 Multiple studies cite Lambert-Lacroix and Zwald (2011) as the original source for this recommendation. Lambert-Lacroix and Zwald (2011) does not, however, provide a rationale for this choice. 9 This initial step was conducted internally for the multi-step adaptive elastic net, although arguments were specified to implement the same procedure and using the same sequence of potential 2 values. 10 Note: This is di↵ erent from the test sets used to generate one of the performance metrics, Mean Squared Error, which will be discussed in the next section. 11 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 37 3.2.3 Performance Metrics The simulations used to primary metrics to evaluate the variable selection characteristics of each method and adaptation: • False-Positive Rate (FPR) the proportion of true zero coe cients incorrectly estimated to be non-zero • False-Negative Rate (FNR) the proportion of non-zero coe cients incorrectly estimated to be zero Thesemetricswerechosenfortheircorrespondencewiththerobusthypothesis-testingstatistical framework. The rate of eliminating true non-zero predictors analogizes with the complement to power 12 in the absence of explicit hypothesis tests. The rate of selection of true zero coe cients similarly equates to the Type I error rate 13 . Secondary performance metrics included accuracy in both coe cient estimation and pre- diction. Coe cient Estimate Precision, henceforth precision, was calculated to explore how accurately each method estimated true non-zero coe cients: Precision( ˆ )= P pnonzero j=1 ( ˆ j j ) 2 p nonzero (3.1) Note that p nonzero = 4 in all cases in this study, as all simulations featured nonzero values only in the first four elements of the coe cient vector . Although developed independently for the purposes of this paper, the similarity of this metric with another measure of precision used by Kurnaz et al. (2018) should be noted. However, their version does not account for the number of true predictors. They also take the square root of the deviation in coe cients. The Test-Set Mean Squared Error, henceforth RMSE, meanwhile, is calculated as follows: RMSE = r P n i=1 (ˆ y i y i ) 2 n (3.2) where ˆ y i is the vector of model-predicted outcome values for the ith observation. To calculate theRMSE,anadditional50%data, roundedup, wasgeneratedforeachdatasetusingthesame seed as the original dataset. FPR and FNR were calculated internally within each model-application function. Ap- pendix B.9 provides example code for generating RMSE and precision. 3.2.4 Simulation Conditions: Low Dimensionality As previously established when reviewing the robust lasso and elastic net literature, the method for generating outliers varies greatly across the literature. However, the majority of studies uti- lized mixture distributions for generating leverage points and response outliers. Furthermore, 12 Aka the probability of rejecting the null hypothesis of no relationship given that the alternative hypothesis is in fact false. 13 Aka the probability of rejecting a true null hypothesis of no relationship. 38 all studies incorporated 10% mixtures. Therefore, the current simulations include 10% contam- ination and 20% contamination to study further contamination in both the predictors and the response 14 . Following the methodology used by Turkmen and Ozturk (2016) in their outlier ro- bustness simulation, data for the current simulations were generated from the following mixture distributions: ✏⇠ (1⌘ y )N(0,1)+⌘ y N(2,5) (3.3) and x⇠ (1⌘ x )N(0,⌃ )+⌘ x N(10,⌃ ) (3.4) where⌃ is the p⇥ p identity matrix with 1’s on the diagonal and 0’s otherwise, and ⌘ x and ⌘ y the amount of contamination in the predictor and response space to be discussed subsequently. The full linear model to be replicated by the analyses was then generated using the standard linear model, y i =x 0 i +✏ i (3.5) with 1 =0.5, 2 =1.0, 3 =1.5, 4 =2.0, and all else equal to 0. The following features of the data were varied across conditions: • p = number of potential predictors p varied across two levels: 8, 30 • n = sample size sample size varied across four levels: 25, 50, 100, 200 • ⌘ x = amount of predictor contamination x-spacecontaminationvariedacrossthreelev- els: 0.0, 0.1, and 0.2, corresponding to 0%, 10%, and 20% contamination, respectively • ⌘ y = amount of response contamination y-space contamination varied across three lev- els: 0.0, 0.1, and 0.2, corresponding to 0%, 10%, and 20% contamination, respectively with each possible combination of feature levels represented for a total of 72 data conditions. 500 iterations of each condition were simulated, and analyses conducted on each iteration. 3.2.5 Simulation Conditions: High Dimensionality To explore the impacts of dimensionality and outlier contamination on variable selection ca- pabilities, additional data were generated from the same mixture and error distributions. The simulations fixed sample size n at 200 and number of potential predictors p at 1000. The same true coe cient vector was used as in the low-dimensionality setting: 1 =0.5, 2 =1.0, 3 =1.5, 4 =2.0, and all else equal to 0. Therefore, a total of 9 high dimensionality data scenarios were simulated for studying outlier contamination, representing the 9 combinations of predictor contamination 15 and response contamination 16 . 500 iterations of each of these 9 data conditions were simulated, and analyses conducted on each iteration. 14 Three studies used 30% contamination, one of which also included 40% contamination. Although higher contamination levels are worth considering, we limit contamination to these two levels, and 0%, to ensure timely completion of simulations. This potential limitation is considered further in the general discussion. 15 3levels:0%,10%,and20%. 16 3levels:0%,10%,and20%. 39 3.3 Results Except for some high-dimensionality data conditions for the adaptive LAD elastic net, all sim- ulations were conducted in RStudio on a 3.80 GHz AMD Ryzen Threadripper processor with 64 GB RAM 17 3.3.1 Results: Low Dimensionality A note on the plots presented: Performance metric data for the low dimensionality settings are displayed as line trends over sample size. Meanwhile, metric data for high dimensionality are displayed as boxplots, since sample size did not vary. One further di↵ erence is noted: the trend lines displayed in the low dimensionality plots display the 20% trimmed means, while the boxplots in the high dimensionality setting display the median as the central tendency of each box. The median was impractical in the low dimensionality setting, as the resulting trends overlapped so precisely that some methods’ trajectories were not visible in multiple plot windows. This issue arose particularly for FPR at p = 8 and FNR at any level of p. The general organization of each plot window is as follows: Each plot window contains nine plots corresponding with each of the combined predictor/response contamination condi- tions since predictor- and response-contamination are best considered in tandem. Predictor- contamination ⌘ x increases along the vertical axis, from top to bottom, while response contam- ination ⌘ y increases along the horizontal axis. Therefore, the top-left most plot contains the given metric for ⌘ x = ⌘ y = 0%. The plot in bottom-right most plot contains the given metric for the most extreme contamination levels ⌘ x = ⌘ y = 20%. 17 . 40 3.3.1.1 p=8 0.0 0.2 0.4 0.6 38 75 150 300 n η x =0 η y =0 0.0 0.2 0.4 0.6 38 75 150 300 n fpr η y =10 0.0 0.2 0.4 0.6 38 75 150 300 n fpr η y =20 0.0 0.2 0.4 0.6 38 75 150 300 n η x =10 0.0 0.2 0.4 0.6 38 75 150 300 n fpr 0.0 0.2 0.4 0.6 38 75 150 300 n fpr 0.0 0.2 0.4 0.6 38 75 150 300 n η x =20 0.0 0.2 0.4 0.6 38 75 150 300 n fpr 0.0 0.2 0.4 0.6 38 75 150 300 n fpr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.1: FPR: Outliers, Low Dimensionality, p=8 Figure 3.1 shows the performance of each method across the various contamination sce- narios with p = 8 potential predictors, with sample size n on the x-axes and average 18 FPR on the y-axis. Two observations are immediately evident. Three methods always produced highest rate of false positives: the standard elastic net (in royal blue), the OS Huber lasso (in lighter blue), and the standard lasso (in bright green). The OS lasso (purple) alone also typically performed poorly in this regard; except under response contamination alone (second and third plots in the top row), its trend is typically within the elastic net-OS Huber lasso-standard lasso grouping. Even under response contamination alone, it is at best the worst of the “pack” in the middle. In numbers, the standard elastic net always approached or exceeded FPR’s of 60%, sometimes surpassing70% 19 . TheOSHuberlassoperformedsimilarly,hoveringaround60%FPR,although it significantly outperformed the elastic net under 20% predictor contamination and either no or 10% response contamination. FPR typically fell within 50% and 60% for the OS Huber 18 As indicated by 20%-trimmed mean 19 See the non-contaminated panel in the top left. 41 lasso, with occasional dips towards 40% under extreme predictor contamination alone (bottom- left-most plot). The standard lasso fared slightly better than the OS Huber under response contamination alone, although its performance otherwise mirrored the standard OS lasso. The OSlassoperformedbestwith20%responsecontaminationandnopredictorcontaminationwith FPR’s all below 40%, and between 39%46% with only 10% response contamination. The OS lasso otherwise tracked closely with the standard lasso, with FPR’s between 40%60%. The second immediate observation is the clear superiority of the MS adaptive elastic net (yellow line) in all cases concerning FPR. FPR only exceeds 20% under 20% response contam- ination alone and only at the smallest sample size. Although it consistently approached 20% in false positives without predictor contamination, its FPR approached or fell below 10% with increasing predictor contamination ⌘ x . Increasing response contamination for a given level of predictor contamination did seem to have a minor negative impact on FPR for the MS adaptive elastic net, though, as seen by the vertical shifts observed in the yellow line when moving from the middle-left and bottom-left plots rightward. Under no predictor contamination, the adaptive Huber lasso (turquoise line), adaptive Hu- berelasticnet(brownline),adaptiveLADlasso(blackline),andadaptiveLADelasticnet(pink line) behaved similarly, with FPR’s between approximately 25%35% with no contamination and closer to 30%40% with 20% response contamination. Here we see that, under response contamination alone, the LAD methods behave more similarly and underperform the Huber methods until larger sample sizes. All four adaptations follow similar trends when handling predictor contamination alone, with meaningful improvements over sample size n. Notably, although distinguished at the smallest sample sizes under either contamination alone, the two lasso methods start trending together once n = 150. Meanwhile, the two elastic net methods trend together beginning at the smallest sample size. Given combined predictor and response contamination, however, the elastic net vs. lasso distinction becomes even clearer, although di↵ erences still arise between the LAD and Huber lassos at the smallest sample size. Consider the bottom-right-most plot pertaining to the most extreme contamination scenario, ⌘ x = ⌘ y = 20%. The adaptive Huber elastic net and the adaptive LAD elastic net (brown and pink, respectively) start around FPR’s of 45%, improve to approximately 35% around n = 150, and then begin to approach 40% at n = 300. The adaptive Huber lasso (turquoise) and adaptive LAD lasso (black), meanwhile, start around 30% FPR, dip slightly to approximately 25% FPR, and then approach FPR’s of 30% as n increases beyond 150. Finally, the adaptive lasso and elastic net (red and green lines, respectively) formulations performed competitively across scenarios and generally had FPR’s in the middle of the mod- els, even outperforming the adaptive Huber and adaptive LAD formulations under response contamination alone and outperforming the respective elastic net formulations of each given combined predictor-response contamination. 42 0.00 0.02 0.04 0.06 38 75 150 300 n η x =0 η y =0 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n fnr η y =10 0.0 0.1 0.2 0.3 38 75 150 300 n fnr η y =20 0.00 0.01 0.02 0.03 0.04 0.05 38 75 150 300 n η x =10 0.00 0.05 0.10 0.15 38 75 150 300 n fnr 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n fnr 0.00 0.02 0.04 38 75 150 300 n η x =20 0.00 0.05 0.10 0.15 38 75 150 300 n fnr 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n fnr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.2: FNR: Outliers, Low Dimensionality, p=8 Figure 3.2displays the average FNRof each method across contamination conditionsand sample sizes for p = 8, arranged similarly to Figure 3.1. Beginning with immediate observa- tions, we see that the OS Huber lasso was generally the least likely to eliminate true predictors, as seen by the light-blue line. The standard elastic net (royal blue), OS lasso (purple line), and standard lasso (light green line) also tended to perform well. However, they often severely underperformed the OS Huber lasso at smaller sample sizes under no predictor contamination (top row of plots). The OS Huber lasso, at worst, showed an FNR of approximately 12%13% when presented with 20% response contamination and at the smallest sample size. Performance quickly approached 0% FNR with increasing sample size, even in this scenario. The elastic net, lasso, and OS lasso all began with FPR’s between 20%25%, and converged more slowly to 0%2% FNR as sample size increased in this scenario. Whereas the MS adaptive elastic net vastly outperformed the other methods concerning FPR, it vastly underperformed in FNR. Except in the absence of response contamination, the MS adaptive elastic net always produced the highest FNR across sample sizes and by a large margin. At worst 20 , the method averaged around 30% FNR. Although the technique typically 20 With 20% response contamination and no predictor contamination. 43 converged the slowest of all the adaptations, its performance did compare to the other methods’ at the largest sample size. Notably, however, it still appeared to underperform in FNR by a few percentage points on average, even at the largest sample size. TheadaptiveHubers,adaptiveLADs,andunmodifiedadaptivemethodsallshowedsimilar and competitive trends in FNR, typically starting poorly at smaller sample sizes but improved toapproximately0%withincreasedsamplesize. ThisgroupconvergedmoreslowlythantheOS lasso, standard lasso, and standard elastic net but more quickly than the MS adaptive elastic net. Although the general trends all followed a similar shape 21 ,thedi↵ erent scales highlight that performance distinctly worsens in response contamination. This observation follows with the results of the FPR’s seen previously. 21 starting at a high peak and converging in sample size to almost 0%. 44 0.00 0.25 0.50 0.75 38 75 150 300 n η x =0 η y =0 0.0 0.5 1.0 1.5 2.0 2.5 38 75 150 300 n RMSE η y =10 0 1 2 3 38 75 150 300 n RMSE η y =20 0.0 0.5 1.0 38 75 150 300 n η x =10 0 1 2 3 38 75 150 300 n RMSE 0 1 2 3 4 5 38 75 150 300 n RMSE 0.0 0.5 1.0 1.5 38 75 150 300 n η x =20 0 1 2 3 4 38 75 150 300 n RMSE 0 1 2 3 4 5 38 75 150 300 n RMSE method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.3: RMSE: Outliers, Low Dimensionality, p=8 Except for the OS lasso, all of the techniques proved similarly-e↵ ective in terms of test-set prediction accuracy for p = 8, as seen in Figure 3.3. Under no predictor contamination, all methods produced nearly the same trend, showing a minor decrease over sample size. Once again, the deleterious impact of response contamination is observed. The average error lines increase from approximately 0.75 to 2.25 when going from ⌘ y = 0% to ⌘ y = 10%, and smaller but noteworthy increases to approximately 3 as ⌘ y increases to 20%. Predictor contamination didnotappeartonegativelyimpactperformanceforagivenresponsecontaminationpercentage. The OS lasso produced the only exception, starting at a higher test-set RMSE 22 and converging more slowly towards the other trends. 22 between 4 and 5, compared to the average between 2.5 and 3.5 at the same sample sizes. 45 0.000 0.025 0.050 0.075 0.100 38 75 150 300 n η x =0 η y =0 0.0 0.1 0.2 0.3 38 75 150 300 n Precision η y =10 0.0 0.2 0.4 38 75 150 300 n Precision η y =20 0.00 0.02 0.04 0.06 0.08 38 75 150 300 n η x =10 0.0 0.1 0.2 0.3 38 75 150 300 n Precision 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n Precision 0.00 0.02 0.04 0.06 0.08 38 75 150 300 n η x =20 0.0 0.1 0.2 38 75 150 300 n Precision 0.0 0.1 0.2 0.3 38 75 150 300 n Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.4: Precision: Outliers, Low Dimensionality, p=8 Figure3.4displaysaveragecoe cientprecisionestimatesforeachmethodforeachsample size and across the nine simulated contamination levels. The trends follow a similar pattern to the trends observed in FNR. Under no response contamination, regardless of predictor contam- ination percentage, the scale is so small 23 and the trends so tightly clustered that the methods all appear roughly competitive. The overlapping lines make distinctions of some methods di - cult as a result. The methods become more distinct as we increase contamination by response outliers. The standard lasso (light green), MS adaptive elastic net (yellow), OS lasso (purple), adaptive lasso and elastic net (red and green, respectively), and adaptive Huber elastic net (brown) all underperformed the other methods. On the other hand, the adaptive Huber elastic net improved when the predictors also contained outliers. On average, the OS Huber lasso most closely estimated the true coe cient values in all conditions with any response contamination. Both adaptive LAD methods and the adaptive Huber lasso performed competitively in coe - cientestimationregardlessofcondition, andoftenapproachedtheperformanceoftheOSHuber lasso. 23 Keeping in mind that the true coe cients are 0.5, 1, 1.5, and 2 46 3.3.1.2 Null Models Produced Method n p Null Models a d ae l net5 38 8 0 0. 2 4 a d al a sso 38 8 0 0. 1 1 a d al a sso 38 8 0 0. 2 4 e lne t 5 38 8 0 0. 1 8 e lne t 5 38 8 0 0. 2 22 e lne t 5 75 8 0 0. 2 2 la sso 38 8 0 0. 1 8 la sso 38 8 0 0. 2 27 la sso 75 8 0 0. 2 3 m sa dae lnet5 38 8 0 0. 1 20 m sa dae lnet5 38 8 0 0. 2 44 m sa dae lnet5 75 8 0 0. 1 1 m sa dae lnet5 75 8 0 0. 2 6 Method n p Null Models hub e r el net 5 38 8 0 .0 0. 2 6 hub e rl asso 38 8 0 .0 0. 2 6 la d l a sso 38 8 0 .0 0. 2 3 oshuberl a sso 38 8 0 .0 0. 1 1 oshuberl a sso 38 8 0 .0 0. 2 13 osl a ssopl us 38 8 0 .0 0. 1 8 osl a ssopl us 38 8 0 .0 0. 2 41 osl a ssopl us 38 8 0 .1 0. 1 4 osl a ssopl us 38 8 0 .1 0. 2 10 osl a ssopl us 38 8 0 .2 0. 1 5 osl a ssopl us 38 8 0 .2 0. 2 7 osl a ssopl us 75 8 0 .0 0. 2 8 osl a ssopl us 75 8 0 .1 0. 1 6 osl a ssopl us 75 8 0 .1 0. 2 8 Method n p Null Models a d ae l net5 38 30 0 0. 1 2 a d ae l net5 38 30 0 0. 2 4 a d al a sso 38 30 0 0. 1 2 a d al a sso 38 30 0 0. 2 4 e lne t 5 38 30 0 0. 1 23 e lne t 5 38 30 0 0. 2 55 e lne t 5 75 30 0 0. 1 1 e lne t 5 75 30 0 0. 2 5 la sso 38 30 0 0. 1 26 la sso 38 30 0 0. 2 74 la sso 75 30 0 0. 1 1 la sso 75 30 0 0. 2 5 Table 2: Null Models: Outliers, Low Dimensionality, p=8,1/2 Method n p Null Models a d ael net5 38 8 0 0. 2 4 a d al a sso 38 8 0 0. 1 1 a d al a sso 38 8 0 0. 2 4 e lne t 5 38 8 0 0. 1 8 e lne t 5 38 8 0 0. 2 22 e lne t 5 75 8 0 0. 2 2 la sso 38 8 0 0. 1 8 la sso 38 8 0 0. 2 27 la sso 75 8 0 0. 2 3 m sa dae lnet5 38 8 0 0. 1 20 m sa dae lnet5 38 8 0 0. 2 44 m sa dae lnet5 75 8 0 0. 1 1 m sa dae lnet5 75 8 0 0. 2 6 Method n p Null Models hub e r e l net 5 38 8 0 .0 0. 2 6 hub e r l asso 38 8 0 .0 0. 2 6 la d l a sso 38 8 0 .0 0. 2 3 oshuberl a sso 38 8 0 .0 0. 1 1 oshuberl a sso 38 8 0 .0 0. 2 13 osl a ssopl us 38 8 0 .0 0. 1 8 osl a ssopl us 38 8 0 .0 0. 2 41 osl a ssopl us 38 8 0 .1 0. 1 4 osl a ssopl us 38 8 0 .1 0. 2 10 osl a ssopl us 38 8 0 .2 0. 1 5 osl a ssopl us 38 8 0 .2 0. 2 7 osl a ssopl us 75 8 0 .0 0. 2 8 osl a ssopl us 75 8 0 .1 0. 1 6 osl a ssopl us 75 8 0 .1 0. 2 8 Method n p Null Models a d ae l net5 38 30 0 0. 1 2 a d ae l net5 38 30 0 0. 2 4 a d al a sso 38 30 0 0. 1 2 a d al a sso 38 30 0 0. 2 4 e lne t 5 38 30 0 0. 1 23 e lne t 5 38 30 0 0. 2 55 e lne t 5 75 30 0 0. 1 1 e lne t 5 75 30 0 0. 2 5 la sso 38 30 0 0. 1 26 la sso 38 30 0 0. 2 74 la sso 75 30 0 0. 1 1 la sso 75 30 0 0. 2 5 Table 3: Null Models: Outliers, Low Dimensionality, p=8,2/2 Brieflyconsiderthenumberofnullmodelsproducedbytheappliedmethodsforthevarious outlier conditions under potential parameter space p = 8. Tables 2 and 3 provide the number of null models 24 for each method and data scenario. Scenarios which produced no null models for a given method are not included. First, with the exception of one model, null models were only produced under no predictor contamination; conversely, null models never arose in the absence of response contamination. Only the adaptive LAD elastic net always included at least one predictor, while the adaptive Hubers, the adaptive LAD lasso, and the adaptive elastic net 24 Aka, the number of models in which no variables were selected. 47 only produced them under ⌘ x = 0%,⌘ y = 20%. The standard elastic net and standard lasso generated null models in approximately 4% and 5% of cases 25 under these conditions, while the MS adaptive elastic net produced null models in almost 9% of cases and the OS lasso in just above 8% of cases in the same conditions. The MS adaptive elastic net also had a null model rate of 4% with only 10% response contamination. The OS lasso was the only method to produce null models in the presence of predictor contamination. 25 Percentage determined by dividing the corresponding number of null models by the number of iterations, 500. 48 3.3.1.3 p = 30 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n η x =0 η y =0 0.0 0.1 0.2 0.3 38 75 150 300 n fpr η y =10 0.0 0.1 0.2 0.3 38 75 150 300 n fpr η y =20 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n η x =10 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n fpr 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n fpr 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n η x =20 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n fpr 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n fpr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.5: FPR: Outliers, Low Dimensionality, p=30 Note scale di↵ erence between Figure 3.5 and the corresponding plot for p=8seenin Figure 3.1. In those scenarios, the FPR plot bounds 26 ypically exceed 60%, with OS Huber lasso and standard elastic net both reaching FPR’s of 70% under ⌘ x = 0%,⌘ y = 20%. FPR only exceeds 45% in a handful of occasions with p = 30 potential predictors, primarily for the standardelasticnet(royalblue)ortheOSHuberlasso(lightblue). Mostmethods’performance improved with the greater number of potential predictors. Except at the smallest sample sizes, the OS Huber lasso and the standard elastic net once again had the highest FPR’s in all scenarios, with the best performance around 30% FPR for the OS Huber lasso 27 and the standard elastic net (⌘ x = 0%,⌘ y = 20%). Notably, both performed competitively with the adaptive lasso, adaptive elastic net, and the adaptive LADs at smaller sample sizes under response contamination alone. However, their FPR’s diverged with increasing sample size. 26 t 27 ⌘ x=20%,⌘ y=0% 49 The standard lasso showed competitive FPR’s at smaller sample sizes regardless of data scenario, although as sample size increased, its relative performance diminished compared to improvements made by other methods over n. Additionally, although the addition of predictor contamination did not severely impact the lasso’s performance, it improved FPR for many of themethodsinthemiddlegroupingoftrends, resultinginworseningrelativeperformanceofthe lasso with smaller sample sizes. The lasso performed best in eliminating true zero coe cients in the presence of response contamination alone, with FPR’s between 20%28%, although it tended towards 25%30% FPR in the remaining scenarios. The OS lasso outperformed all other methods in the presence of response contamination alone, with average FPR’s between 11%17% and 10%14% under ⌘ y = 10% and ⌘ y = 20%, respectively. As predictor contamination increased, however, its performance began to mirror the standard lasso’s, with relatively competitive performance at small sample sizes that did not hold up to other methods as sample size increased. Whereas the standard lasso at worst showed a minor upward trend in FPR as sample size increased, the OS lasso’s average FPR was more likely to trend upwards with sample size. Th OS lasso’s FPR trend is particularly clear under the worst contamination scenario ⌘ x = ⌘ y = 20%, where it deteriorates to approximately 33% FPR at the largest sample size (compared to the lasso’s 30%). Meanwhile, the MS adaptive elastic rarely underperformed other methods. Given 20% predictor contamination, the average FPR of the MS adaptive elastic net never exceeded 10% and only exceeded 10% at smaller sample sizes under 10% predictor contamination. The technique performed at its worst with n = 38 and no predictor contamination, reaching approximately 20% FPR before decreasing to approximately 15% as n increased. The adaptive lasso and adaptive elastic net (red and green, respectively) showed poor- competitive performance at smaller sample sizes that trended towards the middle of all trend lines as sample size increased. The average FPR for both methods at n = 38 ranged between approximately 30%33% 28 at best, and both approached 40% FPR at n = 38 when no outlier contamination was present. TheadaptiveHuberelasticnetandadaptiveLADelasticnet(brownandpink,respectively) performed moderately poorly at smaller sample sizes in the absence of predictor contamination. Under 20% response contamination alone and at n = 38, the adaptive LAD elastic net was the worst performer, with an average FPR exceeding 35%. Under combined predictor and response contamination, both method’s small-sample FPR’s were 35%, if not approaching 40%. Notably, under predictor contamination alone, both methods rapidly improved with sample size, showing competitive FPR’s between 7%15% at n = 300. Both methods trended towards 20% FPR 29 as sample size increased under the most extreme response contamination percentage combined with any predictor contamination. Under no outlier contamination, the adaptive Huber lasso (turquoise) and adaptive LAD lasso (black) performed relatively competitively at smaller sample sizes, with FPR’s of 27% and 32%, respectively 30 . Similar to their elastic net cousins, both rapidly improved with increased 28 See ⌘ x=20%,⌘ y=10%&20% 29 Comparable to the adaptive lasso and adaptive elastic net. 30 Though this is distinctly worse than the MS adaptive lasso’s average FPR of approximately 20%. 50 sample size and slightly outperformed the MS adaptive elastic net atn = 300, at average FPR’s of approximately 13%14%. Under response contamination alone, both methods generally had middling-competitive performance and trended towards 20% and 23% FPR’s under ⌘ y = 10% and ⌘ y = 20%, respectively. However, both methods only underperformed the MS adaptive elastic net in the presence of any amount of predictor contamination. The only exception to is the adaptive LAD elastic net, which slightly outperformed both at n = 150 and n = 300 under 10% predictor contamination alone. With predictor contamination alone, both trended towards FPR’s between 5%10%. Increased response contamination resulted in FPR’s approaching 15%20%. AlthoughtheadaptiveLADlassotendedtounderperformtheadaptiveHuberlasso at smaller sample sizes by approximately 5%, the two methods converged in FPR as sample size increased. The adaptive LAD lasso slightly outperformed the adaptive Huber lasso in average FPR by a few percentage points given the most extreme contamination scenarios 31 . 31 See the three bottom-right-most plots, corresponding to ⌘ x=10%,⌘ y =20%, ⌘ x=20%,⌘ y =10%,and ⌘ x = ⌘ y=20%. 51 0.00 0.03 0.06 0.09 0.12 38 75 150 300 n η x =0 η y =0 0.0 0.1 0.2 0.3 38 75 150 300 n fnr η y =10 0.0 0.1 0.2 0.3 0.4 0.5 38 75 150 300 n fnr η y =20 0.00 0.03 0.06 0.09 38 75 150 300 n η x =10 0.0 0.1 0.2 38 75 150 300 n fnr 0.0 0.1 0.2 0.3 38 75 150 300 n fnr 0.00 0.03 0.06 0.09 38 75 150 300 n η x =20 0.0 0.1 0.2 38 75 150 300 n fnr 0.0 0.1 0.2 0.3 38 75 150 300 n fnr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.6: FNR: Outliers, Low Dimensionality, p=30 Whereas the FPR plots reduced significantly in scale from p=8to p = 30, the opposite appears true for FNR, as seen in Figure 3.6. FNR overall was much higher across methods for all data scenarios, particularly under response contamination, nearing 50% for the OS lasso and just over 40% for the standard lasso. The standard lasso (bright green) and OS lasso (purple) both performed the worst in FNR in the absence of predictor outliers and their performance worsened as response contamination increased. Increases in predictor contamination did produce a positive impact on the lasso’s performance. The standard lasso typically performed competitively in FNR, if not among the strongestperformersunderpredictorcontaminationalone. At20%predictorcontaminationand no response contamination, the standard lasso’s average FNR never exceeded 3% and rapidly decreased to 0% alongside the OS Huber lasso (light blue) and standard elastic net (royal blue). Meanwhile, the OS lasso did not benefit from increased predictor contamination to the same degree. At smaller sample sizes in particular it still performed among the poorest in FNR. Its FNR did improve with sample size in all scenarios featuring predictor contamination and even competed with the elastic net, the lasso, and the OS Huber lasso once sample size increased past n = 38. 52 The adaptive lasso and adaptive elastic net (red and green, respectively) showed average FNR’s in the middle of the trends in the absence response contamination. However, in the presence of any response contamination they typically performed near-last alongside the MS adaptive elastic net once sample size increased past n = 38. The four scenarios involving combined contamination demonstrate this the most clearly. Their red and green trends fall between the MS adaptive elastic net and the remaining methods in these conditions. Although bothmethods’FNR’strendedtowards0%assamplesizeincreased,theirFNR’salsoapproached 30% under response contamination alone and n = 38. Regardless of sample size, the MS adaptive elastic net performed poorly. Although it pro- duced mediocre FNR’s at the smallest sample size in the presence of response contamination alone, other methods quickly outperformed it as sample size increased. Notably, though, its average FNR trend closely followed the average group of trends. On the other hand, though the MS adaptive elastic net generally underperformed the other methods under predictor contam- ination alone at n = 38, it improved the most rapidly with increased sample size. However, it stillperformedmostothermethods. Finally, combinedoutliercontaminationresultedintheMS adaptive elastic net’s worst relative performance, where its FNR trend always stood out from the other FNR trends. The MS adaptive elastic net’s worst FNR values came with small sample sizes and response contamination alone. Like the other methods, the MS adaptive trended to- wards 0% FNR as sample size increased, although it was slightly higher than the other methods under ⌘ y = 20%. A familiar pattern emerged regarding the adaptive LADs and Hubers. The elastic nets 32 trended similarly and distinctly from the lassos 33 , which also trended similarly. For the most part, the di↵ erence in performance between the elastic net and lasso formulations occurs at smaller sample sizes. Average FNR’s of the lasso versions surpassed 25% under 20% response contamination alone at n = 38, although this decreased as predictor contamination increased. The adaptive Huber and adaptive LAD elastic nets, meanwhile, performed very strongly at small sample sizes and given any amount of response contamination. However, they improved more slowly than other methods for n = 75 and n = 150. Their worst performance came with ⌘ x = 0%,⌘ y = 20%, with both techniques’ average FNR’s approximately between 23%24%. Otherwise, their FNR typically fell below 20% and approached 10%, even at small sample sizes. 32 Pink and brown for LAD and Huber, respectively. 33 Black and turquoise for the LAD and Huber, respectively. 53 0.0 0.5 1.0 38 75 150 300 n η x =0 η y =0 0 1 2 3 38 75 150 300 n RMSE η y =10 0 1 2 3 38 75 150 300 n RMSE η y =20 0 1 2 3 38 75 150 300 n η x =10 0 2 4 6 38 75 150 300 n RMSE 0 2 4 6 38 75 150 300 n RMSE 0 1 2 3 4 38 75 150 300 n η x =20 0 2 4 6 38 75 150 300 n RMSE 0 2 4 6 8 38 75 150 300 n RMSE method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.7: RMSE: Outliers, Low Dimensionality, p=30 Similarlytop = 8,RMSE(Figure3.7)overallwashighlycomparableacrossmethodswith p = 30. The scale is slightly di↵ erent, although the larger prediction errors produced by the OS lasso (purple) might explain this change. The OS lasso underperforms in prediction accuracy at smaller sample sizes in the presence of any predictor contamination. Under 20% predictor contamination at n = 38, the OS lasso’s test-set prediction error was more than five times greater than the other methods’ predictions. As in p = 8, increasing response contamination generally increased test-set prediction error. 54 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n η x =0 η y =0 0.0 0.2 0.4 0.6 38 75 150 300 n Precision η y =10 0.00 0.25 0.50 0.75 38 75 150 300 n Precision η y =20 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n η x =10 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n Precision 0.0 0.1 0.2 0.3 0.4 0.5 38 75 150 300 n Precision 0.0 0.1 0.2 38 75 150 300 n η x =20 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n Precision 0.0 0.1 0.2 0.3 0.4 0.5 38 75 150 300 n Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.8: precision: Outliers, Low Dimensionality, p=30 Scale changed relative to p = 8, as coe cient estimate precision in Figure 3.8 gener- ally deteriorated. Most methods performed similarly, although smaller sample sizes produced the most notable di↵ erences between methods. The presence of any response contamination also contributed to more distinct precision trends. In the absence of response contamination 34 , the OS lasso (purple) performed the worst at smaller samples, followed by adaptive Huber elastic net (brown) and adaptive LAD elastic net (pink). The standard lasso (bright green) and the MS adaptive lasso (yellow) provided the most accurate estimates at all sample sizes and outperformed the OS lasso by a factor of two or more at n = 38. The OS Huber lasso (light blue) performed relatively poorly at n = 38 under no contamination or with only 10% predictorcontamination, althoughitperformedmorecompetitivelywith20%predictorcontam- ination. The adaptive lasso and elastic net (red and green, respectively), the adaptive Huber lasso (turquoise), and the adaptive LAD lasso (black) all produced relatively average estimate performance amongst the various methods. Increased response contamination resulted in changes in the relative performance of some adaptations. The OS lasso still performs poorly, especially at smaller sample sizes. However, its 34 The left column of plots. 55 performance past n = 38 becomes more average as predictor contamination increases. The OS Huber lasso, adaptive LAD lasso, and adaptive Huber lasso all performed the strongest across sample sizes in the presence of any response contamination. Conversely, response contamina- tion resulted in worsening performance by the standard lasso. The MS adaptive elastic net’s performance deteriorated relative to the other methods, particularly under combined predictor and response contamination. Under response contamination alone, the adaptive LAD elastic net and adaptive Huber elastic net’s precision tracked with the strongest performers across sample sizes. Their relative performance is worst at smaller sample sizes and under combined contamination. The adaptive lasso and adaptive elastic net performed relatively average given response contamination alone, which deteriorated with the addition of predictor contamination. 56 3.3.1.4 Null Models Produced Method n p Null Models a d ae l n et5 3 8 8 0 0 . 2 4 a d al a sso 3 8 8 0 0 . 1 1 a d al a sso 3 8 8 0 0 . 2 4 e lne t5 3 8 8 0 0 . 1 8 e lne t5 3 8 8 0 0 . 2 2 2 e lne t5 7 5 8 0 0 . 2 2 la sso 3 8 8 0 0 . 1 8 la sso 3 8 8 0 0 . 2 2 7 la sso 7 5 8 0 0 . 2 3 m sa dae ln e t5 3 8 8 0 0 . 1 2 0 m sa dae ln e t5 3 8 8 0 0 . 2 4 4 m sa dae ln e t5 7 5 8 0 0 . 1 1 m sa dae ln e t5 7 5 8 0 0 . 2 6 Method n p Null Models hub e r elnet 5 3 8 8 0. 0 0 . 2 6 hub e rl asso 3 8 8 0. 0 0 . 2 6 la d l a sso 3 8 8 0. 0 0 . 2 3 oshub er l asso 3 8 8 0. 0 0 . 1 1 oshub er l asso 3 8 8 0. 0 0 . 2 1 3 osl a sso p l u s 3 8 8 0. 0 0 . 1 8 osl a sso p l u s 3 8 8 0. 0 0 . 2 4 1 osl a sso p l u s 3 8 8 0. 1 0 . 1 4 osl a sso p l u s 3 8 8 0. 1 0 . 2 1 0 osl a sso p l u s 3 8 8 0. 2 0 . 1 5 osl a sso p l u s 3 8 8 0. 2 0 . 2 7 osl a sso p l u s 7 5 8 0. 0 0 . 2 8 osl a sso p l u s 7 5 8 0. 1 0 . 1 6 osl a sso p l u s 7 5 8 0. 1 0 . 2 8 Method n p Null Models a d ae l n e t5 38 30 0 0 . 1 2 a d ae l n e t5 38 30 0 0 . 2 4 a d al a sso 38 30 0 0 . 1 2 a d al a sso 38 30 0 0 . 2 4 e lne t 5 38 30 0 0 . 1 2 3 e lne t 5 38 30 0 0 . 2 5 5 e lne t 5 75 30 0 0 . 1 1 e lne t 5 75 30 0 0 . 2 5 la sso 38 30 0 0 . 1 2 6 la sso 38 30 0 0 . 2 7 4 la sso 75 30 0 0 . 1 1 la sso 75 30 0 0 . 2 5 Table 4: Null Models: Outliers, Low Dimensionality, p=30,1/4 Method n p Null Models m sa dae l ne t5 38 30 0 0. 0 1 m sa dae l ne t5 38 30 0 0. 1 39 m sa dae l ne t5 38 30 0 0. 2 71 m sa dae l ne t5 75 30 0 0. 2 9 Method n p Null Models hub e r elne t 5 38 30 0 0.2 1 hub e r l a sso 38 30 0 0.2 1 l ad l a sso 38 30 0 0.2 1 oshuberl asso 38 30 0 0.0 1 oshuberl asso 38 30 0 0.1 14 oshuberl asso 38 30 0 0.2 42 Method n p Null Models oslassopl us 38 30 0. 0 0.1 42 oslassopl us 38 30 0. 0 0.2 1 01 oslassopl us 38 30 0. 1 0.0 11 oslassopl us 38 30 0. 1 0.1 37 oslassopl us 38 30 0. 1 0.2 24 oslassopl us 38 30 0. 2 0.0 7 oslassopl us 38 30 0. 2 0.1 17 oslassopl us 38 30 0. 2 0.2 33 oslassopl us 75 30 0. 0 0.0 1 oslassopl us 75 30 0. 0 0.1 1 oslassopl us 75 30 0. 0 0.2 8 oslassopl us 75 30 0. 1 0.0 1 oslassopl us 75 30 0. 1 0.1 12 oslassopl us 75 30 0. 1 0.2 12 oslassopl us 75 30 0. 2 0.1 1 oslassopl us 75 30 0. 2 0.2 1 oslassopl us 150 30 0. 1 0.1 1 Table 5: Null Models: Outliers, Low Dimensionality, p=30,2/4 Method n p Null Models m sa dae l ne t5 38 30 0 0. 0 1 m sa dae l ne t5 38 30 0 0. 1 39 m sa dae l ne t5 38 30 0 0. 2 71 m sa dae l ne t5 75 30 0 0. 2 9 Method n p Null Models hub e r elne t 5 38 30 0 0.2 1 hub e r l a sso 38 30 0 0.2 1 l ad l a sso 38 30 0 0.2 1 oshuberl asso 38 30 0 0.0 1 oshuberl asso 38 30 0 0.1 14 oshuberl asso 38 30 0 0.2 42 Method n p Null Models oslassopl us 38 30 0. 0 0.1 42 oslassopl us 38 30 0. 0 0.2 1 01 oslassopl us 38 30 0. 1 0.0 11 oslassopl us 38 30 0. 1 0.1 37 oslassopl us 38 30 0. 1 0.2 24 oslassopl us 38 30 0. 2 0.0 7 oslassopl us 38 30 0. 2 0.1 17 oslassopl us 38 30 0. 2 0.2 33 oslassopl us 75 30 0. 0 0.0 1 oslassopl us 75 30 0. 0 0.1 1 oslassopl us 75 30 0. 0 0.2 8 oslassopl us 75 30 0. 1 0.0 1 oslassopl us 75 30 0. 1 0.1 12 oslassopl us 75 30 0. 1 0.2 12 oslassopl us 75 30 0. 2 0.1 1 oslassopl us 75 30 0. 2 0.2 1 oslassopl us 150 30 0. 1 0.1 1 Table 6: Null Models: Outliers, Low Dimensionality, p=30,3/4 Tables 4-7 display the null model results for p = 30 potential predictors. On average, most methods produced more null models with p = 30 relative to p = 8. No null models arose under ⌘ x = ⌘ y = 0% with p = 8 potential predictors. However, both the OS Huber lasso and the MS adaptive elastic net produced a single null model with n = 38, while the OS lasso produced a single null model under n = 75. The largest response contamination combined with no predictor contamination once again produced the most null models for most adaptations, especially at n = 38. The OS lasso performed the worst in terms of null model production, producing a sizable minority of null models at the smallest sample size. Similar to the other methods, its worst performance came with ⌘ x = 0%,⌘ y = 20%, where it produced null models 57 Method n p Null Models m sad a e l net5 38 30 0 0 . 0 1 m sad a e l net5 38 30 0 0 . 1 3 9 m sad a e l net5 38 30 0 0 . 2 7 1 m sad a e l net5 75 30 0 0 . 2 9 Method n p Null Models hub e r eln et5 3 8 30 0 0 . 2 1 hub e rl a sso 3 8 30 0 0 . 2 1 l ad l a sso 3 8 30 0 0 . 2 1 oshub e r l asso 3 8 30 0 0 . 0 1 oshub e r l asso 3 8 30 0 0 . 1 1 4 oshub e r l asso 3 8 30 0 0 . 2 4 2 Method n p Null Models o sla ssopl us 38 30 0 . 0 0 . 1 4 2 o sla ssopl us 38 30 0 . 0 0 . 2 1 0 1 o sla ssopl us 38 30 0 . 1 0 . 0 1 1 o sla ssopl us 38 30 0 . 1 0 . 1 3 7 o sla ssopl us 38 30 0 . 1 0 . 2 2 4 o sla ssopl us 38 30 0 . 2 0 . 0 7 o sla ssopl us 38 30 0 . 2 0 . 1 1 7 o sla ssopl us 38 30 0 . 2 0 . 2 3 3 o sla ssopl us 75 30 0 . 0 0 . 0 1 o sla ssopl us 75 30 0 . 0 0 . 1 1 o sla ssopl us 75 30 0 . 0 0 . 2 8 o sla ssopl us 75 30 0 . 1 0 . 0 1 o sla ssopl us 75 30 0 . 1 0 . 1 1 2 o sla ssopl us 75 30 0 . 1 0 . 2 1 2 o sla ssopl us 75 30 0 . 2 0 . 1 1 o sla ssopl us 75 30 0 . 2 0 . 2 1 o sla ssopl us 150 30 0 . 1 0 . 1 1 Table 7: Null Models: Outliers, Low Dimensionality, p=30,4/4 at a rate of more than 20%. The OS lasso continued to produce null models up to sample sizes of n = 150. The adaptive LADs and adaptive Hubers produced the fewest null models. Both adaptive HubersandtheadaptiveLADlassoproducedonlyasinglenullmodelunder⌘ x = 0%,⌘ y = 20%. The adaptive LAD elastic net did not produce a single null model under any condition. 58 3.3.2 Results: High Dimensionality Although the plots below include the MS adaptive elastic net, its performance in the absence of predictor contamination is not reviewed or discussed. This method produced null models at a rate of nearly 100% under no predictor contamination, rendering any results from these data meaningless. The middle line in each box represents the median value of the given metric; the upper bound of the box represents upper quartile 35 of the metrics, while the lower bound of the box represents the lower quartile 36 ; and the whiskers denote the extent of non-outlying data points, as determined by a distance of 1.5 times the Interquartile Range (IQR 37 ). ⌘ x increases going down the plots, and ⌘ y moving right along the plots. 35 Otherwise known as the .75 quantile or the 75th percentile. 36 Otherwise known as the .25 quantile or the 25th percentile. 37 Calculated as the upper quartile minus the lower quartile, visually represented by the box itself. 59 0.0 0.1 0.2 method η x =0 η y =0 0.0 0.1 0.2 0.3 method fpr η y =10 0.0 0.1 0.2 0.3 method fpr η y =20 0.0 0.1 0.2 0.3 0.4 0.5 method η x =10 0.0 0.2 0.4 0.6 method fpr 0.0 0.2 0.4 0.6 method fpr 0.0 0.1 0.2 0.3 0.4 method η x =20 0.0 0.2 0.4 0.6 method fpr 0.0 0.2 0.4 0.6 method fpr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.9: FPR: Outliers, High Dimensionality The adaptive lasso and elastic net (red and green, respectively) produced the most distinct FPR results inFigure 3.9. Their FPR dispersion is at least as large, if not larger than, that of the other methods under all data conditions. Both methods also produced more outlying FPR values, suggesting a skewed sampling distribution. Although the median FPR of both methods was competitive in the absence of response contamination, relative FPR performance worsened with increased ⌘ y . The standard elastic net (royal blue) also consistently produced large across FPR’s across data scenarios despite competitive median FPR’s. The exception to this competitive FPR is observed under increasing ⌘ x and in the absence of response contamination. Although the di↵ erence is small, the standard elastic net did perform the worst in median FPR under the most extreme predictor contamination. The standard lasso (bright green) shows a relatively long spread of outlying FPR’s across data scenarios despite competitive median FPR’s. The OS Huber lasso (light blue) and OS lasso (purple) were also widely dispersed, especially in the absenceofpredictorcontamination. Similartothosetwomethods,theyalsotypicallyperformed competitively in median FPR and improved as response contamination increased. 60 The adaptive Huber lasso (turquoise) and adaptive Huber elastic net (brown) both per- formed competitively in the presence of predictor contamination. However, they both became more dispersed with increasing response contamination and their relative median FPR was poor under response contamination alone. Under response contamination alone or with only 10% predictor contamination, the adaptive Huber elastic net produced no outlying FPR’s by the boxplot rule. The adaptive Huber lasso also produced no outlying FPR’s by the boxplot rule under three conditions 38 . Except for the unmodified adaptive variants and the adaptive LAD variants, these two methods tended to have the longest whiskers of all methods. The adaptive LAD elastic net (pink) performed relatively poorly under all conditions lack- ing predictor contamination, with median FPR’s higher than all other methods. Although the adaptive LAD lasso (black) did not perform poorly under these conditions, its median FPR was still among the poorest under response contamination alone. However, it produced average performance under no contamination in either the response or predictor space. Both adaptive LAD variants showed very wide sampling distributions in the absence of predictor contamina- tion. The absence of any contamination produced a large number of outlying FPR’s for the adaptive LAD lasso. Both adaptive LAD variants performed more competitively in the presence of predictor contamination. However, the adaptive LAD lasso again produced a number of outlying ob- servations above the upper whiskers. The skew of the adaptive LAD lasso’s FPR improved under combined predictor and response contamination. The elastic net variant was much less dispersed under predictor contamination alone. The MS adaptive elastic net (yellow) always performed the best in all of the scenarios, with median FPR nearly 0% and very low dispersion. Overall, all methods produced low median FPR’s that typically fell below 10%. Only the FPR’s of the adaptive lasso and adaptive elastic net exceeded 10% in the presence of response contamination and their FPR’s never exceeded 15%. The adaptive LAD variants both exceeded FPR’s of 10% but only under response contamination alone. 38 ⌘ x = y,⌘ y=10%&20%,and ⌘ x=20%,⌘ y=10% 61 0.00 0.05 0.10 0.15 0.20 0.25 method η x =0 η y =0 0.00 0.05 0.10 0.15 0.20 0.25 method fnr η y =10 0.0 0.1 0.2 0.3 0.4 0.5 method fnr η y =20 0.00 0.05 0.10 0.15 0.20 0.25 method η x =10 0.00 0.05 0.10 0.15 0.20 0.25 method fnr 0.0 0.1 0.2 0.3 0.4 0.5 method fnr 0.00 0.05 0.10 0.15 0.20 0.25 method η x =20 0.00 0.05 0.10 0.15 0.20 0.25 method fnr 0.0 0.1 0.2 0.3 0.4 0.5 method fnr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.10: FNR: Outliers, High Dimensionality Figure 3.10 displays FNR performance for the outlier simulations under high dimension- ality. Except under the highest response contamination and for only a few methods, median FNRwas0%acrossscenarios. Furthermore, FNRneverexceeded25%exceptunder20%outlier contamination in the responses. Most methods occasionally eliminated a single true predictor from the model. However, the standard elastic net (royal blue), the standard lasso (bright green), the MS adaptive elastic net (yellow), the OS lasso (purple), and the OS Huber lasso (light blue) always selected true predictors into the model in certain scenarios. Increasing re- sponse contamination resulted in more non-zero FNR’s for some adaptations, resulting in 50% FNR. Under 20% response contamination alone, the upper quartiles of the adaptive elastic net, adaptive lasso, standard lasso, and OS lasso all extend to an FNR of 25% the adaptive elastic net extends its upper quartile to an FPR of 25%, as do the adaptive lasso, standard elastic net, standard lasso, and OS lasso. The adaptive lasso, standard lasso, and OS lasso also pro- duced FNR’s of 50% under certain conditions. The adaptive Huber lasso (turquoise), adaptive Huber elastic net (brown), adaptive LAD lasso (black), and adaptive LAD elastic net (pink) all performed comparably across conditions, with median FNR’s of 0% and outlying FNR’s of 62 25% observed in all cases. The MS adaptive elastic net performed the worst under response contamination. Under the most severe contamination scenario with ⌘ x = ⌘ y = 20%, the whisker extends to 0.5 and the median FNR for the MS adaptive elastic net is 25%. 63 0.0 0.5 1.0 method η x =0 η y =0 0 1 2 3 method RMSE η y =10 0 1 2 3 4 method RMSE η y =20 0.0 0.5 1.0 1.5 method η x =10 0 1 2 3 method RMSE 0 1 2 3 4 method RMSE 0.0 0.5 1.0 1.5 method η x =20 0 3 6 9 method RMSE 0 1 2 3 4 method RMSE method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.11: RMSE: Outliers, High Dimensionality The median predictive performance (Figure 3.11) of the various methods under high di- mensionality are competitive in most data scenarios. Test-set RMSE is similarly dispersed in most cases. However, the adaptive lasso (red) and adaptive elastic net (green) showed wider dispersion and larger outlying RMSE’s in some scenarios, particularly in the absence of re- sponse contamination. The adaptive Huber elastic net (brown) and the adaptive LAD elastic net (pink) produced larger median prediction errors than all other methods under predictor contamination alone. The adaptive LAD elastic net also showed slightly larger median RMSE and wider dispersion across conditions. The OS lasso (purple) produced a large prediction error under 20% predictor contamination and 10% response contamination. Overall, prediction error increased with increasing response contamination. Prediction contamination did not produce significant impacts on most methods’ prediction accuracy, although increasing ⌘ x resulted in slight reductions depending on the method. 64 0.00 0.05 0.10 0.15 0.20 method η x =0 η y =0 0.0 0.2 0.4 0.6 0.8 method Precision η y =10 0.00 0.25 0.50 0.75 1.00 method Precision η y =20 0.0 0.2 0.4 method η x =10 0.0 0.2 0.4 0.6 0.8 method Precision 0.0 0.3 0.6 0.9 method Precision 0.00 0.25 0.50 0.75 method η x =20 0.00 0.25 0.50 0.75 1.00 method Precision 0.0 0.3 0.6 0.9 method Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 3.12: precision: Outliers, Low Dimensionality, p=30 Figure 3.12 displays coe cient precision for the high dimensionality outlier simulations. Performance generally deteriorated with increased response contamination and improved with increased predictor contamination. However, the adaptive Huber elastic net (brown) and adap- tive LAD elastic net (pink) deteriorated in precision with increases in ⌘ x . Furthermore, dis- persion increases with predictor contamination. Both methods performed comparably to other adaptations in the absence of predictor contamination. The adaptive Huber lasso (turquoise) and adaptive LAD lasso (black) most closely esti- mated the true coe cients. In the presence of any response contamination, both adaptations outperformed and produced less dispersion than other methods. Both methods also performed competitively in the absence of response contamination. The adaptive lasso (red) and adaptive elastic net (green) produced wider distributions of observed precision results except in comparison to the adaptive Huber elastic net in some scenarios. Although more dispersed than other methods, the two adaptations performed com- petitively in median precision given no response contamination. These two methods were also less dispersed than the OS lasso (purple) under the most extreme predictor contamination 65 combined with the absence of response contamination. However, the presence of any response contamination produced poor precision results. The standard elastic net (royal blue) had middling performance under predictor contami- nation alone, but performed worst in median precision under no contamination and competed for worst performance given any amount of response contamination. However, the elastic net produced a narrowed distribution than the other poor-performing methods 39 in those scenarios. The standard lasso (bright green) dispalyed relatively average coe cient estimate perfor- mance across data scenarios. The standard lasso’s produced the strongest relative performance in the absence of response contamination. The OS Huber lasso (light blue) performed compet- itively in all response-contamination scenarios, although it underperformed in the absence of responsecontamination. TheOSlassoperformedsimilarlywhennoresponsecontaminationwas present and produced competitive median precision under combined contamination. However, only the adaptive Huber elastic net and adaptive LAD elastic net produced wider distributions under predictor contamination alone. The MS adaptive elastic net (yellow) performed competitively with the lasso and the two OS lasso methods across data scenarios with su cient non-null data. However, it produced larger dispersion with 20% predictor contamination. The general e↵ ect of increased predictor contamination is unclear given the lack of su cient data without any predictor contamination. 39 Adaptive Huber elastic net, adaptive LAD elastic net, adaptive lasso, and adaptive elastic net. 66 3.3.2.1 Null Models Produced Consider the production of null models in high dimensionality for the 9 outlier contamination scenarios displayed inTable 8. In all three scenarios lacking any predictor contamination, null model rate approached 100% for the MS adaptive elastic net. This observation likely relates to concerns expressed in reviewing this method inChapter 2: the additional adaptive estimation steps might be overcompensating for false positives and consequently eliminating too many potential predictors from the models. 10% predictor contamination still produced a moderate range of null models 40 , while the null model rate for the MS adaptive elastic net fell to single digits in the presence of 20% predictor contamination. No other methods produced null models in the high dimensionality outlier simulations. Method n p Null Models msad ae lnet5 300 1 00 0 0.0 0. 0 4 90 msad ae lnet5 300 1 00 0 0.0 0. 1 4 87 msad ae lnet5 300 1 00 0 0.0 0. 2 4 86 msad ae lnet5 300 1 00 0 0.1 0. 0 1 23 msad ae lnet5 300 1 00 0 0.1 0. 1 1 32 msad ae lnet5 300 1 00 0 0.1 0. 2 1 54 msad ae lnet5 300 1 00 0 0.2 0. 0 4 msad ae lnet5 300 1 00 0 0.2 0. 1 15 msad ae lnet5 300 1 00 0 0.2 0. 2 29 Method n p Null Models Method n p Null Models Table 8: Null Models: Outliers, High Dimensionality 40 Between 25%31% 67 3.4 Discussion Particularly for methods that performed the strongest or weakest, lower rates of false posi- tives typically corresponded with higher rates of false negatives, and vice versa. This general observationisconsistentwiththehypothesis-testingframingofvariableselection; allotherchar- acteristics held constant, the Type I error rate 41 and the Type II error rate 42 for a procedure are inversely proportional. Low-dimensionality scenarios provide the most clear examples to this e↵ ect. The MS adaptive elastic net consistently produced the lowest FPR’s and the higher FNR’s. The standard elastic net and the OS Huber lasso produced the opposite findings, typ- ically underperforming concerning FPR while outperforming other adaptations in FNR. The general discussion inChapter 7 returns to this observation, as it arises throughout subsequent simulations conducted for this course of study. Response outlier contamination consistently resulted in negative impacts on performance. The increases in scale across plots for any given metric besides FPR demonstrate this im- pact. Regardless of p or dimensionality, the scale of each metric generally increases for a given 10%-increase in response contamination. Although the scale does not increase for FNR when increasing ⌘ y from % to 10% under high dimensionality, most methods produced wider FNR distributions over this increased contamination. Although performance generally deteriorates over increases in response contamination, the shape of all precision and FNR trends over sample size is universal across methods under low dimensionality. All adaptations performed the worst at the smallest sample size and decrease towards a comparable value with increased sample size 43 . Increasing response contamination reduces the benefits of increased sample size. No consistent pattern emerges across methods cocnerning FPR performance trends. The heterogeneous impact of outlier contamination on FPR can be seen most clearly in Figure 3.1 and Figure 3.5. Cnosider the OS lasso. In the absence of predictor contamination alone for low-dimensionality conditions, the OS lasso properly eliminated more true zero predictors compared to similar scenarios that included neither predictor nor response contamination. Mostmethodsseereducedperformanceunderincreasedresponsecontamination,regardless of p or dimensionality. The trends of the adaptive Huber and adaptive LAD methods, and especially the former under low-dimensionality, display another less common negative e↵ ect of response contamination. For instance, Figure 3.5 shows the average FPR trends for p = 30 potential predictors. Under no outlier contamination in either predictors or response, the adaptive Huber lasso incorrectly selects true zero coe cients into the model approximately 28%29% of the time for n = 38. This improves dramatically with sample size, with an average FPR of less than 14% by n = 300. Although the average FPR is still approximately the same at n = 38, the overall improvement over n is much less drastic in the presence of 41 AKA, the rate at which non-predictors are included in the model, or FPR 42 AKA, the rate at which true predictors are eliminated from the model, or FNR 43 Between approximately 0%5% FNR, or coe cient precision of magnitudes between 00.05. 68 response contamination; for both ⌘ y = 10% and ⌘ y = 20%, the adaptive Huber lasso’s average FPR reduces by less than 5% from n = 38 to n = 300. Unfortunately, RMSE did not appear to separate the performance of any of the methods. Even after checking thegeneratingcode, it’sunclear why the RMSE was sosimilar across meth- ods. In higher dimensions, the boxplots show us that dispersion of observed prediction error varies somewhat by method, but the adaptations still do not meaningfully distinguish them- selves. Thissimilaritymightreflectoneofafewunderlyingprocesses. First, thecurrentdataset might be conducive to a minimal range of prediction error, regardless of method. Another po- tential explanation is that these methods are all just equally capable in terms of prediction. These methods might generally show similar prediction capacity within a broader range, which combines with characteristics of this particular simulated dataset to further reduce variability in prediction. One final potential explanation for this similarity is that there is an issue in the codethatgeneratestest-setRMSEitself. Theauthorfeelsconfidentthatthiscontributedtothe observed RMSE results. However, no obvious code issues presented themselves upon multiple inspections. Furthermore, the somewhat small variability in high-dimensionality, and the OS lasso’s distinct performance at smaller sample sizes, suggest the production of variability. The precision estimates, which follow similar mathematical formulae to RMSE, showed noticable variability across methods. This confusing finding merits exploration in future research. Findings in the current simulations contradict previous observations regarding the impact of leverage points 44 on model performance. Where studied in previous studies, leverage points consistently reduce model performance. On the other hand, leverage points proved protective of performance more often than they reduced performance in these simulations. The most logical explanation relates to the generation of predictor outliers for the current simulations. Regardless of contamination scenario, final predictor values always preceded the generation of corresponding response values. This generation sequence results in response values generated from corrupted predictor values in the presence of predictor contamination, rather than re- sponses being generated before the contamination of predictors. The former case produces Y values consistent with the association between uncontaminated X and Y values. However, Y arises from uncontaminated predictors in the latter case; therefore, the association between predictors and response should be diminished for the subsequently contaminated predictors. Another way to think about this is to consider “good” versus “bad” outliers. Let us briefly consider a plot and its generating code to visualize the issue. x <- seq(from = 0.1 , to = 1.0 , by = 0.1) y <- seq(from = 0.1 , to = 1.0 , by = 0.1) x.all <- c(x , 2 , 0.5 , 2) y.all <- c(y , 0.5 , 2 , 2) plot(y.all ~ x.all) Consider a sequence of x and y values from 0.1 to 1.0 corresponding with a true 1-to-1 relationship. Consider two additional values: one that is larger in x and smaller in y, seen 44 Aka, outliers in the predictor space. 69 in the lower right-hand corner of Figure 3.13, and a similar value where y is larger than x, seen in the upper left-hand corner. Consider a final point generated with a consistent 1-to-1 relationship between x and y except with much larger values of 2 for both X and Y. All three additional values can be considered outliers, given that they fall outside the typical range of observed values. However, only two of these outliers are likely to reduce model performance for establishing the clear association observed in the original set of X-Y observations. The observation in the lower right-hand corner provides an example of a leverage point; its response falls within the typical range of Y corresponding with an atypical predictor value. The response outlier in the upper left provides the opposite type of outlier, with X typical but Y atypical. The final outlying point in the upper right, though an outlier, likely increases the association of interest for an arbitrary regression model, given that it maintains the association between X and Y. The current simulations intended to study the impacts of the bottom-right and top-left outliers as examples of predictor and response outliers, respectively. Although response outliers in these simulations appropriately corresponded with the top-left outlier, predictor outliers instead corresponded with the model-supporting outlier seen in the top-right of the example plot. Given the desire of the current research to study robustness of model perfiormance in the context of deleterious outliers, this data characteristic means that these simulations do not entirely study the outlier processes of interest. This limitation merits addressing in subsequent research under this simulation framework. Chapter 1 mentions that an ideal robust tool performs well and consistently across pos- sible data-generating mechanisms given the improbability of a method which dominates other adaptationsinallscenarios. Thecurrentsimulationsprovideaglutofdatacharacteristics,levels over which those characteristics vary, and e↵ ect heterogeneity of varying characteristic levels. Concerning the desire for a method showing consistency, consistency across unobserved data characteristics is tantamount. Accounting for these unobserved characteristics, understanding of performance variability across observable or foreseeable characteristics 45 further refines the choice of adaptation. The most likely candidate for an ideal robust tool will arise not from a method that performs consistently across metrics and unobservable characteristics. A more likelycandidatemightdisplayinconsistencyinthesimulationsconducted,butthisinconsistency corresponds with a competitive worst performance across scenarios. If an inconsistent method performs competitively with other adaptations at its worst, it still provides an e↵ ective tool in this case and provides an even more e↵ ective tool under other scenarios. 3.4.1 Recommendations Under Low Dimensionality The two primary characteristics of interest in these simulations are outlier contamination in predictor variables and response variables, denoted by ⌘ x and ⌘ y , respectively. Contaminated varied across three levels for both spaces of contamination: 0%, 10%, and 20%. No method shows consistent performance across all metrics and conditions. Even within metrics and con- trollingfortheobservablecharacteristics, allmethodsshowFPRvariabilityofatleast5%10% 45 E.g., sample size, number of potential parameters, etc. 70 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 x.all y.all Figure 3.13: Di↵ erent Types of Outliers across contamination levels when holding n and p constant. Given this, the significant under- performance provides some guidance. With p = 8 predictors, the OS-based methods showed typical FPR’s greater than 40% in the absence of predictor outliers. This performance did not significantly improve with increased predictor outlier contamination. Only under response contamination alone with p = 30 predictors did the OS lasso perform strongly, with FPR’s be- low 20%, and it otherwise underperformed or compared with the average-performing methods. Although this poor performance corresponds with relatively strong FNR and precision perfor- mance, particularly for the OS Huber lasso, the general degree of underperformance in FPR nullifies this advantage unless FNR and/or coe cient precision are of much greater concern than FPR. The MS adaptive elastic net presents the opposite problem. It showed the strongest or near-strongest performance in FPR trends under low dimensionality across observable char- acteristics, and it often outperformed most methods’ FPR’s by at least 5%10%. Unfortu- nately, the MS adaptive elastic net also majorly underperformed other methods’ FNR trends 71 undercombinedpredictorandresponsecontamination. Furthermore, itgeneralunderperformed across contamination scenarios concerning FNR. Its precision trends were strong under no re- sponse contamination, although increases in response contamination reduced its comparative performance here as well. The elastic net also performed extremely poorly in regarding FPR trends, with either the worst or second-worst average FPR across all unobservables and observables except at the smallest sample sizes with p = 30 potential predictors. The degree of underperformance renders the standard elastic net highly impractical, regardless of its minimal advantage over some methods in average FNR in some data conditions. The standard lasso, as would be expected given its pure ` 1 loss 46 , generally outperformed the standard elastic net in FPR and underperformed in FNR. However, overall it still underpformed in FPR. Meanwhile, the adaptive lasso and adaptive elastic net perform well in terms of FPR and their low-dimensionality trends are competitive across contamination scenarios. The two methods also performed comparably to each other. However, this corresponds with middling- to-poor FNR’s and coe cient precision. Given the relatively narrow spread of FNR trends more broadly, this sacrifice in performance is worth considering, although other methods merit consideration given a moderate number of potential predictors and smaller sample sizes. Similarly, the adaptive Huber elastic net and adaptive LAD elastic net generally showed average-to-strongperformanceinlow-dimensionalityacrossmetrics. Therefore, thesetwometh- ods deserve consideration if the data features a moderate number of potential predictors. How- ever, they also produced middling FPR performance under combined predictor and response contamination with p = 8 potential predictors. However, middling performance does not equal poor performance and these methods provide a practical tool even in the worst case observed in these simulations. The most consistent performers across scenarios, observable data characteristics, and met- rics were the adaptive Huber lasso and adaptive LAD lasso. Only under specific conditions did the adaptive LAD lasso’s FPR exceed the majority of other methods. Meanwhile, the adaptive Huber lasso always performed competitively concerning FPR and only exceeded average FPR’s of 30% in a small group of scenarios forp = 8 potential predictors. Both methods underperform inFNR,especiallyatthesmallestsamplesizeofn = 38. Thisshouldbetakenintoconsideration when utilizing smaller sample sizes and if elimination of true predictors is of particular concern. Given moderate sample sizes, however, their FNR performance was close to that of their elastic net cousins. These findings suggest a few recommendations depending on the expected conditions un- derlying an applied dataset. The adaptive Huber lasso and adaptive LAD lasso both performed consistently and competitively across scenarios and metrics, though the adaptive Huber lasso produced the most consistency. Therefore, the adaptive Huber lasso is recommended given a dataset of at least n train = 50 47 . The adaptive LAD lasso provides a strong tool given a small number of potential predictors p. These two lasso formulations are also recommended more 46 Since `1 loss alone should correspond with more variable elimination. 47 Depending on if test-set prediction is of interest, this may or may not correspond with the full sample size. 72 generally at training sample sizes as small as n = 25 except when false negatives are of par- ticular concern. Given a small model-training n and concern for false negatives, the adaptive Huber elastic net and adaptive LAD elastic net are recommended instead. The MS adaptive elastic net is also recommended, although only if concerns about false positives far outweigh the potential for false negatives. 3.4.2 Recommendations Under High Dimensionality Inhigherdimensions,therelativeperformanceofthevariousmethodschangessomewhat. FPR’s were generally much smaller across methods, and average FPR’s for most adaptations never exceed 10%. The adaptive Huber elastic net and adaptive LAD lasso surpassed 10% median FPR under response contamination alone, while adaptive Huber lasso approximated 10% FPR under ⌘ x = 10%,⌘ y = 20%. Under response contamination alone, the adaptive LAD elastic net approached or equaled 15%, the worst among all methods in those scenarios. Meanwhile, the adaptivelassoandadaptiveelasticnetalwayssurpassed10%givenanyresponsecontamination. The adaptive lasso and adaptive elastic net also showed a very wide spread of outlying FPR’s in all high-dimensionality scenarios and thus are not recommended. The two unmodified adaptive methodsfurtherunderperformedinFNRdistributionunderthehighestresponsecontamination except when combined with the highest predictor contamination. The adaptive LAD elastic net produced wide distributions and high enough median FPR under no contamination or predictor contamination alone that it is not recommended in this case. It also performed poorly in prediction in the presence of predictor contamination alone, and generally performed poorly in estimating coe cients under high dimensionality. The MS adaptive elastic net cannot be recommended under high dimensionality primarily based on the rate of null model production when ⌘ x = 0%. Due to this, there is simply not enough information to assess performance across all levels of the unobservable characteristics. Evenifthemethod’sperformanceisotherwisestellar, theseresultsmightnotgeneralizetothose scenarios lacking su cient data. This lack of data corresponds with unbounded uncertainty for the MS adaptive elastic net in high dimensions rendering inferences unreliable. ThestandardlassoandstandardelasticnetbothproducecompetitivemedianFPR’sunder high dimensionality when presented with the most extreme predictor contamination, regardless of response contamination. They both also produced low median FPR’s regardless of condition. However, they, too, showed wide distributions and a wide spread of outlying data points in various scenarios across metrics and thus are not recommended. The two OS methods perform much more strongly in properly eliminating true zero coef- ficients under high dimensionality than in low dimensions. The OS lasso shows a competitive medianFPRacrosscontaminationscenariosandarelativelysmallspreadofoutlyingdatapoints with few exceptions. However, the OS lasso produced a wider distribution of FNR’s and wider precision distributions under some contamination scenarios. If, however, properly eliminating variables from a model is the biggest concern, the OS lasso is a strong contender in higher 73 dimensions. Conversely, the OS Huber lasso provides a useful tool where FNR and/or preci- sion are of greater concern, as its distribution and range of outlying data points in FPR are notably less competitive under response contamination alone. However, problematic software implementation tempers both OS recommendations. The adaptive Huber lasso and adaptive LAD lasso come with similar conditional recom- mendations concerning outliers and high dimensionality. They both produced wider FPR dis- tributions in the presence of response outleirs alone and larger median FPR’s than many other methods. However, this relative underperformance is less concerning given the lower base rate of false positives across methods in higher dimensions. Furthermore, methods that otherwise performed more e↵ ectively in false positives in some scenarios 48 ultimately performed worse in other scenarios. The recommendation of the adaptive LAD lasso is more cautious, given its wider FPR distribution of FPR’s in all scenarios not featuring combined outlier contamination. Given any predictor contamination, the adaptive Huber lasso performed competitively, and its distribution and outlying spread narrowed meaningfully. If coe cient precision is of particular concern,theadaptiveHuberlassoandadaptiveLADlassoarethebestrecommendation,asthey outperformed other methods across scenarios in median precision and they typically produced a narrower spread of outlying precision values. Although the OS lasso proved a strong performer across high-dimensionality outlier sce- narios, implementation renders it a di cult recommendation for the average researcher. Con- sequently, the author generally recommends the adaptive Huber methods in high dimensions given their stable and competitive performance across high-dimensionality conditions and met- rics. Either variant is recommended in the absence of collinearity. Meanwhile, the elastic net variant likely provides the best solution if researchers are concerned about collinearity. 3.4.3 A Note on Conservative Selection in the MS Adaptive Elastic net This initial simulation provides the first observations of the conservative nature of variable selection in the MS adaptive elastic net. The method appears to overcompensate for the false positive concern of the lasso expressed by Xiao and Xu (2017) by more frequently eliminating potential predictors in general. This finding occurs in subsequent simulations and the general discussion addresses it in further detail. 48 For instance, the standard lasso 74 Chapter 4 Simulations: Distributional Robustness This chapter outlines simulations conducted to evaluate the robustness of various lasso and elastic net adaptations to non-normality in the error distribution. The chapter begins with a brief review of relevant findings in the extant literature, followed by a description of the methods used to simulate and study non-normal error distributions. Results are presented and discussed. Additionally, the chapter includes a description of and results from a smaller simulation conducted to evaluate distributional robustness in higher dimensions. 4.1 Relevant Findings Thefollowingreviewisconstrainedtofindingsinvolvingatleastoneoftheincludedadaptations designed for robust characteristics and in the context of non-normality in the error distribution. Consequently, the review below does not include several studies reviewed while describe lasso and elastic net adaptations in Chapter 2. Studies that assessed the standard or adaptive formulations of the lasso or elastic net, but did not consider any other adaptations included in the current simulations, are not reviewed. However, the performance of these methods is discussed below in the context of other included adaptations. Li et al. (2020) once again provide the most clear results regarding the adaptive LAD and adaptive Huber lasso, as they separate non-normality from outlier contamination and collinear- ity. Double exponential-distributed errors produced 0% FNR’s for the standard lasso, adaptive LAD lasso, adaptive Huber lasso, and adaptive Tukey-M lasso. These four methods produced comparable FPR’s between 24%26% in this scenario. Meanwhile, Lambert-Lacroix and Zwald (2011), Lambert-Lacroix and Zwald (2016), and Zheng et al. (2016) each separated non-normality and outlier contamination, although each study included collinearity in all data conditions. Both Lambert-Lacroix and Zwald (2011) and Zheng et al. (2016) found that non-normally-distributed errors contributed to drastically reduced ability to select true non-zero predictors into the model for all methods studied for n = 50, including the adaptive LAD lasso and adaptive Huber lasso. Lambert-Lacroix and Zwald(2011)foundFNR’sof51%,35%,and40%fortheadaptivelasso,adaptiveLADlasso,and adaptive Huber lasso, respectively for double-exponential error distributions. Under Cauchy- distributed errors and n = 50, Zheng et al. (2016) observed FNR’s of 50%, 23%, and 44% for 75 the adaptive lasso, adaptive LAD lasso, and adaptive Huber lasso, respectively. The adaptive LAD lasso and adaptive Huber lasso improved to 12% and 38%, respectively, when sample size increased to n = 100, although the adaptive lasso’s FNR deteriorated to 57%. These findings further highlight the lack of results indicating FNR’s in Lambert-Lacroix and Zwald (2016). All three studies found no significant changes in FPR related to the non-normally-distributed error conditions. X. Wang et al. (2013) evaluated the adaptive LAD lasso compared to the ESL lasso and quantile lasso given combined predictor contamination, response contamination, and Cauchy- distributed errors. FNR was 0% for all methods and sample sizes. FPR’s of the adaptive LAD lasso ranged between 44% and 49%, even with sample sizes as large as n = 800, while the ESL lasso and quantile lasso FPR’s of 0% and 30%, respectively. Arslan (2012) evaluated the performance of the adaptive LAD lasso and weighted LAD lasso under Cauchy-distributed errors and collinearity. For all sample sizes (50, 100, and 200), the adaptive LAD lasso FPR’s and FNR’s approached or equaled 0%. The limited field of relevant results suggests that non-normality will produce negative impacts on the FNR’s for the standard and adaptive lasso formulations, the adaptive LAD lasso, and the adaptive Huber lasso, particularly at smaller sample sizes. Collinearity might also impact FPR, although this finding is less consistent. Collinearity might underlie the in- creased FPR’s observed, as those results only arose in studies which did not separately simulate collinearityandnon-normality. Asinstudiesofoutlierrobustness, afinalconcernwithapplying all of these findings to the current simulations rests in the methodological heterogeneity used to study the impacts of non-normality on variable selection performance. 4.2 Methods and Design 4.2.1 Models and Software Implementation The current simulations evaluate the variable selection abilities of the following methods. Sec- tion 4.2.2 includes further details about cross-validation procedures that are non-specific to method or implementation. Appendix B.4 provides the code used to generate simulated pre- dictor and response data. Section 4.2.4 and3.2.5 outline the procedure used to generate this data. Examplesofcodeformethodswithandwithouttheadaptivelassotuninghyperparameter can be found in Appendices B.5 and B.6,respectively. • Standard Lasso and Elastic Net The standard lasso and elastic net were included as a benchmark against which to compare the other methods. Both are implemented using the cv.glmnet function from the glmnet package (Friedman et al., 2019). • Adaptive Lasso and Elastic Net The weights for the adaptive lasso tuning hyperparam- eter were applied using the “penalty.factor” argument from the cv.glmnet function from 76 the glmnet package (Friedman et al., 2019). This argument only applies the weights for a given weighting hyperparameter , and does not conduct cross-validation to select . This process is addressed in Section 4.2.2. • Multi-Step Adaptive Elastic Net The multi-step adaptive elastic net was conducted us- ing the msaenet function from the msaenet package (Xiao and Xu, 2019). This function includes a built-in procedure for calculating initial coe cient estimates ˆ for subsequent weighting during the adaptive lasso process. The initial ridge estimates and subsequent weight application were thus all implemented within this function rather than through running a preliminary model, although cross-validation still needed to be incorporated for selecting optimal values for the weighting hyperparameter 1 . The original study does not outline the number of stages used in their simulations, nor a recommended number of stages. Therefore, the number of stages k was arbitrarily set to 10 2 . •Adaptive LAD Lasso and Elastic Net The adaptive LAD lasso and elastic net were im- plemented using the cv.hqreg function from the hqreg package (Yi, 2017)bysettingthe “method” argument to “quantile” and the corresponding hyperparameter “tau” argument to 0.5 for the LAD lasso criterion. This function includes a similar “penalty.factor” ar- gument to cv.glmnet, which was used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weightinghyperparameter 3 . Avariablescreeningprocedureforcomputationaloptimiza- tion developed by Tibshirani et al. (2012) was incorporated into the function by setting the “screen” argument to “SR.” The authors also developed a new, less computationally- intensive screening rule, which is less conservative as a result and can be found detailed in their study. The original, more conservative screening rule developed by Tibshirani et al. (2012)waschosenforthisstudy. Althoughthechoicetouseascreeningrule(asopposedto none) was made at the recommendation of the developers of the hqreg package, the choice of particular rule was arbitrary on my part given the lack of specific recommendations. •Adaptive Huberized Lasso and Elastic Net The adaptive Huberized lasso and elastic net were implemented using the cv.hqreg function from the hqreg package (Yi, 2017) by setting the “method” argument to “huber” and the corresponding transition hyper- parameter “gamma” argument to 1.345 to balance both robustness of the Huber loss as well as e ciency under ideal data conditions 4 . The “penalty.factor” argument was again used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weighting hyperparameter 5 . The same screening rule was used as described in the adaptive LAD lasso and elastic net. • Outlier-Shifted Lasso The outlier-shifted lasso was implemented using R code adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the code needed to be made to correct issues with internal cross- validation procedures and internal object references. Furthermore, although an internal 1 This was selected using the same procedure described in Section 4.2.2 2 This choice is discussed further in the general discussion in Chapter 7. 3 See Section 4.2.2. 4 Per recommendation by Huber (1981) 5 Yes, Section 4.2.2. 77 procedure calculates the outlier-shifting hyperparameter os , the tuning hyperparameter OS for estimating these values required selection. The optimal value for os was chosen via an additional cross-validation procedure, which was conducted in a similar fashion to that described in Section 4.2.2. No other choices or alterations were made in the applicationofthecustomcode. TheRcodeusedtosimulatetheOSlassomodel,including the adapted custom OS lasso code, can be found in Appendix B.7. • Outlier-Shifted Huberized Lasso The outlier-shifted Huber lasso was implemented using a custom R function adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the custom function needed to be made to correct issues with internal cross-validation procedures and internal object references. As opposed to the standalone OS implementation, potential values for os were chosen from the same potential values for the regularization tuning hyperparameters , and no other cross-validation procedures needed to be incorporated to run the simulated models. No other choices or alterations were made in the application of the custom code. The R code used to simulate the OS Huber lasso model, including the adaptation of the custom R function, can be found in Appendix B.8. 4.2.2 Hyperparameter Selection For all instances of the elastic net, balancing hyperparameter ↵ was set to 0.5. Values of .75 and .9 were also considered, but discarded due to lack of meaningful di↵ erences on primary performance metrics in initial simulations 6 . It is standard to select tuning hyperparameters via a cross-validation procedure rather than running models such as the lasso or ridge regression using a pre-determined value. No standard exists for the cross-validation procedure itself, so these simulations use the following procedure. Regularization tuning hyperparameters were chosen from 100 logarithmically- equidistant values between 0.01 and 1400 by 5-fold cross-validation 7 . When utilizing the adaptive lasso tuning hyperparameter, a preliminary 5-fold cross- validated ridge regression was conducted using the same 100 lambda values and the resulting coe cients used for determining the weights vector. Unless otherwise specified, the initial ridge stepwasperformedusingthecv.glmnetfunctionintheglmnet package(Friedmanetal., 2019) 8 . A subsequent 5-fold cross-validation procedure was conducted over 100 potential values of scal- ing hyperparameter chosen from the logarithmic sequence described previous. The literature does not provide a standard selection procedure for this hyperparameter. Furthermore, the literature does not make any recommendations for the sequence of potential values to use in cross-validation. 6 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 7 Multiple studies cite Lambert-Lacroix and Zwald (2011) as the original source for this recommendation. Lambert-Lacroix and Zwald (2011) does not, however, provide a rationale for this choice. 8 This initial step was conducted internally for the multi-step adaptive elastic net, although arguments were specified to implement the same procedure and using the same sequence of potential 2 values. 78 The criterion for selection of optimal hyperparameters was mean prediction error in cross- validation test sets 9 . This metric was the only metric available in all software implementations of the included methods and adaptations 10 . 4.2.3 Performance Metrics The simulations used to primary metrics to evaluate the variable selection characteristics of each method and adaptation: • False-Positive Rate (FPR) the proportion of true zero coe cients incorrectly estimated to be non-zero • False-Negative Rate (FNR) the proportion of non-zero coe cients incorrectly estimated to be zero Thesemetricswerechosenfortheircorrespondencewiththerobusthypothesis-testingstatistical framework. The rate of eliminating true non-zero predictors analogizes with the complement to power 11 in the absence of explicit hypothesis tests. The rate of selection of true zero coe cients similarly equates to the Type I error rate 12 . Secondary performance metrics included accuracy in both coe cient estimation and pre- diction. Coe cient Estimate Precision, henceforth precision, was calculated to explore how accurately each method estimated true non-zero coe cients: Precision( ˆ )= P pnonzero j=1 ( ˆ j j ) 2 p nonzero (4.1) Note that p nonzero = 4 in all cases in this study, as all simulations featured nonzero values only in the first four elements of the coe cient vector . Although developed independently for the purposes of this paper, the similarity of this metric with another measure of precision used by Kurnaz et al. (2018) should be noted. However, their version does not account for the number of true predictors. They also take the square root of the deviation in coe cients. The Test-Set Mean Squared Error, henceforth RMSE, meanwhile, is calculated as follows: RMSE = r P n i=1 (ˆ y i y i ) 2 n (4.2) where ˆ y i is the vector of model-predicted outcome values for the ith observation. To calculate theRMSE,anadditional50%data, roundedup, wasgeneratedforeachdatasetusingthesame seed as the original dataset. FPR and FNR were calculated internally within each model-application function. Ap- pendix B.9 provides example code for generating RMSE and precision. 9 Note: This is di↵ erent from the test sets used to generate one of the performance metrics, Mean Squared Error, which will be discussed in the next section. 10 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 11 Aka the probability of rejecting the null hypothesis of no relationship given that the alternative hypothesis is in fact false. 12 Aka the probability of rejecting a true null hypothesis of no relationship. 79 4.2.4 Simulation Conditions: Low Dimensionality Data for the distributional robustness simulations were generated from the following linear model. y i =x 0 i +✏ i (4.3) with 1 =0.5, 2 =1.0, 3 =1.5, 4 =2.0, and all else equal to 0. Predictor values were generated from the following distribution: x⇠ N(0,⌃ ) (4.4) where⌃ is the p⇥ p identity matrix with 1’s on the diagonal and 0’s otherwise. To simulate non-normality in the error distributions, residuals were generated using the g- and-h distribution. The g-and-h distribution, as previously described in Section 1.1.1,results in greater asymmetry as g increases, heavier tails as h increases, and contains the standard normal distribution as a special case when g =h=0 13 . The error distribution took one of the following forms: • g-and-h(0) : The same data utilized for ⌘ x = ⌘ y = 0% in the outlier robustness simulation was used to represent the condition where g = h = 0, as both represent an underlying standard normal population distribution. • g-and-h(1) : ✏⇠ g-and-h(0.0 , 0.2) • g-and-h(2) : ✏⇠ g-and-h(0.2 , 0.0) • g-and-h(3) : ✏⇠ g-and-h(0.2 , 0.2) The values were chosen based on use in general studies in the field of robust hypothesis testing for generating skewed distributions and heavy-tailed distributions 14 . Inadditiontothefourerrordistributionsspecifiedabove, thefollowingfeaturesofthedata varied across error distribution conditions: • p = number of potential predictors p varied across three levels: 8, 30 • n = sample size sample size varied across four levels: 25, 50, 100, 200 with each possible combination of feature levels (including error distribution) represented for a total of 32. 500 iterations of each condition were simulated, and analyses conducted on each iteration. 4.2.5 Simulation Conditions: High Dimensionality Toexploretheimpactsofdimensionalityandnon-normalerrordistributionsonvariableselection capabilities, additional data were generated from the same g-and-h error distributions. The 13 see Gillivray, 1992;Wilcox, 2016;andWilcoxetal., 2013 for further details. 14 See, for example Wilcox et al. (2013), although Wilcox (2016) and others note that the skew and increased kurtosisgeneratedbythesevaluesmaynotbesu ciently-extremetoreflectconditionsinrealdata. Thispotential limitation is discussed further in the general discussion section in Chapter 7. 80 simulations fixed sample size n at 200 and number of potential predictors p at 1000. The same truecoe cientvectorwasusedasinthelow-dimensionalitysetting: 1 =0.5, 2 =1.0, 3 =1.5, 4 =2.0, and all else equal to 0. Therefore, a total of 4 high-dimensionality data scenarios were simulated for studying non-normal error distributions, representing the 4 combinations of g 15 andh 16 . 500iterationsofeachofthese4dataconditionsweresimulated,andanalysesconducted on each iteration. 4.3 Results Except for some high-dimensionality data conditions for the adaptive LAD elastic net, all sim- ulations were conducted in RStudio on a 3.80 GHz AMD Ryzen Threadripper processor with 64 GB RAM 17 Anoteon theplots presented: Performancemetric datafor thelowdimensionality settings are displayed as line trends over sample size. Meanwhile, metric data for high dimensionality are displayed as boxplots, since sample size did not vary. One further di↵ erence is noted: the trend lines displayed in the low dimensionality plots display the 20% trimmed means, while the boxplots in the high dimensionality setting display the median as the central tendency of each box. The median was impractical in the low dimensionality setting, as the resulting trends overlapped so precisely that some methods’ trajectories were not visible in multiple plot windows. This issue arose particularly for FPR at p = 8 and FNR at any level of p. Thegeneralorganizationofeachplotwindowisasfollows: Eachplotwindowcontainsfour plotscorrespondingwitheachofthecombinedg-and-hconditions,sincethesecharacteristicsare best considered in tandem. Predictor-contamination ⌘ x increases along the vertical axis, from top to bottom, while response contamination ⌘ y increases along the horizontal axis. Therefore, the top-left most plot contains the given metric for g = h=0.0. The plot in the bottom-right contains the given metric for g =h=0.2. 15 2levels:0.0and0.2. 16 2levels:0.0and0.2. 17 . 81 4.3.1 Results: Low Dimensionality 4.3.1.1 p=8 0.0 0.2 0.4 0.6 38 75 150 300 n g=0 h=0 0.0 0.2 0.4 0.6 38 75 150 300 n fpr h=0.2 0.0 0.2 0.4 0.6 38 75 150 300 n g=0.2 0.0 0.2 0.4 0.6 38 75 150 300 n fpr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.1: FPR: Non-Normality, Low Dimensionality, p=8 Figure 4.1 presents the FPR data for the non-normality simulations with p = 8 potential predictors. TheMSadaptiveelasticnet(yellow)consistentlyperformsthebestinFPR,hovering just below 20% in all four scenarios. Similar to the outlier simulations, the standard elastic net (royal blue), the standard lasso (bright green), the OS Huber lasso (light blue), and the OS lasso (purple) underperformed regardless of scenario. The standard lasso’s average FPR ranged between 55%62%, while the standard elastic net ranged between 64%75%. The OS Huber lasso ranged between 59%67%. The OS lasso outperformed the other three poor FPR performers. However, it underperformed all other adaptations by a wide margin. Interestingly, its average FPR improved as h increased, although its FPR went as high as 35%40% at its best for some scenarios. The adaptive lasso (red) and adaptive elastic net (green) performed competitively in FPR across conditions, generally ranging between 25%30%. The adaptive Huber lasso (turquoise) performed similarly, although its relative relative FPR performance deteriorated with increased 82 sample size and kurtosis. Under increased kurtosis 18 , the adaptive Huber elastic net (brown) behaved similarly to the adaptive Huber lasso and the two unmodified adaptive methods. How- ever,theadaptiveHuberelasticalsoproduceditsworstrelativeperformanceasbothsamplesize and kurtosis increased. The adaptive LAD lasso (black) generally underperformed this group by small margins and outperformed the adaptive LAD elastic net (pink) by small margins but convergedontheadaptiveHuber’sFPR’swithincreasedsamplesize. TheadaptiveLADelastic net typically produced the worst FPR’s of the competitive grouping. However, the adaptive LAD elastic net also improved the most as sample size increased. Its underperformance was clearest in the presence of increased kurtosis. At the largest sample size and with h = 0, the adaptive LAD elastic net averaged an FPR of approximately 27%28%. It produced its worst performance for g =h=0.2 and n = 38, with an average FPR of nearly 40%. 18 Aka, when h=0. 83 0.00 0.02 0.04 0.06 38 75 150 300 n g=0 h=0 0.00 0.05 0.10 38 75 150 300 n fnr h=0.2 0.00 0.02 0.04 0.06 38 75 150 300 n g=0.2 0.00 0.05 0.10 38 75 150 300 n fnr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.2: FNR: Non-Normality, Low Dimensionality, p=8 FNR’swerelowformostmethodsacrossscenariosofnon-normality, asseeninFigure4.2. There was high variability between methods, especially at smaller sample sizes, and this vari- ability increased under h=0.2. FNR generally increased under h=0.2, but most adaptations were not seriously impacted by g=0.2. While the OS Huber lasso (light blue), standard elastic net (royal blue), and standard lasso (bright green) all drastically underperformed in FPR, this corresponded with stronger performance in FNR relative to the other methods. Even at small sample sizes under combined skew and increased kurtosis, the OS Huber lasso and standard elastic net barely exceeded average FNR’s of 5%. In comparison, the standard lasso’s FNR did not exceed 7.5%, compared to all other methods’ average FNR of at least 10% at the same sample size n = 38 and for h=0.2. The OS lasso (purple) also performed competitively under h=0.2, though its FNR at n = 38 was relatively average at about 7%10%. Under h = 0, the adaptive LAD lasso (black), adaptive LAD elastic net (pink), adaptive Huber lasso (turquoise), and adaptive Huber elastic net (brown) all showed among the worst average FNR’s, especially at smaller sample sizes. The adaptive Huber lasso’s performance improved much more rapidly under skew alone (g=0.2). The adaptive LADs and adaptive Hubers all performed similarly under h=0.2, following a similar trend in FNR to the adaptive lasso (red) and adaptive elastic net (green). Under combined skew and increased kurtosis, all 84 six methods’ average FNR exceeded 10%, and ranged between 8%11% under g=0,h=0.2. The standard lasso and elastic net had average performance in the smallest sample size when h=0.2, and improved more rapidly with sample size than any other method (although never surpassing the strongest performers). TheMSadaptiveelasticnet(black), whilenottheworstperformerunderh = 0, performed among the worst and converged slowly on average FNR of 0%, and was the worst performer given h=0.2. The method’s average FNR approached 20% under combined skew and incrased kurtosis. 85 0.00 0.25 0.50 0.75 38 75 150 300 n g=0 h=0 0.0 0.5 1.0 38 75 150 300 n RMSE h=0.2 0.00 0.25 0.50 0.75 38 75 150 300 n g=0.2 0.0 0.5 1.0 38 75 150 300 n RMSE method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.3: RMSE: Non-Normality, Low Dimensionality, p=8 Test-set prediction error was competitive across methods and scenarios, see Methods did not produce performance di↵ erences in test-set prediction regardless of sce- nario, seeFigure 4.3. The OS lasso’s performance did not su↵ er under smaller sample sizes as it did in the outlier simulations. Prediction error did not increase or decrease with g, although it did deteriorate slightly under h=0.2. 86 0.000 0.025 0.050 0.075 0.100 38 75 150 300 n g=0 h=0 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n Precision h=0.2 0.00 0.03 0.06 0.09 38 75 150 300 n g=0.2 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.4: precision: Non-Normality, Low Dimensionality, p=8 Figure 4.4 displays coe cient estimate performance in the non-normality simulations for p = 8. Under h = 0, relative performance was similar, although increased skew did appear to have a slight negative impact on precision for all methods. The majority of the methods clustered together in these two conditions. However, the adaptive LAD lasso (black), adaptive LAD elastic net (pink), and the OS lasso (purple) all typically estimated the true non-zero coe cients less e↵ ectively. Underh=0.2, the OS lasso performed the worst by a small margin, particularly at n = 38, while the OS Huber lasso (light blue), adaptive Huber lasso (turquoise), and adaptive Huber elastic net (brown) performed the best by a slightly larger margin. The remaining methods clustered together when g=0,h=0.2, while the standard lasso (bright green) and standard elastic net (royal blue) compared more closely to the OS lasso as sample size increased and with g =h=0.2. 87 4.3.1.2 Null Models Produced Method n p g h Null Models e lnet5 38 8 0.0 0. 2 2 e lnet5 38 8 0.2 0. 2 1 hu b er elne t 5 38 8 0.2 0. 2 1 hu b er l a sso 38 8 0.2 0. 2 1 l ad l asso 38 8 0.2 0. 2 1 l asso 38 8 0.0 0. 2 2 l asso 38 8 0.2 0. 2 1 msada e lnet5 38 8 0.0 0. 2 4 msada e lnet5 38 8 0.2 0. 2 6 oslassopl us 38 8 0.0 0. 2 5 oslassopl us 38 8 0.2 0. 0 1 oslassopl us 38 8 0.2 0. 2 2 Method n p g h Null Models ad a e l ne t5 38 30 0.2 0. 2 1 ad a la sso 38 30 0.2 0. 2 2 ad a la sso 75 30 0.0 0. 2 1 el ne t5 38 30 0.0 0. 2 8 el ne t5 38 30 0.2 0. 2 1 0 el ne t5 75 30 0.0 0. 2 1 el ne t5 75 30 0.2 0. 2 1 hub er elne t 5 38 30 0.0 0. 2 1 hub er l asso 38 30 0.0 0. 2 2 la d el n et 5 38 30 0.0 0. 2 1 la d el n et 5 38 30 0.2 0. 2 1 la d l asso 38 30 0.0 0. 2 1 la d l asso 38 30 0.2 0. 2 1 la sso 38 30 0.0 0. 2 7 la sso 38 30 0.2 0. 2 1 2 la sso 75 30 0.0 0. 2 1 la sso 75 30 0.2 0. 2 1 Method n p g h Null Models m sad a e lnet5 38 30 0.0 0. 0 1 m sad a e lnet5 38 30 0.0 0. 2 15 m sad a e lnet5 38 30 0.2 0. 2 13 m sad a e lnet5 75 30 0.0 0. 2 1 m sad a e lnet5 75 30 0.2 0. 2 1 m sad a e lnet5 150 30 0.0 0. 2 1 o sh uberl asso 38 30 0.0 0. 0 1 o sh uberl asso 38 30 0.0 0. 2 11 o sh uberl asso 38 30 0.2 0. 0 2 o sh uberl asso 38 30 0.2 0. 2 7 Table 9: Null Models: Non-Normality, Low Dimensionality, p=8 Non-normality resutled in fewer null models with p = 8 potential predictors compared to compared to the outlier contamination simulations, as seen in Table 9. Only the MS adaptive elastic net and the OS lasso exceeded 1% null model rate and only under one condition each 19 . Only the OS lasso produced a null model underh = 0 and only one null model in that scenario. The adaptive lasso, adaptive elastic net, adaptive LAD elastic net, and OS Huber lasso all produced no null models. 19 g = h=0.2 for the MS adaptive elastic net, g=0,h=0.2 for the OS lasso. 88 4.3.1.3 p = 30 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n g=0 h=0 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n fpr h=0.2 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n g=0.2 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n fpr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.5: FPR: Non-Normality, Low Dimensionality, p=30 FPR’s for the non-normality simulations with p = 30 potential predictors are displayed in Figure 4.5. As in the outlier simulations, increasing the potential predictor space resulted in overall improved performance in correctly eliminating true zero variables from the model. The standard elastic net (royal blue) produced worst performance in FPR, averaging be- tween 37%40% across data scenarios. Although they performed relatively better than with only p = 8 potential predictors, the OS Huber lasso (light blue) and standard lasso (bright green) also underperformed relative to the other methods, especially once sample size increased past n = 38. However, both methods showed average performance at the smallest sample size. The standard lasso outperformed the OS Huber lasso, ranging between 27%31% across data scenarios, and and neither kurtosis nor skew impacted its performance significantly. On the other hand, the OS Huber lasso saw increased FPR under h=0.2. Under h = 0, the OS Huber lasso’s performance ranged between approximately 30% to approximately 33%. With g=0,h=0.2, the OS Huber lasso had an FPR of roughly 35% except at the smallest sam- ple size, where its average FPR was approximately 31%. The method performed similarly at all sample sizes for g = h=0.2, although it improved slightly to an average FPR of 33% at n = 300. 89 TheOSlasso(purple)producedrelativelystrongerperformance, especiallyatsmallsample sizes or with heavier tails. Although consistently the fourth-worst performer in average FPR once sample size increased past n = 75 for h = 0, its small sample performance was relatively average at n = 75 and competitive with the adaptive Huber lasso (turquoise) at n = 38. However, the OS lasso’s performance did not increase more than a few percentage points with increased sample size under h = 0, while the adaptive Huber lasso improved by 12% or more on average as sample size increased. The adaptive Huber lasso and the OS lasso performed well under h=0.2, approaching the average FPR of the MS adaptive elastic net as sample size increased to n = 300. The OS lasso outperformed the adaptive Huber lasso at the smallest sample size when h=0.2 by a small margin. The OS lasso also outperformed the adaptive Huber lasso for all sample sizes when g = h=0.2. At the smallest sample size, the OS lasso averaged 23% FPR for h=0.2, while the adaptive Huber lasso produced an average FPR of 25%27%. Both methods improved to average FPR’s of approximately 17%18% at the largest sample size for both h=0.2 conditions. Regardless of data scenario, the adaptive lasso (red), adaptive elastic net (green), and adaptive LAD elastic net (pink) all performed poorly at the smallest sample size n = 38, approachingorsurpassingthestandardelasticnetandrangingbetweenaverageFPR’sof35% 41%. For h = 0, the adaptive LAD elastic far outpaced the two unmodified adaptive methods; under normality g = h = 0, averaged an FPR of 15% at n = 300, compared to roughly 20% and 21% for the adaptive lasso and elastic net, respectively. The adaptive LAD elastic net deteriorated slightly to 16% or 17%, while the adaptive lasso and elastic net improved to 20%, for g=0.2,h = 0 at the largest sample size. The two adaptive methods outperformed the adaptive LAD elastic net under h=0.2 by a small margin as sample size increased, converging on average FPR’s of approximately 20%. FPR was slightly greater than 20% for the adaptive LAD elastic net under g = 0 and approximately 22% under g=0.2. Whereas the adaptive LAD elastic net’s average FPR increased with h, the performance trends of the adaptive lasso and adaptive elastic net remained relatively constant across data scenarios. TheadaptiveLADlasso(black)andtheadaptiveHuberelasticnet(brown)hadcomparable average FPR’s of 31%33% at the smallest sample size, regardless of data scenario. Trend divergences across sample sizes were sporadic, although the adaptive LAD lasso outperformed the adaptive Huber elastic net by a few percentage points where deviations in average FPR occurred. Increased skew did increase FPR’s for both methods by a few percentage points at the largest sample size, regardless of h. The adaptive LAD lasso (alongside the adaptive Huber lasso) slightly outperformed the MS adaptive elastic net at the largest sample size for h = 0, with average FPR’s of 12%14% at n = 300, compared to approximately 15% for the MS adaptive elastic net. In comparison, the adaptive Huber elastic net converged on 15%. The adaptive LAD lasso and adaptive Huber converged on an average FPR of approximately 20% at the largest sample size for both h = 0 data scenarios. The MS adaptive elastic net performed strongly and consistently across data scenarios, with average FPR of 22%23% at the smallest sample size and just above or below 15% at the largest. 90 0.00 0.03 0.06 0.09 0.12 38 75 150 300 n g=0 h=0 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n fnr h=0.2 0.00 0.05 0.10 38 75 150 300 n g=0.2 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n fnr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.6: FNR: Non-Normality, Low Dimensionality, p=30 Figure 4.6 displays the FNR trends for the low dimensionality non-normality simulations for p = 30. Most methods produced higher average FNR’s relative to their performance with p = 8, although the MS adaptive elastic net (yellow) produced some exceptions. Similar to the p = 8 simulations, the standard lasso (bright green), standard elastic net (royal blue), and OS Huber lasso (bright blue) performed the strongest given h = 0. However, theirrelativesuperiorityoverothermethodswasmuchsmallerthaninthefirstsetofsimulations. Whereas the standard elastic net had average FNR’s between 1%2% under the two h=0 conditions previously, average FNR increased many times over to roughly 7% under g = 0, and quadrupled to approximately 8% under skew with g=0.2. The OS Huber lasso increased from 2% to more than 8% and from 3% to approximately 9% under g = 0 and g=0.2, respectively. In comparison, the standard lasso’s average FNR increased from the same values in p=8to roughly8%underbothh = 0scenarios. WhiletheOSHuberlasso’sperformancewasconsistent ath=0.2forsamplesizesbeyondn = 38,itunderperformedmostotheradaptationsforn = 38. The standard lasso and standard elastic net also underperformed at the smallest sample size for h=0.2. However, both outperformed all but the OS Huber lasso when n = 75. They also converged with or performed worse than the trends of the other methods as sample size increased beyond n = 75. 91 The OS lasso (purple) deteriorated more significantly than the lasso, elastic net, or OS Huber lasso concerning FNR. The OS lasso, regardless of scenario, always produced the highest average FNR atn = 38; underh = 0, average FNR for the OS lasso ranged between 11%13% and increased further to 21%22% under h=0.2. Performance improved with increasing sample size, and the OS lasso had middling performance in average FNR beyond n = 38 when h = 0. It also took longer to converge on the average FNR trend under h=0.2. The adaptive lasso (red) and adaptive elastic net (green) produced average performance under h = 0 and middling small-sample performance under h=0.2. However, once sample size increased beyond n = 38, both methods underperformed until converging on 0% average FNR at n = 300. The MS adaptive elastic net behaved similarly to the adaptive lasso and elastic net under h = 0. Notably, its performance did not change meaningfully under h=0.2 relative to the p = 8 simulations. That is not to say that it performed well; on the contrary, it was the second-worst performer in both h=0.2 conditions at n = 38 and was the worst performer as sample size increased. At the two smaller sample sizes under h = 0, the adaptive LAD lasso (black) and the adaptive Huber lasso (turquoise) performed poorly, if not the worst, in average FNR. However, both converged on the general trend for n = 150 and n = 300. Under h=0.2, both showed relatively middling FNR’s of roughly 16%, and still trended in the middle of the methods at n = 75, although both performed competitively at the two larger sample sizes in both h=0.2 conditions. The adaptive LAD elastic net (pink) and adaptive Huber elastic net (brown) behaved similarly to the adaptive Huber lasso and adaptive LAD lasso under h = 0. They produced stronger performance under h=0.2 at the smallest sample size, comparing to the strongest methodsatanaverageFNRofapproximately14%and15%fortheLADandHuber,respectively under g = 0 and 15% under g=0.2. Although their relative FNR performance deteriorated at larger sample sizes, they still performed competitively under h=0.2 regardless of sample size or g. 92 0.0 0.5 1.0 38 75 150 300 n g=0 h=0 0.0 0.5 1.0 1.5 38 75 150 300 n RMSE h=0.2 0.0 0.5 1.0 38 75 150 300 n g=0.2 0.0 0.5 1.0 1.5 38 75 150 300 n RMSE method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.7: RMSE: Non-Normality, Low Dimensionality, p=30 Allmethodsperformedsimilarlyintermsoftest-setpredictionerror(Figure4.7. However, thetrendsweremoredistinct,particularlyatthesmallestsamplesizes. Theadaptivelasso(red) and the adaptive elastic net (green) both performed the worst across scenarios until sample size increased to n = 150, and the adaptive LAD elastic net (pink) underperformed slightly at the smallestsamplesizeunderh = 0. g didnotappeartoimpactpredictionerror,whilehnegatively impacted prediction error. 93 0.00 0.05 0.10 0.15 0.20 38 75 150 300 n g=0 h=0 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n Precision h=0.2 0.00 0.05 0.10 0.15 0.20 0.25 38 75 150 300 n g=0.2 0.0 0.1 0.2 0.3 0.4 38 75 150 300 n Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.8: precision: Non-Normality, Low Dimensionality, p=30 Figure4.8presentsprecisionperformancefornon-normalitywithp = 30potentialpredic- tors. ThefourworstperformsregardingFPR 20 alsoperformedtheworstincoe cientestimation accuracy across data scenarios. The four methods trend together once sample size hits n = 75, and all four produced worst performance at small sample sizes. The OS lasso underperforms all other methods at n = 38 and converges with the other three methods in average precision at n = 75. At this sample size all four methods underperform relative to the other trends. Increasingh to 0.2, their underperformance stands out even more, although the OS Huber lasso performs comparably to the worst of the general cluster by n = 75. h = 0 produced two distinct clusters of precision trends besides the poorest performers mentioned previously. The strongest performers, across sample sizes, include the MS adaptive elastic net (yellow), adaptive Huber lasso (turquoise), and adaptive Huber elastic net (brown). However, the adaptive Huber elastic net showed the poorest precision performance atn = 38 21 . Interestingly, the adaptive lasso (red) and adaptive elastic net (green) converge on this overper- forming trend by n = 75, although both start with worse precision at n = 38 of approximately 0.18 and 0.20 under g = 0 and g=0.2, respectively. The adaptive LAD elastic net (pink) has initial precision slightly higher than this trend at 0.19. The adaptive LAD lasso (black) has 20 the OS lasso (purple), standard elastic net (royal blue), standard lasso (bright green), and OS Huber lasso (light blue). 21 Though still competitive with the remaining methods. 94 slightly lower at 0.17 under g = 0, and both start at approximately 0.20 for g=0.2. Both adaptive LADs converge atn = 75 in bothh = 0 scenarios and have average precision estimates that track in between the strong-performing cluster and the poor-performing cluster. Under h=0.2, however, strong-to-middling n = 38 performance by the adaptive LADs and adaptive Hubers converges to the strongest precision performance trend by n = 75. The adaptive lasso, adaptive elastic net, and MS adaptive elastic net underperform this group and outperform the OS Huber lasso by small margins. 95 4.3.1.4 Null Models Produced Method n p g h Null Models e lnet5 38 8 0.0 0. 2 2 e lnet5 38 8 0.2 0. 2 1 hu b e r elnet5 38 8 0.2 0. 2 1 hu b e rl asso 38 8 0.2 0. 2 1 l ad l a sso 38 8 0.2 0. 2 1 l asso 38 8 0.0 0. 2 2 l asso 38 8 0.2 0. 2 1 msadae lnet5 38 8 0.0 0. 2 4 msadae lnet5 38 8 0.2 0. 2 6 oslassopl us 38 8 0.0 0. 2 5 oslassopl us 38 8 0.2 0. 0 1 oslassopl us 38 8 0.2 0. 2 2 Method n p g h Null Models ad a el net5 38 30 0 .2 0. 2 1 ad a la sso 38 30 0 .2 0. 2 2 ad a la sso 75 30 0 .0 0. 2 1 el ne t5 38 30 0 .0 0. 2 8 el ne t5 38 30 0 .2 0. 2 1 0 el ne t5 75 30 0 .0 0. 2 1 el ne t5 75 30 0 .2 0. 2 1 hub e r e l net 5 38 30 0 .0 0. 2 1 hub e rl asso 38 30 0 .0 0. 2 2 la d el ne t5 38 30 0 .0 0. 2 1 la d el ne t5 38 30 0 .2 0. 2 1 la d l a sso 38 30 0 .0 0. 2 1 la d l a sso 38 30 0 .2 0. 2 1 la sso 38 30 0 .0 0. 2 7 la sso 38 30 0 .2 0. 2 1 2 la sso 75 30 0 .0 0. 2 1 la sso 75 30 0 .2 0. 2 1 Method n p g h Null Models m sad ae l ne t5 38 30 0.0 0. 0 1 m sad ae l ne t5 38 30 0.0 0. 2 15 m sad ae l ne t5 38 30 0.2 0. 2 13 m sad ae l ne t5 75 30 0.0 0. 2 1 m sad ae l ne t5 75 30 0.2 0. 2 1 m sad ae l ne t5 150 30 0.0 0. 2 1 o sh uberl a sso 38 30 0.0 0. 0 1 o sh uberl a sso 38 30 0.0 0. 2 11 o sh uberl a sso 38 30 0.2 0. 0 2 o sh uberl a sso 38 30 0.2 0. 2 7 Table 10: Null Models: Non-Normality, Low Dimensionality, p=30,1/2 Method n p g h Null Models m sa dae l ne t5 38 30 0.0 0. 0 1 m sa dae l ne t5 38 30 0.0 0. 2 15 m sa dae l ne t5 38 30 0.2 0. 2 13 m sa dae l ne t5 75 30 0.0 0. 2 1 m sa dae l ne t5 75 30 0.2 0. 2 1 m sa dae l ne t5 150 30 0.0 0. 2 1 o sh uberl a sso 38 30 0.0 0. 0 1 o sh uberl a sso 38 30 0.0 0. 2 11 o sh uberl a sso 38 30 0.2 0. 0 2 o sh uberl a sso 38 30 0.2 0. 2 7 o sla ssopl us 38 30 0.0 0. 2 11 o sla ssopl us 38 30 0.2 0. 0 1 o sla ssopl us 38 30 0.2 0. 2 19 o sla ssopl us 75 30 0.0 0. 0 1 o sla ssopl us 75 30 0.0 0. 2 1 o sla ssopl us 75 30 0.2 0. 2 1 Table 11: Null Models: Non-Normality, Low Dimensionality, p=30,2/2 Tables 10 and 11 report the number of null models produced with p = 30 potential predictors. 96 All methods were more likely to produce null models under p = 30 relative to p = 8. Null models were also produced under more conditions, particularly with increases in sample size. At n = 38, both the standard lasso and standard elastic net showed null model rates of approximately 1%2% under the h=0.2 conditions. The other methods reported in Table 10 typically produced only one null model, while two methods produced two null models. The first null models produced by any simulations for the adaptive LAD elastic net occurred in these simulations, and the adaptive LAD lasso produced null models under two conditions. The adaptive Huber elastic net produced a single null model, while the adaptive Huber lasso produced two null models under the same conditions (g=0,h=0.2). The MS adaptive elastic net and OS lasso produced the most null models. However, the MS adaptive elastic net’s worst null model rate was only 3% for g=0,h=0.2 and n = 38, while the OS lasso did not quite reach 4% at its worst (g = h=0.2,n = 38). The OS Huber lasso, finally, produced upwards of 2% null models under g=0,h=0.2 and n = 38. 97 4.3.2 Results: High Dimensionality The results for the MS adaptive elastic net (black) will not be interpreted, as null model rate was nearly 100% in all four high-dimensionality scenarios. 0.0 0.1 0.2 method g=0 h=0 0.0 0.1 0.2 0.3 method fpr h=0.2 0.0 0.1 0.2 method g=0.2 0.0 0.1 0.2 0.3 method fpr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.9: FPR: Outliers, High Dimensionality Theadaptivelasso(red), adaptiveelasticnet(green), andadaptiveLADlasso(black), and adaptive LAD elastic net (pink) produced the widest dispersion in FPR across data scenarios, with outlying data points extending far beyond observations from any other methods. Their whiskers alone typically extended past any other methods’ largest outlying observations. These methods’ median FPR’s typically exceeded other methods’ FPR’s. However only a few percent- age points separated the median FPR’s of these four adaptations from the rest of the methods under h = 0. Although the adaptive Hubers showed less outlying FPR’s across conditions, they also produced longer whiskers than all methods besides the adaptives and adaptive LADs. The adaptive Huber elastic net performed comparably to the adaptive lasso and adaptive elastic net in median FPR, while the adaptive Huber lasso performed better than all three across conditions. Under h = 0, the adaptive Huber lasso showed average FPR performance across methods, although it only outperformed the adaptive lasso, adaptive elastic net, and adaptive Huber elastic net under h=0.2. 98 The standard lasso (bright green) and standard elastic net (royal blue) both performed competitively across scenarios in terms of median FPR and general dispersion. However, the two standard methods also produced many outlying data points. They also produced extreme outlying FPR’s under h=0.2. Although the central tendency for FPR did not decrease notice- ably with increased h, the extremity of outlying data points grew. Neither the OS Huber lasso (light blue) nor the OS lasso (purple) saw diminished FPR performance across data scenarios. The OS lasso always showed the best median FPR and narrower dispersion. Meanwhile, the OS Huber lasso performed competitively in median FPR. However, the OS Huber lasso also produced wide dispersion in FPR and a wider spread of outlying observations. Median FPR for all methods except the adaptive LAD elastic net never exceeded 10%, while under h = 0 upper quartiles did not exceed 10%, either 22 . However, the upper quartiles surpassed 10% for the adaptive lasso, adaptive elastic net, adaptive Huber elastic net, and adaptive LAD variants. All other methods’ median and 75th-percentile FPR’s remained below 10%. 22 Again, with the exception of the adaptive LAD elastic net. 99 0.00 0.05 0.10 0.15 0.20 0.25 method g=0 h=0 0.00 0.05 0.10 0.15 0.20 0.25 method fnr h=0.2 0.00 0.05 0.10 0.15 0.20 0.25 method g=0.2 0.00 0.05 0.10 0.15 0.20 0.25 method fnr method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.10: FNR: Outliers, High Dimensionality Figure 4.10 displays FNR performance for the high dimensionality simulations. The median FNR for all methods across all data scenarios was 0%, and under h=0.2, all methods produced least one FNR of 25%. No method ever removed more than one true predictor from the model. Under h = 0, the standard elastic net (royal blue), standard lasso (bright green), OS Huber lasso (light blue), and OS lasso (purple) selected true predictors into the model in 100% of simulations. 100 0.0 0.5 1.0 method g=0 h=0 0 1 2 method RMSE h=0.2 0.0 0.5 1.0 method g=0.2 0 1 2 3 method RMSE method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.11: RMSE: Outliers, High Dimensionality Figure4.11displayssimilartest-setRMSEbetweenmethods. However,theadaptivelasso (red) and adaptive elastic net (green) produced slightly higher median RMSE under h=0.2 and the adaptive LAD elastic net (pink) produced slightly higher RMSE under h = 0. RMSE deteriorated with h=0.2 and g=0.2 did not seem to negatively impact prediction. Under a standard normal error distribution, all methods’ RMSE had fairly narrow dispersion, although the adaptive lasso, adaptive elastic net, adaptive Huber elastic net, and adaptive LAD elastic net all showed slightly wider dispersion and number of outlying data points. Under h=0.2, however, dispersion, and particularly outlying observation magnitude, increased dramatically relativetoh = 0. Thenon-outlyingdispersionoftheadaptivelassoandadaptiveelasticnetwas still wider than the other methods under h=0.2, and their median RMSE’s were also larger. 101 0.00 0.05 0.10 0.15 0.20 method g=0 h=0 0.0 0.2 0.4 0.6 method Precision h=0.2 0.0 0.1 0.2 0.3 method g=0.2 0.0 0.2 0.4 0.6 method Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 4.12: precision: Non-Normality, High Dimensionality Figure 4.12 shows the coe cient precision results in the high dimensionality simulations. The adaptive lasso (red) and adaptive elastic net’s (green) produced wider dispersions than the remainingadaptationsandlargeroutlyingobservations. However,thesetwomethodsperformed competitively under h = 0. However, their relative median precision deteriorates when h=0.2. Dispersion and outlying observation magnitude also increased with h=0.2, although this was true for all methods. The standard elastic net performed poorly in median precision regardless of data scenario. The adaptive LAD elastic net (pink) performed similarly under h = 0, with slightly smaller median precision, slightly wider IQR, and a smaller number of larger outlying values. The adaptive LAD elastic net’s performance improved under h=0.2. The adaptive Huber elastic net, standard lasso (bright green), OS Huber lasso (light blue), and OS lasso (purple) all had comparable median precision results under h = 0, while the adaptiveHuberelasticnetproducedawiderdispersion. Themagnitudeofoutlyingobservations for the standard lasso was smaller under h = 0. However, under h=0.2, the standard lasso producedmoreextremeoutlyingobservationsthantheadaptiveHuberelasticnetandOSHuber lasso. The OS Huber lasso slightly outperformed the other three methods in median precision, dispersion, and extent of outlying observations under h=0.2. 102 TheadaptiveHuberlasso(turquoise)andadaptiveLADlasso(black)showedthestrongest performance across scenarios. They always showed the best median precision and narrowest IQR’s. Although they both produced some moderate outlying precision values, they were never particularly large relative to any other methods’ outlying values. 103 4.3.2.1 Null Models Produced Method n p g h Null Models msad ae lnet5 300 1 00 0 0.0 0. 0 4 90 msad ae lnet5 300 1 00 0 0.0 0. 2 4 90 msad ae lnet5 300 1 00 0 0.2 0. 0 4 98 msad ae lnet5 300 1 00 0 0.2 0. 2 4 80 Method n p g h Null Models Method n p g h Null Models ` ` ` Table 12: Null Models: Non-Normality, High Dimensionality Table 12 displays null model information. The MS adaptive elastic net approached 100% null model rate, while no other model produced null models in high dimensions. 104 4.4 Discussion RMSEfollowedasimilarpatternofcomparableperformanceacrossmethods,resultinginsimilar concerns as in the previous chapter. The discussion points remain the same, and readers are directed to the previous discussion of the matter in Section 3.4. Non-normality, at least to the extent studied in the current simulations, did not impact performance to the same degree as outlier contamination. However, similar relative patterns of behavior emerged in the di↵ erent methods, while increased h resulted in relative performance changes between the di↵ erent methods. Similar clusterings of methods arose as in the outlier simulations. The standard elastic net and OS Huber lasso consistently underperformed in aver- age FPR’s across distributional scenarios. However, the OS Huber showed average performance at the smallest sample size with p = 30 potential predictors. The OS lasso and the standard lasso generally outperformed the elastic net and OS Huber lasso across conditions concerning FPRtrends,althoughtheystillunderperformedallothermethodswhenp = 8. Forp = 30,both methods performed competitively at the smallest sample size, although the remaining methods improved past the standard lasso at n = 75. The OS lasso still performed competitively at n = 75 and p = 30, regardless of scenario. Generally, these underperforming methods improved in average FPR with increased p. The adaptive lasso, adaptive elastic net, adaptive Huber methods, and adaptive LAD methods all outperformed the OS methods and unmodified methods at all sample sizes for p = 8 in terms of properly selecting out true zero coe cients. In all such cases, they were slightly di↵ erentiated particularly at smaller sample sizes, which increased with heavy tails. The adaptive Hubers, adaptive lasso, and adaptive elastic net tended towards better average FPRexceptatn = 300,wherethefivemethodsconvergedwithinafewpercentagepointsofeach other. Under heavy tails, the adaptive LAD methods underperformed more clearly at smaller sample sizes, while the adaptive LAD lasso converged with the unmodified adaptives and the adaptiveHuber’satn = 75. TheadaptiveLADlassodeterioratedfromn = 75ton = 150under heavy tails alone, although it converged with the adaptive Hubers and the adaptive LAD elastic net by n = 300. The unmodified adaptives slightly outperformed the adaptive Hubers and adaptive LAD’s starting at n = 150 for both heavy-tailed scenarios. The adaptive Huber lasso and adaptive LAD elastic net slightly outperformed their elastic net cousins under combined skew and heavy tails. For these methods, increasing p to 30 resulted in greater improvements over sample size. Heavier tails reduced sample size improvements in FPR for the adaptive Hubers and adaptive LADs. The adaptive lasso and adaptive elastic net showed similar FPR trends over sample size across all four scenarios for p = 30. TheMSadaptivenetoutperformedothermethodscocnerningFPRatp = 8andperformed competitivelyforp = 30potentialpredictors. Unlikemostothermethods,theMSadaptiveelas- tic net did not deteriorate meaningfully in FPR performance with heavier tails. Consequently, it produced the strongest FPR performance for h=0.2 at p = 8. The MS adaptive elastic net 105 only underperformed relative to the adaptive Hubers and adaptive LADs at the largest sample size, n = 300. All methods approached 0% FNR with increased sample size. Generally, the four worst- performing methods besides the OS lasso performed competitively in FNR, although the stan- dard lasso starts underperforming under both heavy tails and p = 30. This advantage was maintained in the absence of heavy tails until n = 150 for both p = 8 and p = 30. The small- sample advantage of the OS Huber lasso, standard lasso, and standard elastic net is smallest when p = 30, although the OS Huber lasso performs the strongest under heavy tails regardless of p or n. The OS lasso only performs competitively under p = 8 scenarios, although its small sample performance at this predictor space compares to most other methods. Under p = 30, the OS lasso is among the worst performers in correctly selectingtrue predictors into the model, regardless of scenario or n. The MS adaptive elastic net’s strong FPR performance corresponded with poor FNR in low-dimension scenarios. On the other hand, the adaptive Hubers, adaptive LADs, adaptive lasso, andadaptiveelasticnetperformedcomparablyinlowdimensions. Thebiggestdi↵ erences were observed at small samples with skew alone. Under skew and n = 38, the adaptive lasso and adaptive elastic net outperformed the adaptive LADs and adaptive Hubers by 2%5%. In the absence of heavy tails and withp = 8 potential predictors, most methods’ coe cient precision was comparable. However, the adaptive LADs and the OS lasso underperformed most other methods, although at the largest sample size the margin of underperformance was much smaller. Under heavy tails, the adaptive LADs improved relative to the remaining methods, while the OS lasso maintained its poor performance. Meanwhile, the adaptive Hubers consis- tently performed competitively for p = 8 and the OS Huber lasso outperformed most methods under heavy tails. The standard elastic net generally performed competitively in estimating true non-zero coe cients. The standard lasso performed competitively in the absence of heavy tails, although its relative performance su↵ ered when presented with heavy tails. The OS lasso, standard lasso, and standard elastic net underperformed in coe cient estimation when p = 30. The OS Huber typically underperformed as well, although its produced competitive precision with heavier tails. All other methods performed competitively forp = 30 except at the smallest sample size. When n = 38, coe cient precision varied noticeably except for under combined skew and heavy tails. Under low-dimensionality, FNR and coe cient precision deteriorated primarily withh but notwithg. Performanceforthesetwometricsalsodeterioratedgenerallywithp = 30relativeto p = 8. The impacts of p and h on FPR were more heterogeneous across methods, as described above. Median FPR and FPR dispersion was largely similar for any given method across scenar- ios under high dimensionality. FNR was also similar across scenarios. Median precision and precision dispersion also both increased over h generally across methods. Some methods, par- ticularly the OS lasso, adaptive lasso and elastic net, and the standard lasso and elastic net, saw increased dispersion over the combination of skew and heavy tails relative to heavy tails alone. In the absence of heavy tails, increased skew had small impacts on precision dispersion, 106 particularly for the the adaptive lasso, the adaptive elastic net, the standard lasso, and the standard elastic net. 4.4.1 Recommendations Under Low Dimensionality The primary unobservable characteristics of interest in the current simulations are skew, as indicated by g, and kurtosis/heavy-tailedness, as indicated by h. Generally, the standard lasso is not recommended. The only scenarios in which it performed competitively were under the smallest sample sizen = 38 forp = 30 potential predictors, and it only performed competitively in average FPR. It performed inconsistently under these scenarios regarding both precision and FNR, showing marked reductions in the presence of heavier tails. Furthermore, given its generally poor FPR performance and inconsistent performance on other metrics, the OS lasso is not recommended. If false positives are of no concern and false negatives are particularly problematic for a given application, the OS Huber lasso or standard elastic net might prove useful. However, both showed poor FPR performance across scenarios. The MS adaptive elastic net is recommended here if false positives are of greater concern than false negatives. It is still generally practical for real-data applications concerning mild- to-moderate skew or kurtosis, as its relative underperformance in FNR is fairly small across scenarios. The adaptive LAD lasso and adaptive Huber lasso provided the most consistent bal- anceacrossscenariosandmetrics. TheLADlassoperformsespeciallywellconcerningcoe cient estimation and properly eliminating true zero coe cients from the model. Despite the above points, however, the adaptive Huber lasso is recommended more broadly given its consistent performance in both the distributional simulations as well as the previous outlier simulations. 4.4.2 Recommendations Under High Dimensionality Under high dimensionality, on the other hand, the OS lasso is a strong candidate overall and performs the most precisely across conditions regarding FPR. It, alongside the OS Huber lasso, never eliminated true predictors from the model in the absence of heavy tails. However, the OS lasso did not perform as strongly concerning coe cient estimation, especially under combined skew and heavier tails. If coe cient estimation is of particular concern and data potentially involves skew or heavier tails, the adaptive Huber lasso or OS Huber lasso are practical choices. Both adaptations perform competitively in median FPR and relatively well in FPR dispersion across data scenarios. Neither the adaptive lasso nor the adaptive elastic net are advised. The standard elastic net and standard lasso produced average FPR and FNR performance. However, they generally underperformed other methods, especially regarding dispersion of FPR. The adaptive Huber elastic net and adaptive LAD elastic net produced mediocre performance across metrics and scenariosandthusarenotrecommendedinhigherdimensionsifnon-normalityisaconcern. The adaptive LAD lasso and adaptive Huber lasso are practical choices in general. Although they 107 both showed a relatively wide distribution of FPR’s, they always produced at least an average FPR compared to other methods and performed competitively in the absence of heavier tails. Both methods performed competitively in all other metrics and are generally recommended in higher dimensions where heavier tails present a potential concern. However, the possibility of collinearity in a dataset suggests the use of the elastic net variants instead. 108 Chapter 5 Simulations: Boundaries of Dimensionality This chapter outlines a smaller pilot simulation designed to evaluate performative boundaries that might occur as the number of potential predictors p increases to and beyond sample size n, separately from concerns regarding outliers or non-normal error distributions. A description of the methods used in these simulations are described. Results are presented and discussed. 5.1 Review of Relevant Findings The current study outlines a question regarding dimensionality which was not examined in the adaptation literature the author reviewed. Additional searches regarding to the lasso or elastic net outside of robustness or adaptation concerns produced a similar lack of studies. Consideration is regularly given top<n,p>n, and the more abstract notion of p n 1 . However, the region surrounding p =n does not appear to have been examined concerning the lasso, the elastic net, or the adaptations in the current study. The present simulation focuses solely on the boundary aroundp =n and howthat boundary and distance fromit might impact performance in the absence of outlier contamination or non-normality. 5.2 Methods and Design 5.2.1 Models and Software Implementation The current simulations evaluate the variable selection abilities of the following methods. Sec- tion 5.2.2 includes further details about cross-validation procedures that are non-specific to method or implementation. Appendix B.4 provides the code used to generate simulated pre- dictor and response data. Section 5.2.4 and3.2.5 outline the procedure used to generate this data. Examplesofcodeformethodswithandwithouttheadaptivelassotuninghyperparameter can be found in Appendices B.5 and B.6,respectively. 1 “” meaning “is much greater than.” 109 • Standard Lasso and Elastic Net The standard lasso and elastic net were included as a benchmark against which to compare the other methods. Both are implemented using the cv.glmnet function from the glmnet package (Friedman et al., 2019). • Adaptive Lasso and Elastic Net The weights for the adaptive lasso tuning hyperparam- eter were applied using the “penalty.factor” argument from the cv.glmnet function from the glmnet package (Friedman et al., 2019). This argument only applies the weights for a given weighting hyperparameter , and does not conduct cross-validation to select . This process is addressed in Section 5.2.2. • Multi-Step Adaptive Elastic Net The multi-step adaptive elastic net was conducted us- ing the msaenet function from the msaenet package (Xiao and Xu, 2019). This function includes a built-in procedure for calculating initial coe cient estimates ˆ for subsequent weighting during the adaptive lasso process. The initial ridge estimates and subsequent weight application were thus all implemented within this function rather than through running a preliminary model, although cross-validation still needed to be incorporated for selecting optimal values for the weighting hyperparameter 2 . The original study does not outline the number of stages used in their simulations, nor a recommended number of stages. Therefore, the number of stages k was arbitrarily set to 10 3 . •Adaptive LAD Lasso and Elastic Net The adaptive LAD lasso and elastic net were im- plemented using the cv.hqreg function from the hqreg package (Yi, 2017)bysettingthe “method” argument to “quantile” and the corresponding hyperparameter “tau” argument to 0.5 for the LAD lasso criterion. This function includes a similar “penalty.factor” ar- gument to cv.glmnet, which was used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weightinghyperparameter 4 . Avariablescreeningprocedureforcomputationaloptimiza- tion developed by Tibshirani et al. (2012) was incorporated into the function by setting the “screen” argument to “SR.” The authors also developed a new, less computationally- intensive screening rule, which is less conservative as a result and can be found detailed in their study. The original, more conservative screening rule developed by Tibshirani et al. (2012)waschosenforthisstudy. Althoughthechoicetouseascreeningrule(asopposedto none) was made at the recommendation of the developers of the hqreg package, the choice of particular rule was arbitrary on my part given the lack of specific recommendations. •Adaptive Huberized Lasso and Elastic Net The adaptive Huberized lasso and elastic net were implemented using the cv.hqreg function from the hqreg package (Yi, 2017) by setting the “method” argument to “huber” and the corresponding transition hyper- parameter “gamma” argument to 1.345 to balance both robustness of the Huber loss as well as e ciency under ideal data conditions 5 . The “penalty.factor” argument was again used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weighting hyperparameter 6 . 2 This was selected using the same procedure described in Section 5.2.2 3 This choice is discussed further in the general discussion in Chapter 7. 4 See Section 5.2.2. 5 Per recommendation by Huber (1981) 6 Yes, Section 5.2.2. 110 The same screening rule was used as described in the adaptive LAD lasso and elastic net. • Outlier-Shifted Lasso The outlier-shifted lasso was implemented using R code adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the code needed to be made to correct issues with internal cross- validation procedures and internal object references. Furthermore, although an internal procedure calculates the outlier-shifting hyperparameter os , the tuning hyperparameter OS for estimating these values required selection. The optimal value for os was chosen via an additional cross-validation procedure, which was conducted in a similar fashion to that described in Section 5.2.2. No other choices or alterations were made in the applicationofthecustomcode. TheRcodeusedtosimulatetheOSlassomodel,including the adapted custom OS lasso code, can be found in Appendix B.7. • Outlier-Shifted Huberized Lasso The outlier-shifted Huber lasso was implemented using a custom R function adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the custom function needed to be made to correct issues with internal cross-validation procedures and internal object references. As opposed to the standalone OS implementation, potential values for os were chosen from the same potential values for the regularization tuning hyperparameters , and no other cross-validation procedures needed to be incorporated to run the simulated models. No other choices or alterations were made in the application of the custom code. The R code used to simulate the OS Huber lasso model, including the adaptation of the custom R function, can be found in Appendix B.8. 5.2.2 Hyperparameter Selection For all instances of the elastic net, balancing hyperparameter ↵ was set to 0.5. Values of .75 and .9 were also considered, but discarded due to lack of meaningful di↵ erences on primary performance metrics in initial simulations 7 . It is standard to select tuning hyperparameters via a cross-validation procedure rather than running models such as the lasso or ridge regression using a pre-determined value. No standard exists for the cross-validation procedure itself, so these simulations use the following procedure. Regularization tuning hyperparameters were chosen from 100 logarithmically- equidistant values between 0.01 and 1400 by 5-fold cross-validation 8 . When utilizing the adaptive lasso tuning hyperparameter, a preliminary 5-fold cross- validated ridge regression was conducted using the same 100 lambda values and the resulting coe cients used for determining the weights vector. Unless otherwise specified, the initial ridge stepwasperformedusingthecv.glmnetfunctionintheglmnet package(Friedmanetal., 2019) 9 . 7 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 8 Multiple studies cite Lambert-Lacroix and Zwald (2011) as the original source for this recommendation. Lambert-Lacroix and Zwald (2011) does not, however, provide a rationale for this choice. 9 This initial step was conducted internally for the multi-step adaptive elastic net, although arguments were specified to implement the same procedure and using the same sequence of potential 2 values. 111 A subsequent 5-fold cross-validation procedure was conducted over 100 potential values of scal- ing hyperparameter chosen from the logarithmic sequence described previous. The literature does not provide a standard selection procedure for this hyperparameter. Furthermore, the literature does not make any recommendations for the sequence of potential values to use in cross-validation. The criterion for selection of optimal hyperparameters was mean prediction error in cross- validation test sets 10 . This metric was the only metric available in all software implementations of the included methods and adaptations 11 . 5.2.3 Performance Metrics The simulations used to primary metrics to evaluate the variable selection characteristics of each method and adaptation: • False-Positive Rate (FPR) the proportion of true zero coe cients incorrectly estimated to be non-zero • False-Negative Rate (FNR) the proportion of non-zero coe cients incorrectly estimated to be zero Thesemetricswerechosenfortheircorrespondencewiththerobusthypothesis-testingstatistical framework. The rate of eliminating true non-zero predictors analogizes with the complement to power 12 in the absence of explicit hypothesis tests. The rate of selection of true zero coe cients similarly equates to the Type I error rate 13 . Secondary performance metrics included accuracy in both coe cient estimation and pre- diction. Coe cient Estimate Precision, henceforth precision, was calculated to explore how accurately each method estimated true non-zero coe cients: Precision( ˆ )= P pnonzero j=1 ( ˆ j j ) 2 p nonzero (5.1) Note that p nonzero = 4 in all cases in this study, as all simulations featured nonzero values only in the first four elements of the coe cient vector . Although developed independently for the purposes of this paper, the similarity of this metric with another measure of precision used by Kurnaz et al. (2018) should be noted. However, their version does not account for the number of true predictors. They also take the square root of the deviation in coe cients. The Test-Set Mean Squared Error, henceforth RMSE, meanwhile, is calculated as follows: RMSE = r P n i=1 (ˆ y i y i ) 2 n (5.2) 10 Note: This is di↵ erent from the test sets used to generate one of the performance metrics, Mean Squared Error, which will be discussed in the next section. 11 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 12 Aka the probability of rejecting the null hypothesis of no relationship given that the alternative hypothesis is in fact false. 13 Aka the probability of rejecting a true null hypothesis of no relationship. 112 where ˆ y i is the vector of model-predicted outcome values for the ith observation. To calculate theRMSE,anadditional50%data, roundedup, wasgeneratedforeachdatasetusingthesame seed as the original dataset. FPR and FNR were calculated internally within each model-application function. Ap- pendix B.9 provides example code for generating RMSE and precision. 5.2.4 Simulation Conditions Although many studies exist assessing the performance of various lasso and elastic net adap- tations under low- and high-dimensionality, there is no explicit study of how approaching the boundary between low- and high-dimensionality impacts model performance. The data in the current simulation were simulated to represent di↵ erent ratios of potential predictors p to sam- ple size n for a fixed sample size n = 200. The current simulation is interested only in this boundary and not in robustness to outlier or non-normality. Therefore, only data from an underlying standard normal population were generated. Furthermore, due to the pilot nature of these simulations and these characteristics not presenting a primary concern for the overall course of research, additional sample sizes were not considered, nor were di↵ erent levels of true predictor sparsity in the potential predictor space 14 . Data for the dimensionality simulations were generated from the following linear model. y i =x 0 i +✏ i (5.3) with 1 =0.5, 2 =1.0, 3 =1.5, 4 =2.0, and all else equal to 0. Predictor values were generated from the following distribution: x⇠ N(0,⌃ ) (5.4) where ⌃ is the p⇥ p identity matrix with 1’s on the diagonal and 0’s otherwise. As this simulation was primarily focused on the e↵ ects that di↵ erent ratios of potential parameters p and sample size n, data were generated from the following mixture distributions: ✏⇠ (1⌘ y )N(0,1)+⌘ y N(2,5) (5.5) and x⇠ (1⌘ x )N(0,⌃ )+⌘ x N(10,⌃ ), (5.6) except, with all contamination levels ⌘ x = ⌘ y = 0%. In other words, all predictor and response data were generated from the standard normal distribution. The following features were varied across data conditons: • p = number of potential predictors p varied across 7 levels: 8, 30, 190, 200, 210, 500, and 1000. 14 These potential limitations are discussed further in the general discussion section in Chapter 7. 113 • n train = training sample size 15 n train comprisedasinglelevel: 200. Thiscorrespondswith afullsamplesizeofn = 300,sincethetestsetconsistsofanadditional50%datagenerated under the same conditions and random seed. for a total of 7 data conditions. The data conditions for p = 8 and p = 30 were taken from the outlier robustness simulations for n = 200,p=8,⌘ x = ⌘ y = 0 and n = 200,p = 30,⌘ x = ⌘ y = 0, respectively. 500 iterations of each condition were generation, and the models applied to each iteration. 114 5.3 Results: Boundaries of Dimensionality All simulations in the current chapter were conducted in RStudio on a 3.80 GHz AMD Ryzen Threadripper processor with 64 GB RAM. Numerical results for the MS adaptive elastic net (yellow) will not be interpreted for po- tential parameter spaces of size p = 500 and p = 1000, as the null model rate in both scenarios approached 100%. 0.0 0.2 0.4 0.6 8 30 190 210 500 1000 p fpr False−Positive Rate 0.0000 0.0025 0.0050 0.0075 0.0100 8 30 190 210 500 1000 p fnr False−Negative Rate 0.00 0.25 0.50 0.75 8 30 190 210 500 1000 p RMSE Test RMSE 0.000 0.025 0.050 0.075 0.100 8 30 190 210 500 1000 p Precision Coefficient Precision method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 5.1: FPR, FNR, RMSE, and Precision: Boundaries of Dimensionality, n=200 Figure 5.1 displays the FPR, FNR, test-set RMSE, and coe cient precision for the cur- rent simulations. The number of potential parameters p is listed on the x-axis, while the corresponding metric is along the y-axis. 5.3.1 FPR The clearest di↵ erences between methods and across p arise when considering average FPR trends. All methods performed their worst with smallerp, although the elastic net (royal blue), the OS Huber lasso (light blue), the OS lasso (purple), and the standard lasso (bright green) 115 all perform notably worse, consistent with findings from previous simulations. Furthermore, all methods improve very clearly over p, although some methods do seem to deteriorate from p = 30 to p = 190. The adaptive lasso (red) and adaptive elastic net (green) both increase in average FPR over this interval, as do the adaptive Huber elastic net (brown) and adaptive Huber lasso (turquoise), although the Hubers deteriorate less severely. Once p surpasses n train , average FPR for all methods improves with increased p.The standard lasso (bright green) and OS lasso appear to perform the best, both of which fell below 3% by p = 1000. The standard elastic net, adaptive Huber lasso, and OS Huber lasso all clustered together in performance just above the standard lasso and OS lasso. The adaptive lassoandadaptiveelasticnetbothperformedworstdespiterapidimprovementsaftersurpassing p = n train .By p = 1000 potential predictors, however, their average FPR’s are approximately 7%. The MS adaptive elastic net appears to improve much more slowly in average FPR after passing p =n train up to the final data point at p = 500. 5.3.2 FNR Average FNR does not deviate from 0% under normality, regardless of dimensionality. The adaptivelasso(red),adaptiveelasticnet(green),adaptiveHuberlasso(turquoise),andadaptive Huber elastic net (brown) start to diverge 16 from 0% as p increases past training-set sample size. However, the magnitude of these di↵ erences are miniscule. 5.3.3 RMSE The trends of test-set RMSE di↵ erentiated as p increased. The trends start to di↵ erentiate from p = 30 to p = 190, particularly for the adaptive lasso and adaptive elastic net relative. The OS Huber lasso, OS lasso, standard lasso, and adaptive Huber lasso maintain the lowest test-set RMSE’s across dimensionality conditions. The adaptive lasso and adaptive elastic net showed increased prediction error closer to the p = n train boundary. The MS adaptive elastic net’s performance proves di cult to evaluate given its null model rate atp = 500 andp = 1000. 5.3.4 Precision Precisionvariedacrossmethods,althoughthesmallscaleisalsonotablehereandthemagnitude of di↵ erences is on the order of two or three decimal places. The adaptive Huber lasso appears to dominate in these conditions except at the smallest sample size, and its trend deteriorates far more slowly than the other methods over p. The adaptive Huber lasso, the adaptive Huber elastic net, and the MS adaptive elastic net perform competitively as p increases up to n train , 16 Note: All other methods are overlapped by the OS huber lasso in purple, as they are all at 0% FNR. 116 the adaptive lasso and adaptive elastic net perform competitively. The standard lasso, standard elastic net, and OS methods produced the worst coe cient estimates for true predictors. The standard elastic net maintains the worst coe cient precision as p continues to increase, and deteriorates much more rapidly than the other methods besides the MS adaptive elastic net and adaptive Huber elastic net. The adaptive Huber lasso maintains the strongest performance over p>n, while both the Ms adaptive elastic net and adaptive Huber lasso deteriorate rapidly. Although still the second-strongest performer at p = 500, by p = 1000, the adaptive Huber elastic net shows average precision comparable to the majority of other methods. The MS adaptive elastic net is comparable to the majority of methods by p = 500. The OS methods, the standard lasso, and the adaptive lasso and elastic net begin to converge on comparable average precision by p = 500. 117 5.3.5 Null Models Table 13 presents the null model information for the dimensionality simulations. The only method that produced null models in these simulations was the MS adaptive elastic net, which produced null models at a rate of nearly 100% at p = 500 and p = 1000. Method n p Null Models msad a e lnet5 300 5 0 0 0 0 4 96 msad a e lnet5 300 1 000 0 0 4 90 Method n p Null Models Method n p Null Models Table 13: Null Models: Boundaries of Dimensionality 5.4 Discussion Despite the icnreased number of true zero coe cients to be incorrectly selected into the model, the FPR showed the only consistent decreased as p increased. Some methods, particularly the adaptive lasso and adaptive elastic net, seem to deteriorate over p until p = n, which appears to serve as an inflection point for average FPR to begin to increase with p. FNR, meanwhile averaged approximately 0% across methods and p 17 . The adaptive lasso, adaptive elastic net, adaptive Huber elastic net, and adaptive Huber lasso start to increase in average FNR with extreme values of p. However, the scale of this increase is only to three decimal places. It would be interesting to see if FNR continued to increase with increased p for these methods. Further research might maintain n and true predictors but increase p to 2000, 3000, or more. All methods saw deterioration in accurate non-zero coe cient estimation over p, although the rate of deterioration past p =n varies. Robustness to the p =n boundary is likely to only apply to a narrow set of real datasets; most applied datasets are unlikely to fall near enough to this boundary for this to present a concern to applied researchers. Consequently, few practical recommendations are provided. However, the discussion attempts to highlight the most intriguing findings for future method- ological researchers interested in this question. The immediate boundary around p = n merits further study. Performance inconsistency and instability are noticeable in the 190 p 210 range of all four metrics. In the FPR plot, for instance, the adaptive lasso and adaptive elastic net clearly decreases from p = 190 to p = n = 200, and increases again from p = n = 200 to p = 210. This instability around the p =nmakessensegiventhatthisboundaryhasmathematicalsignificanceandoftendenotesthe pointatwhichcertainmethodsproducenon-uniquesolutionsorbecomesaturated. Thecurrent simulations only evaluated performance at p = n = 200, p = 190<n, and p = 210>n. More granular ranges of p around sample size n would provide additional nuance to understanding of the performance inconsistency presented by the p =n boundary. Taking n = 200, a researcher 17 Keeping in mind that this data is all standard-normal generated. 118 might study all p from p = 150 to p = 250. Another factor worth considering is the value of n itself. Fluctuations around p = n themselves might vary around the boundary if n, and thus p =n, becomes larger or smaller. Further quantitative research should establish an empirical distinction betweenp>n vs. pn and any impacts on performance around any empirical inflection point between the two. ThesesimulationsgivemoreinformationaboutwhentheMSadaptiveelasticnetstartstobreak down, as it performs well with respect to null models until p = 500. If n changed, would the value ofp corresponding to the MS adaptive elastic net’s null-model consistency change as well? A final question pertains instead to the sparsity of a process, between the number or proportion of true non-zero predictors in p compared to the total potential parameter space p. This question could also be further considered within the context of sample size n or even to contextualize n/p bounds orp>n vs. p n bounds. These additional questions, though interestingandrelevant,arenotexaminedhereinandareleftforconsiderationinfutureresearch. The current simulations did not incorporate non-normality or outlier contamination. Typ- ical robustness concerns such as non-normality or outlier contamination will likely produce di↵ erent patterns of behavior regarding the boundaries of dimensionality. Future study of the current questions and the questions proposed for future research should also consider typical questions of robustness. 119 Chapter 6 Real-World Data: Entry Status of Partial Hospital Patients Given the focus of the current research on providing practical modeling tools for applied re- searchers, anapplicationwitharealdatasetcanhelpillustratetheirutility. Thecurrentchapter utilizes data taken from the Rhode Island Methods to Improve Diagnostic Assessment and Ser- vices (MIDAS) project, a research program run through data collected from psychiatric treat- ment recipients of both outpatient care and partial hospitalization. The chapter begins with a brief introduction to the partial hospitalization psychiatric context and its clinical significance. A brief review of clinical research in outpatient settings relevant to the variables of interest in the current study follows. The data-analytic methodology is outlined, followed by the results of the analyses and follow-up analyses. The chapter concludes with a discussion of the findings. 6.1 Introduction 6.1.1 Partial Hospital Programs (PHP) PartialHospitalPrograms(PHPs)providealevelpsychiatriccarebetweenoutpatientandinpa- tient treatment. PHPs do not involve onsite residence and instead patients attend multi-faceted day treatment for three to six hours each day. A treatment day also often involves multiple group therapy sessions. Patients typically receive individual therapy and psychiatry sessions, though not necessarily daily. PHPs typically manage intensive clinical presentations for indi- viduals with su cient social functioning such that inpatient care is unnecessary. Consequently, PHPs often see patients stepping down from inpatient care or referred from recent emergency psychiatric treatment. PHPs also see outpatient referrals for patients requiring more acute, intensive psychiatric care but who do not yet need full hospitalization 1 . A notable characteristic of PHP patients in the limited available data is the high level of psychiatric comorbidity. Kertz et al. (2012), for instance, note that their average sample patient presented with more than two diagnoses. Meanwhile, patients in the current received an 1 Those interested in learning more about PHP’s and the relatively sparse research in the PHP context are encouragedtoconsidertheworksofresearcherssuchasSarahJ.Kertz,KathrynMcHugh,Thr¨ osturBj¨ orgvinsson, Courtney D. Beard, and Andrew Peckham, who work or previously worked at McLean Hospital’s Behavioral Health Partial Program in Massachusetts. 120 average of nearly four mental health diagnoses. Corresponding with high rates of comorbidity, PHP patients are typically diagnostically-heterogeneous, with a variety of principal diagnoses and comorbidities as defined within current diagnostic systems such as the DSM or ICD. 6.1.2 Overview of Relevant Constructs A comprehensive review of all literature pertaining to mindfulness, depression, anxiety, and psychological flexibility is beyond the scope of this study, and previous findings relevant to results of the current analyses will be addressed as is applicable. Notably, these constructs have not seen extensive study in PHP contexts, although this is likely due to the general paucity of research at this level of care. I will provide brief descriptions of the four constructs for the purposes of readers without previous background in clinical psychology or psychiatry. Anxiety and depression correspond with two broad and common realms of mental health disorder. Anxiety disorders usually focus on anticipatory fear and consequent impacts on an individual’s behavior, emotions, and thoughts. Depressive disorders also feature distorted cog- nitionsparticular emphasis on negative self-referential beliefs. While physiological arousal is a feature of some anxiety disorders, depression often manifests as a reduction in terms of phys- iological activity. Suicidal ideation and thoughts of death are also relatively more common in the context of depressive disorders as well as bipolar disorders. Historically, the broader cat- egory of anxiety disorders contained both Obsessive-Compulsive Disorder and Post-Traumatic Stress Disorder, while depressive disorders and bipolar disorders were subsumed under a single category of Mood Disorders. Mindfulness is a collection of techniques originally derived from Buddhist meditation prac- tices. Mindfulness in the research and treatment context generally emphasizes present-centered and non-judgmental experiential awareness. Mindfulness plays a prominent role in a number of the so-called “third-wave” cognitive therapies, including Acceptance and Commitment Therapy (ACT, Hayes et al., 2012), Mindfulness-Based Cognitive Therapy (MBCT, Segal et al., 2013), and Dialectical Behavior Therapy (DBT, Linehan, 2015). Even related to the constructs above, psychological flexibility reflects a very broad process. It has alternately been referred to in research and practice as experiential avoidance or acceptance, among other labels (Bond et al., 2011). These labels convey a construct focused on the relationship between an individual and their thoughts, particularly with respect to how that relationship might present obstacles for functioning. Psychological flexibility also plays a significant role in modern cognitive therapies and is the namesake construct underlying both ACT and Acceptance-Based Behavior Therapy (Roemer and Orsillo, 2005). Depression and anxiety are not typically included as predictors of protective constructs such as mindfulness or psychological flexibility. Both psychological flexibility and mindfulness, on the other hand, have been frequently studied in predictive models of anxiety and depression, and depression and anxiety are occasionally included in research on the two constructs for the purposes of psychometric evaluation. Many predictor models of depression and anxiety that incorporate mindfulness and psychological flexibility utilize a mediation approach. Such models 121 consistently find that greater mindfulness negatively predicts depression and anxiety but that psychological flexibility mediates this relationship. This finding has arisen in models of mind- fulness and clinical worry in an undergraduate sample (Ruiz, 2014); in models of mindfulness, depression, and anxiety in individuals in the wake of a psychotic episode (White et al., 2013); and in models of mindfulness change in community members throughout an 8-week non-clinical mindfulness intervention (Mutch et al., 2021), among many others. Other studies, such as Calvo et al. (2020)’s study of attachment orientation and general well-being, have found that both mindfulness and psychological flexibility play mediate the relationships between various mental health characteristics. 6.2 Participants The current data comprise a subsample of daily and intake data collected from patients at RhodeIslandHospital’sAdultPartialHospitalProgramspecializinginprovidingnon-substance mental health treatment. The program runs from 9am to 2pm, with three large group sessions and one smaller interpersonal group therapy session. Patients in the interpersonal group session are divided either into high or low functioning groups, a PTSD-specific group, or a young adult group. TheprogramisstructuredbroadlyaroundACT,althoughthecliniciansalsoincorporate skills and tasks from other paradigms depending on the needs of the current treatment group. Breaks occur between group sessions, and there is a 15-minute optional meditation session held during one of the breaks. The first day of treatment for most patients involves an intake evaluation by one of the program psychiatrists, an initial individual therapy session with a licensed psychologist, and attendance at the regular group sessions as time allows. However, some patients instead receive a “Day 0” prior to starting treatment. This Day 0 primarily entails a comprehensive psychiatric evaluation via the Axis I module of the Structured Clinical Interview for the DSM-IV (SCID, First and Gibbon, 2004). This evaluation also includes the Borderline Personality Disorder module of the Structured Interview for DSM-IV Personality (SIDP, Pfohl et al., 1997) and psychosocial history. The SCID, SIDP, and psychosocial history are all administered by trained RA’s. AllpatientscompletetheshortformoftheFiveFacetMindfulnessQuestionnaire(FFMQ- SF,Bohlmeijeretal.,2011)andtheAcceptanceandActionQuestionnaire(AAQ-II,Bondetal., 2011), whichareonlycompletedontheintake(orDay0)andfinaldayoftreatment, inaddition to daily measures of depression and anxiety symptoms. The current sample includes 619 individuals who were administered the SCID on their initial day in the program. The data includes all patients who completed the SCID and subse- quentlyenteredtheprogram,regardlessofwhetherornottheyultimatelyattendedorcompleted the program after their Day 0. In choosing to specifically analyze data only for patients who were administered the SCID and only for their Day 0, I am eliminating the majority of the data collected from this program, which includes thousands of patients across more than two decades of providing treatment. 122 The purposes of this choice are four-fold. Given the focus on cross-sectional applications of the lasso, longitudinal data is outside the scope of the current research. Only SCIDded patients wereincludedinthedata, astheirinitialdayintheprogramismorehomogeneousthanpatients whose first day corresponds with a day in the program itself. Another benefit of utilizing this dataisthecompletenessofthedata,asinterviewersreviewself-reportmeasureswiththepatient prior to adminstering the diagnostic assessments. Finally, RA’s only administer the SCID to first-time patients. Many patients attend the program for multiple treatment occasions, with each attendance corresponding with a new ID in the dataset. Utilizing Day 0 data streamlines data processing by ensuring all observations correspond with an individual’s first attendance occasion. 6.3 Measures 6.3.1 SCID-V TheSCID-IV(hereafter“SCID”)isacomprehensivesymptomatologicalreviewofallAxisIand Axis II disorders from the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders(DSM-IV,Association, 1994). AxisIcorrespondswithmentalhealthconditionsother than personality disorders, while Axis II corresponds specifically with personality disorders. The SCID is a semi-structured interview to be administered by trained researchers. In the current setting, this interview was conducted by trained Research Assistants. Depending on diagnostic criteria and symptoms endorsed by the patient during the interview, the interviewer recommended diagnoses for the patient in a comprehensive report reviewed and confirmed by patients’ psychiatrists during their initial session in the program. 6.3.2 SIDP The SIDP is a semi-structured clinical interview corresponding with DSM-IV Axis II (personal- ity) disorders and contains a module for each of the DSM-IV personality disorders. All patients receiving the SCID also received the SIDP module corresponding specifically with Borderline Personality Disorder. RA’s incorporate additional SIDP modules based on clinical impression during the main interview. 6.3.3 CUDOS and CUXOS 2 The Clinically Useful Depression Outcome Scale (CUDOS, Zimmerman et al., 2008) and Clin- ically Useful Anxiety Outcome Scale (CUXOS, Zimmerman et al., 2010) were developed by 2 See Appendices C.1 and C.2. 123 the MIDAS project to provide brief and clinically-practical self-report measures of mood and anxiety symptoms. They are intended to be completed in 5 minutes or less. Both measures demonstrated strong convergent and discriminant validity and appropriate correlations with clinical ratings of depressive and anxiety symptomatology. Patients in the PHP receive this measure daily, and items on the measure correspond with ratings for the past day. The data used in these analyses only included patients’ Day 0 responses. Items on both measures 3 are intended to correspond roughly with particular DSM-IV mood or anxiety symptoms while also being understandable by the patient. The CUDOS additionally includes items relating to mood-associated impairment and quality of life. Items on both measures are rated on 5-point likert scales, from 0 (“not at all true”) to 4 (“almost always true”), with higher ratings indicating stronger depressive or anxious symptoms. 6.3.4 FFMQ-SF The short form of the Five Facet Mindfulness Questionnaire (FFMQ-SF, Bohlmeijer et al., 2011) was developed to evaluate dispositional mindfulness. The questionnaire includes items corresponding with each of five subscales, including Non-Reactivity to Inner Experience, Non- Judgment to Inner Experience, Observing, Describing, and Acting with Awareness. Both the regular and short form have been validated in depressed and anxious clinical samples. The 24 items on the FFMQ-SF are rated on 5-point likert scales, from 1 (“never or rarely true”) to 5 (“very often or always true”). Each value will be reduced by one 4 for consistency with the likert items on the CUDOS and CUXOS. Some items are reverse-valenced and must be reverse-scored when scoring the overall measure or one of the five facets. 6.3.5 AAQ-II TheAcceptanceandActionQuestionnaire(AAQ-II,Bondetal., 2011)wasdesignedtomeasure psychological inflexibility. The measure has shown strong psychometric properties in a variety of samples, including patients seeking outpatient mental health care. The 7 items on the AAQ-II are rated on 7-point likert scales with values ranging from 1 (“never true”) to 7 (“always true”), with greater ratings indicating psychological rigidity and lower ratings indiciating greater psychological flexibility. Like the FFMQ, each value on the AAQ-II will be reduced by one so that the minimum possible rating for a given item is 0 instead of 1. The mean item score was used as the main outcome variable in the current analyses. TheoriginalpsychometricstudyoftheAAQ-IIincludedanadditional3positively-valenced, reverse-scored items. However, only the 7-item version included in the appendix was collected from patients prior to their SCID evaluation on Day 0. 3 18 and 20 items for the CUDOS and CUXOS, respectively 4 Resulting an a minimal rating of 0 and a maximum rating of 4. 124 6.4 Analyses 6.4.1 Analytic Procedure Lasso and elastic net models were applied, with average item score on the AAQ-II used as the outcome variable. Individual items for each of the CUDOS, CUXOS, and FFMQ-SF were included alongside the number of current mental health diagnoses and age for a total of 64 potential predictors. Collinearity is a relevant characteristics both in applied contexts and when comparing the lasso and the elastic net. Consequently, the analyses included individual items as potential predictor to increase the potential parameter space and, more importantly, to induce collinearity in that space. Given that items within measures should correlate strongly, collinearity should be present in thecurrent data. Sincetheanalysesuseindividual itemsaspredictors, thevalenceof reverse- scored FFMQ-SF items maintained in the analyses and was not reverse-scored prior to conduct- ing the analyses. Section 6.6 describes a follow-up study investigating predictor correlations. Onehundredrandomtraining-testingsplitsweregenerated,withthetrainingsetcontaining 67% of the data and the testing set containing the remaining 33% 5 . Out-of-sample performance was evaluated using the remaining third of the data. Each model was applied to each of the 100 training sets and metrics obtained as described in Section 6.4.4. 6.4.2 Models and Software Implementation The current simulations evaluate the variable selection abilities of the following methods. Sec- tion 6.2.2 includes further details about cross-validation procedures that are non-specific to method or implementation. Appendix B.4 provides the code used to generate simulated pre- dictor and response data. Section 6.2.4 and3.2.5 outline the procedure used to generate this data. Examplesofcodeformethodswithandwithouttheadaptivelassotuninghyperparameter can be found in Appendices B.5 and B.6,respectively. • Standard Lasso and Elastic Net The standard lasso and elastic net were included as a benchmark against which to compare the other methods. Both are implemented using the cv.glmnet function from the glmnet package (Friedman et al., 2019). • Adaptive Lasso and Elastic Net The weights for the adaptive lasso tuning hyperparam- eter were applied using the “penalty.factor” argument from the cv.glmnet function from the glmnet package (Friedman et al., 2019). This argument only applies the weights for a given weighting hyperparameter , and does not conduct cross-validation to select . This process is addressed in Section 6.2.2. 5 This is directly analagous to the 50% testing data generated for each simulated dataset in the simulation studies. 125 • Multi-Step Adaptive Elastic Net The multi-step adaptive elastic net was conducted us- ing the msaenet function from the msaenet package (Xiao and Xu, 2019). This function includes a built-in procedure for calculating initial coe cient estimates ˆ for subsequent weighting during the adaptive lasso process. The initial ridge estimates and subsequent weight application were thus all implemented within this function rather than through running a preliminary model, although cross-validation still needed to be incorporated for selecting optimal values for the weighting hyperparameter 6 . The original study does not outline the number of stages used in their simulations, nor a recommended number of stages. Therefore, the number of stages k was arbitrarily set to 10 7 . •Adaptive LAD Lasso and Elastic Net The adaptive LAD lasso and elastic net were im- plemented using the cv.hqreg function from the hqreg package (Yi, 2017)bysettingthe “method” argument to “quantile” and the corresponding hyperparameter “tau” argument to 0.5 for the LAD lasso criterion. This function includes a similar “penalty.factor” ar- gument to cv.glmnet, which was used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weightinghyperparameter 8 . Avariablescreeningprocedureforcomputationaloptimiza- tion developed by Tibshirani et al. (2012) was incorporated into the function by setting the “screen” argument to “SR.” The authors also developed a new, less computationally- intensive screening rule, which is less conservative as a result and can be found detailed in their study. The original, more conservative screening rule developed by Tibshirani et al. (2012)waschosenforthisstudy. Althoughthechoicetouseascreeningrule(asopposedto none) was made at the recommendation of the developers of the hqreg package, the choice of particular rule was arbitrary on my part given the lack of specific recommendations. •Adaptive Huberized Lasso and Elastic Net The adaptive Huberized lasso and elastic net were implemented using the cv.hqreg function from the hqreg package (Yi, 2017) by setting the “method” argument to “huber” and the corresponding transition hyper- parameter “gamma” argument to 1.345 to balance both robustness of the Huber loss as well as e ciency under ideal data conditions 9 . The “penalty.factor” argument was again used to apply weights during the adaptive lasso process. An additional cross-validation procedure needed to be incorporated to determine optimal weighting hyperparameter 10 . The same screening rule was used as described in the adaptive LAD lasso and elastic net. • Outlier-Shifted Lasso The outlier-shifted lasso was implemented using R code adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the code needed to be made to correct issues with internal cross- validation procedures and internal object references. Furthermore, although an internal procedure calculates the outlier-shifting hyperparameter os , the tuning hyperparameter OS for estimating these values required selection. The optimal value for os was chosen via an additional cross-validation procedure, which was conducted in a similar fashion 6 This was selected using the same procedure described in Section 6.2.2 7 This choice is discussed further in the general discussion in Chapter 7. 8 See Section 6.2.2. 9 Per recommendation by Huber (1981) 10 Yes, Section 6.2.2. 126 to that described in Section 6.2.2. No other choices or alterations were made in the applicationofthecustomcode. TheRcodeusedtosimulatetheOSlassomodel,including the adapted custom OS lasso code, can be found in Appendix B.7. • Outlier-Shifted Huberized Lasso The outlier-shifted Huber lasso was implemented using a custom R function adapted from code generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). Changes to the custom function needed to be made to correct issues with internal cross-validation procedures and internal object references. As opposed to the standalone OS implementation, potential values for os were chosen from the same potential values for the regularization tuning hyperparameters , and no other cross-validation procedures needed to be incorporated to run the simulated models. No other choices or alterations were made in the application of the custom code. The R code used to simulate the OS Huber lasso model, including the adaptation of the custom R function, can be found in Appendix B.8. 6.4.3 Hyperparameter Selection For all instances of the elastic net, balancing hyperparameter ↵ was set to 0.5. Values of .75 and .9 were also considered, but discarded due to lack of meaningful di↵ erences on primary performance metrics in initial simulations 11 . It is standard to select tuning hyperparameters via a cross-validation procedure rather than running models such as the lasso or ridge regression using a pre-determined value. No standard exists for the cross-validation procedure itself, so these simulations use the following procedure. Regularization tuning hyperparameters were chosen from 100 logarithmically- equidistant values between 0.01 and 1400 by 5-fold cross-validation 12 . When utilizing the adaptive lasso tuning hyperparameter, a preliminary 5-fold cross- validated ridge regression was conducted using the same 100 lambda values and the result- ing coe cients used for determining the weights vector. Unless otherwise specified, the initial ridge step was performed using the cv.glmnet function in the glmnet package (Friedman et al., 2019) 13 . Asubsequent5-foldcross-validationprocedurewasconductedover100potentialvalues of scaling hyperparameter chosen from the logarithmic sequence described previous. The lit- erature does not provide a standard selection procedure for this hyperparameter. Furthermore, the literature does not make any recommendations for the sequence of potential values to use in cross-validation. 11 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 12 Multiple studies cite Lambert-Lacroix and Zwald (2011) as the original source for this recommendation. Lambert-Lacroix and Zwald (2011) does not, however, provide a rationale for this choice. 13 This initial step was conducted internally for the multi-step adaptive elastic net, although arguments were specified to implement the same procedure and using the same sequence of potential 2 values. 127 The criterion for selection of optimal hyperparameters was mean prediction error in cross- validation test sets 14 . This metric was the only metric available in all software implementations of the included methods and adaptations 15 . 6.4.4 Performance Metrics Giventhatthetruedata-generatingmechanismisunknown,FPR,FNR,andcoe cientprecision do not provide useful performance information. Therefore, the primary metric of interest in the current analyses isTest-Set Mean Squared Error, henceforthRMSE.RMSE is calculated as follows: RMSE = r P n i=1 (ˆ y i y i ) 2 n (6.1) where ˆ y i is the vector of model-predicted outcome values for the ith observation. RMSE was calculated by applying the models generated by the 2/3rds training data to the remaining third of the data. Additionally, the number of non-zero coe cients and the model run-time for each training-testing split were collected. The 15 most frequently selected predictors across all 100 training-testing for each model were collected to assess variable selection consistency and compare each model’s selection ten- dencies. To assess multicollinearity, the Variance Inflation Factor (VIF) was calculated for each lasso and elastic net adaptation. VIF is calculated as follows: VIF j,adapt = 1 1R 2 j , (6.2) for all predictors j selected as a top-15 predictor in any adaptation, and where R 2 j is the unadjusted coe cient of determination for the corresponding adaptation adapt with predictor j as the outcome and all other variables as potential predictors. All hyperparameters in these modelswereselectedviacross-validationasdescribedpreviously. VIFwascalculatedinthisway for all training-testing splits, calculated in models run on both the training and testing data. Meaningful di↵ erences were not Observed in VIF between corresponding training and testing sets for any given model, and thus only the training-set VIF is reported. VIF was unable to be reliably calculated for either of the outlier-shifting methods, and thus VIF is not reported for these methods. Substantively, the square root of VIF is interpreted as the multiplicative increase in the corresponding coe cient’s standard error, relative to its standard error if the predictor had no correlation with other predictors. A VIF of 10 is commonly used as the benchmark indicating strong multicollinearity in a model. 14 Note: This is di↵ erent from the test sets used to generate one of the performance metrics, Mean Squared Error, which will be discussed in the next section. 15 The potential limitation of this choice is given further consideration in the general discussion section in Chapter 7. 128 6.5 Results All primary real-data analyses were conducted in RStudio on a 3.80 GHz AMD Ryzen Thread- ripper processor with 64 GB RAM. Table 14 presents the demographic characteristics of the analytic sample of the current study. The mean age of the current sample was 35.3 years (SD = 13.90). The sample was predominantlyWhite(77.54%)andhadnotobtainedacollegedegree(56.54%). Forthemajority of the period comprising the current data, the demographic form patients filled out on their initial day in the program had only a single question assessing Gender, and the only response options were “Female” and “Male.” In 2015, the program added additional response options to increase inclusivity, resulting in the two additional options listed 16 . A fifth option “Unknown” was added for research purposes for patients who did not endorse an option under the previous structure of the question. Due to this recent change in the question, “Male” and “Female” are likely overreported in this table and “Other”/“Non-Binary” are likely underreported. This corresponds with patients who might have endorsed these options previously but did not, given the response options available to them at the time. 16 “Other” and “Non-Binary,” neither of which were endorsed in participants in the current sample for whom these options were available. 129 #— #title: “KableTesting” #author: “Matt Multach” #date: “3/25/2021” #output: # html_document: default # pdf_document: default #— Mean or N SD or % Age Age 35.3 13.90 Education Grades 7-12 30 4.85 High School 75 12.12 GED 31 5.01 Some College 214 34.57 2-Year College 55 8.89 BA/BS 102 16.48 Some Graduate School 48 7.75 Graduate School 55 8.89 Gender Identity Female 420 67.85 Male 179 28.92 Other 13 2.10 Unknown 7 1.13 Race/Ethnicity White 480 77.54 Black 29 4.68 Hispanic/Latinx 55 8.89 Asian 19 3.07 Portuguese 13 2.10 Other 23 3.72 Table 14: Demographic Characteristics of PHP SCID Sample, N = 619 Table 15 presents primary diagnoses for all 619 SCID-administered patients included in the current analyses. For patients receiving initial psychiatric evaluation via SCID interview, the most common primary DSM-IV diagnosis was major depressive disorder (MDD) with- out psychotic symptoms, accounting for nearly half of all primary diagnoses in the sample. Post-traumatic stress disorder (PTSD, 8.40%), bipolar disorder without psychotic symptoms 17 (8.08%), and generalized anxiety disorder (GAD, 7.11%) were the next most common primary diagnoses. All other primary diagnoses each represented less than 5% of the overall sample. Both MDD categories disorder categories comprise multiple distinctions, such as recur- rent vs. single episode MDD. Substance use disorder (SUD), adjustment disorder, and eating disorder all correspond with entire DSM-IV disorder categories and thus comprise multiple di- agnoses. “Other Anxiety Disorders” and “Other Mood Disorders” are heterogeneous groupings 17 Including both bipolar I and bipolar II regardless of whether the current episode was depressive or man- ic/hypomanic. 130 thatincludeNotOtherwiseSpecified(NOS)disorders. Uncommondisordersinthecurrentsam- ple, including (but not limited to) Dysthymia, Cyclothymia, Specific Phobias, or Agoraphobia without Panic Disorder, are also contained in these categories. Table 16 presents the percentage of the sample with any of the listed diagnoses upon evaluation, including non-primary diagnoses 18 . Consequently, the percentage column adds up to more than 100, and N likewise adds up to more than 619. Note also that MDD and bipolar disorder percentages include patients both with and without psychotic symptoms. The “Psy- chosis” category comprises all DSM-IV diagnoses for which psychotic symptoms were present. This includes MDD and bipolar disorder with psychotic symptoms, disorders classified under DSM-IV “Schizophrenia and other psychotic disorders,” and DSM-IV Cluster A personality disorders. More than 62% of patients received a diagnosis of MDD and more than half of patients received a diagnosis of GAD. DSM-IV social phobia was diagnosed in one-third of patients and PTSD and panic disorder were diagnosed in approximately one-quarter of patients each. ADHD,SUD’s, andborderlinepersonalitydisorderwerediagnosedinroughlyone-fifthofSCID- evaluated patients. 18 Note: Substance use disorder (SUD), impulse control disorder, somatoform disorder, adjustment disorder, and eating disorder all correspond with entire DSM-IV disorder categories and thus comprise multiple diagnoses. 131 Table 15: Primary DSM-IV Diagnoses Upon Initial PHP SCID Evaluation Figure 6.1 displays boxplots for test-set RMSE, run-time, and number of non-zero co- e cients for all 11 models across all 100 training-testing splits. The applied analyses did not produce any null models. Median RMSE was incredibly stable within any given method’s 100 training-testing splits and across methods. The multi-step adaptive elastic net (yellow), outlier-shifted Huber lasso (light blue), and outlier-shifted lasso (purple), however, each underperformed relative to the other models and showed much greater variability in predictive performance. The IQR boxes for each of these three methods fell far outside the highest whisker of the other methods. Of the three, the MS adaptive elastic net had the worst median RMSE and the narrowest distribution of RMSE values. The OS lasso showed a similar distribution of RMSE values, although its distribution was slightly lower in value overall as seen by all significant points in the box and whiskers falling lower on the plot. The OS Huber lasso, meanwhile, showed median RMSE between the MS adaptive elastic net and the OS lasso. However, its distribution was far wider, particularly regarding larger RMSE values. 132 #— #title: “KableTesting” #author: “Matt Multach” #date: “3/25/2021” #output: # html_document: default # pdf_document: default #— Current DSM-IV Diagnoses Diagnosis N Percent Major Depressive Disorder 386 62.36 Generalized Anxiety Disorder 318 51.37 Social Phobia 205 33.12 Post-Traumatic Stress Disorder 157 25.36 Panic Disorder, w/ or w/o Agoraphobia 154 24.88 Attention-Deficit Hyperactivity Disorder 122 19.71 Substance Use Disorder 116 18.74 Borderline Personality Disorder 113 18.26 Dysthymia 70 11.31 Bipolar Disorder 64 10.34 Specific Phobia 61 9.85 Impulse Control Disorder 51 8.24 Somatoform Disorder 50 8.08 Psychosis 40 6.46 Obessive-Compulsive Disorder 38 6.14 Adjustment Disorder 33 5.33 Eating Disorder 31 5.01 Primary DSM-IV Diagnosis Diagnosis N Percent Major Depressive Disorder w/o Psychosis 292 47.17 Post-Traumatic Stress Disorder 52 8.40 Bipolar Disorder w/o Psychosis 50 8.08 Generalized Anxiety Disorder 44 7.11 Adjustment Disorder 27 4.36 Borderline Personality Disorder 27 4.36 Panic Disorder, w/ or w/o Agoraphobioa 23 3.72 Other Mood Disorder 20 3.23 Other Anxiety Disorder 15 2.42 Non-Mood Psychotic Disorder 14 2.26 Major Depressive Disorder w/ Psychosis 13 2.10 Other 11 1.78 Obsessive-Compulsive Disorder 8 1.29 Substance Use Disorder 7 1.13 Bipolar Disorder w/ Psychosis 6 0.97 Multiple Co-Primary Diagnoses 6 0.97 Eating Disorder 4 0.65 Table 16: All Current DSM-IV Diagnoses Upon Initial PHP SCID Evaluation The OS Huber Lasso’s also produced wider IQR’s compared to the other two underper- forming methods. As discussed in previous chapters, the process by which RMSE tends to be incredibly stable and comparable across methods is unclear and merits further study, as the author was unable to locate any code issues that might have produced these observations. How- ever, the three underperforming models were also the only three models which did not include an intercept term and could only generate coe cients for predictors. The largest variability across methods in terms of performance metrics was Observed in run-time. The OS lasso, OS Huber lasso, adaptive Huber lasso (turquoise), standard lasso (bright green), and standard elastic net (royal blue) all had median run-times in single digits, while the adaptive lasso (red), adaptive elastic net (green), and adaptive Huber elastic net (brown) all produced run-times of approximately 10 seconds. Notably, the run-time of each of thesemethodswasincrediblystable,asdemonstratedbytheincrediblynarrowbox-and-whiskers for each method, although the OS Huber lasso did see a handful of larger outlying run-times, and the OS lasso produced a single large outlying run-time along with a slightly longer upper 133 0.0 0.5 1.0 1.5 method RMSE Test−Set RMSE 0 20 40 60 80 method RunTime Run Time (seconds) 0 10 20 30 40 50 method NumCoefs Number of Non−Zero Coefficients method adaelnet5 adalasso elnet5 huberelnet5 huberlasso ladelnet5 ladlasso lasso msadaelnet5 oshuberlasso oslassoplus Figure 6.1: Test-Set RMSE, Run-Time (seconds), and Number of Non-Zero Coe cientsforDay0PHPPatients whisker. The longest-running methods were the adaptive LAD lasso (black), adaptive LAD elastic net (pink), and the MS adaptive elastic net. The adaptive LAD methods were also the most widely distributed in terms of run-time, although the adaptive LAD lasso never exceeded 40 seconds and still showed a fairly tight distribution. The adaptive LAD elastic net, on the otherhand, wasbyfarthemostvariableinrun-time, withitslow1.5*IQRwhiskerreachingjust below 40 seconds and its largest outlying run-times as high as 80 seconds. Its median run-time wasapproximately56seconds. Finally, theMSadaptiveelasticnethadthehighestmedianrun- time of roughly 60 seconds, though it showed a very narrow distribution of run-times. These run-times likely underestimate the run-times that the average researcher would experience. The computer used for these analyses was built specifically for running intensive simulations and thus is not comparable to the average personal computer. The MS adaptive elastic net generally produced the sparsist models. The MS adaptive elastic net selected a median of 17-18 coe cients, while the adaptive lasso and elastic net, the next-sparsest models on average, selectinga median of approximately 22 coe cients. The adap- tive LAD and adaptive Huber methods all had the next-sparsest models on average after the adaptive lasso and adaptive elastic net, with a median of 25 non-zero coe cients for all four methods. TheadaptiveLADsbehavedmoresimilarlytoeachotherindistribution, andlikewise for the adaptive Huber formulations. The adaptive LADs produced more narrow distributions 134 compared to the adaptive Hubers. The adaptive LAD lasso showed a slightly wider distribu- tion towards larger values, as Observed by the wider IQR. On the other hand, the elastic net formulation produced two large outlying observations of 41 and 47 coe cients. The OS Huber lasso tended to select the most coe cients, with a median of 31 coe cients, upper whisker extending almost to 50 coe cients, and a single outlying observation of approx- imately 51-52 coe cients. The standard lasso, standard elastic net, and OS lasso selected an average of 29, 30, and 27-28 coe cients, respectively. The distributions were also fairly similar, although the OS lasso produced a shorter upper whisker and longer lower whisker than the lasso or elastic net. The standard lasso did not show any outlying values of non-zero coe cients selected, while the elastic net and OS lasso had 2-4 outlying values near 40 coe cients. Twenty-one unique predictors comprised the top 15 most-selected coe cients across all methods. Those predictors are listed below, along with the scale or subscale corresponding to each question. Each predictor also lists the number of methods that frequently selected that predictor and its average selection percentage into the model across all training/testing splits and all methods 19 . • FFMQ-SF Observe “Generally, I pay attention to sounds, such as clocks ticking, birds chirping, or cars passing.” 1 Method. Total Selection Percentage: 66.0. • FFMQ-SF Act with Awareness “I rush through activities without being really attentive to them.” 1 Method. Total Selection Percentage: 61.5. • CUDOS “I felt sad or depressed.” 2 Methods. Total Selection Percentage: 35.2. • CUXOS “I felt scared.” 2 Methods. Total Selection Percentage: 71.8. • FFMQ-SF Act with Awareness “I find it di cult to stay focused on what’s happening in the present moment.” 3 Methods. Total Selection Percentage: 77.1. • CUDOS “I had problems concentrating.” 5 Methods. Total Selection Percentage: 74.6. • CUDOS “Overall, how much have symptoms of depression interfered with or caused dif- ficulties in your life during the past week? 20 .” 6 Methods. Total Selection Percentage: 75.9. • CUDOS “My appetite was poor and I didn’t feel like eating.” 6 Methods. Total Selection Percentage: 74.8. • CUXOS “I had muscle tension or muscle aches.” 9 Methods. Total Selection Percentage: 83.4. • FFMQ-SF Nonreactivity “When I have distressing thoughts or images, I feel calm soon after.” 10 Methods. Total Selection Percentage: 90.6. • FFMQ-SF Nonreactivity “When I have distressing thoughts or images, I just notice them and let them go.” 10 Methods. Total Selection Percentage: 87.2. • CUXOS “I felt nervous or anxious.” 11 Methods. Total Selection Percentage: 93.4. • CUDOS “I thought I was a failure.” 11 Methods. Total Selection Percentage: 99.0. • CUDOS “I thought that the future looked hopeless.” 11 Methods. Total Selection Percent- age: 99.5. 19 Atotalof100splits*11methods=1100totalmodels. 20 Note: The range of responses for this item were “0 - Not at all” to “4 - Extremely.” 135 • CUDOS “How would you rate your overall quality of life during the past week? 21 .” 10 Methods. Total Selection Percentage: 96.0. • FFMQ-SF Act with Awareness “I find myself doing things without paying attention.” 11 Methods. Total Selection Percentage: 97.7. • FFMQ-SF Nonjudgement “I think some of my emotions are bad or inappropriate and I shouldn’t feel them.” 11 Methods. Total Selection Percentage: 99.9. • FFMQ-SF Nonjudgement “I disapprove of myself when I have illogical ideas.” 11 Meth- ods. Total Selection Percentage: 99.5. • FFMQ-SF Nonreactivity “I watch my feelings without getting carried away by them.” 11 Methods. Total Selection Percentage: 91.5. • FFMQ-SF Nonreactivity “When I have distressing thoughts or images, I don’t let myself be carried away by them.” 11 Methods. Total Selection Percentage: 97.2. • Number of current mental health diagnoses 11 Methods. Total Selection Percentage: 98.1. Out of the 21 items comprising all methods’ top-15 predictors, two 22 were selected fre- quently by only one method; two 23 were selected frequently by only two methods; one 24 by only three methods; and one 25 by only five methods. Two further items 26 were top-15 predictors in six methods, just over half the 11 total methods studied. One item 27 was a top-15 predictor for 9 out of 11 methods, two 28 were selected frequently by 10 out of 11 methods, and the remaining 10 items were top-15 predictors in all 11 methods. Three of the five items selected frequently by three or less methods come from the FFMQ- SF and correspond broadly with external awareness. At the same time, the other two were highly general depression and anxiety items. One of the most selected predictors also related to external awareness. The results show a gap between 6-or-less and 9-or-more predictors. No item was a top-15 predictor in 7 or 8 methods out of 11. A selection percentage gap of 7.5% between the 9-method item and the more-selected of the two 6-method items is also present (83.4% vs. 75.9%). The general “sad/depressed” item on the CUDOS showed very low overall selection percentage of 35.2%; the next-lowest item in terms of selection percentage was the FFMQ-SF item “I rush through activities without being really attentive to them,” which was selected into 61.5% of all 1100 models. Half of the ten items selected consistently by all 11 methods were FFMQ-SF items, two each from the Nonjudgment and Noreactivity subscales and one from the Act with Awareness subscale. Four out of five of these FFMQ-SF items were selected in at least 97% of all 1100 21 Note: The range of responses for this item were “0 - Very good, my life could hardly be better” to “4 - Very bad, my life could hardly be worse.” 22 “FFMQ-AA: Rush Through Activities” and “FFMQ-OB: Notice Environment Sounds.” 23 “Anx: Scared” and “Dep: Sad/Depressed.” 24 “FFMQ-AA: Di culty Focusing on Present.” 25 “Dep: Di culty Concentrating.” 26 “Dep: Poor Appetite” and “Depression-Related Impairment.” 27 “Anx: Muscle Tension/Aches.” 28 “FFMQ-NR: Calm Soon After Distress” and “FFMQ-NR: Let Distress Drift Away.” 136 models. Only one Nonreactivity item 29 was not a top-15 predictor. There might be a similar explanation for the high selection probability of the one Act with Awareness item compared to the similar but less-selected Act with Awareness/Observe items on this list. This singular item might simply capture most of the variance associated with the concept of external and active awareness in predicting psychological flexibility, likely owing to the intentional relatedness of these items of the overall FFMQ scale. Quantitatively speaking, the selection of one subscale itemovertheothersislikelyarelicofsomelassomethods’tendencytoarbitrarilyselectasingle predictor from among a group of correlated predictors. One of the all-11 items was a general anxiety question on the CUXOS; two were CUDOS items relating to symptoms of worthlessness and hopelessness, respectively; one was a general quality-of-life item from the CUDOS; and the last all-11 predictor was the number of current mental health diagnoses. 93.4% of models selected the CUXOS item, while the three CUDOS items and the mental health diagnoses item were all selected in at least 96% of models. The five most-frequently-selected items across all 1100 models were: • Number of current mental health diagnoses 11 Methods. Total Selection Percentage: 98.1. • CUDOS “I thought I was a failure.” 11 Methods. Total Selection Percentage: 99.0. • CUDOS “I thought that the future looked hopeless.” 11 Methods. Total Selection Percent- age: 99.5. • FFMQ-SF Nonjudgement “I disapprove of myself when I have illogical ideas.” 11 Meth- ods. Total Selection Percentage: 99.5. • FFMQ-SF Nonjudgement “I think some of my emotions are bad or inappropriate and I shouldn’t feel them.” 11 Methods. Total Selection Percentage: 99.9. 29 “Usually when I have distressing thoughts or images I can just notice them without reacting.” 137 Standard Elastic Net LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Anx: Nervous/Anxious 100 0.090 0.032 2.972 0.177 Dep: I’m a Failure 100 0.110 0.021 2.930 0.142 Dep: Hopelessness 100 0.100 0.018 2.576 0.145 FFMQ-NJ: Some of My Emotions Are Bad 100 0.119 0.017 1.901 0.075 FFMQ-AA: Absent-Minded Activity 100 0.093 0.019 2.519 0.138 FFMQ-NJ: Judging My Illogical Ideas 100 0.110 0.014 1.691 0.071 # of Current Diagnoses 100 0.060 0.009 1.343 0.030 Quality of Life 99 0.086 0.020 2.125 0.073 FFMQ-NR: Not Carried Away by Distress 99 -0.099 0.023 1.779 0.066 FFMQ-NR: Not Carried Away by Emotions 98 -0.082 0.019 1.831 0.067 FFMQ-NR: Calm Soon After Distress 98 -0.070 0.022 1.407 0.037 FFMQ-NR: Let Distress Drift Away 98 -0.067 0.020 1.811 0.080 Anx: Muscle Tension/Aches 95 0.040 0.014 1.578 0.043 Dep: Poor Appetite 95 0.037 0.012 1.789 0.059 Depression-Related Impairment 95 0.055 0.016 2.753 0.096 b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o op b l o o p b l o o p b l o o p b l o o p b l o o p Table 17: 15 Most-Frequently Selected PHP Coe cients: Standard Elastic Net b l oop Standard Lasso LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Dep: I’m a Failure 100 0.113 0.023 2.904 0.134 Dep: Hopelessness 100 0.103 0.019 2.576 0.155 FFMQ-NJ: Some of My Emotions Are Bad 100 0.121 0.018 1.895 0.074 FFMQ-AA: Absent-Minded Activity 100 0.096 0.022 2.516 0.147 FFMQ-NJ: Judging My Illogical Ideas 100 0.112 0.015 1.688 0.070 # of Current Diagnoses 100 0.062 0.010 1.344 0.028 Anx: Nervous/Anxious 99 0.095 0.033 2.962 0.162 FFMQ-NR: Not Carried Away by Distress 99 -0.102 0.024 1.777 0.064 Quality of Life 98 0.086 0.020 2.119 0.072 FFMQ-NR: Not Carried Away by Emotions 98 -0.082 0.021 1.821 0.069 FFMQ-NR: Calm Soon After Distress 98 -0.071 0.023 1.407 0.038 FFMQ-NR: Let Distress Drift Away 98 -0.068 0.022 1.793 0.077 Dep: Poor Appetite 95 0.038 0.012 1.782 0.057 Anx: Muscle Tension/Aches 94 0.041 0.013 1.575 0.041 Depression-Related Impairment 94 0.055 0.018 2.737 0.096 b l oop b l oop b l oop b l oop b l oop b l oop b l oop b l oop b l ooop b l oop b l oop b l oop b l oop Table 18: 15 Most-Frequently Selected PHP Coe cients: Standard Lasso 138 Adaptive Elastic Net LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Dep: I’m a Failure 100 0.119 0.024 3.014 0.144 Dep: Hopelessness 100 0.112 0.022 2.707 0.142 FFMQ-NJ: Some of My Emotions Are Bad 100 0.133 0.021 1.913 0.076 FFMQ-AA: Absent-Minded Activity 100 0.114 0.027 2.538 0.147 FFMQ-NJ: Judging My Illogical Ideas 100 0.126 0.015 1.718 0.079 # of Current Diagnoses 100 0.063 0.013 1.341 0.035 FFMQ-NR: Not Carried Away by Distress 98 -0.118 0.027 1.808 0.072 Anx: Nervous/Anxious 96 0.124 0.035 3.030 0.164 Quality of Life 96 0.107 0.025 2.160 0.078 FFMQ-NR: Calm Soon After Distress 96 -0.085 0.026 1.422 0.038 FFMQ-NR: Not Carried Away by Emotions 94 -0.092 0.023 1.844 0.067 FFMQ-NR: Let Distress Drift Away 91 -0.083 0.025 1.842 0.080 Depression-Related Impairment 87 0.059 0.020 2.759 0.094 Anx: Muscle Tension/Aches 78 0.054 0.018 1.579 0.046 Dep: Difficulty Concentrating 76 0.042 0.018 2.722 0.103 b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o op b l o o p b l o o p b l o o p b l o o p Table 19: 15 Most-Frequently Selected PHP Coe cients: Adaptive Elastic Net Adaptive Lasso LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Dep: I’m a Failure 100 0.119 0.024 2.977 0.144 Dep: Hopelessness 100 0.113 0.022 2.666 0.140 FFMQ-NJ: Some of My Emotions Are Bad 100 0.133 0.020 1.911 0.077 FFMQ-NJ: Judging My Illogical Ideas 100 0.127 0.015 1.717 0.077 # of Current Diagnoses 100 0.064 0.013 1.337 0.033 FFMQ-NR: Not Carried Away by Distress 98 -0.119 0.028 1.799 0.073 FFMQ-AA: Absent-Minded Activity 98 0.116 0.026 2.529 0.141 Anx: Nervous/Anxious 96 0.124 0.037 3.024 0.203 Quality of Life 96 0.107 0.022 2.139 0.082 FFMQ-NR: Not Carried Away by Emotions 95 -0.091 0.023 1.837 0.065 FFMQ-NR: Calm Soon After Distress 95 -0.087 0.026 1.427 0.041 FFMQ-NR: Let Distress Drift Away 93 -0.083 0.025 1.831 0.077 Depression-Related Impairment 86 0.061 0.022 2.757 0.093 Anx: Muscle Tension/Aches 82 0.054 0.021 1.581 0.042 Dep: Difficulty Concentrating 76 0.044 0.018 2.723 0.112 b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o op b l o o p b l o o p b l o o p b l o o p Table 20: 15 Most-Frequently Selected PHP Coe cients: Adaptive Lasso 139 b l oo p Multi-Step Adaptive Elastic Net LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models FFMQ-NJ: Some of My Emotions Are Bad 99 0.134 0.017 1.953 0.079 Dep: Hopelessness 98 0.117 0.021 2.729 0.117 FFMQ-NJ: Judging My Illogical Ideas 98 0.129 0.018 1.772 0.075 FFMQ-AA: Absent-Minded Activity 97 0.131 0.026 2.607 0.154 # of Current Diagnoses 97 0.073 0.012 1.384 0.036 Dep: I’m a Failure 92 0.128 0.023 3.091 0.106 FFMQ-NR: Not Carried Away by Distress 88 -0.131 0.026 1.861 0.081 Quality of Life 85 0.116 0.023 2.229 0.070 Anx: Nervous/Anxious 84 0.145 0.030 3.102 0.163 FFMQ-NR: Not Carried Away by Emotions 80 -0.101 0.019 1.892 0.074 FFMQ-NR: Calm Soon After Distress 77 -0.102 0.022 1.463 0.053 FFMQ-NR: Let Distress Drift Away 77 -0.102 0.022 1.883 0.082 Anx: Muscle Tension/Aches 64 0.072 0.016 1.591 0.055 Dep: Poor Appetite 61 0.068 0.012 1.852 0.069 Depression-Related Impairment 61 0.082 0.011 2.782 0.099 b l oo p b l oo p b l oo p b l oo p b l oo p b l oo p b l oo p b l oo p b l oo op b l oo p b l oo p b l oo p b l oo p Table 21: 15 Most-Frequently Selected PHP Coe cients: Multi-Step Adaptive Elastic Net Adaptive Huber Elastic Net LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Anx: Nervous/Anxious 100 0.103 0.038 3.024 0.175 Dep: I’m a Failure 100 0.113 0.024 2.990 0.131 Dep: Hopelessness 100 0.112 0.020 2.635 0.152 FFMQ-NJ: Some of My Emotions Are Bad 100 0.132 0.020 1.906 0.073 FFMQ-AA: Absent-Minded Activity 100 0.093 0.025 2.529 0.154 FFMQ-NJ: Judging My Illogical Ideas 100 0.104 0.015 1.696 0.078 # of Current Diagnoses 100 0.060 0.011 1.320 0.034 Quality of Life 98 0.092 0.022 2.151 0.082 FFMQ-NR: Not Carried Away by Distress 98 -0.111 0.028 1.805 0.075 FFMQ-NR: Not Carried Away by Emotions 94 -0.077 0.020 1.839 0.069 FFMQ-NR: Calm Soon After Distress 93 -0.068 0.023 1.414 0.038 FFMQ-NR: Let Distress Drift Away 93 -0.069 0.023 1.820 0.070 Anx: Muscle Tension/Aches 88 0.043 0.015 1.569 0.044 Dep: Difficulty Concentrating 88 0.043 0.016 2.709 0.113 Depression-Related Impairment 87 0.048 0.016 2.749 0.101 b l oop b l oop b l oop b l oop b l oop b l oop b l oop b l oop b l ooo p b l oop b l oop b l oop b l oop Table 22: 15 Most-Frequently Selected PHP Coe cients: Adaptive Huber Elastic Net 140 bl oo p Adaptive Huber Lasso LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Dep: I’m a Failure 100 0.115 0.025 2.954 0.117 Dep: Hopelessness 100 0.113 0.021 2.572 0.132 FFMQ-NJ: Some of My Emotions Are Bad 100 0.132 0.022 1.884 0.072 FFMQ-NJ: Judging My Illogical Ideas 100 0.102 0.017 1.688 0.075 # of Current Diagnoses 100 0.062 0.010 1.325 0.033 FFMQ-AA: Absent-Minded Activity 99 0.094 0.025 2.503 0.145 Anx: Nervous/Anxious 98 0.106 0.037 2.976 0.168 Quality of Life 98 0.089 0.021 2.131 0.079 FFMQ-NR: Not Carried Away by Distress 98 -0.113 0.027 1.787 0.070 FFMQ-NR: Not Carried Away by Emotions 95 -0.074 0.020 1.834 0.070 FFMQ-NR: Calm Soon After Distress 92 -0.067 0.026 1.412 0.039 Anx: Muscle Tension/Aches 91 0.041 0.018 1.563 0.042 FFMQ-NR: Let Distress Drift Away 91 -0.070 0.023 1.806 0.069 Dep: Poor Appetite 88 0.033 0.014 1.781 0.063 Dep: Difficulty Concentrating 85 0.043 0.018 2.671 0.080 bl oo p bl oo p bl oo p bl oo p bl oo p bl oo p bl oo p bl oo p bl oo op bl oo p bl oo p bl oo p bl oo p Table 23: 15 Most-Frequently Selected PHP Coe cients: Adaptive Huber Lasso b l o o p Adaptive LAD Elastic Net LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models FFMQ-NJ: Some of My Emotions Are Bad 100 0.117 0.026 1.807 0.073 FFMQ-NJ: Judging My Illogical Ideas 100 0.100 0.026 1.619 0.060 # of Current Diagnoses 100 0.059 0.012 1.272 0.030 Dep: I’m a Failure 99 0.112 0.027 2.854 0.107 Dep: Hopelessness 99 0.097 0.029 2.466 0.091 Quality of Life 99 0.120 0.036 2.122 0.062 FFMQ-NR: Not Carried Away by Distress 99 -0.118 0.033 1.784 0.074 FFMQ-AA: Absent-Minded Activity 97 0.099 0.024 2.370 0.122 Anx: Nervous/Anxious 94 0.106 0.034 2.777 0.134 FFMQ-NR: Calm Soon After Distress 92 -0.073 0.023 1.354 0.034 Dep: Poor Appetite 87 0.039 0.014 1.760 0.067 FFMQ-NR: Not Carried Away by Emotions 85 -0.053 0.019 1.785 0.055 FFMQ-NR: Let Distress Drift Away 82 -0.064 0.036 1.778 0.072 FFMQ-AA: Difficulty Focusing on Present 79 0.070 0.029 1.746 0.063 Dep: Difficulty Concentrating 75 0.047 0.021 2.642 0.085 b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o p b l o o o p b l o o p b l o o p b l o o p b l o o p Table 24: 15 Most-Frequently Selected PHP Coe cients: Adaptive LAD Elastic Net 141 bl oo p Adaptive LAD Lasso LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models FFMQ-NJ: Some of My Emotions Are Bad 100 0.117 0.026 1.796 0.076 FFMQ-NJ: Judging My Illogical Ideas 100 0.099 0.026 1.603 0.063 Dep: Hopelessness 99 0.097 0.032 2.435 0.104 FFMQ-NR: Not Carried Away by Distress 99 -0.118 0.036 1.767 0.069 # of Current Diagnoses 99 0.061 0.014 1.272 0.031 Dep: I’m a Failure 98 0.110 0.030 2.821 0.104 Quality of Life 98 0.125 0.038 2.102 0.060 FFMQ-AA: Absent-Minded Activity 97 0.098 0.026 2.003 0.275 Anx: Nervous/Anxious 91 0.106 0.036 2.733 0.129 FFMQ-NR: Calm Soon After Distress 91 -0.073 0.025 1.338 0.048 Dep: Poor Appetite 87 0.040 0.016 1.739 0.068 FFMQ-NR: Not Carried Away by Emotions 82 -0.049 0.021 1.776 0.056 FFMQ-NR: Let Distress Drift Away 81 -0.064 0.040 1.767 0.072 FFMQ-AA: Difficulty Focusing on Present 79 0.073 0.029 1.723 0.068 FFMQ-OB: Notice Environment Sounds 74 0.045 0.020 1.654 0.058 bl oo p bl oo p bl oo p bl oo p bl oo p bl oo p bl oo p bl oo p bl oo o p bl oo p bl oo p bl oo p bl oo p Table 25: 15 Most-Frequently Selected PHP Coe cients: Adaptive LAD Lasso b l o op Outlier-Shifted Lasso LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Dep: I’m a Failure 100 0.121 0.029 NA NA Dep: Hopelessness 100 0.098 0.026 NA NA FFMQ-NJ: Some of My Emotions Are Bad 100 0.148 0.027 NA NA FFMQ-NJ: Judging My Illogical Ideas 100 0.098 0.024 NA NA Quality of Life 99 0.128 0.032 NA NA FFMQ-NR: Not Carried Away by Distress 98 -0.114 0.033 NA NA Anx: Muscle Tension/Aches 95 0.051 0.018 NA NA # of Current Diagnoses 95 0.048 0.013 NA NA FFMQ-AA: Rush Through Activities 94 0.062 0.021 NA NA Anx: Scared 92 0.055 0.020 NA NA Dep: Sad/Depressed 92 0.079 0.025 NA NA FFMQ-NR: Calm Soon After Distress 90 -0.048 0.023 NA NA FFMQ-AA: Absent-Minded Activity 90 0.071 0.028 NA NA FFMQ-NR: Not Carried Away by Emotions 89 -0.065 0.026 NA NA Anx: Nervous/Anxious 87 0.101 0.036 NA NA b l o op b l o op b l o op b l o op b l o op b l o op b l o op b l o op b l o oo p b l o op b l o op b l o op b l o op Table 26: 15 Most-Frequently Selected PHP Coe cients: Outlier-Shifted Lasso 142 b l oop Outlier-Shifted Huber Lasso LabelShort Non-Zero Frequency Mean in Non-Zero Models SE in Non-Zero Models Training-Set VIF, All Models Training-Set VIF, All Models Dep: I’m a Failure 100 0.143 0.026 NA NA FFMQ-NJ: Some of My Emotions Are Bad 100 0.141 0.027 NA NA Dep: Hopelessness 98 0.087 0.032 NA NA FFMQ-AA: Absent-Minded Activity 97 0.095 0.028 NA NA FFMQ-NR: Not Carried Away by Emotions 96 -0.093 0.029 NA NA FFMQ-NJ: Judging My Illogical Ideas 96 0.072 0.022 NA NA FFMQ-NR: Not Carried Away by Distress 95 -0.084 0.032 NA NA FFMQ-AA: Difficulty Focusing on Present 93 0.076 0.036 NA NA Quality of Life 90 0.102 0.037 NA NA # of Current Diagnoses 88 0.038 0.013 NA NA Anx: Muscle Tension/Aches 87 0.051 0.022 NA NA Anx: Scared 83 0.044 0.022 NA NA Anx: Nervous/Anxious 82 0.074 0.030 NA NA FFMQ-NR: Let Distress Drift Away 81 -0.053 0.027 NA NA Dep: Sad/Depressed 80 0.062 0.028 NA NA Table 27: 15 Most-Frequently Selected PHP Coe cients: Outlier-Shifted Huber Lasso Tables 17-27 present information on the 15 most-frequently selected variables for each of the 11 methods and adaptations. Each table includes a shorthand variable label, the number of times that variable was selected into the model across the 100 training-testing splits, the 20% trimmed mean of the coe cient estimate across its non-zero models, and the 20% Winsorized standard error of the trimmed mean. A Winsorized statistic is similar to a trimmed statistic, except that the upper and lower % of sorted observations are shifted to the nearest untrimmed value in the dataset (Wilcox, 2016). A Winsorized statistic, then, shrinks or increases trimmed observationsratherthaneliminatingthemfromthedataset. TheWinsorizedstandarddeviation is to the trimmed mean as the standard deviation is to the mean. Therefore, the Winsorized standard deviation of the sampling distribution of the trimmed mean corresponds with the standard error of the trimmed mean. All methods except the MS adaptive elastic net selected at least one variable in 100% of training-testing splits, with most selecting between 4 and 6 predictors 100% of the time. The standard elastic net and adaptive Huber elastic net each selected 7 predictors 100% of the time. At the lower end of 100% selection, the MS adaptive never selected a variable 100% of the time, with a single item being selected in 99 models and two selected in 98 models. The adaptive LAD lasso and adaptive LAD elastic net selected two and three predictors 100% of the time, respectively. Where there were both elastic net and lasso variants of a method, the elastic net tended to select more variables 100% of the time, consistent with its dilution of the lasso selection penalty with the non-selection ridge penalty. Additionally, there tended to be overlap between the selections of elastic net and lasso variants of a method. The adaptive LAD lasso’s two 100% items, for instance, corresponded with two of the three 100% items for the elastic net 143 formulation, while the third 100% item in the elastic net version was one of the items selected by the lasso formulation 99% of the time. Coe cients were generally small; the largest average coe cients were slightly more than 0.14, while the smallest coe cients were less than 0.03 on average. Two characteristics con- tribute to this scale. First, the outcome variable ranges from 0 to 6. Second, most methods typicallyselected20ormorepredictorsintothemodel. Theonlynegativecoe cientscorrespond with positively-valenced FFMQ-SF items 30 . 6.6 Follow-Up Analyses Basedonfindingsinthefirstsetofanalysesandtofurtherinvestigatetheimpactsofcollinearity on the current models, further analyses were designed and conducted. To begin this investiga- tion, consider the pairwise correlation plots of the 21 variables selected most frequently across adaptations against all potential predictors in the previous analyses. All correlation plots were generatedusingthecorrplot (WeiandSimko, 2021)andPearson’sProduct-MomentCorrelation usedtomeasureofpairwisecorrelation. Althoughtherearevalidconcernswiththisstatistic,es- pecially with largenumbersof predictors, thismetricisthemost familiar toapplied researchers. These correlations are split into two plots for legibility, and the plots are limited to correlations with the 21 original top-15 predictors. Figure 6.2 presents the correlations of the 21 top-15 predictors, age, and the remaining CUXOS items. Figure 6.3 presents correlations of the top- 15 predictors, the remaining CUDOS items, and the remaining FFMQ-SF items. Blue indicates positive pairwise correlations, red indicates negative pairwise correlations, and the depth of the color indicates the magnitude of each correlation. The diagonal lines in negatively-correlated cells do not signify any aspect of the correlations other than being negative and were unable to be removed or altered. Attempts to include these lines in all cells were unsuccessful. Among the top-15 predictors, he depression-related impairment item and the past-week quality of life item of the CUDOS 31 produced the largest correlation (0.67). The FFMQ-SF, Nonreact item “I watch my feelings without getting carried away by them,” and the general “I felt nervous or anxious [in the past week]” item on the CUXOS 32 produced the strongest negative correlation (-0.33) among top-15 predictors. The strongest positive correlation overall was 0.72. This correlation arose between the general anxiety/nervous item and the item “I worried too much about things [in the past week]” on the CUXOS 33 . -0.34 was the strongest negative correlation overall, while several pairwise correlations occurred between -0.30 to -0.33. Most of the largest negative correlations were between the positively- and negatively-valenced FFMQ-SF items or between the positively-valenced FFMQ-SF items and CUXOS items. One 30 For example, “I watch my feelings without getting carried away by them.” 31 dep0 171anddep0 18 1, respectively. 32 FFMQpre 31andanx0 1 1, respectively. 33 anx0 11andanx0 3 1, respectively. 144 pairwise correlation between CUDOS items was also -0.34, between items indicating increased appetite and reduced appetite 34 . The pairwise correlations presented in these plots and follow-up analyses provide insight into results Observed in previous analyses and will be given more-detailed consideration in the chapter discussion. This section begins by presenting the particular correlations of interest. A set of semi-simulations conducted to study these correlations are described, followed by pre- sentation of the results for each set of analyses. The chapter concludes with the discussion of applied data results. 34 dep0 31anddep0 4 1, respectively. 145 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 anx0_1_1 anx0_6_1 anx0_7_1 dep0_1_1 dep0_3_1 dep0_11_1 dep0_12_1 dep0_16_1 dep0_17_1 dep0_18_1 FFMQpre_3_1 FFMQpre_8_1 FFMQpre_9_1 FFMQpre_10_1 FFMQpre_13_1 FFMQpre_17_1 FFMQpre_19_1 FFMQpre_21_1 FFMQpre_23_1 FFMQpre_24_1 numdx anx0_1_1 anx0_6_1 anx0_7_1 dep0_1_1 dep0_3_1 dep0_11_1 dep0_12_1 dep0_16_1 dep0_17_1 dep0_18_1 FFMQpre_3_1 FFMQpre_8_1 FFMQpre_9_1 FFMQpre_10_1 FFMQpre_13_1 FFMQpre_17_1 FFMQpre_19_1 FFMQpre_21_1 FFMQpre_23_1 FFMQpre_24_1 numdx Age_1 anx0_2_1 anx0_3_1 anx0_4_1 anx0_5_1 anx0_8_1 anx0_9_1 anx0_10_1 anx0_11_1 anx0_12_1 anx0_13_1 anx0_14_1 anx0_15_1 anx0_16_1 anx0_17_1 anx0_18_1 anx0_19_1 anx0_20_1 1 0.52 0.38 0.44 0.28 0.38 0.44 0.31 0.43 0.39 −0.33 0.34 −0.28 0.07 −0.23 0.27 0.23 −0.24 0.24 0.2 0.3 −0.12 0.67 0.72 0.47 0.68 0.54 0.45 0.55 0.34 0.33 0.41 0.45 0.24 0.3 0.2 0.36 0.37 0.29 0.52 1 0.39 0.39 0.31 0.34 0.42 0.36 0.41 0.34 −0.3 0.28 −0.29 0.06 −0.19 0.22 0.29 −0.23 0.22 0.25 0.23 −0.05 0.64 0.54 0.52 0.54 0.43 0.43 0.45 0.36 0.34 0.41 0.46 0.29 0.33 0.23 0.37 0.33 0.27 0.38 0.39 1 0.29 0.15 0.26 0.31 0.2 0.29 0.22 −0.14 0.22 −0.12 0.12 −0.11 0.16 0.2 −0.12 0.21 0.16 0.22 0.02 0.38 0.38 0.4 0.42 0.44 0.44 0.34 0.29 0.37 0.42 0.35 0.25 0.29 0.25 0.35 0.3 0.34 0.44 0.39 0.29 1 0.37 0.59 0.55 0.55 0.63 0.54 −0.3 0.28 −0.29 −0.06 −0.26 0.26 0.35 −0.27 0.28 0.26 0.23 −0.13 0.41 0.44 0.26 0.36 0.28 0.28 0.37 0.21 0.2 0.26 0.34 0.18 0.19 0.12 0.2 0.23 0.2 0.28 0.31 0.15 0.37 1 0.21 0.31 0.27 0.32 0.27 −0.19 0.17 −0.21 −0.01 −0.12 0.1 0.15 −0.13 0.16 0.07 0.17 −0.13 0.24 0.29 0.21 0.27 0.25 0.27 0.3 0.25 0.26 0.3 0.42 0.21 0.21 0.15 0.21 0.2 0.18 0.38 0.34 0.26 0.59 0.21 1 0.55 0.62 0.49 0.45 −0.3 0.31 −0.23 −0.07 −0.23 0.23 0.32 −0.23 0.28 0.32 0.28 −0.08 0.36 0.41 0.2 0.36 0.27 0.25 0.27 0.18 0.22 0.2 0.24 0.2 0.19 0.12 0.23 0.21 0.2 0.44 0.42 0.31 0.55 0.31 0.55 1 0.42 0.54 0.42 −0.29 0.5 −0.25 0 −0.23 0.28 0.27 −0.23 0.37 0.23 0.29 −0.1 0.44 0.46 0.33 0.42 0.33 0.32 0.34 0.23 0.3 0.34 0.31 0.22 0.26 0.23 0.24 0.27 0.21 0.31 0.36 0.2 0.55 0.27 0.62 0.42 1 0.51 0.51 −0.25 0.3 −0.25 −0.05 −0.21 0.19 0.31 −0.18 0.23 0.26 0.22 −0.12 0.36 0.36 0.18 0.32 0.21 0.17 0.24 0.17 0.21 0.24 0.23 0.22 0.22 0.14 0.22 0.24 0.21 0.43 0.41 0.29 0.63 0.32 0.49 0.54 0.51 1 0.67 −0.27 0.35 −0.27 −0.09 −0.22 0.27 0.26 −0.23 0.29 0.19 0.24 −0.07 0.44 0.42 0.31 0.38 0.3 0.32 0.35 0.28 0.27 0.32 0.37 0.23 0.26 0.18 0.26 0.31 0.22 0.39 0.34 0.22 0.54 0.27 0.45 0.42 0.51 0.67 1 −0.32 0.26 −0.25 −0.07 −0.25 0.21 0.2 −0.3 0.23 0.14 0.22 −0.04 0.41 0.38 0.19 0.31 0.26 0.22 0.29 0.21 0.22 0.22 0.31 0.2 0.21 0.11 0.23 0.25 0.21 −0.33 −0.3 −0.14 −0.3 −0.19 −0.3 −0.29 −0.25 −0.27 −0.32 1 −0.26 0.53 0.05 0.27 −0.27 −0.24 0.39 −0.25 −0.11 −0.25 0.14 −0.33 −0.34 −0.18 −0.26 −0.17 −0.17 −0.22 −0.13 −0.14 −0.18 −0.24 −0.18 −0.18 −0.1 −0.21 −0.21 −0.16 0.34 0.28 0.22 0.28 0.17 0.31 0.5 0.3 0.35 0.26 −0.26 1 −0.21 0.08 −0.17 0.35 0.26 −0.13 0.46 0.23 0.24 −0.09 0.36 0.4 0.26 0.33 0.28 0.24 0.24 0.18 0.22 0.29 0.23 0.11 0.19 0.23 0.27 0.22 0.17 −0.28 −0.29 −0.12 −0.29 −0.21 −0.23 −0.25 −0.25 −0.27 −0.25 0.53 −0.21 1 0.06 0.37 −0.19 −0.21 0.5 −0.11 −0.1 −0.15 0.14 −0.3 −0.32 −0.16 −0.24 −0.17 −0.16 −0.23 −0.11 −0.14 −0.13 −0.25 −0.15 −0.12 −0.08 −0.18 −0.16 −0.1 0.07 0.06 0.12 −0.06 −0.01 −0.07 0 −0.05 −0.09 −0.07 0.05 0.08 0.06 1 0.1 0.07 0.05 0.12 0 0.03 0.08 −0.07 0.04 0.04 0.17 0.08 0.12 0.05 0.08 0.14 0.04 0.07 0.08 0.02 0.09 0 0.11 0.07 0.09 −0.23 −0.19 −0.11 −0.26 −0.12 −0.23 −0.23 −0.21 −0.22 −0.25 0.27 −0.17 0.37 0.1 1 −0.1 −0.16 0.43 −0.12 −0.12 −0.2 0.08 −0.25 −0.24 −0.16 −0.18 −0.12 −0.14 −0.17 −0.04 −0.11 −0.1 −0.1 −0.1 −0.07 −0.1 −0.13 −0.09 −0.05 0.27 0.22 0.16 0.26 0.1 0.23 0.28 0.19 0.27 0.21 −0.27 0.35 −0.19 0.07 −0.1 1 0.19 −0.13 0.5 0.26 0.21 −0.1 0.23 0.28 0.22 0.27 0.21 0.24 0.26 0.19 0.14 0.18 0.19 0.14 0.13 0.18 0.19 0.21 0.16 0.23 0.29 0.2 0.35 0.15 0.32 0.27 0.31 0.26 0.2 −0.24 0.26 −0.21 0.05 −0.16 0.19 1 −0.18 0.27 0.5 0.21 −0.12 0.36 0.32 0.19 0.27 0.25 0.2 0.28 0.2 0.18 0.22 0.24 0.08 0.16 0.11 0.24 0.19 0.19 −0.24 −0.23 −0.12 −0.27 −0.13 −0.23 −0.23 −0.18 −0.23 −0.3 0.39 −0.13 0.5 0.12 0.43 −0.13 −0.18 1 −0.07 −0.1 −0.18 0.14 −0.28 −0.23 −0.14 −0.16 −0.11 −0.14 −0.19 −0.09 −0.15 −0.13 −0.18 −0.1 −0.11 −0.03 −0.16 −0.19 −0.07 0.24 0.22 0.21 0.28 0.16 0.28 0.37 0.23 0.29 0.23 −0.25 0.46 −0.11 0 −0.12 0.5 0.27 −0.07 1 0.34 0.22 −0.05 0.29 0.31 0.25 0.27 0.24 0.21 0.23 0.19 0.18 0.26 0.25 0.15 0.12 0.2 0.22 0.18 0.22 0.2 0.25 0.16 0.26 0.07 0.32 0.23 0.26 0.19 0.14 −0.11 0.23 −0.1 0.03 −0.12 0.26 0.5 −0.1 0.34 1 0.14 −0.07 0.29 0.28 0.2 0.21 0.21 0.15 0.21 0.14 0.11 0.14 0.13 0.04 0.07 0.05 0.14 0.11 0.19 0.3 0.23 0.22 0.23 0.17 0.28 0.29 0.22 0.24 0.22 −0.25 0.24 −0.15 0.08 −0.2 0.21 0.21 −0.18 0.22 0.14 1 −0.21 0.29 0.3 0.27 0.27 0.28 0.28 0.27 0.21 0.19 0.26 0.23 0.13 0.16 0.15 0.25 0.25 0.19 Figure 6.2: Correlations of Top-15 Predictors with Age and CUXOS Items 146 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 anx0_1_1 anx0_6_1 anx0_7_1 dep0_1_1 dep0_3_1 dep0_11_1 dep0_12_1 dep0_16_1 dep0_17_1 dep0_18_1 FFMQpre_3_1 FFMQpre_8_1 FFMQpre_9_1 FFMQpre_10_1 FFMQpre_13_1 FFMQpre_17_1 FFMQpre_19_1 FFMQpre_21_1 FFMQpre_23_1 FFMQpre_24_1 numdx anx0_1_1 anx0_6_1 anx0_7_1 dep0_1_1 dep0_3_1 dep0_11_1 dep0_12_1 dep0_16_1 dep0_17_1 dep0_18_1 FFMQpre_3_1 FFMQpre_8_1 FFMQpre_9_1 FFMQpre_10_1 FFMQpre_13_1 FFMQpre_17_1 FFMQpre_19_1 FFMQpre_21_1 FFMQpre_23_1 FFMQpre_24_1 numdx dep0_2_1 dep0_4_1 dep0_5_1 dep0_6_1 dep0_7_1 dep0_8_1 dep0_9_1 dep0_10_1 dep0_13_1 dep0_14_1 dep0_15_1 FFMQpre_1_1 FFMQpre_2_1 FFMQpre_4_1 FFMQpre_5_1 FFMQpre_6_1 FFMQpre_7_1 FFMQpre_11_1 FFMQpre_12_1 FFMQpre_14_1 FFMQpre_15_1 FFMQpre_16_1 FFMQpre_18_1 FFMQpre_20_1 FFMQpre_22_1 1 0.52 0.38 0.44 0.28 0.38 0.44 0.31 0.43 0.39 −0.33 0.34 −0.28 0.07 −0.23 0.27 0.23 −0.24 0.24 0.2 0.3 0.43 0.04 0.3 0.08 0.44 0.26 0.27 0.29 0.42 0.2 0.11 −0.17 −0.2 0.17 0.17 −0.08 0.07 0.16 0.25 0.2 −0.02 −0.16 −0.21 −0.09 0.16 0.52 1 0.39 0.39 0.31 0.34 0.42 0.36 0.41 0.34 −0.3 0.28 −0.29 0.06 −0.19 0.22 0.29 −0.23 0.22 0.25 0.23 0.33 0 0.22 0.07 0.32 0.25 0.26 0.34 0.37 0.26 0.2 −0.04 −0.11 0.14 0.1 −0.11 0.15 0.14 0.17 0.21 −0.03 −0.04 −0.16 −0.04 0.19 0.38 0.39 1 0.29 0.15 0.26 0.31 0.2 0.29 0.22 −0.14 0.22 −0.12 0.12 −0.11 0.16 0.2 −0.12 0.21 0.16 0.22 0.29 0.13 0.2 0.11 0.28 0.28 0.3 0.24 0.31 0.13 0.03 −0.1 −0.12 0.11 0.12 0.01 0.08 0.16 0.17 0.11 0.09 −0.06 −0.12 0.03 0.2 0.44 0.39 0.29 1 0.37 0.59 0.55 0.55 0.63 0.54 −0.3 0.28 −0.29 −0.06 −0.26 0.26 0.35 −0.27 0.28 0.26 0.23 0.71 0.08 0.28 0.2 0.23 0.43 0.49 0.49 0.48 0.45 0.37 −0.18 −0.2 0.15 0.15 −0.17 0.09 0.16 0.33 0.19 −0.09 −0.14 −0.17 −0.13 0.25 0.28 0.31 0.15 0.37 1 0.21 0.31 0.27 0.32 0.27 −0.19 0.17 −0.21 −0.01 −0.12 0.1 0.15 −0.13 0.16 0.07 0.17 0.42 −0.34 0.25 0.02 0.26 0.23 0.28 0.19 0.3 0.18 0.18 −0.07 −0.08 0.05 0.08 −0.08 0.05 0.05 0.17 0.07 −0.06 −0.07 −0.05 −0.05 0.1 0.38 0.34 0.26 0.59 0.21 1 0.55 0.62 0.49 0.45 −0.3 0.31 −0.23 −0.07 −0.23 0.23 0.32 −0.23 0.28 0.32 0.28 0.51 0.12 0.23 0.13 0.22 0.36 0.38 0.68 0.48 0.4 0.32 −0.21 −0.2 0.12 0.16 −0.21 0.08 0.17 0.3 0.16 −0.12 −0.18 −0.21 −0.18 0.26 0.44 0.42 0.31 0.55 0.31 0.55 1 0.42 0.54 0.42 −0.29 0.5 −0.25 0 −0.23 0.28 0.27 −0.23 0.37 0.23 0.29 0.55 0.11 0.27 0.16 0.38 0.42 0.48 0.49 0.68 0.33 0.22 −0.19 −0.2 0.15 0.2 −0.12 0.08 0.23 0.39 0.18 −0.04 −0.17 −0.15 −0.09 0.25 0.31 0.36 0.2 0.55 0.27 0.62 0.42 1 0.51 0.51 −0.25 0.3 −0.25 −0.05 −0.21 0.19 0.31 −0.18 0.23 0.26 0.22 0.45 0.01 0.2 0.18 0.21 0.35 0.41 0.42 0.42 0.6 0.47 −0.16 −0.15 0.14 0.15 −0.17 0.08 0.16 0.3 0.13 −0.13 −0.17 −0.12 −0.11 0.21 0.43 0.41 0.29 0.63 0.32 0.49 0.54 0.51 1 0.67 −0.27 0.35 −0.27 −0.09 −0.22 0.27 0.26 −0.23 0.29 0.19 0.24 0.62 0.12 0.26 0.25 0.31 0.41 0.5 0.38 0.53 0.36 0.25 −0.12 −0.15 0.15 0.13 −0.17 0.07 0.14 0.31 0.14 −0.08 −0.11 −0.11 −0.13 0.22 0.39 0.34 0.22 0.54 0.27 0.45 0.42 0.51 0.67 1 −0.32 0.26 −0.25 −0.07 −0.25 0.21 0.2 −0.3 0.23 0.14 0.22 0.55 0.02 0.2 0.16 0.23 0.34 0.42 0.32 0.4 0.34 0.23 −0.11 −0.16 0.11 0.14 −0.22 0.03 0.15 0.24 0.1 −0.12 −0.16 −0.18 −0.17 0.17 −0.33 −0.3 −0.14 −0.3 −0.19 −0.3 −0.29 −0.25 −0.27 −0.32 1 −0.26 0.53 0.05 0.27 −0.27 −0.24 0.39 −0.25 −0.11 −0.25 −0.28 −0.03 −0.13 −0.04 −0.2 −0.12 −0.18 −0.23 −0.25 −0.25 −0.13 0.26 0.31 −0.04 −0.11 0.26 −0.03 −0.16 −0.22 −0.13 0.16 0.26 0.41 0.15 −0.16 0.34 0.28 0.22 0.28 0.17 0.31 0.5 0.3 0.35 0.26 −0.26 1 −0.21 0.08 −0.17 0.35 0.26 −0.13 0.46 0.23 0.24 0.3 0.04 0.18 0.08 0.27 0.26 0.28 0.23 0.43 0.14 0.06 −0.14 −0.17 0.16 0.19 −0.13 0.17 0.32 0.44 0.19 −0.03 −0.11 −0.07 −0.06 0.34 −0.28 −0.29 −0.12 −0.29 −0.21 −0.23 −0.25 −0.25 −0.27 −0.25 0.53 −0.21 1 0.06 0.37 −0.19 −0.21 0.5 −0.11 −0.1 −0.15 −0.25 −0.01 −0.09 −0.07 −0.19 −0.17 −0.21 −0.18 −0.21 −0.18 −0.09 0.14 0.15 −0.06 −0.04 0.23 −0.03 −0.03 −0.13 −0.1 0.07 0.2 0.44 0.12 −0.03 0.07 0.06 0.12 −0.06 −0.01 −0.07 0 −0.05 −0.09 −0.07 0.05 0.08 0.06 1 0.1 0.07 0.05 0.12 0 0.03 0.08 −0.08 0 0.04 −0.06 0.15 −0.04 −0.03 −0.03 0.01 −0.02 −0.03 −0.02 0.01 0.07 0.05 0.45 0.13 0.13 −0.07 0.08 0.49 0.11 0.09 0.5 0.02 −0.23 −0.19 −0.11 −0.26 −0.12 −0.23 −0.23 −0.21 −0.22 −0.25 0.27 −0.17 0.37 0.1 1 −0.1 −0.16 0.43 −0.12 −0.12 −0.2 −0.21 −0.06 −0.05 −0.08 −0.07 −0.12 −0.15 −0.15 −0.13 −0.14 −0.11 0.08 0.12 −0.07 −0.08 0.2 0.03 −0.1 −0.16 −0.04 0.16 0.19 0.43 0.16 −0.04 0.27 0.22 0.16 0.26 0.1 0.23 0.28 0.19 0.27 0.21 −0.27 0.35 −0.19 0.07 −0.1 1 0.19 −0.13 0.5 0.26 0.21 0.25 0.12 0.18 0.13 0.25 0.16 0.18 0.14 0.26 0.13 0.07 −0.14 −0.13 0.1 0.12 −0.12 0.08 0.17 0.38 0.17 0.05 −0.07 −0.04 −0.04 0.42 0.23 0.29 0.2 0.35 0.15 0.32 0.27 0.31 0.26 0.2 −0.24 0.26 −0.21 0.05 −0.16 0.19 1 −0.18 0.27 0.5 0.21 0.21 −0.01 0.11 0.04 0.21 0.12 0.17 0.24 0.26 0.25 0.25 −0.08 −0.08 0.39 0.09 −0.05 0.4 0.13 0.19 0.51 0.02 −0.08 −0.12 0.04 0.24 −0.24 −0.23 −0.12 −0.27 −0.13 −0.23 −0.23 −0.18 −0.23 −0.3 0.39 −0.13 0.5 0.12 0.43 −0.13 −0.18 1 −0.07 −0.1 −0.18 −0.23 −0.06 −0.07 −0.07 −0.09 −0.15 −0.17 −0.15 −0.18 −0.16 −0.11 0.11 0.12 −0.02 −0.01 0.17 −0.04 −0.02 −0.14 −0.07 0.08 0.2 0.55 0.19 0.02 0.24 0.22 0.21 0.28 0.16 0.28 0.37 0.23 0.29 0.23 −0.25 0.46 −0.11 0 −0.12 0.5 0.27 −0.07 1 0.34 0.22 0.29 0.04 0.13 0.11 0.22 0.26 0.25 0.23 0.33 0.17 0.1 −0.19 −0.23 0.15 0.19 −0.19 0.08 0.28 0.52 0.19 −0.08 −0.16 −0.09 −0.16 0.7 0.2 0.25 0.16 0.26 0.07 0.32 0.23 0.26 0.19 0.14 −0.11 0.23 −0.1 0.03 −0.12 0.26 0.5 −0.1 0.34 1 0.14 0.18 0.02 0.06 0.07 0.15 0.15 0.15 0.23 0.21 0.21 0.21 −0.07 −0.07 0.36 0.09 −0.02 0.38 0.15 0.21 0.43 0.07 −0.08 0 0.05 0.3 0.3 0.23 0.22 0.23 0.17 0.28 0.29 0.22 0.24 0.22 −0.25 0.24 −0.15 0.08 −0.2 0.21 0.21 −0.18 0.22 0.14 1 0.23 0.13 0.16 0.09 0.24 0.17 0.17 0.18 0.22 0.19 0.1 −0.21 −0.23 0.05 0.2 −0.11 0.06 0.24 0.23 0.12 −0.06 −0.2 −0.13 −0.03 0.13 Figure 6.3: Correlations of Top-15 Predictors with CUDOS and FFMQ-SF Items 147 6.6.1 Correlations of Interest 6.6.1.1 Perturbation 1: Top 5 Predictors Most the five variables selected most-frequently across methods show strong pairwise corre- lations with each other. The lowest correlation among them is 0.14, between the number of current mental health diagnoses and FFMQpre 24 1, “I disapprove of myself when I have il- logical ideas,” from the Nonjudge subscale. The correlations between the non-diagnosis items tended to be stronger, ranging between 0.26 and 0.62. The di↵ erent discrete scale of the diag- nosis item relative to the other 5-point items presents potential statistical concerns addressed in the discussion. Therefore, follow-up analyses focus on other variables. Given the small-to-moderate pairwise correlations among these variables, analyses will examinehowalterationsamongtheme↵ ectsvariableselectionofthevariousadaptationsstudied in the current research program. 6.6.1.2 Perturbations 2a and 2b: FFMQ Item 10 and Other Observe Items FFMQpre 10 1, “Generally, I pay attention to sounds...” presents one of the more interesting findings from the pairwise correlations. This item, notably the only Observe subscale item among the top-15 predictors, showed very low correlations with all items outside its FFMQ-SF subscale. All correlations fell below 0.2 in absolute value, only one correlation passed 0.15, and most of the remaining correlations fell below 0.10. This includes correlations between this item and the other items in the top-15 predictors. The only noteworthy pairwise correlations with this item were with the other items from the Observe subscale: FFMQpre 6 1, “I pay attention to physical experiences...” at 0.45; FFMQpre 15 1, “I notice smells and aromas of things,” at 0.49; and FFMQpre 20 1, “I notice visual elements in art or nature...” at 0.5. As noted by Zou and Hastie (2005) and many subsequent authors, one of the primary concerns with the lasso is its tendency to arbitrarily select a single predictor from a group of correlated variables and eliminate the rest. The frequent selection of FFMQpre 10 1 in the primary analyses, but not the other items from its subscale and with which it correlates, poten- tially represents a manifestation of this problem. Therefore, these relationships are investigated further by changing the correlation between the selected item and the three unselected items. 6.6.1.3 Perturbation 3: FFMQ Item 10 and Other Top-15 Predictors As noted above, FFMQpre 10 1 correlated weakly with all other predictors, including all 21 other original top-15 predictors. Two correlations were exactly 0: with dep0 12 1, “I had problems concentrating”; and FFMQpre 23 1, “I find myself doing things without paying at- tention.” One further correlation was -0.01, with dep0 3 1, “My appetite was poor and I didn’t feel like eating.” These non-correlations, and FFMQpre 10 1’s general low correlation with top-15 predictors, correspond with the selection issue presented by the other Observe items’ exclusion, except from the angle of selection rather than elimination. These analyses further 148 probe FFMQpre 10 1’s unique place among top-15 predictors by manipulating its correlation with these other top-15 predictors. 6.6.1.4 Perturbation 4: FFMQ Item 18 vs. Selected Nonreact Items Finally, FFMQpre 18 1, “Usually when I have distressing thoughts or images I can just notice themwithoutreacting,”presentstheoppositeobservationfromFFMQpre 10 1. FFMQpre 18 1 was the only Nonreact subscale item not frequently included as a predictor despite moderate correlations with the frequently-selected Nonreact items ranging between 0.41 to 0.55. This se- lectiondiscrepancycontrastswiththeselectionconcernsofthelassowhendealingwithcollinear predictors. Follow-up analyses perturb the correlation between this item and the remaining Nonreact items to further investigate this finding and its relation to collinearity concerns. 6.6.2 Follow-Up Analyses: Methods Participants, models and software implementation, and hyperparameter selection were all con- ductedaspreviouslydescribedintheprimaryappliedanalyses. RMarkdownworkbooksusedto conduct the necessary data manipulations and subsequent follow-up analyses will be presented in an organized fashion in a special GitHub repository within my general repository linked in Appendix B.1. 6.6.2.1 Data Perturbation The perturbations outlined below provide preliminary results on collinearity and variable selec- tion in the current lasso and elastic net adaptations. ThefirstsetofcorrelationswereevaluatedbyincreasingthepairwisecorrelationFFMQpre 24 1 sharedwithFFMQpre 19 1,dep0 11 1,anddep0 16 1toapproximately0.7,0.5,and0.7,respec- tively. The number of current mental health diagnoses was excluded from these perturbations owing to its distinct scale. These values corresponded with increasing each correlation’s ef- fect size to the next-strongest “level” of correlation by the traditional e↵ ect sizes of small (0.3), medium(0.5),andlarge(0.7). PerturbationwasachievedbyshiftingFFMQpre 19 1,dep0 11 1, and dep0 16 1 values with particular discrepancies relative to FFMQpre 24 1 to be closer to the latter. For example, individuals whose responses on any of those three items were di↵ erent from FFMQpre 24 1 by two or more were shifted by two in the corresponding direction of di↵ erence. To make this example more concrete: suppose Matt selected “4” on FFMQpre 24 1, but only “1” on FFMQpre 19 1. Matt’s score on FFMQpre 19 1 would then be increased to “3.” Note that these analyses did not alter FFMQpre 24 1. Figure 6.4 presents all pairwise correlations which changed by an absolute value of 0.19 or more 35 0.19, rather than 0.2, was used as this was the value by which CUDOS item 11’s correlation with FFMQ item 24 shifted. The upper 35 . 149 diagonal displays the original correlations, while the lower diagonal displays the correspond- ing correlations resulting from the perturbation described above. Blank cells denote pairwise correlations that changed by an absolute value of less than 0.19. ? ? ? ? ? ? ? ? ? ? −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 dep0_11_1 dep0_16_1 FFMQpre_19_1 FFMQpre_24_1 FFMQpre_14_1 dep0_11_1 dep0_16_1 FFMQpre_19_1 FFMQpre_24_1 FFMQpre_14_1 0.51 0.58 0.72 0.33 0.31 0.69 0.32 0.26 0.5 0.13 Figure 6.4: Correlation Changes Resulting from Perturbation 1 150 The second set of correlations were manipulated in two ways to eliminate the correlation between FFMQpre 10 1 and the remaining Observe items. These manipulations resulted in Perturbations 2a and 2b. One manipulation (Perturbation 2a) altered FFMQpre 10 1, while the other (Perturbation 2b) altered the other three Observe items. Both manipulations were achievedbygeneratingasyntheticalternativevariablerandomlysampledfromtheintegervalues 0 to 4 for each altered variable. Figure 6.5 presents any impacted correlations resulting from Perturbation 2a. Figure 6.6 presents the correlations altered under Perturbation 2b. ? ? ? ? ? ? ? ? ? ? ? ? −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 FFMQpre_10_1 dep0_7_1 FFMQpre_6_1 FFMQpre_15_1 FFMQpre_20_1 FFMQpre_10_1 dep0_7_1 FFMQpre_6_1 FFMQpre_15_1 FFMQpre_20_1 −0.08 −0.01 0 −0.01 0.15 0.45 0.49 0.5 Figure 6.5: Correlation Changes Resulting from Perturbation 2a 151 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 dep0_18_1 FFMQpre_3_1 FFMQpre_9_1 FFMQpre_10_1 FFMQpre_13_1 FFMQpre_23_1 FFMQpre_2_1 FFMQpre_5_1 FFMQpre_6_1 FFMQpre_15_1 FFMQpre_16_1 FFMQpre_18_1 FFMQpre_20_1 dep0_18_1 FFMQpre_3_1 FFMQpre_9_1 FFMQpre_10_1 FFMQpre_13_1 FFMQpre_23_1 FFMQpre_2_1 FFMQpre_5_1 FFMQpre_6_1 FFMQpre_15_1 FFMQpre_16_1 FFMQpre_18_1 FFMQpre_20_1 0.02 −0.12−0.04−0.05 0.01 0.01 −0.04 0.02 −0.01 0.04 0 −0.22 0.26 0.23 0.45 0.2 −0.19 0.2 0.02 −0.06 −0.06 −0.03 0.49 0.43 0.02 0.1 0.22 0.21 −0.03 0.17 −0.02 0.5 0.23 −0.2 0.53 0.51 0.25 0.18 Figure 6.6: Correlation Changes Resulting from Perturbation 2b 152 The third set of correlations were manipulated in a similar fashion to Perturbation 1, with FFMQpre 10 1 remaining unaltered and dep0 3 1, dep0 12 1, and FFMQpre 23 1 being per- turbed. Since these correlations were all approximately 0 to begin with, one correlation each was perturbed to have small (0.3), medium (0.5), and large (0.7) correlations. Specifically, dep0 3 1 was perturbed such that its correlation with FFMQpre 10 1 increased to approxi- mately 0.3; dep0 12 1 was perturbed to achieve a correlation of 0.5 with FFMQpre 10 1; and FFMQpre 23 1 was perturbed to achieve a correlation of 0.7 with FFMQpre 10 1. Figure 6.7 presents altered correlations resulting from Perturbation 3. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 dep0_1_1 dep0_3_1 dep0_11_1 dep0_12_1 dep0_17_1 FFMQpre_10_1 FFMQpre_23_1 dep0_2_1 FFMQpre_6_1 FFMQpre_12_1 FFMQpre_15_1 FFMQpre_16_1 FFMQpre_20_1 FFMQpre_22_1 dep0_1_1 dep0_3_1 dep0_11_1 dep0_12_1 dep0_17_1 FFMQpre_10_1 FFMQpre_23_1 dep0_2_1 FFMQpre_6_1 FFMQpre_12_1 FFMQpre_15_1 FFMQpre_16_1 FFMQpre_20_1 FFMQpre_22_1 0.08 0.29 0.35 0.55 0.51 0.15 0.22 0.18 0.07 −0.01 0 0.71 0.28 0.29 0 0.08 0.23 0.21 0.31 0.05 0.27 0.36 0.29 −0.12 −0.19 0.52 −0.04 −0.08−0.16 −0.09 −0.16 0.7 Figure 6.7: Correlation Changes Resulting from Perturbation 3 153 The fourth and final set of correlations were manipulated similarly to Perturbation 2a, with a synthetic FFMQpre 18 1 generated from a random sample of the integers 0 to 4 and the remaining Nonreact items remaining unperturbed. Altered correlations resulting from Pertur- bation 4 are presented in Figure 6.8 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 dep0_11_1 FFMQpre_3_1 FFMQpre_9_1 FFMQpre_13_1 FFMQpre_21_1 FFMQpre_6_1 FFMQpre_16_1 FFMQpre_18_1 FFMQpre_20_1 dep0_11_1 FFMQpre_3_1 FFMQpre_9_1 FFMQpre_13_1 FFMQpre_21_1 FFMQpre_6_1 FFMQpre_16_1 FFMQpre_18_1 FFMQpre_20_1 −0.02 0.01 0 −0.01 −0.01 −0.07 0.01 −0.21 0.41 0.44 0.43 0.55 0.17 0.24 −0.05 0.18 Figure 6.8: Correlation Changes Resulting from Perturbation of Perturbation 4 154 6.6.2.2 Evaluating Performance The follow-up analyses intend to evaluate the impact that perturbations of di↵ erent predictor correlations produces in variable selection. Pursuant to this goal and the goals outlined for each perturbation, these analyses do not distinguish top-15 predictors from other variables. The re- sults focus instead on continuouschanges in selection frequency of all previous top-15 predictors andanyothervariableswhosecorrelationschangedasaresultofthedataperturbationsinthese analyses. The following tables display the selection frequency of all included variables in the original analyses and each perturbation condition. Given the current mini-study’s interest in collinearity, VIF was also calculated. Unfortu- nately,initialresultswiththefirst10training-testingsplitsacrossperturbationsandadaptations proved comparable to the results of the previous analyses, with few VIF’s surpassing 3.00. Full calculations were therefore not conducted and are not included in the results below. 6.6.3 Follow-Up Analyses: Overall Results Table28presentsvariableselectionpercentagesacrossadaptationsforfrequently-selectedvari- ables and variables whose correlations changed as a result of perturbations. Tables 29-39 present the same information separately for each of the 11 methods and adaptations. Each table presents a shortened item question, along with selection percentages in each of the five perturbed datasets as well as the original dataset. Percentages reported in Table 28 are the 20% trimmed mean of the percentages across methods. Percentages reported in Tables 29-39 are simple percentages. They correspond with the number of the 100 training-testing splits in which the indicated variable each adaptation selected that variable as a predictor. 155 #— #title: “KableTesting” #author: “Matt Multach” #date: “3/25/2021” #output: # html_document: default # pdf_document: default #— Non-Zero Percentages: All Adaptations Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 93.36 93.00 94.73 93.64 91.18 94.45 anx0_6_1 Anx: Scared 71.82 74.55 73.64 73.36 73.18 70.91 anx0_7_1 Anx: Muscle Tension/Aches 83.36 78.64 83.73 82.91 77.27 84.45 dep0_1_1 Dep: Sad/Depressed 35.18 45.73 36.00 34.36 38.64 31.55 dep0_11_1 Dep: I’m a Failure 99.00 99.36 99.09 99.27 99.45 99.09 dep0_12_1 Dep: Difficulty Concentrating 74.64 77.18 74.00 74.64 80.45 74.91 dep0_16_1 Dep: Hopelessness 99.45 96.73 99.09 99.36 99.64 99.36 dep0_17_1 Depression-Related Impairment 75.91 78.55 73.64 75.82 79.91 76.82 dep0_18_1 Quality of Life 96.00 98.55 96.09 95.45 95.73 95.91 dep0_2_1 Dep: Anhedonia 12.18 14.45 10.55 10.91 16.55 12.09 dep0_3_1 Dep: Poor Appetite 74.82 74.82 74.27 73.55 42.09 73.36 dep0_7_1 Dep: Fidgety, Can’t Sit Still 7.91 4.82 6.91 7.00 6.45 8.00 FFMQpre_10_1 FFMQ: Notice Sounds 66.00 62.36 13.18 67.27 2.45 66.00 FFMQpre_12_1 FFMQ: Running on Automatic 38.82 39.27 35.82 38.27 52.00 40.73 FFMQpre_13_1 FFMQ: Calm Soon After Distress 90.64 88.64 89.55 90.82 90.36 92.00 FFMQpre_14_1 FFMQ: Should Think Differently 21.27 20.36 19.64 20.27 20.73 22.09 FFMQpre_15_1 FFMQ: Notice Smells 23.55 16.27 41.82 16.00 11.00 26.91 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 24.18 20.91 20.73 22.27 28.55 27.45 FFMQpre_17_1 FFMQ: Rush Through Activities 61.55 63.27 62.64 62.27 68.55 60.73 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 48.36 44.82 47.91 48.45 48.73 62.09 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 99.91 99.73 99.73 99.91 99.91 99.82 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 7.09 7.91 7.64 7.64 7.09 7.73 FFMQpre_20_1 FFMQ: Notice Sights 6.73 6.45 6.82 8.64 11.00 6.91 FFMQpre_21_1 FFMQ: Let Distress Drift Away 87.18 87.18 85.55 86.09 84.18 90.09 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 13.27 21.55 15.73 12.18 44.00 18.18 FFMQpre_23_1 FFMQ: Absent-Minded Activity 97.73 95.45 97.73 97.73 89.73 98.00 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 99.45 30.09 99.36 99.55 99.82 98.82 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 91.45 93.55 91.73 91.55 95.64 93.36 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 49.45 48.64 48.45 49.82 45.18 47.45 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 20.36 18.91 11.91 8.82 36.91 24.73 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 77.09 83.73 80.45 78.45 82.00 75.73 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 97.18 97.82 97.27 97.91 96.09 97.64 numdx # of Current Diagnoses 98.09 97.91 97.82 97.64 97.73 98.27 BLOOP Table 28: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, All Models 156 Across all methods, the selection of CUXOS items remained relatively stable. No anxiety items were altered, either directly or indirectly, except for the three previous top-15 predictors. Nervousness/anxiety (item 1) was particularly stable, staying within approximately 9395%. The muscle tension item decreased in selection by roughly 5% and 6% for Perturbations 1 and 3, respectively, likelyrelatingtocorrespondingcorrelationchangesthatwerelessthan0.19(and thus not seen in the previous correlation plots). Two CUDOS items arose through these perturbations: anhedonia (lack of enjoyment of typically-enjoyed activities) and restlessness. Anhedonia saw a slight overall increase from 12.18% to 16.55% from the original dataset to Perturbation 3. This item’s decreased corre- lation with FFMQ item 23 36 from 0.29 to 0.08 provides a potential explanation for this change. Restlesness’slargestchangeofroughly3%arosefromPerturbation1,eventhoughitonlyshowed ameaningfullyalteredcorrelationinPerturbation2a. CUDOSitem1,thegeneralsad/depressed item, similarly showed its most significant change in selection probability of greater than 10% in Perturbation 1, even though we only Observe a sizable correlation change in Perturbation 3. CUDOS items for worthlessness, hopelessness, and past-week quality-of-life (items 11, 16, and 18, respectively) were consistent across perturbations when considering all adaptations. Worthlessness 37 never fell below 99% selection under Perturbation 1. Meanwhile, the hopeless- ness item only fell below 99% in Perturbation 1, which directly altered this item to increase its correlation with FFMQ item 24 38 from 0.26 to 0.72. Although selection only decreased to 96.73%, this change is still notable given the otherwise small range of selection probabilities in all other scenarios. CUDOS item 3, poor appetite produced stable selection probabilities between 73%74% in all but Perturbation 3, where it was only selected in 42.09% of models. ThisitemwasspecificallyalteredinPerturbation3suchthatitscorrelationwithFFMQitem10 increased from -0.01 to 0.29. The same perturbation scenario resulted in the greatest selection increase for the concentration item, CUDOS item 8, of roughly 6% from 74.64% to 80.45%. In additiontothedirectalterationofCUDOSitem8’scorrelationwithFFMQitem10from0from 0.51, a number of its correlations with other items changed: reducing from 0.55 to 0.25 with CUDOS item 12; increasing from -0.12 to 0.15 with FFMQ item 6, noticing physical sensations; increasing from -0.04 to 0.22 with FFMQ item 15, noticing smells; and increasing from -0.09 to 0.18 with FFMQ item 20, noticing visual stimuli. All of CUDOS item 8’s altered correlations in this scenario correspond with Observe subscale items. Three FFMQ items were highly stable across scenarios. Items 2 and 19 39 varied by less than 1% overall. However, item 2 was only selected approximately 7% of the time, compared to item 19’s selection probability greater than 99%. FFMQ item 14 varied to a greater degree, but stayed within a few percentage points between roughly 19.64%22.09%. Item 8 40 produced a wider range of selection probabilities between 75.73%83.73%, although selection probabilities were evenly distributed over this range. 36 “I find myself doing things without paying attention.” 37 Aka, “I’m a failure.” 38 “I disapprove of myself when I have illogical ideas.” 39 “I can easily put my beliefs, opinions, and expectations into words,” and “I think some of my emotions are bad or inappropriate and I shouldn’t feel them,” respectively. 40 Di culty focusing on the present. 157 Most of the FFMQ items included in the current analyses produced tight selection proba- bilities across scenarios, with one notable exception. Item 3 41 ranged between 91.55%93.55% for all but Perturbation 3, in which it was selected 95.64% of the time. Items 5, 13, and 21 42 showed similar patterns. Items 17 43 and 23 showed larger selection di↵ erences of 5 + % and 6+% in one scenario, compared to their general ranges of approximately 2.5%. The last item with a single notable di↵ erence in selection probability was produced by item 24. Approxi- mately 30.09% of all models selected this item under Perturbation 1, compared to probabilities of 98.82% or greater in all other scenarios. This is the same scenario which directly increased its correlations with three other frequently-selected items. Item 18 44 showed a relatively tight range of less than 2% for four out of six conditions, a selection probability more than 3% lower in one scenario (Perturbation 1), and a selection probabilitymorethan13%greaterinthelastscenario(Perturbation4). Thisincreasedselection probabilitycorrespondedwiththescenarioinwhichitscorrelationwithotheritemswasdirectly increased. Item 12 45 similarly had a generally tight range of less than 2% in four scenarios, one selection probability a few more percentage points lower, and one much larger selection probability approximately 12% greater than the other scenarios. Item 20 produced selection probabilities within 0.5% in four scenarios, one probability 1.73% greater than the next largest, and one probability more than 4% greater than the largest of the tight range 46 . Item 16 47 showed a wider but less di↵ use spread of selection probabilities. Selection prob- abilities ranged between 21%29%, although values were unevenly distributed across this range. Item 22 48 produced a wide spread across five of the six scenarios, ranging between 12.18%21.55% and spread unevenly across that range. Under Perturbation 3, however, 44% of models selected item 22, a large selection change for a variable whose correlation did not directly change in any scenario. Items 6, 10, and 15 showed the greatest and most disjointed variability in selection proba- bilities across perturbation scenarios. Items 6 and 15 49 are two of only three variables outside the previous top-15 predictors with directly altered correlations. Meanwhile, item 10’s correla- tionwithitem6anditem15wasdirectlyaltered. Atthelowerend,11.91%and8.82%ofmodels selected item 6 in Perturbations 2a and 2b, respectively. 37% of models, however, selected item 6 under Perturbation 3. Item 10 was selected by 66%67.27% of models in three scenarios, selected slightly less under Perturbation 1 (62.36%), and was selected drastically less under Per- turbations 2a (13.18%) and 3 (2.45%). Perturbation 3 also produced item 15’s lowest selection probability (11%). Two further low selection selection probabilities of roughly 16% arose under 41 “I watch my feelings without getting carried away by them.” 42 “It’s hard for me to find the words to describe what I’m thinking,” “When I have distressing thoughts or images, I feel calm soon after,” “When I have distressing thoughts or images, I just notice them and let them go,” respectively 43 “I rush through activities without being really attentive to them.” 44 “Usually when I have distressing thoughts or images I can just notice them without reacting.” 45 “It seems I am ’running on automatic’ without much awareness of what I am doing.” 46 To reemphasize a point made earlier: although the magnitude of these di↵ erences may not be large, they are notable for their magnitude relative to the common range of selection probabilities seen in most of the scenarios. 47 “Even when I’m feeling terribly upset, I can find a way to put it into words.” 48 “I do jobs or tasks automatically without being aware of what I’m doing.” 49 Along with item 20 158 Perturbations 1 and 2b, while PErturbation 2a produced a high selection probability of 41.82% for item 15. 6.6.4 Follow-Up Analyses: Scenario-Specific Results Variables whose alterations were directly altered in a given perturbation scenario will hence- forth be referred to as “variables which were directly altered.” Not all variables of interest were directly altered to achieve any perturbation of interest. However, referring repeatedly to “vari- ableswhosecorrelationsweredirectlyaltered”wouldbecumbersome, andthusthisconvenience is used instead. Tables29-39displaytheindividualselectionprobabilitiesofthelassoandelasticnetadap- tations for each variable and across the follow-up data scenarios. Common patterns emerged for the impact each scenario had on relative selection probabilities, though the specific impacts by method varied. Perturbation 1 was less likely to result in notable selection changes for the directlyalteredandmorelikelytoproduceselectiondi↵ erencesinvariablese↵ ectedindirectlyby theperturbation. AlloftheadaptationssawdrasticallyreducedselectionprobabilityforFFMQ item 24, being selected the least by the MS adaptive elastic net (5% of models) and adaptive Huber lasso (13%). The standard elastic net and adaptive LAD elastic net consistently selected this item, selecting it 56% and 48% of the time, respectively. The adaptive LAD elastic net, OS Huber lasso, and MS adaptive elastic net were all less likely to select CUDOS item 16 under Perturbation 1, which was otherwise near-universally selected across models. Selection percent- age of item 16 decreased by 10% to roughly 88% for the MS adaptive elastic net. The selection di↵ erence Observed in the adaptive LAD elastic net for CUDOS item 16 was much smaller, as it still selected this item into 96% of models. However given its selection rate of 99%100% in all other scenarios. As noted, Perturbation 1 was more likely to produce selection changes in downstream vari- ables, with the OS methods, the MS adaptive elastic net, and adaptive LAD lasso producing the most changes. Selection of downstream variables changes to lesser degree for the adap- tive elastic net, the adaptive Huber lasso, and the adaptive Huber elastic net. No consistent pattern emerged among the di↵ erent adaptations in terms of which downstream variables were di↵ erentially selected, by what degree, or in which direction. Perturbation2aresultedinmeaningfulselectionchangesinthreeoutoffourdirectlyaltered variables for all methods except for the OS Huber lasso, and the changes occurred in the same three variables except in the OS lasso. Perturbation 2a produced reduced selection of FFMQ items 6 and 10 and increased selection of item 15’s. Though the magnitude of change varied by method, the proportional change in each was often comparable for the lasso and elastic net variants of an adaptation. The adaptive LAD methods, for instance, selected item 10 at one- seventh and one-eighth the rate of the original dataset for the lasso and elastic net, respectively, and selected item 6 at roughly one-third the rate. While the OS lasso also saw selection changes in three directly altered variables, item 6 did not see a selection change, and item 20 instead saw an increase in selection. The OS Huber lasso only saw a reduction in selection of item 10, 159 The variable that was actually itself altered in this scenario. Perturbation 2a resulted in fewer downstream selection changes. The standard lasso saw reduced selection of FFMQ item 10 and theadaptiveHuberelasticnetincreaseditsselectionofFFMQitem22. TheMSadaptiveelastic net, adaptive LAD Huber lasso, and adaptive Huber elastic net all saw downstream impacts on four variables, and the OS Huber lasso on five. Perturbation 2b resulted in selection changes for 2-4 directly-altered variables, with four methods 50 seeing a change in two variables, 5 methods 51 seeing a change in three, and the adaptive LAD lasso seeing a change in all four. FFMQ item 10 remained relatively constant in most methods, seeing increased selection in the standard elastic net and adaptive lasso and decreasedselectionintheadaptiveLADlassoandOSlasso. Items6and15weremoreconsistent, with each seeing reduced selection in 10 and 9 models, respectively. Item 6 saw no change in the OS Huber lasso, while item 15 saw positive changes in both OS methods. Item 20 was the most inconsistent, with a reduced selection by the adaptive Huber lasso and OS Huber lasso, increased selection by the adaptive elastic net and the adaptive LAD variants, and no change otherwise. The standard lasso saw no downstream selection changes, while the standard elastic net, adaptive Huber lasso, and adaptive LAD elastic net saw only one downstream change. The MS adaptive elastic net and OS lasso saw selection changes in five downstream variables, while the OS Huber lasso saw six downstream changes. Perturbation 3 resulted in reduced selection of FFMQ item 10 and CUDOS item 3 for all methodsand adaptations. Most methods reduced selection of FFMQ item10tonearly 0%from original rates between 50% 52 and 80% 53 . Most methods cut CUDOS item 3’s selection rate in half, although the standard formulations cut it down by only a quarter from 95% to 71%73%. The MS adaptive elastic net cut it down by roughly two-thirds, from 61% down to 20%. Most adaptations reduced their selection of FFMQ item 23 as well, although typically by margins of 10% or less. Both standard formulations selected the item 96% of time; as Observed in other contexts, this corresponded with 100% selection rates across other scenarios. The adaptive lasso also only reduced its selection by 4% compared to the original dataset, though it too also saw consistent selection in four out of the five other data scenarios. The OS lasso was the only method that did not change item 23’s selection rate under Perturbation 3. This scenario produced inconsistent impacts on CUDOS item 12’s selection. All methods saw meaningful selection changes in at least three out of four directly altered variables under Perturbation 3. Perturbation 3 also resulted in the most and most consistent downstream changes, with changes in seven to eleven other variables of interest. Perturbation 4 produced consistent impacts on selection reduction of directly-altered vari- ables with some notable exceptions, while the downstream e↵ ects varied to a greater degree. Selection of FFMQ item 18 increased for all methods except for the OS Huber lasso. The adap- tive elastic net and adaptive Huber lasso both increased selection of FFMQ item 21 by a small margin to nearly 100% from 91% each. The MS adaptive elastic net increased its selection of 50 Standard lasso, MS adaptive elastic net, adaptive Huber lasso, adaptive Huber elastic net, and OS Huber lasso. 51 Standard elastic net, adaptive lasso, adaptive elastic net, adaptive LAD elastic net, and OS lasso. 52 MS adaptive elastic net 53 Standard lasso and elastic net, and the OS lasso. 160 FFMQ items 3 and 13 by 5%. The OS lasso saw selection changes in three of the four directly altered variables, increasing selection of FFMQ items 3 and 21 and reducing selection of item 13. The OS Huber lasso increased its selection of item 21 by 5%, with the remaining variables relatively constant. Perturbation 4 induced selection changes in seven downstream variables for the adaptive LAD lasso and only two downstream changes for the MS adaptive elastic net. The majority of adaptations 54 saw impacts on four downstream variables, with five and six variables impacted in the elastic net and adaptive lasso, respectively. 54 Adaptive elastic net, adaptive Huber lasso, adaptive Huber elastic net, adaptive LAD elastic net, and the OS variants. 161 Standard Lasso: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 99 100 100 99 98 99 anx0_6_1 Anx: Scared 88 91 89 89 89 88 anx0_7_1 Anx: Muscle Tension/Aches 94 93 95 94 94 95 dep0_1_1 Dep: Sad/Depressed 22 37 21 21 29 19 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 92 91 92 93 88 92 dep0_16_1 Dep: Hopelessness 100 100 100 100 100 100 dep0_17_1 Depression-Related Impairment 94 94 91 93 96 94 dep0_18_1 Quality of Life 98 100 99 99 98 98 dep0_2_1 Dep: Anhedonia 9 14 10 9 19 11 dep0_3_1 Dep: Poor Appetite 95 95 95 95 71 95 dep0_7_1 Dep: Fidgety, Can’t Sit Still 13 10 10 9 10 13 FFMQpre_10_1 FFMQ: Notice Sounds 80 76 20 81 2 78 FFMQpre_12_1 FFMQ: Running on Automatic 65 66 61 62 82 66 FFMQpre_13_1 FFMQ: Calm Soon After Distress 98 98 97 98 98 99 FFMQpre_14_1 FFMQ: Should Think Differently 24 27 26 24 23 29 FFMQpre_15_1 FFMQ: Notice Smells 39 24 58 23 17 44 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 36 32 30 34 48 42 FFMQpre_17_1 FFMQ: Rush Through Activities 72 73 73 72 78 71 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 62 58 64 63 65 79 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 10 10 11 12 11 11 FFMQpre_20_1 FFMQ: Notice Sights 4 9 3 7 18 7 FFMQpre_21_1 FFMQ: Let Distress Drift Away 98 97 97 98 98 98 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 5 22 7 5 54 10 FFMQpre_23_1 FFMQ: Absent-Minded Activity 100 100 100 100 96 100 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 39 100 100 100 100 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 98 98 98 98 99 98 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 67 67 68 69 68 68 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 23 23 11 13 47 32 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 84 92 87 86 90 83 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 99 99 99 99 99 99 numdx # of Current Diagnoses 100 100 100 100 100 100 BL OOP BL OP Table 29: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Standard Lasso 162 Standard Elastic Net: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 100 99 100 100 99 100 anx0_6_1 Anx: Scared 89 93 90 89 93 89 anx0_7_1 Anx: Muscle Tension/Aches 95 94 95 95 94 95 dep0_1_1 Dep: Sad/Depressed 27 46 27 27 38 24 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 94 94 95 95 93 94 dep0_16_1 Dep: Hopelessness 100 100 100 100 100 100 dep0_17_1 Depression-Related Impairment 95 97 94 95 98 95 dep0_18_1 Quality of Life 99 100 100 99 100 99 dep0_2_1 Dep: Anhedonia 12 17 11 11 21 11 dep0_3_1 Dep: Poor Appetite 95 97 94 94 73 96 dep0_7_1 Dep: Fidgety, Can’t Sit Still 11 8 8 8 12 12 FFMQpre_10_1 FFMQ: Notice Sounds 77 75 18 84 3 79 FFMQpre_12_1 FFMQ: Running on Automatic 65 71 65 63 81 67 FFMQpre_13_1 FFMQ: Calm Soon After Distress 98 99 98 99 99 99 FFMQpre_14_1 FFMQ: Should Think Differently 30 31 30 28 33 29 FFMQpre_15_1 FFMQ: Notice Smells 36 24 54 23 15 43 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 36 35 35 39 48 41 FFMQpre_17_1 FFMQ: Rush Through Activities 74 74 73 73 83 73 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 65 64 63 63 67 80 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 11 9 13 13 14 12 FFMQpre_20_1 FFMQ: Notice Sights 6 9 4 8 22 7 FFMQpre_21_1 FFMQ: Let Distress Drift Away 98 97 98 98 98 98 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 14 27 14 9 60 17 FFMQpre_23_1 FFMQ: Absent-Minded Activity 100 100 100 100 96 100 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 56 100 100 100 100 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 98 98 98 98 99 98 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 70 71 69 71 71 65 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 27 24 11 14 51 32 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 86 94 88 87 91 86 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 99 100 99 99 99 99 numdx # of Current Diagnoses 100 100 100 100 100 100 BLOOP BLOP Table 30: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Standard Elastic Net 163 Adaptive Lasso: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 96 96 96 94 91 98 anx0_6_1 Anx: Scared 53 57 57 57 54 54 anx0_7_1 Anx: Muscle Tension/Aches 82 71 81 82 73 83 dep0_1_1 Dep: Sad/Depressed 13 24 13 12 15 9 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 76 76 75 76 78 77 dep0_16_1 Dep: Hopelessness 100 100 100 100 100 100 dep0_17_1 Depression-Related Impairment 86 85 79 86 90 86 dep0_18_1 Quality of Life 96 100 97 96 96 97 dep0_2_1 Dep: Anhedonia 5 5 3 4 6 5 dep0_3_1 Dep: Poor Appetite 73 74 74 75 34 72 dep0_7_1 Dep: Fidgety, Can’t Sit Still 3 1 2 3 0 5 FFMQpre_10_1 FFMQ: Notice Sounds 66 58 9 73 1 72 FFMQpre_12_1 FFMQ: Running on Automatic 30 33 30 34 49 37 FFMQpre_13_1 FFMQ: Calm Soon After Distress 95 95 95 93 96 98 FFMQpre_14_1 FFMQ: Should Think Differently 24 18 21 23 21 28 FFMQpre_15_1 FFMQ: Notice Smells 19 12 44 8 9 29 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 23 15 18 24 28 29 FFMQpre_17_1 FFMQ: Rush Through Activities 48 52 52 52 57 47 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 45 40 41 43 46 54 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 3 3 5 3 3 3 FFMQpre_20_1 FFMQ: Notice Sights 1 1 1 2 6 1 FFMQpre_21_1 FFMQ: Let Distress Drift Away 93 93 90 95 93 96 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 4 6 5 4 27 7 FFMQpre_23_1 FFMQ: Absent-Minded Activity 98 95 99 99 94 100 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 37 100 100 100 100 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 95 97 93 92 97 96 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 33 33 32 38 27 36 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 16 15 6 3 35 22 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 70 77 77 72 73 71 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 98 98 97 98 96 98 numdx # of Current Diagnoses 100 99 99 100 98 100 BL O OP BL O P Table 31: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Lasso 164 Adaptive Elastic Net: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 96 96 96 94 91 98 anx0_6_1 Anx: Scared 55 59 63 57 56 59 anx0_7_1 Anx: Muscle Tension/Aches 78 77 82 80 73 85 dep0_1_1 Dep: Sad/Depressed 13 26 13 11 14 9 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 76 78 79 78 79 76 dep0_16_1 Dep: Hopelessness 100 100 100 100 100 100 dep0_17_1 Depression-Related Impairment 87 91 81 88 91 87 dep0_18_1 Quality of Life 96 100 97 96 97 96 dep0_2_1 Dep: Anhedonia 5 7 3 4 9 3 dep0_3_1 Dep: Poor Appetite 73 78 74 72 36 73 dep0_7_1 Dep: Fidgety, Can’t Sit Still 2 0 2 2 2 4 FFMQpre_10_1 FFMQ: Notice Sounds 67 64 9 69 1 66 FFMQpre_12_1 FFMQ: Running on Automatic 31 34 28 35 47 37 FFMQpre_13_1 FFMQ: Calm Soon After Distress 96 96 97 96 96 96 FFMQpre_14_1 FFMQ: Should Think Differently 20 18 23 22 22 25 FFMQpre_15_1 FFMQ: Notice Smells 22 12 45 5 8 25 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 21 20 21 23 32 24 FFMQpre_17_1 FFMQ: Rush Through Activities 49 55 50 53 62 47 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 46 46 44 41 41 57 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 3 3 5 4 2 3 FFMQpre_20_1 FFMQ: Notice Sights 0 0 0 3 5 0 FFMQpre_21_1 FFMQ: Let Distress Drift Away 91 92 91 92 93 96 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 5 12 6 4 35 6 FFMQpre_23_1 FFMQ: Absent-Minded Activity 100 97 99 99 94 100 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 34 100 100 100 100 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 94 96 94 94 97 96 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 33 32 37 41 31 32 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 14 12 6 1 35 19 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 71 80 76 73 76 66 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 98 98 98 98 96 98 numdx # of Current Diagnoses 100 100 100 100 99 100 BLOOP BLOP Table 32: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Elastic Net 165 Multi-Step Adaptive Elastic Net: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 84 88 89 86 81 86 anx0_6_1 Anx: Scared 41 44 40 36 39 35 anx0_7_1 Anx: Muscle Tension/Aches 64 57 64 64 54 64 dep0_1_1 Dep: Sad/Depressed 1 6 1 1 1 1 dep0_11_1 Dep: I’m a Failure 92 96 93 94 96 94 dep0_12_1 Dep: Difficulty Concentrating 38 39 39 38 44 40 dep0_16_1 Dep: Hopelessness 98 88 97 97 97 97 dep0_17_1 Depression-Related Impairment 61 58 50 55 68 57 dep0_18_1 Quality of Life 85 91 86 84 81 82 dep0_2_1 Dep: Anhedonia 1 2 1 1 2 1 dep0_3_1 Dep: Poor Appetite 61 57 57 59 20 56 dep0_7_1 Dep: Fidgety, Can’t Sit Still 1 1 1 1 0 0 FFMQpre_10_1 FFMQ: Notice Sounds 51 50 3 53 0 54 FFMQpre_12_1 FFMQ: Running on Automatic 16 21 13 16 33 16 FFMQpre_13_1 FFMQ: Calm Soon After Distress 77 78 74 77 77 82 FFMQpre_14_1 FFMQ: Should Think Differently 8 7 9 9 6 6 FFMQpre_15_1 FFMQ: Notice Smells 16 8 31 2 4 16 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 14 8 9 8 17 10 FFMQpre_17_1 FFMQ: Rush Through Activities 18 16 19 13 22 13 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 19 15 18 22 20 32 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 99 99 99 99 99 99 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 3 1 2 2 2 2 FFMQpre_20_1 FFMQ: Notice Sights 0 3 0 0 3 0 FFMQpre_21_1 FFMQ: Let Distress Drift Away 77 73 71 75 70 77 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 1 2 2 1 10 1 FFMQpre_23_1 FFMQ: Absent-Minded Activity 97 90 96 96 86 94 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 98 5 100 100 100 98 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 80 86 82 80 90 85 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 25 29 26 27 25 22 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 9 9 1 1 16 9 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 42 50 49 43 46 39 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 88 92 89 92 82 91 numdx # of Current Diagnoses 97 92 94 92 92 91 BL OOP BL OP Table 33: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Multi-Step Adaptive Elastic Net 166 Adaptive Huber Lasso: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 98 99 100 100 94 98 anx0_6_1 Anx: Scared 79 89 87 85 82 81 anx0_7_1 Anx: Muscle Tension/Aches 91 88 92 87 84 89 dep0_1_1 Dep: Sad/Depressed 18 31 22 19 25 16 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 85 89 86 87 89 87 dep0_16_1 Dep: Hopelessness 100 100 100 100 100 100 dep0_17_1 Depression-Related Impairment 84 89 86 85 88 85 dep0_18_1 Quality of Life 98 100 98 97 98 98 dep0_2_1 Dep: Anhedonia 7 10 6 6 11 7 dep0_3_1 Dep: Poor Appetite 88 89 89 86 47 88 dep0_7_1 Dep: Fidgety, Can’t Sit Still 4 3 5 4 5 5 FFMQpre_10_1 FFMQ: Notice Sounds 57 60 12 60 0 62 FFMQpre_12_1 FFMQ: Running on Automatic 46 51 46 44 62 44 FFMQpre_13_1 FFMQ: Calm Soon After Distress 92 93 93 93 94 95 FFMQpre_14_1 FFMQ: Should Think Differently 19 20 18 17 16 22 FFMQpre_15_1 FFMQ: Notice Smells 26 20 51 7 9 34 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 21 19 15 19 24 22 FFMQpre_17_1 FFMQ: Rush Through Activities 70 74 74 72 74 69 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 42 46 50 44 44 64 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 6 6 7 6 4 4 FFMQpre_20_1 FFMQ: Notice Sights 7 9 3 6 11 1 FFMQpre_21_1 FFMQ: Let Distress Drift Away 91 93 92 92 91 97 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 14 27 24 14 61 26 FFMQpre_23_1 FFMQ: Absent-Minded Activity 99 99 100 99 89 100 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 13 100 100 100 100 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 95 96 94 94 97 96 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 47 51 47 43 40 39 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 16 18 7 7 38 21 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 78 88 83 79 84 77 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 98 98 98 99 98 98 numdx # of Current Diagnoses 100 100 100 100 100 100 B L O OP B L O P Table 34: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Huber Lasso 167 Adaptive Huber Elastic Net: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 100 99 99 100 94 100 anx0_6_1 Anx: Scared 83 85 85 82 81 78 anx0_7_1 Anx: Muscle Tension/Aches 88 88 89 86 83 92 dep0_1_1 Dep: Sad/Depressed 22 35 23 22 33 17 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 88 90 87 87 91 88 dep0_16_1 Dep: Hopelessness 100 100 100 100 100 100 dep0_17_1 Depression-Related Impairment 87 91 88 85 90 90 dep0_18_1 Quality of Life 98 100 99 98 98 99 dep0_2_1 Dep: Anhedonia 6 9 6 7 12 7 dep0_3_1 Dep: Poor Appetite 85 91 89 87 45 83 dep0_7_1 Dep: Fidgety, Can’t Sit Still 4 3 3 2 4 2 FFMQpre_10_1 FFMQ: Notice Sounds 60 60 7 63 0 59 FFMQpre_12_1 FFMQ: Running on Automatic 45 49 41 40 59 48 FFMQpre_13_1 FFMQ: Calm Soon After Distress 93 93 95 95 94 97 FFMQpre_14_1 FFMQ: Should Think Differently 19 21 16 20 19 22 FFMQpre_15_1 FFMQ: Notice Smells 26 15 48 4 8 32 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 21 22 18 16 25 24 FFMQpre_17_1 FFMQ: Rush Through Activities 70 71 74 71 75 67 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 48 49 46 45 45 61 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 3 6 5 3 4 7 FFMQpre_20_1 FFMQ: Notice Sights 5 4 2 2 7 4 FFMQpre_21_1 FFMQ: Let Distress Drift Away 93 94 91 93 92 97 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 15 35 24 14 54 25 FFMQpre_23_1 FFMQ: Absent-Minded Activity 100 99 100 100 93 100 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 20 100 100 100 100 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 94 97 96 95 97 97 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 44 45 46 41 37 39 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 17 15 10 4 39 20 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 80 89 83 78 86 76 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 98 99 98 99 98 98 numdx # of Current Diagnoses 100 100 100 99 100 99 BL OOP BL OP Table 35: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive Huber Elastic Net 168 Adaptive LAD Lasso: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 91 93 96 91 91 94 anx0_6_1 Anx: Scared 62 61 65 69 64 62 anx0_7_1 Anx: Muscle Tension/Aches 72 57 67 70 57 71 dep0_1_1 Dep: Sad/Depressed 50 57 52 51 51 43 dep0_11_1 Dep: I’m a Failure 98 100 99 99 99 98 dep0_12_1 Dep: Difficulty Concentrating 69 77 72 72 89 75 dep0_16_1 Dep: Hopelessness 99 95 100 100 100 100 dep0_17_1 Depression-Related Impairment 67 70 71 70 74 70 dep0_18_1 Quality of Life 98 100 97 96 99 98 dep0_2_1 Dep: Anhedonia 19 20 12 13 27 18 dep0_3_1 Dep: Poor Appetite 87 89 85 84 41 82 dep0_7_1 Dep: Fidgety, Can’t Sit Still 15 5 13 13 11 10 FFMQpre_10_1 FFMQ: Notice Sounds 74 68 11 66 3 68 FFMQpre_12_1 FFMQ: Running on Automatic 41 32 36 43 57 42 FFMQpre_13_1 FFMQ: Calm Soon After Distress 91 85 92 93 87 93 FFMQpre_14_1 FFMQ: Should Think Differently 20 12 16 14 19 18 FFMQpre_15_1 FFMQ: Notice Smells 15 8 38 7 3 14 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 20 15 18 20 18 26 FFMQpre_17_1 FFMQ: Rush Through Activities 55 50 50 51 66 52 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 50 38 49 53 52 74 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 99 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 8 2 5 9 3 6 FFMQpre_20_1 FFMQ: Notice Sights 4 2 3 18 6 3 FFMQpre_21_1 FFMQ: Let Distress Drift Away 81 79 78 76 72 80 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 25 29 28 22 60 35 FFMQpre_23_1 FFMQ: Absent-Minded Activity 97 98 99 97 79 98 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 32 99 99 100 96 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 82 82 83 82 90 83 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 43 43 42 43 40 41 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 16 7 6 7 26 22 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 79 83 84 82 84 78 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 99 99 99 99 98 99 numdx # of Current Diagnoses 99 100 100 100 100 100 B L OOP B L OP Table 36: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Adaptive LAD Lasso 169 Adaptive LAD Elastic Net: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 94 94 97 94 93 98 anx0_6_1 Anx: Scared 65 64 64 66 67 61 anx0_7_1 Anx: Muscle Tension/Aches 71 65 73 73 64 76 dep0_1_1 Dep: Sad/Depressed 49 69 52 48 48 42 dep0_11_1 Dep: I’m a Failure 99 99 99 99 99 99 dep0_12_1 Dep: Difficulty Concentrating 75 77 74 74 89 76 dep0_16_1 Dep: Hopelessness 99 96 100 100 100 100 dep0_17_1 Depression-Related Impairment 75 72 70 75 76 76 dep0_18_1 Quality of Life 99 100 98 97 97 98 dep0_2_1 Dep: Anhedonia 16 19 11 12 22 20 dep0_3_1 Dep: Poor Appetite 87 85 85 86 43 86 dep0_7_1 Dep: Fidgety, Can’t Sit Still 12 6 12 12 7 11 FFMQpre_10_1 FFMQ: Notice Sounds 68 68 8 71 3 77 FFMQpre_12_1 FFMQ: Running on Automatic 42 36 34 43 56 50 FFMQpre_13_1 FFMQ: Calm Soon After Distress 92 90 94 93 90 96 FFMQpre_14_1 FFMQ: Should Think Differently 20 17 16 20 16 16 FFMQpre_15_1 FFMQ: Notice Smells 13 10 30 6 3 14 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 19 18 15 19 22 28 FFMQpre_17_1 FFMQ: Rush Through Activities 54 64 56 62 60 59 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 55 40 53 52 46 70 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 7 8 7 9 4 8 FFMQpre_20_1 FFMQ: Notice Sights 3 2 3 12 7 1 FFMQpre_21_1 FFMQ: Let Distress Drift Away 82 83 82 81 74 86 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 30 32 32 26 56 33 FFMQpre_23_1 FFMQ: Absent-Minded Activity 97 96 98 97 87 98 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 48 99 99 99 98 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 85 90 84 85 91 87 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 44 41 40 44 40 40 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 13 6 4 6 26 17 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 79 85 82 80 83 77 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 99 100 99 99 98 99 numdx # of Current Diagnoses 100 100 100 100 100 100 BL OOP BL OP Table 37: VariableSelectionPercentagesofFrequently-SelectedandPerturbedVariables, AdaptiveLADElastic Net 170 Outlier-Shifted Lasso: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 87 87 87 92 91 89 anx0_6_1 Anx: Scared 92 91 88 92 91 90 anx0_7_1 Anx: Muscle Tension/Aches 95 90 94 95 92 93 dep0_1_1 Dep: Sad/Depressed 92 93 89 88 92 89 dep0_11_1 Dep: I’m a Failure 100 98 99 100 100 99 dep0_12_1 Dep: Difficulty Concentrating 71 81 64 69 96 67 dep0_16_1 Dep: Hopelessness 100 97 97 98 100 99 dep0_17_1 Depression-Related Impairment 42 48 42 42 33 46 dep0_18_1 Quality of Life 99 98 98 99 99 98 dep0_2_1 Dep: Anhedonia 20 29 19 21 21 19 dep0_3_1 Dep: Poor Appetite 44 39 44 40 34 42 dep0_7_1 Dep: Fidgety, Can’t Sit Still 5 4 4 7 3 7 FFMQpre_10_1 FFMQ: Notice Sounds 78 72 24 73 6 73 FFMQpre_12_1 FFMQ: Running on Automatic 20 19 17 19 26 15 FFMQpre_13_1 FFMQ: Calm Soon After Distress 90 80 81 86 86 83 FFMQpre_14_1 FFMQ: Should Think Differently 22 26 21 21 27 22 FFMQpre_15_1 FFMQ: Notice Smells 17 19 30 30 16 22 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 17 19 17 19 18 19 FFMQpre_17_1 FFMQ: Rush Through Activities 94 91 91 92 94 93 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 59 58 66 67 68 67 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 99 98 100 100 99 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 4 4 4 8 8 9 FFMQpre_20_1 FFMQ: Notice Sights 9 8 21 10 9 15 FFMQpre_21_1 FFMQ: Let Distress Drift Away 74 75 74 71 66 80 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 12 22 14 19 28 15 FFMQpre_23_1 FFMQ: Absent-Minded Activity 90 85 88 90 90 92 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 100 25 99 100 100 99 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 89 91 89 92 96 95 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 76 65 72 75 72 77 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 49 49 46 14 63 55 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 86 92 86 91 96 89 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 98 97 98 99 99 99 numdx # of Current Diagnoses 95 98 96 97 98 97 Table 38: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Outlier-Shifted Lasso 171 Outlier-Shifted Huber Lasso: Non-Zero Percentages Variable LabelShort Original Perturbation 1 Perturbation 2a Perturbation 2b Perturbation 3 Perturbation 4 anx0_1_1 Anx: Nervous/Anxious 82 72 82 80 80 79 anx0_6_1 Anx: Scared 83 86 82 85 89 83 anx0_7_1 Anx: Muscle Tension/Aches 87 85 89 86 82 86 dep0_1_1 Dep: Sad/Depressed 80 79 83 78 79 78 dep0_11_1 Dep: I’m a Failure 100 100 100 100 100 100 dep0_12_1 Dep: Difficulty Concentrating 57 57 51 52 49 52 dep0_16_1 Dep: Hopelessness 98 88 96 98 99 97 dep0_17_1 Depression-Related Impairment 57 69 58 60 75 59 dep0_18_1 Quality of Life 90 95 88 89 90 92 dep0_2_1 Dep: Anhedonia 34 27 34 32 32 31 dep0_3_1 Dep: Poor Appetite 35 29 31 31 19 34 dep0_7_1 Dep: Fidgety, Can’t Sit Still 17 12 16 16 17 19 FFMQpre_10_1 FFMQ: Notice Sounds 48 35 24 47 8 38 FFMQpre_12_1 FFMQ: Running on Automatic 26 20 23 22 20 26 FFMQpre_13_1 FFMQ: Calm Soon After Distress 75 68 69 76 77 74 FFMQpre_14_1 FFMQ: Should Think Differently 28 27 20 25 26 26 FFMQpre_15_1 FFMQ: Notice Smells 30 27 31 61 29 23 FFMQpre_16_1 FFMQ: Effective at Communicating Difficulty 38 27 32 24 34 37 FFMQpre_17_1 FFMQ: Rush Through Activities 73 76 77 74 83 77 FFMQpre_18_1 FFMQ: Notice, Not React, to Distress 41 39 33 40 42 45 FFMQpre_19_1 FFMQ: Some of My Emotions Are Bad 100 100 100 100 100 100 FFMQpre_2_1 FFMQ: Effective at Communicating Internal Experience 20 35 20 15 23 20 FFMQpre_20_1 FFMQ: Notice Sights 35 24 35 27 27 37 FFMQpre_21_1 FFMQ: Let Distress Drift Away 81 83 77 76 79 86 FFMQpre_22_1 FFMQ: Notice Inattentivenes to What I’m Doing 21 23 17 16 39 25 FFMQpre_23_1 FFMQ: Absent-Minded Activity 97 91 96 98 83 96 FFMQpre_24_1 FFMQ: Judging My Illogical Ideas 96 22 96 97 99 96 FFMQpre_3_1 FFMQ: Not Carried Away by Emotions 96 98 98 97 99 96 FFMQpre_5_1 FFMQ: Ineffective at Communicating Thoughts 62 58 54 56 46 63 FFMQpre_6_1 FFMQ: Pay Attention to Physical Experiences 24 30 23 27 30 23 FFMQpre_8_1 FFMQ: Difficulty Focusing on Present 93 91 90 92 93 91 FFMQpre_9_1 FFMQ: Not Carried Away by Distress 95 96 96 96 94 96 numdx # of Current Diagnoses 88 88 87 86 88 94 BL OOP BL OP Table 39: Variable Selection Percentages of Frequently-Selected and Perturbed Variables, Outlier-Shifted Huber Lasso 172 6.7 Discussion Previous research has established associations between the four main constructs utilized in the applied models 55 in outpatient or randomized-controlled trial contexts. Many psychometric studies of the FFMQ and AAQ utilize the other for validation purposes. Baer et al. (2006)’s initial psychometric study of the FFMQ discussed the conceptual similarity between the Non- judgment and Nonreactivity facets of the FFMQ and the psychological conceptualization of acceptance. Their results found a significant correlation between all FFMQ facets besides Ob- serve, with Nonjudgment being the strongest and Nonreactivity the next-strongest correlation. Meanwhile, Fledderus et al. (2012) found that all but the Observe facet correlated with the AAQ-II. Nonreactivity and Nonjudgment produced the strongest associations with AAQ-II in this study. These findings are consistent with the data in the current study. Focusing in particular on the list of all methods’ top 15 most-selected predictors, both Nonreactivity and Nonjudgment were strong performers in di↵ erent ways. Two out of five Nonreactivity items were most-selected by all 11 methods studies, and two further were most-selected by 10 out of 11 methods. One set of follow-up analyses focused on the fifth item on the Nonreactivity facet 56 given its sole exclusion from the top-15 predictors. The Nonjudgment facet produced more interested results. Only two of the five Nonjudg- ment items show up in the list of top-15 model-selected predictors. However, those two items 57 were two of the three most-selected items across models and training-testing splits in the un- perturbed data, being eliminated in only 0.5% or 0.1% of all 1100 applied models. Despite previous research showing the lack of a relationship between the Observe facet and psychological flexibility, a single item from this facet appears in the list of top-15 predictors across methods and adaptations. Two pieces of information suggest the potential aberrance of this observation. Only the adaptive LAD lasso selected this item as a top-15 predictor. Furthermore, the adaptive LAD lasso only selected this item in 74 out of 100 data splits, the lowest frequency of its top-15 predictors. The correlation plots presented in Figures 6.2 and 6.3 provide additional information explaining FFMQ item 10’s behavior. FFMQ item 10 shares minimal-to-nonexistent correla- tions with all other variables besides the other Observe items. That its only top-15 selection occurred in a lasso variant, along with the exclusion of the other Observe items from top-15 predictors, suggests the lasso arbitrary selection problem plays some role. Further supporting this explanation, increasing its correlation with other top-15 predictors reduced its selection to nearly 0% across methods. Consider the performance metrics used to judge the models applied to the current dataset. As observed in the simulation studies, average prediction performance did not distinguish most methods. TheOSHuberlasso, OSlasso, andMSadaptiveelasticnetallshowedmediantest-set 55 Namely, depressive symptoms, anxiety symptoms, mindfulness, and psychological flexibilty/acceptance. 56 “Usually when I have distressing thoughts or images I can just notice them without reacting.” 57 “I disapprove of myself when I have illogical ideas,” and “I think some of my emotions are bad or inappro- priate and I shouldn’t feel them.” 173 RMSEs and IQRs not contained within the IQR of the remaining 8 methods. Other than these three methods, however, all other methods showed incredibly similar predictive performance. The greatest variability in performance in the main analyses arose in computational time, which is not a useful metric at the scale of models run by an average researcher. However, if running time is of concern and a researcher plans on applying methods in the manner applied here 58 , neither the adaptive LAD methods nor the MS adaptive elastic net are as e cient. The standard lasso, standard elastic net, OS lasso, and OS Huber lasso all outpaced the other methods. Number of coe cients selected also distinguished methods quite meaningfully. The MS adaptive elastic net, the adaptive lasso, and the adaptive elastic net were all typically the most conservative concerning inclusion of predictors. The adaptive Hubers and adaptive LAD’s showed the next most-parsimonious median selection numbers, although the range of selected coe cients varied much more widely for the adaptive LAD models. 6.7.1 Discussion: Perturbation Scenarios The first perturbation was conducted to evaluate if the inter-correlations among four of the top 5 predictors played a role in their frequent selection. The alterations made resulted in drastic selection reductions of the unaltered item. These alterations produced the smallest selection impact in the elastic net and adaptive LAD elastic net for FFMQ item 24. Meanwhile, the MS adaptiveelasticnet, theadaptiveHuberlasso, andtheadaptiveHuberelasticnetwerethemost impacted. The MS adaptive elastic net, OS methods, and adaptive LAD lasso all saw the most downstream variables’ selections change between the original data and Perturbation 1. FFMQitem24’sconsistentlyreducedselectioncomparedtothethreeitemsreceivingdirect manipulation presents several potential implications. Intercorrelations among potential predic- tors impacted selection in the elastic net as well as the lasso. This perturbation reduced FFMQ item 24’s selection probability by a smaller degree in some of the elastic net versions compared to their lasso analogs, particularly in the standard lasso and adaptive LAD lasso. On the other hand, the MS adaptive elastic net saw the greatest selection reduction of this item. The MS adaptive elastic net’s selection conservatism provides a potential explanation for this discrep- ancy. Furthermore, the data manipulations enacted in the current chapter combined with the specific dataset utilized likely produced highly specific e↵ ects that themselves presented in the reduced selection of this particular item. Nonetheless, the reduced selection of the unmanipu- lated item rather than the manipulated ones carries significance for future study of lasso and elastic net adaptations. FFMQ item 10 presented a unique finding among frequently-selected predictors in the main applied analyses: that of a predictor which generally did not correlate with the other frequent selections. This observation contrasts with the finding underlying Perturbation 1. Three alterations were made to explore FFMQ item 10’s position among top-15 predictors. The first and second explored its consistent selection over the other items comprising the Observe subscale, with which it shared its only moderate correlations. 2a and 2b a↵ ected the same 58 In other words, running models over multiple training-testing splits. 174 goal but in di↵ erent manners. Perturbation 2a eliminated the correlation between item 10 and the other Observe items by implementing a synthetic random integer variable in item 10’s place. Perturbation 2b, in contrast, created three separate synthetic alternatives in the place of the other Observe items, leaving item 10 itself unaltered. Perturbation 2a resulted in reduced selection of item 10 and item 6 for all but the OS methods. Items 6 and 15 saw consistent reduction across methods under Perturbation 2b. The OS Huber lasso saw the most downstream selection impacts in both Perturbations 2, alongside the MS adaptive elastic net, adaptive Huber lasso, and adaptive LAD elastic net under 2a; and alongside the MS adaptive elastic net and OS lasso under 2b. Perturbation 2b did not produce any downstream selection changes for the adaptive Huber lasso. Perturbation 3 elaborated upon Perturbations 2a and 2b by increasing item 10’s correla- tion with completely uncorrelated items among top-15 predictors. This manipulation was also intended as an indirect contrast to Perturbation 1. Both scenarios increased the correlation of one frequent predictor with three others through manipulation of the three. Scenario 1 fo- cused on three associations already of small-to-moderate size; Scenario 3, on the other hand, increasedpreviouslyuncorrelatedvariables. SimilarlytoPerturbation1,Perturbation3resulted in a consistent reduction in the singular item’s selection probability, regardless of adaptation, and this by a larger magnitude and proportion relative to the original dataset. Where Pertur- bation 1 less consistently altered the selection of the altered variables, however, Perturbation 3 consistently reduced the selection of two of the three directly-altered items. The degree of change was relatively small for some adaptations, but still contrasted with selection that was otherwise consistent across other perturbations. Further complicating these results, the item seeing the largest reduction and most-stable reduction of the three had only been induced to a small correlation. On the other hand, the item with only slight-but-consistent reductions had been induced to a large correlation and the largely-una↵ ected item had been altered to a moderate correlation with the main item of interest. Perturbation 3 produced the greatest frequency of downstream selection changes in addition to the most frequent selection changes in directly altered variables. This scenario presents further evidence of an idiosyncratic process underlying the current data-generating mechanism, whether through the perturbations made, the psychometric characteristics explaining the current sample’s responses, or some complex combination of the two with other intricate behavioral and statistical processes. Perturbation 4 evaluated the reverse scenario from 2a and 2b: a singular subscale item remaining unselected despite frequent selection of all other items on the subscale. Four out of the five Nonreact items specifically reference “distressing thoughts or images,” although only FFMQitem18didnotoccurasatop-15predictor. Perturbation4intendedtoexaminewhether a similar-but-opposing mechanism to the lasso’s arbitrary selection explained this unselected Nonreact item. Findings of reduced selection of one or more of those predictors might support suchamechanism; unfortunately, theoppositeoccurred. Perturbation4consistentlyresultedin theincreasedselectionofitem18intothemodels, withonlytheOSHuberlassonotproducinga notablechangeinselectionrate. TheotherNonreactitemsalsodidnotseeconsistentchangesin selection rate under this scenario, further confounding the results of this scenario. Perturbation 175 4 consistently produced the least selection changes among the directly altered variables, with most adaptations only increasing selection of FFMQ item 18. The standard and adaptive LAD formulations, in particular, all increased item 18’s selection to between 70%80%. 6.7.2 Applied Analyses: Limitations and Recommendations Thepresentstudypresentsafewlimitations. ThestudyofPHP-leveldataisrare. Furthermore, these analyses focus on a particular group of high-intensity patients, particularly those with health insurance that covered the expense of a SCID. Further research should expand these methods into outpatient data and would further benefit from connecting treatment entry status with treatment outcomes and longer-term trends. These results also present a potential limitation of lasso and elastic net models. Unlike standard regression models, which can incorporate binary predictors into their models 59 ,itis unclear whether binary and multinomial predictors are viable in these techniques. The reason for this comes from the necessity of standardizing inputs into lasso and elastic net models. All R implementations of lasso and elastic net models used in the current study handle predictor data in the following manner: implementation will standardize the initial predictors before estimating the model and unstandardize the corresponding coe cients after model estimation for improved interpretation in the original context. The author was unable to find any research investigating whether issues arise in either step of this process when potential predictors are not continuous or follow di↵ erent distributions. The standardization process might di↵ erentially alter two-level variables as opposed to continuous variables. The unstandardization process might present further di culties of the same nature. This potential concern resulted in the decision to exclude diagnostic predictors from the original data in the current models. The non-continuous ordinal nature of the likert items used as predictors presents another potential limitation. If the binary problem suggested above is correct, related issues should apply to the predictors used in these models, particularly concerning interpretation of model coe cientsfordiscretepredictors. Thecurrentfocusofthisresearchonpropervariableselection rather than coe cient precision might temper this concern for these analyses. Considered another way: if interest lies primarily in determining the true contributors to a process and the inputvariablesfallonthesameorder-meaningfulscale, theseselectionmethodswillstillprovide meaningful information 60 . However, a potential concern remains given that age and number of current diagnoses fall on di↵ erent discrete scales than the measure items included. The potential issue presented by non-continuous predictors in lasso and elastic net models meritsfurtherconsideration. Categoricalandbinaryvariablesaboundinthebehavioralresearch context, and the inability to incorporate such variables into these variable selection procedures presents a significant hurdle to their practicality. The perturbations conducted in this chapter, while interested, do not comprehensively address the issue of predictor collinearity. Even more so than the conditions simulated in 59 And, consequently, multinomial categorical predictors via dummy-coding. 60 Although direct numerical interpretation is ill-advised. 176 Chapters 3-5, the manipulations herein likely produced idiosyncratic results and should be considered preliminary. The absence of any collinearity issues as measured by VIF further sug- gests the specificity of the process studied by the follow-up perturbations. More comprehensive simulations addressing collinearity in robust regularization are necessary to better understand andcomplementtheexistingbodyofresearch. Alterationsinsurface-level, pairwisecorrelations do not capture the true interconnectedness of behavioral phenomena, nor in a related manner do these alterations measure or address underlying psychometric mechanisms that produce re- sponses. The general discussion chapter addresses the simulation-idiosyncratic concern in more detail concerning the full simulations conducted in previous chapters. The models applied herein do not capture the nuances and interdirectional nature of be- havioral constructs, nor do they stand in for proper psychometric evaluation of behavioral measures. Application of variable selection methods in the manner described herein requires further development and empirical exploration. However, this application presents potential utility, demonstrated most clearly by the similarities between the current results and outpatient studies of psychological flexibility, mindfulness, depression, and anxiety. Expanding upon these connections with longitudinal data might provide prognostic utility in best tailoring treatment to the di↵ erential needs of individual patients. Furthermore, elaborating on the potential prob- lems presented by non-continuous predictors is paramount to proper implementation of these methods. Despitethecurrentlimitations,recommendationsarestillnecessary. Combiningtheresults the main applied analyses and the perturbation studies can supplement the simulation results to eliminate adaptations with clear limitations and reduce the field of practical methods to the average researcher. Given the holistic nature of such a recommendation, this is left to the general discussion in the following chapter. 177 Chapter 7 Discussion and Future Directions The current research program presents preliminary work on the unified study of lasso and elas- tic net procedures and their ability to properly select variables in the presence of outliers and non-normal error distributions. This research included evaluation of outlier and distributional robustness concerning high-dimensionality 1 and also briefly considered the performative bound- ary surrounding p = n in the absence of robustness concerns. The current lasso and elastic net models were applied to intake data for a group of patients entering high-intensity psychi- atric day-treatment, with follow-up analyses providing limited perspectives on the impacts of predictor correlation. This discussion begins with a general review of the findings from each study. Theauthormakesinitialrecommendationsonthepracticalapplicationofthesemethods, outlines limitations of the current study, and sets a course for future research in this area. 7.1 General Discussion Outliers, particularly in the response, produced negative impacts on most methods’ ability to properly select true predictors into a model and true non-predictors out of the model. Although this was not universally true, and the nature of negative impacts varied 2 , outliers produced negative impacts on the five most reliable methods studied 3 . Leverage points, however, did not produce consistent negative impacts on model performance. Some methods improved with leverage point contamination, whether by generally improving in FPR/FNR or showing greater improvements over increases in sample size. The observed protective e↵ ect of leverage points in the current data merits further exploration in future studies. However, the process used in these simulations to generate predictor outliers likely produced model-supporting rather than detrimental outliers and likely produced this protective e↵ ect. Dimensionality produced a sig- nificant positive impact on FPR, although this likely relates to the extreme increase in potential predictors without a corresponding increase in true predictors. The other notable observation in higher-dimension considerations of outlier robustness is the extreme likelihood of the MS adaptive elastic net to produce null models under high dimensionality. In lower dimensions, the standardlasso,standardelasticnet,MSadaptiveelasticnet,andOSHuberlassoproducedmore 1 Aka,p>n. 2 For example, generally increasing FPR vs. reducing the improvements made over sample size. 3 Notably, the adaptive Hubers, the adaptive LADs, and the unmodified adaptive methods. 178 null models than most other methods depending on the data condition. However, the OS lasso produced more null models and more consistently than all other methods, with a null-model rate exceeding 20% in some scenarios. Error distribution, at least as generated in the current simulations, had a much less mean- ingful impact on variable selection performance. However, heavy tails produced a consistent andobservablenegativeimpact. Thisimpactincreasedwithadditionalincreasesinskew. Fewer methods produced null models compared to the outlier simulations, although the MS adaptive elastic net and the OS lasso again produced null models in more conditions. Boundaries of dimensionality induced impacts on the performance of some methods. Clear performancechangessurroundp =n, aswell. Forthemostpart, FPRandRMSEconvergewith larger p, and further distinctions in coe cient precision might occur with more larger values of p. Incorporating characteristics such as outliers and non-normality would enhance the current findings, as would increasing the levels of p surround p =n, varying n as well as p, and varying the ratio of true predictors to the total number of potential predictors. Finally, in the applied data setting, the association of psychological flexibility with items related to anxiety, depression was investigated. Holistic consideration of the frequently selected predictors for each method and adaptation provided interested findings. Although the overlap is not exact, there were clear connections between the consistent and near-unanimous selection of particular Nonreactivity and Nonjudgment items from the FFMQ and previously observed associations between non-Observe facets of the FFMQ and the AAQ-II. FFMQ-SF items with seeming similarity were di↵ erentially selected as top-15 predictors. Potential explanations for this observation include: a di↵ erence in the way respondents in the PHP context answered those questions; di↵ erential predictive capacity that might vary with the broadness or specifity of an item’s content; or the lasso’s inclusion and exclusion of correlated predictors into a model. Follow-up analyses that perturbed correlations among variables produced meaningful selection changes, although these changes often varied by method and scenario. However, some variables were still selected frequently and consistently, including CUXOS’ general nervousness/anxiety item; CUDOS’ worthlessness, hopelessness, and general quality-of-life items; and FFMQ items corresponding with judgement of some emotional experiences (item 19) and not being carried awaybydistressfulthoughtsorimages(item9). Thenumberofcurrentdiagnosesalsoremained a consistent selection across methods. One of the perturbation scenarios produced a drastic reduction in selection of an FFMQ item indicating judgment of illogical ideas (item 24). This reduction might have arisen from the particular perturbation made. 7.1.1 On Extreme Null Models, False Positives, and False Negatives TheMSadaptiveelasticnetdemonstratedageneralrelationshipobservedinhypothesistesting: holding all constant, increasing the Type I error rate 4 inherently decreases Type II error rate 5 . This observation indicates the main limitation of the MS adaptive elastic net: in attempting 4 i.e., the FPR in this context. 5 i.e., the FNR in this context. 179 to adjust for excessive false positives in the lasso, the method appears to overcompensate by generally eliminating any potential predictor from the model. Notably, Xiao and Xu (2017) do not make recommendations for the number of adaptive steps to use. The choice in the current study to use k = 10 adaptive steps potentially contributed to the extreme null-model rate observed. However, a small subset of simulations using fewer adaptive steps still produced high rates of null models. Further study on the optimal number of steps to use would greatly improve understanding and potential utility of the MS adaptive elastic net. 7.2 Recommendations Unfortunately, statistics does not provide black-and-white answers, despite perceptions of the concreteness that its solutions provide. On the other hand, statistics still needs to be able to communicate information about observed data and the underlying process generating it. The traditional values used when conducting hypothesis tests - that is, power of 0.80 and Type I error rate of 0.05 - should serve as practical heuristics around which non-statisticians can orient discrete choices, given a non-discrete problem. Proper application of these norms requires nuanced interpretation and incorporation of relevant substantive and empirical considerations. First, consider methods that do not su ce based on the current research. The outlier- shifted methods did not perform competitively in the presence of outliers, and often performed drastically worse than other methods. Although the OS lasso significantly outperformed all other methods in two specific outlier conditions, it otherwise performed extremely poorly 6 .The two OS methods occasionally produced competitive or strong performances; however, extreme false selection rates in simulations render both impractical in the applied context. These lim- itations combine with obtuse software implementation to provide a clear indication that these two methods cannot be recommended in their current form and implementation. The MS adaptive elastic net is not recommended particularly in the higher-dimensionality context given the near-100% null-model rate under certain dimensionality conditions. The method showed strong FPR’s when su cient non-null data was produced. However, given that few non-null models arose in three out of nine outlier data scenarios in higher dimensions, significantuncertaintyexistsifapplyingthemethodtorealdata. Giventhegoalofconstraining uncertainty within some known boundaries, this is untenable. However, the MS adaptive elastic net can be recommended in lower dimensions given particular concerns on the researcher’s part. The clearest performative advantage of the MS adaptive elastic net concerns exclusion of true zero variables. Its “Type I error rate” of false positives approaches approaches the arbitrary norm of ↵ < 0.05 that has been accepted as tradition in applied research. This compares extremely favorably to any other methods’ FPR. However, this extreme advantage in FPR comes at the cost of extreme underperformance in selecting the true variables into the model, especially under heavier tails or any amount of response contamination. The MS adaptive elastic net still meets the standard of traditional 6 See Figure 3.5 180 hypothesis testing norms once sample sizes increases to n = 75, with “power” 7 greater than or equal to 0.80 across conditions. Given its adherence to these arbitrary benchmarks and its incredibly consistent and strong FPR performance in simulations, the MS adaptive elastic net is a strong choice at moderate and larger sample sizes. The MS adaptive elastic net’s utility is especially strong for researchers concerned with eliminating non-predictors. Although the standard elastic net and lasso both performed much better than expected in multiple scenarios, their performance across data scenarios was volatile. Therefore these methods are not recommended regardless of dimensionality. Although the exact behavior of the adaptive lasso and adaptive elastic net was not the same as the standard formulations, they also are not recommended due to performance inconsistency. The adaptive Hubers and adaptive LADs provide the most reliable tools available to re- searchers 8 These four methods generally underperformed at least one method for any given metric. However, each showed competitive performance and relative consistency across data scenarios. The lasso variants typically outperformed their elastic net analogs, although this performance advantage might disappear in the presence of collinearity. Supporting this notion, perturbed correlations in the follow-up applied analyses generally resulted in more selection instability in lasso versus elastic net variants. Observations from these perturbation analy- ses suggest some limitations of the adaptive Huber methods cocnerning correlational instability. Perturbation1resultedinsignificantlyreducedselectionofFFMQitem24,regardlessofmethod. TheadaptiveHubermethodssawthesecondandthirdlargestreductionsinselectionofitem24, emphasising the potential limits in handling correlational changes as manifested in the current analyses. However, the adaptive Huber methods otherwise produced consistent selection rates. The adaptive LAD elastic net generally showed stable selection rates for directly altered variables. However,theadaptiveLADelasticnetalsoproduceddownstreamselectioninstability in multiple scenarios. The two adaptive Hubers and adaptive LADs, however, and particularly the elastic net variants, remain the best recommendations for the average applied researcher interested in variable selection. 7.2.1 Streamlined Outline of Recommendations The adaptive Huber lasso, followed by the adaptive LAD lasso, is recommended for various applied data scenarios. Both methods demonstrate robustness in the ideal sense: in the absence of universally-dominant methods, they provide competitive and consistent performance under both non-ideal data conditions and data following ideal assumptions. The adaptive Huber lasso is recommended over the adaptive LAD lasso unless considering data with a single-digit number of potential predictors p. If an elastic net variant is being implemented based on the recommendations made in this section, the authors recommend balancing ↵ =0.5, given its use in the current research. 7 Aka, 1 - FNR 8 Unless false positive are of such concern that the gain of the MS adaptive elastic net in low dimensions is worth the loss in FNR. 181 However, researchers with a solid understanding of the lasso, the ridge, and the elastic net should feel confident in di↵ erent specifications to change the balance of variable removal with prediction capacity. 7.2.1.1 Low Dimensionality Figure 7.1: Decision Tree, Low Dimensionality; **See Recommendation in Section 7.2.1.1,3rdParagraph 182 Given collinearity concerns, the elastic net variants of adaptive Huber or adaptive LAD formulations are recommended. The choice of which formulation to use is arbitrary under low dimensionality, although the Huber formulations take significantly less time to complete the models if timing is a significant concern. The elastic net variants are similarly recommended over theirlassocounterpartsgiven aparticularcombination ofconcernswithfalsenegativesand smaller sample sizes (n< 50). Themulti-stepadaptiveelasticnetisrecommended,givenspecificdataconditionsandcon- cerns of the researcher. A researcher planning to use this method should be far more concerned with false positives than false negatives. The strong performance in eliminating non-predictors correspondswithpoorcapacitytoselecttruepredictorsintothemodelinthepresenceofoutlier contamination. k = 10 stages is recommended as an initial model; if a null model is produced, repeat the process with k = 5 stages; and if a null model is produced with k = 5 stages, use the adaptive Huber lasso. As described previously, concerns with collinearity call for the elastic net variant of the adaptive Huber formulation instead of the lasso variant. The Outlier-Shifted Huber lasso, finally, comes with even more cautious and conditional recommendations. Ifidentifyingtruepredictorsofaprocessisoftheutmostimportance,outliers are expected in the response variable, and false positives are not a concern, the OS Huber lasso is recommended. Particular emphasis is placed on a complete lack of concern for false positives, as FPR’s approaching 80% were observed under response contamination alone. The author also strongly recommends competency with statistical implementation in R using custom functions. TheOSHuberlassoisimplementedviaacomplexcustomRfunctionthatlikelyrequiresfurther debugging. 7.2.1.2 High Dimensionality Figure 7.2: Decision Tree, High Dimensionality; **See Recommendation in Section 7.2.1.2,2ndParagraph 183 The adaptive Huber and adaptive LAD lassos are both generally recommended in high dimensions. If timing is a concern, the adaptive Huber lasso is preferred. The adaptive LAD methods typically feature longer runtimes, especially in higher dimensions. Researchers should use the elastic net variants instead if collinearity is a potential concern. The standard OS lasso proved a strong performer across metrics and scenarios under high dimensionality. Recommended use, however, is limited only to skilled functional R users. The custom function used to implement the non-Huber OS lasso proved even more complex and obtuse than the Huberized version. Therefore, the author only recommends its use given that the researcher has both R expertise and the capability to understand the formal 9 model being implemented. 7.2.2 Proposed Method and Area for Research Inworkingwiththeapplieddataset, apotentialrecommendationaroseforthepracticalapplica- tionofsomemethods. Notethatthisrecommendationisparticularlymadeformethodologically- and computationally-oriented researchers. The process involves multiple steps, each presenting potential research questions for empirically studying and improving the combined process. First, data should be split into training-testing splits for cross-validation and evaluating prediction performance. • Research Question 1 : How many cross-validation splits should be used? Next, apply each adaptation of interest to each cross-validation training set. • Research Question 2a : What adaptations should be used as models? • Research Question 2b : What is the ideal/optimal number of adaptations? Afterapplicationofeachmethodtoeachtrainingdataset, evaluatethemostfrequentpredictors for each adaptation across all cross-validation splits. • Research Question 3 : Whatistheoptimalnumberof”Top-XMostFrequentPredictors”? CompiletheuniquepredictorsappearingasTop-Xpredictorsacrossadaptations. Finally, apply some metric or criterion to this set of predictors from multiple adaptations to arrive at an improved set of variables for selection. • Research Question 4 : What metrics or criteria should be used for filtering the combined set of Top-X predictors? The author hopes that, by empirically studying the various steps of a combined model outlined above, a variable selection tool can be validated and formally implement in statistical software tomakeavailabletoallresearchers. Apreliminaryapproachisproposedbeforethedevelopment of empirical results. Conduct the steps outlined in Figure 7.3. This combined model should include the following adaptations: 9 Aka, mathematical. 184 Figure 7.3: Structure of Combined Model, Plus Research Questions for Further Study (Orange) 185 • Adaptive Huber Lasso • Adaptive Huber Elastic Net • Adaptive LAD Lasso • Adaptive LAD Elastic Net • Multi-Step Adaptive Elastic Net, k=3 • Multi-Step Adaptive Elastic Net, k=5 • Multi-Step Adaptive Elastic Net, k = 10 ForconsistencywiththeapplicationinChapter6,100cross-validationsplitsarerecommended. This discussion does not recommend how many predictors to select per adaptation. In the absence of empirically-derived criteria to guide model results, both a modeling expert and a substantive expert should qualitatively interpret the collected list of Top-X predictors. Organized code will be provided on GitHub to support use of such a method. Implemen- tation will likely require more technical skill than the broader applied audience this research is directed at. 7.2.3 The Adaptive LAD Elastic Net One of the innovative features of the current research program is the preliminary study of the elastic net formulation using the LAD loss instead of squared-error loss. This particular elastic net adaptation arose from software intended to implement Huber-loss and quantile-loss functions in both the lasso and elastic using a novel computational algorithm (Yi and Huang, 2017). Although their study makes no mention of this formulation, their package, hqreg (Yi, 2017) serves both as a practical and straightforward lasso and elastic net package and provides the basis for studying this new elastic net adaptation. Although the findings herein are very preliminary,theysuggestthepotentialutilityoftheadaptiveLADelasticnetforrobustvariable selection. The authors deserve credit not just for the practicality of their software but also for opening this door up for future researchers to enter. 7.3 Limitations and Future Directions The current research opens up multiple avenues for future research and improvement. One interesting finding provides a good starting point: at higher dimensions, performance generally improved for all methods, especially regarding FPR. One potential explanation for this phe- nomenon is as follows: Given a truly sparse underlying process, the growth of the p:n ratio outpaces any increases in false positives resulting from additional non-predictors. Simulating model performance across more fine-grained p:n ratios would provide one possible test of this hypothesis. Further interest could lie in wheterh sample size influences this boundary; in other words, does dimensionality impact performance quicker or slower if sample size is smaller or 186 larger? This line of research would also benefit from exploring the impacts of outliers, non- normality, and other features on e↵ ects of this dimensionality boundary. The current study also purposefully limited its scope to outliers and non-normality at the expense of collinearity. This feature is especially relevant given that the elastic net was developed specifically to address selection and exclusion of grouped variables by the lasso. In the current simulations, the lasso form of an adaptation often outperformed its elastic net counterpart on most or all metrics and tended to be slightly more stable. It is likely that, when considering data with collinearity, this relationship would reverse. The combined impacts of outliers, non-normality, and collinearity on model performance merit further investigation. The real-data analyses attemtped to provide a preliminary look at the impacts of collinearity and predictor intercorrelation on selection capacity. However, the limitations and potential idiosyncraticnatureofthoseperturbationsandtheire↵ ectshasbeennoted. Furthermore, small VIFs arose in both the original dataset and the perturbed sets, suggesting a lack of collinearity as the particular issue present in the follow-up analyses. The implementation of leverage point contamination 10 did not achieve the desired goals originally planned for this course of study. Due to contamination of predictors before the generation of responses, these leverage points were likely conducive, rather than detrimental, to variable selection capabilities. This limitation mertis central focus in future research continuing this work to ensure proper understanding of how methods handle detrimental outlier presence. The particular processes used to generate outliers, non-normality, and predictor correla- tions also deserves considerations in future research. Previous chapters have discussed that the values used for the g-and-h distribution might not be su ciently extreme to address concerns of non-normality. Similar logic is applicable to the limit of 20% outlier contamination produced. Further levels of contamination should be studied, both in terms of increased granularity of contamination levels and larger percentages of contamination. A normality-adjacent distribution was used to produce outliers and non-normality. Study- ing both outliers and non-normality using alternative distribution forms would greatly improve empirical understanding of the robust capabilities of lasso and elastic net adaptations. Even keeping focus within contamination by an “extreme” normal distribution, future re- search can expand on the outlier-generating process. The current study generated response outliers from a normal distribution with a mean of 2 and standard deviation of 5; predictor out- lier contamination arose by increasing the population mean to 10. Contamination with normal distributions with far di↵ erent population parameters would further increase understanding of robust variable selection methods. One limitation in software limitation likely meant the underperformance of all methods relative to their robust selection potential. Implementations of most of the 11 methods studied limited hyperparameter selection to cross-validation guided by mean squared error. Lambert- Lacroix and Zwald (2011) provide preliminary evidence that other procedures such as the AIC or BIC to determine optimal tuning hyperparameters improves robust variable selection per- formance in lasso and elastic net adaptations. Their study found that using BIC instead of 10 Aka, contamination by outliers in the predictors. 187 MSE-driven cross-validation reduced FPR by 10%20% or more in simulations using the adaptive lasso, adaptive LAD lasso, and adaptive Huber lasso. The performance observed by methods herein likely underestimates the true variable selection capabilities of these methods. Future e↵ ortstoincorporatealternativehyperparameterselectionmethodsintolassoandelastic net software implementation are necessary. Manylinescontinuingthecurrentresearchlieindecisionsthatneededmadethroughoutthe current research process. Most choices required throughout these analyses relied upon little- studied recommendations or norms lacking any empirical justification. Future study should consider the following decisions: • Range of tuning hyperparameter values for cross-validation in regularization. • Estimator for ˆ for adaptive weighting. • Range of weighting tuning hyperparameter values for cross-validation. • Foroutlier-shifting: rangeofpossibletuninghyperparameter (distinctfromtheadaptive tuning hyperparameter) for shifting outliers Covering the wide range of values possible for all of these decisions and more is far outside the scope of a single study. These characteristics, along with the directions suggested previ- ously, should all be pursued further to provide the most holistic view possible on the robust qualifications of the lasso and elastic net adaptations. Amongthehyperparameterchoicesintentionallyrestrictedhereinbutnecessitatingfurther study, one of the most prominent lies within the MS adaptive elastic net. The original study by Xiao and Xu (2017) does not provide detailed information on the number of steps k to use for the multi-step process, and thus an arbitrary choice had to be made. The results of these simulations suggest the method, at 10 steps, is overly conservative. When presented with high-dimensional data conditions, it is very likely, if not certain, to produce a null model. A reasonable suggestion is that 10 steps is too many given high dimensionality, and that fewer might produce better results. Preliminary simulations on a small number of high-dimensional iterations with k = 3 11 and k = 5 suggest that null models are still produced at an impractical rate. Given the lack of recommendations in the proposing study, the null findings herein, and thepracticalbehavioroftheMSadaptiveelasticnetwhenproducingnon-nullmodels, exploring this hyperparameter further is a necessary step for moving the current program forward. The current simulations and the extant literature limit analyses to non-hierarchical data structures and modelling of linear relationships. Given the prevalence of hierarchical data and nonlinear relationships in behavioral and other applied contexts, further evaluation of lasso and elastic net adaptations and thir selection capabilities under these conditions is necessary. 7.3.1 Other Adaptations The adaptations included herein are also far from the only adaptations proposed to the lasso or elastic net. Most were excluded due to a lack of accessible software implementation. Other 11 aka, the first k for which the procedure is truly multi-step 188 adaptations worth studying include, but are not limited to: • Adaptive formulations of outlier shifting • Outlier shifting in the elastic net • The MM Lasso (Smucler and Yohai, 2017) and the Penalized Elastic Net S-Estimator (PENSE; Freue et al., 2019) • Multi-tuning hyperparameter elastic net (Liu et al., 2018) • Penalized weighted LAD lasso (jiang et al., 2020) • Adaptive BerHu (reversed Huber)-penalized elastic net (Lambert-Lacroix and Zwald, 2011) The field of lasso and elastic net adaptations is likely to continue to expand in the coming years, and continued study of these methods for their robust characteristics will prove useful. 7.4 Conclusions Over a variety of simulated data conditions and a real-data scenario, this dissertation evaluated the performance of many developments in the lasso and elastic net. This program provides an initial framework for further studying the statistical properties and behavior of these variable selection tools in practical contexts. Furthermore, the current study makes initial recommen- dations of lasso and elastic net tools for practical researchers, including a combined method for further statistical and potential applied study. Although this research serves only as a first step, itprovidesawidenetworkofrelatedideastohelpexpandpracticalandrobustvariableselection considerations. Although all methods showed performance stability in some respect, the adap- tive Hubers and adaptive LADs appeared the most consistent throughout the research and are therefore recommended to a broader applied audience. The more adventurous or technically- skilledresearchermightconsideracombinedapproachofthesefourwiththeMSadaptiveelastic net, given the various findings of this research. 189 References Alfons,A.,Croux,C.,&Gelper,S.(2013).Sparseleasttrimmedsquaresregressionforanalyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248. Arslan, O. (2012). Weighted lad-lasso method for robust parameter estimation and variable selection in regression. Computational Statistics and Data Analysis, 56, 1952–1965. Association,A.P.(1994).Diagnostic and statistical manual of mental disorders (4th).American Psychiatric Association. Baer, R., Smith, G., Hopkins, J., Krietemeyer, J., & Toney, L. (2006). Using self-report assess- ment methods to explore faets of mindfulness. Assessment, 13, 27–45. Basu, S., & DasGupta, A. (1995). Robustness of standard confidence intervals for location parameters under departures from normality. Annals of Statistics, 23, 1433–1442. Benjamini,Y.(1983).Isthettestreallyconservativewhentheparentdistributionislong-tailed? Journal of the American Statistical Association, 78, 645–654. Bessel, F. (1818). Fundamenta astronomiae pro anno mdcclv deducta ex observationibus viri incomparabilis james bradley in specula astronomica grenovicensi per annos 1750-1762 institutis. K¨ onigsburg: Friedrich Nicolovius. Bohlmeijer, E., ten Klooser, P., Fledderus, M., Veehof, M., & Baer, R. (2011). Psychometric properties of the five facet mindfulness questionnaire in depressed adults and develop- ment of a short form. Assessment, 18(3), 308–320. Bond, F., Hayes, S., Baer, R., Carpenter, K., Guenole, N., Orcutt, H., Waltz, T., & Zettle, R. (2011). Preliminary psychometric properties of the acceptance and action questionnaire - ii: A revised measure of psychological inflexibility and experiential avoidance. Behavior Therapy, 42, 676–688. Calvo,V.,D’Aquila,C.,Rocco,D.,&Carraro,E.(2020).Attachmentandwell-being:Mediatory roles of mindfulness, psychological inflexibility, and resilience. Current Psychology. Eddelbuettel,D.,&Sanderson,C.(2014).Rcpparmadillo:Acceleratingrwithhigh-performance c++ linear algebra. Computational Statistics and Data Analysis, 71, 1054–1063. Fan,J.,Fan,Y.,&Barut,E.(2014).Adaptierobustvariableselection. The Annals of Statistics, 42(1), 324–351. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihod and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360. First, M., & Gibbon, M. (2004). The structured clinical interview for dsm-iv axis i disorders (scid-i) and the structured clinical interview for dsm-iv axis ii disorders (scid-ii). In M. Hilsenroth & D. Segal (Eds.), Comprehensive handbook of psychological assessment, vol. 2 personality assessment (pp. 134–143). Wiley. 190 Fledderus, M., Voshaar, M. O., ten Klooster, P., & Bohlmeijer, E. (2012). Further evaluation of the psychometric properties of the acceptance and action questionnaire-ii. Psychological Assessment, 24(4), 925–936. Freue, G., Kepplinger, D., Salibi´ an-Barrera, M., & Smucler, E. (2019). Robust elastic new estimators for variable selection and identification of proteomic biomarkers. Annals of Applied Statistics, 13(4), 2065–2090. Friedman, J., Hastie, T., Tibshirani, R., Simon, N., Narasimhan, B., & Qian, J. (2019). Glmnet: Lasso and elastic-net regularized generalized linear models. Gillivray, H. (1992). Shape properties of the g-and-h and johnson families. Communications in Statistics - Theory and Methods, 21(5), 1233–1250. Hayes, S., Strosahl, K., & Wilson, K. (2012). Acceptance and commitment therapy: The process and practice of mindful change (2nd). The Guilford Press. Hill, R., & Dixon, W. (1982). Robustness in real life: A study of clinical laboratory data. Biometrics, 72, 377–396. Huber, P. (1981). Robust statistics. New York, New York: Wiley. Huber, P. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73–101. Huber, P. (1973). Robust regression: Asymptotics. conjectures, and monte carlo. Annals of Statistics, 1, 799–821. jiang, Y., Wang, Y., Zhang, J., Xie, B., Liao, J., & Liao, W. (2020). Outlier detection and robustvariableselectionviathepenalizedweightedlad-lassomethod.Journal of Applied Statistics. Jung, Y., Lee, S., & Hu, J. (2016). Robust regression for highly corrupted response by shifting outliers. Statistical Modelling, 16(1), 1–23. Kepplinger, D., Salibian-Barrera, M., Freue, G., & Cho, D. (2020). Pense: Penalized elastic net s/mm-estimator of regression. Kertz, S., Bigda-Peyton, J., Rosmarin, D., & Bjorgvinsson, T. (2012). The importance of worry across diagnostic presentations: Prevalence, severity and associated symptoms in a par- tial hospital setting. Journal of Anxiety Disorders, 26, 126–133. Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50. Kurnaz, F., Ho↵ mann, I., & Filzmoser, P. (2018). Robust and sparse estimation methods for high-dimensionallinearandlogisticregression.Chemometrics and Intelligent Laboratory Systems, 172, 211–222. Lambert-Lacroix, S., & Zwald, L. (2011). Robust regression through the huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics, 5, 1015–1053. Lambert-Lacroix, S., & Zwald, L. (2016). The adaptive berhu penalty in robust regression. Journal of Nonparametric Statistics, 28(3), 487–514. Li, B., Zhang, Y., & Tang, N. (2020). Robust variable selection and estimation in threshold regression model. Acta Mathematicae Applicatae Sinica, English Series, 36(2), 332–346. Linehan, M. (2015). Dbt skills training manual (2nd). The Guilford Press. 191 Liu, J., Liang, G., Siegmund, K., & Lewinger, J. (2018). Data integration by multi-tuning parameter elastic net regression. Bioinformatics. Machkour, J.,Bastian,A.,Muma,M., &Zoubir,A.(2017). Theoutlier-corrected-data-adaptive lasso: A new robust estimator for the independent contamination model. 2017 25th European Signal Processing Conference, 1649–1653. Machkour, J., Muma, M., Alt, B., & Zoubir, A. (2020). A robust adaptive lasso estimator for the independent contamination model. Signal Processing, 174. Maronna, R. (2011). Robust ridge regression for high-dimensional data. Technometrics, 53(1), 44–53. Meinshausen, N., & Buhlman, P. (2004). Technical report. University of Wisconsin-Madison, Dept. of Statistics. Micceri,T.(1989).Theunicorn,thenormalcurve,andotherimprobablecreatures.Psychological Bulletin, 105, 156–166. Mutch, V. A., Evans, S., & Wyka, K. (2021). The role of acceptance in mood improvement during mindfulness-based stress reduction. Journal of Clinical Psychology, 77, 7–19. Newcomb, S. (1886). A generalized theory of the combination of observations so as to obtain the best result. American Journal of Mathematics, 8, 343–366. Park, H. (2017). Outlier-resistant high-dimensional regression modelling based on distribution- freeoutlierdetectionandtuningparameterselection.Journal of Statistical Computation and Simulation, 87(9), 1799–1812. Pedersen, W., Miller, L., Putcha-Bhagavatula, A., & Yang, Y. (2002). Evolved sex di↵ erences in sexual strategies: The long and short of it. Psychological Science, 13, 157–161. Pfohl, B., Blum, N., & Zimmerman, M. (1997). Structured interview for dsm-iv personality: Sidp-iv. American Psychiatric Press. Roemer, L., & Orsillo, S. (2005). An acceptance-based behavior therapy for generalized anxiety disorder. In S. Orsillo & L. Roemer (Eds.), Series in anxiety and related disorders. ac- ceptance and mindfulness-based approaches to anxiety: Conceptualization and treatment (pp. 213–240). Springer. Rosset,S.,&Zhu,J.(2007).Piecewiselinearregularizedsolutionpaths.TheAnnalsofStatistics, 35(3), 1012–1030. Rousseeuw, P., & Leroy, A. (1987). Robust regression & outlier detection. New York, New York: Wiley. Rousseeuw, P., & Yohai, V. (1984). Robust regression by means of s-estimators. Lecture notes in statistics: Vol. 26. nonlinear time series analysis (pp. 256–272). Springer. Ruiz,F.(2014).Therelationshipbetweenlowlevelsofmindfulnessskillsandpathologicalworry: The mediating role of psychological inflexibility. Anales de Psicolog´ ıa, 30(3), 887–897. Segal,Z.,Williams,J.,&Teasdale,J.(2013).Mindfulness-based cognitive therapy for depression (1st). The Guilford Press. Smucler, E., & Yohai, V. (2017). Robust and sparse estimators for linear regression models. Computational Statistics and Data Analysis, 111, 116–130. 192 Susanti, Y., Pratiwi, H., Sulistijowati, S., & Liana, T. (2014). M estimation, s estimation, and mm estimation in robust regression. International Journal of Pure and Applied Mathematics, 91(3), 349–360. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society - Series B, 58, 347–355. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., & Tibshirani, R. (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Sta- tistical Society - Series B, 74, 245–266. Turkmen, A., & Ozturk, O. (2016). Generalised rank regression estimator with standard error adjusted lasso. Australian and New Zealand Journal of Statistics, 58(1), 121–135. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with s (Fourth) [ISBN 0- 387-95457-0]. Springer. http://www.stats.ox.ac.uk/pub/MASS4 Wang, H., Li, G., & Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25(3), 347– 355. Wang,X.,Jiang,Y.,Huang,M.,&Zhang,H.(2013).Robustvariableselectionwithexponential squared loss. Journal of the American Statistical Association, 108, 632–643. Wei, T., & Simko, V. (2021). Corrplot: Visualization of a correlation matrix (0.88). Westfall, P., & Young, S. (1993). Resampling based multiple testing. New York: Wiley. White, R., Gumley, A., McTaggart, J., Rattrie, L., McConville, D., & Cleare, S. (2013). Depres- sion and anxiety following psychosis: Associations with mindfulness and psychological flexibility. Behavioural and Cognitive Psychotherapy, 41, 34–51. Wilcox, R. (1990). Comparing the means of two independent groups. Biometrical Journal, 32, 771–780. Wilcox, R. (2016). Introduction to robust estimation and hypothesis testing (4th). Academic Press. Wilcox,R.,Erceg-Hurn,D.,Clark,F.,&Carlson,M.(2013).Comparingtwoindependentgroups via the lower and upper quantiles. Journal of Statistical Computation and Simulation, 84(7), 1543–1551. Wu, X., & Zhou, X. (2019). On hodges’ supere ciency and merits of oracle property in model selection. Annals of the Institute of Statistical Mathematics, 71, 1093–1119. Wu, Y., & Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica, 19, 801– 817. Xiao, N., & Xu, Q. (2017). Multi-step adaptive elastic-net: Reducing false positive in high- dimensionalvariableselection.JournalofStatisticalComputationandSimulation,26(3), 3755–3765. Xiao, N., & Xu, Q. (2019). Msaenet: Mutli-step adaptive estimation for sparse regressions. Yi, C. (2017). Hqreg: Regularization paths for lasso or elastic-net penalized huber loss regression and quantile regression. 193 Bibliography Yi, C., & Huang, J. (2017). Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3), 547–557. Yohai, V. (1987). High breakdown-point and high e ciency robust estimates for regression. Annals of Statistics, 15(2), 642–656. Zheng, Q., Gallagher, C., & Kulasekera, K. (2016). Robust adaptive lasso for variable selection. Communications in Statistics - Theory and Methods, 46(9), 4642–4659. Zimmerman, M., Chelminski, I., McGlinchey, J., & Posternak, M. (2008). A clinically useful depression outcome scale. Comprehensive Psychiatry, 49, 131–140. Zimmerman, M., Chelminski, I., Young, D., & Dalrymple, K. (2010). A clinically useful anxious outcome scale. Journal of Clinical Psychiatry, 71(5), 534–542. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 91, 258–266. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 467, 301–320. Zou,H.,&Ying,M.(2008).Compositequantileregressionandtheoraclemodelselectiontheory. The Annals of Statistics, 36(3), 1108–1126. Zou,H.,&Zhang,H.(2009).Ontheadaptiveelastic-netwithadivergingnumberofparameters. The Annals of Statistics, 37(4), 1733–1751. 194 Appendix A Bonus Mathematical Formulations This Appendix is devoted to the mathematical formulations relevant to the robustification of the lasso and elastic net, particularly for each of the adaptations reviewed in Chapter 2 of this dissertation. Theoretical considerations are left to the original proposal study, which is reference in footnotes for each technique. Notation is explained where appropriate and a brief verbal description given. Where elastic net formulations are applicable, their formulations will be included. A.1 The Oracle Property 1 Suppose n-independent and identically-distributed (hereafter ”i.i.d.”) random variables V = {(x 1 ,Y 1 ),...,(x i ,Y i )fori=1,...,n,eachwithdensityf(V, )satisfyingthefollowingconditions: 1. VariablesV i are i.i.d. with density f(V, ) with respect to some measure mu. 2. f(V, ) is identifiable. 3. f(V, ) has common support. 4. dlogf(V, ) df( j ) and d 2 logf(V, ) df( j ) satisfy the following: E [·]=0, (A.1) for j=1,...,p and I jk ( )= d 2 logf(V, ) df( j )df( k ) , (A.2) for all j6=k. 5. I( )=E ⇢ d 2 logf(V, ) df( ) d 2 logf(V, ) df( ) 0 (A.3) is finite and positive definite at = 0 . 6. Let⌦ represent the parameter space for . The true parameters 0 are contained within subset ! such that f(V, ) is third-moment di↵ erentiable for nearly allV. 1 Per Fan and Li (2001) 195 Appendix: Bonus Mathematical Formulations 7. there exist some functions G ijk for all 2 ! for which d 3 logf(V, ) df( j )df( k )df( l ) G jkl (V), (A.4) for all j6=k6=l, and g jkl =E 0 [G jkl (V)]<1 . (A.5) For max{|p 00 q (| j0 |)| : j0 6=0}! 0, for some penalty function p q (|·|), there exists an estimator ˆ of the penalized likelihood function of for which k ˆ k =O q (n 1/2 +a n ) (A.6) where a n =max{|p 0 q (| j0 |)| : j0 6=0} (A.7) Furthermore, if q ! 0, and lim n!1 p n q !1 , the local maximum estimator ˆ must meet the following conditions: 1. For all j = 0, ˆ j =0 2. For all k 6= 0, p n(I k ( k0 ,0)+⌃ ){ ˆ k k0 +(I k ( k0 ,0)+⌃ ) 1 b} Distr ! N{0,I k ( k0 ,0)}, (A.8) where I k ( k0 ,0) is the Fisher information with knowledge of j = 0. A.2 Exponential Squared Loss Function Parameter Selection Procedure The procedure outlined in X. Wang et al. (2013) for selecting model parameter n is as follows. 1. Initialize preliminary estimate for ˆ n , ˜ n . The authors suggest using the MM-estimator, although any number of robust regression estimators can be used. 2. Calculate r i ( ˜ )=Y i x i 0 ˜ , (A.9) for i=1,...,n and S n =1.4826⇥ median i r i ( ˜ )median j (r j ( ˜ )) (A.10) 3. Determine the outlier observations, D m = {(x i ,Y i ): r i ( ˜ ) 2.5S n }, (A.11) where m=#{1 i n : r i ( ˜ ) 2.5S n } (A.12) 196 Appendix: Bonus Mathematical Formulations 4. Update the tuning parameter n , the minimizer of det( ˜ V( ), in G = { : ⇣ ( )2 (0,1]}, (A.13) where ⇣ ( )= 2m n + 2 n n X i=m+1 (1exp{r i ( ˜ ))}, (A.14) det(·) is the determinant, ˜ V( )= { ˜ I 1 ( ˜ )} 1 ˜ ⌃ 2 { ˜ I 1 ( ˜ )} 1 , (A.15) ˜ I 1 ( ˜ n )= 2 ⇢ 1 n n X i=1 exp{r 2 i ( ˜ )/ } ✓ 2r 2 i ( ˜ ) 1 ◆✓ 1 n n X i=1 x i x 0 i ◆ , (A.16) and ˜ ⌃ =cov ⇢ exp(r 2 1 ( ˜ )/ 2r 1 ( ˜ ) x 1 ,...,r 2 n ( ˜ )/ 2r n ( ˜ ) x n . (A.17) 5. Set ˆ = ˜ Still with me? Let’s put these steps into words. 1. First, establish initial coe cient estimates, similar to how would be done for the adaptive lasso tuning parameter and using the estimator ˆ of choice (although X. Wang et al. (2013) recommend an MM-estimator. 2. Calculatetheresidualsforthisinitialestimator, thendetermineamedianmeasureofscale analogous to the standard error. 3. Determine the set of outlying observations based on this median measure of scale. 4. Linear algebra 2 5. Set coe cient weights for the adaptive tuning parameter using the initial estimator and our newly-determined n . 2 Aka magic. 197 Appendix B R Code B.1 GitHub Respository All code used in the process of this research can be found at https://github.com/multach87/ Dissertation. 198 Appendix: R Code B.2 Type I error and Power Data Simulations #Load package libraries library(psych) library(magrittr) library(stats) library(purrr) library(ggplot2) library(dplyr) library(gridExtra) #Load function for generating data from the g-and-h distribution ghdist <- function(n,g=0,h=0){ # # generate n observations from a g-and-h dist. # x<-rnorm(n) if (g>0){ ghdist<-(exp(g*x)-1)*exp(h*x^2/2)/g } if(g==0)ghdist<-x*exp(h*x^2/2) ghdist } #Generate "population" data from each distribution normal.pop <- rnorm(10000) heavy.pop <- c(rnorm(9000) , rnorm(1000 , sd = 15)) outlier.pop <- c(rnorm(9000) , rnorm(1000 , mean = 10)) g0h0.pop <- ghdist(n = 10000 , g = 0 , h = 0) g2h0.pop <- ghdist(n = 10000 , g = 0.2 , h = 0) g0h2.pop <- ghdist(n = 10000 , g = 0 , h = 0.2) g2h2.pop <- ghdist(n = 10000 , g = 0.2 , h = 0.2) #combine separate population vectors into a single list # #for convenience with subsequent simulation functions full.pop <- list(normal = list(distribution = "normal" , population = normal.pop) , heavy = list(distribution = "heavy" , population = heavy.pop) , outlier = list(distribution = "outlier" , population = outlier.pop) , g0h0 = list(distribution = "g0h0" , population = g0h0.pop) , g2h0 = list(distribution = "g2h0" , population = g2h0.pop) , g0h2 = list(distribution = "g0h2" , population = g0h2.pop) , 199 Appendix: R Code g2h2 = list(distribution = "g2h2" , population = g2h2.pop)) #generate dataframe with conditions for simulations sim.conds <- as.numeric(rep(seq(from = 10 , to = 500 , by = 10) , 7)) %>% #sample sizes #number of samples - fixed at 5000 cbind(as.numeric(rep(5000 , 7*50))) %>% #delta for power - fixed at 1 cbind(as.numeric(rep(1 , 7*50))) %>% #iteration tracker cbind(1:(7*50)) %>% #make into dataframe data.frame %>% #distribution label cbind(c(rep("normal" , 50) , rep("heavy" , 50) , rep("outlier" , 50) , rep("g0h0" , 50) , rep("g2h0" , 50) , rep("g0h2" , 50) , rep("g2h2" , 50))) %>% #label columns setNames(c("sam_size" , "num_sam" , "delta" , "tracker.i" , "data")) #function for generating t-statistics t_stat <- function(sam_size, num_sam , delta , tracker.i , data) { #extract name of current distribution for storage and later comparison distribution <- names(full.pop)[which(names(full.pop) %in% data)] #print information to console for tracking progress of t-statistic simulation cat("distribution = " , distribution , " , sample size = " , sam_size , " , i = " , tracker.i , "\n") #establish current population from full list using distribution label data <- full.pop[[distribution]] #initialize blank matrix for simulated samples mat <- matrix(nrow = sam_size , ncol = num_sam) 200 Appendix: R Code #fill each column with a sample of specified size for(i in 1:ncol(mat)) { mat[ , i] <- sample(data[["population"]] , sam_size , replace = F) } #generate population mean for given distribution mu0 <- mean(data[["population"]]) #generate alternative-hypothesis mean # #for given delta and given population mean muA <- mu0 + delta #generate sample means xbar <- colMeans(mat) #generate sample standard deviations s <- apply(mat , 2 , sd) #generate sample t-scores t <- (xbar - mu0) / (s / sqrt(sam_size)) #generate corresponding p-values for testing against # #true population mean p.vals0 <- mat %>% data.frame() %>% map(t.test , mu = mu0) %>% map_dbl("p.value") %>% as.numeric() #generate corresponding p-values for testing against # #alternative hypothesis with delta = 1 p.valsA <- mat %>% data.frame() %>% map(t.test , mu = muA) %>% map_dbl("p.value") %>% as.numeric() #store all data to dataframe data <- data.frame(distribution = distribution , sam_size = sam_size , num_sam = num_sam , 201 Appendix: R Code mu0 = mu0 , muA = muA , delta = delta , xbar = xbar , s = s , t = t , p.vals0 = p.vals0 , p.valsA = p.valsA) #save dataframe to permanent object return(data) } #map t-statistic function over all simulation conditions simulated.data <- sim.conds %>% pmap(t_stat) #function for generating true probability coverage, type I, power tcov <- function(data , conf.level = .95) { #print information to console for tracking progress of simulation cat("distribution = " , levels(data[ , "distribution"])[1] , " , sample size = " , data[1 , "sam_size"] , "\n") #set degrees of freedom for theoretical t-distribution: n-1 df <- (data[1 , "sam_size"] - 1) #set upper quantile for theoretical CI conf.hi <- 1 - (1 - conf.level) / 2 #set lower quantile for theoretical CI conf.lo <- 1 - conf.hi #sort simulated t-statistics t_statistics <- sort(data[ , "t"]) #store current null-hypothesis p-vals # #to object for convenience p.vals0 <- data[ , "p.vals0"] #store current alt-hypothesis p-vals # #to object for convenience p.valsA <- data[ , "p.valsA"] 202 Appendix: R Code #store delta # #to object for convenience delta <- data[1 , "delta"] #store arguments for "power.t.test" # #to object for convenience power.args <- list(delta = delta , sd = data[ , "s"] , sig.level = rep(conf.level , times = data[1 , "num_sam"]) , n = rep(data[1 , "sam_size"] , times = data[1 , "num_sam"])) #calculate nominal power level for each sample to detect # #delta nominal.power <- power.args %>% pmap(power.t.test) %>% map_dbl("power") #calculate actual power level for given data condition overall # #aka the rate of rejections to the total number of samples power <- sum(p.valsA < 0.05) / length(p.valsA) #calculate the actual probability coverage of the 95% CI # #in given data condition actual_prob <- length(which(t_statistics >= qt(conf.lo , df = df) & t_statistics <= qt(conf.hi,df = df))) / length(t_statistics) #calculate actual Type I error rate for given data condition + model # #aka the rate of rejections to the total number of samples typeI <- sum(p.vals0 < 0.05) / length(p.vals0) #store simulation data to dataframe sim.data <- data.frame(distribution = data[1 , "distribution"] , samplesize = data[1 , "sam_size"] , TypeI = typeI , nominalpower = mean(nominal.power) , actualpower = power) #save simulation to permanent object return(sim.data) 203 Appendix: R Code } #generate type I error rates and power for all simulated samples perf.data <- simulated.data %>% map_dfr(tcov) #plot sim results: Type I error typeI.plot <- ggplot(data = perf.data) + geom_smooth(mapping = aes(x = samplesize, y = TypeI, color = distribution) , se = FALSE) typeI.plot + coord_cartesian(ylim = c(0 , 0.20)) + xlab("Sample Size") + ylab("Type I") + ggtitle("Type I Error Rate") + theme(plot.title = element_text(hjust = 0.5)) + scale_color_discrete(name = "Distribution", breaks = c("normal" , "heavy" , "outlier" , "g0h0" , "g2h0" , "g0h2" , "g2h2"), labels = c("Standard\nNormal", "Heavy-Tailed\nNormal", "Outlier-Contaminated\nNormal" , "g-and-h(0,0)\naka Standard Normal" , "g-and-h(0.2,0)" , "g-and-h(0,0.2)" , "g-and-h(0.2,0.2)")) #plot sim results: Power # #generate and store plot objects nominalpower.plot <- ggplot(data = perf.data) + geom_smooth(mapping = aes(x = samplesize, y = nominalpower , color = distribution) , se = FALSE) nominalpower.plot + coord_cartesian(ylim = c(0 , 1.0)) actualpower.plot <- ggplot(data = perf.data) + geom_smooth(mapping = aes(x = samplesize, y = actualpower , color = distribution) , se = FALSE) actualpower.plot + coord_cartesian(ylim = c(0 , 1.0)) # #generate legend for combined newlegend.plot <- nominalpower.plot + scale_color_discrete(name = "Distribution", breaks = c("normal" , "heavy" , "outlier" , "g0h0" , "g2h0" , "g0h2" , "g2h2"), labels = c("Standard\nNormal", "Heavy-Tailed\nNormal", 204 Appendix: R Code "Outlier-Contaminated\nNormal" , "g-and-h(0,0)\naka Standard Normal" , "g-and-h(0.2,0)" , "g-and-h(0,0.2)" , "g-and-h(0.2,0.2)")) #function which stores legend from plot into object for subsequent use g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend)} mylegend <- g_legend(newlegend.plot) #generate simultaneous plot of nominal + actual power + stored legend grid.arrange(arrangeGrob((nominalpower.plot + coord_cartesian(ylim = c(0 , 1.0)) + xlab("Sample Size") + ylab("Power") + theme(legend.position = "none") + ggtitle("Nominal Power") + theme(plot.title = element_text(hjust = 0.5))) , (actualpower.plot + coord_cartesian(ylim = c(0 , 1.0)) + xlab("Sample Size") + ylab("Power") + theme(legend.position = "none") + ggtitle("Actual Power") + theme(plot.title = element_text(hjust = 0.5))) , nrow = 1) , mylegend , ncol = 2 , widths = c(200 , 40)) 205 Appendix: R Code B.3 Code to produce Cars93 regression coe cients #load MASS package library(MASS) #load Cars93 dataset from MASS package data("Cars93") #generate OLS model, with Engine Size, Horsepower, # #Car Length, and Car Width predicting Highway MPG # #and store model to object titled "fit" fit <- lm(data = Cars93 , MPG.highway ~ EngineSize + Horsepower + Length + Width) #Print summary of OLS model summary(fit) 206 Appendix: R Code B.4 Code for Generating Simulation Data The following code was used to generate all training set data for all data scenarios. The testing set was similarly generated using the same seeds, except n was increased by 50%, the data- generating function only stored the last third of the data (rounded up), and the resulting data was stored in a separate file. #load package libraries library(mvtnorm) library(magrittr) library(purrr) library(dplyr) #g-and-h distribution function ghdist<-function(n,g=0,h=0){ # # generate n observations from a g-and-h dist. # x<-rnorm(n) if (g>0){ ghdist<-(exp(g*x)-1)*exp(h*x^2/2)/g } if(g==0)ghdist<-x*exp(h*x^2/2) ghdist } #generate singular 108 data conditions for low dimensionality data # #initialize empty dataframe sim.structure1 <- as.data.frame(matrix(ncol = 6 , nrow = 96)) # #fill columns with data characteristic values { colnames(sim.structure1) <- c("n" , "p" , "eta.x" , "eta.y" , "g" , "h") sim.structure1[ , "eta.x"] <- c(rep(c(0.0 , 0.1 , 0.2) , 24) , rep(0 , 24)) sim.structure1[ , "eta.y"] <- c(rep(c(rep(0.0 , 3) , rep(0.1 , 3) , rep(0.2 , 3)) , 8) , rep(0 , 24)) sim.structure1[ , "p"] <- c(rep(c(rep(8 , 9) , rep(30 , 9)) , 4) , rep(c(rep(8 , 3) , rep(30 , 3)) , 4)) sim.structure1[ , "n"] <- c(rep(25 , 18) , rep(50 , 18) , rep(100 , 18) , 207 Appendix: R Code rep(200 , 18) , rep(25 , 6) , rep(50 , 6) , rep(100 , 6) , rep(200 , 6)) sim.structure1[ , "g"] <- c(rep(0 , 72) , rep(c(0.2 , 0.0 , 0.2) , 8)) sim.structure1[ , "h"] <- c(rep(0 , 72) , rep(c(0.0 , 0.2 , 0.2) , 8)) } #generate repeated conditions dataframe for low dimensionality data # #initialize empty dataframe sim.structure.repped1 <- as.data.frame(matrix(ncol = 11 , nrow = (96*500))) colnames(sim.structure.repped1) <- c("n" , "p" , "eta.x" , "eta.y" , "g" , "h" , "seed.1" , "seed.2" , "seed.3" , "seed.4" , "seed.5") # #repeat each data condition in singular dataframe 500 times in new dataframe for(i in 1:nrow(sim.structure1)) { sim.structure.repped1[ ((500*(i - 1)) + 1): (500*i), (1:6)] <- purrr::map_dfr(seq_len(500) , ~sim.structure1[i , ]) } #generate singular HD data conditions # #initialize empty dataframe sim.structure2 <- as.data.frame(matrix(ncol = 6 , nrow = 16)) # #fill columns with data characteristic values { colnames(sim.structure2) <- c("n" , "p" , "eta.x" , "eta.y" , "g" , "h") sim.structure2[ , "n"] <- c(rep(200 , 16)) sim.structure2[ , "p"] <- c(190 , 200 , 210 , 500 , rep(1000 , 12)) sim.structure2[ , "eta.x"] <- c(rep(0 , 4) , rep(c(0.0 , 0.1 , 0.2) , 3) , rep(0 , 3)) sim.structure2[ , "eta.y"] <- c(rep(0 , 4) , rep(0 , 3) , rep(0.1 , 3) , rep(0.2 , 3) , rep(0 , 3)) sim.structure2[ , "g"] <- c(rep(0 , 13) , 0.2 , 0.0 , 0.2) sim.structure2[ , "h"] <- c(rep(0 , 13) , 0.0 , 0.2 , 0.2) } #generate repeated conditions dataframe # #initialize empty dataframe sim.structure.repped2 <- as.data.frame(matrix(ncol = 11 , nrow = (16*500))) colnames(sim.structure.repped2) <- c("n" , "p" , "eta.x" , "eta.y" , "g" , "h" , "seed.1" , "seed.2" , "seed.3" , "seed.4" , 208 Appendix: R Code "seed.5") # #repeat each data condition in singular dataframe 500 times in new dataframe for(i in 1:nrow(sim.structure2)) { sim.structure.repped2[ ((500*(i - 1)) + 1): (500*i), (1:6)] <- purrr::map_dfr(seq_len(500) , ~sim.structure2[i , ]) } #combine two repeated dataframes of data conditions sim.structure.repped <- bind_rows(sim.structure.repped1 , sim.structure.repped2) #generate seeds for random processes seeds <- rnorm(112*500*5) # #fill seed columns in dataframe of repeated conditions sim.structure.repped[ , 7:11] <- seeds #data-generating function for single iteration of single condition data.gen <- function(n , p , eta.x , eta.y , g , h , seed.1) { #Create dataframe of current data conditions conditions <- data.frame(n = n , p = p , eta.x = eta.x , eta.y = eta.y , g = g , h = h , seed = seed.1) #Create dataframe of seeds seeds <- data.frame(seed.1 = seed.1 , seed.2 = seed.2 , seed.3 = seed.3 , seed.4 = seed.4 , seed.5 = seed.5) #create vector of true coefficient values betas <- matrix(0 , nrow = p , ncol = 1) betas[1,1] <- 0.5 betas[2,1] <- 1.0 betas[3,1] <- 1.5 betas[4,1] <- 2.0 #set seed for data-generating process seed <- seed.1 #generate Identity covariance matrix # #initialize matrix of all 0’s covar.X <- matrix(rep(0 , p^2) , ncol = p) # #Generate 1’s along diagonal diag(covar.X) <- 1 209 Appendix: R Code #generate predictor values # #generate uncontaminated predictor values X.UC <- rmvnorm(floor((1 - eta.x)*n) , mean = rep(0 , p) , sigma = covar.X) # #Generate full set of predictor values # # #and residuals for outlier sets if(((g == 0) & (h == 0))){ # # #Generate contaminated values for eta.x > 0 # # # #and combine, otherwise make X from X.UC if(eta.x > 0) { X.C <- rmvnorm(ceiling(eta.x*n) , mean <- rep(10 , p) , sigma = covar.X) X <- rbind(X.UC , X.C) } else { X.C <- 0 X <- X.UC } # # #Generate uncontaminated residuals err.UC <- rnorm(floor((1-eta.y)*n) , mean = 0 , sd = 1) # # #Generate contaminated residuals for eta.y > 0 # # # #and combine, otherwise make err from err.UC if(eta.y > 0) { err.C <- rnorm(ceiling(eta.y*n) , mean = 2 , sd = 5) err <- c(err.UC , err.C) } else { err.C <- 0 err <- err.UC } # #Generate full set of predictor values # # #and residuals for distributional set } else if(((g != 0) | (h != 0))) { X <- X.UC err <- ghdist(n = n , g = g , h = h) } #Generate response values from X and residuals Y <- X %*% betas[ , 1] + err #Combine all data into a list for storage and analyses 210 Appendix: R Code combine <- list(conditions = conditions , seeds = seeds , betas = betas , Y=Y, X=X, err = err) #save the complete list of data return(combine) } #map data-generating function # #over all iterations of all data conditions data.full <- sim.structure.repped %>% pmap(data.gen) #save data to computer saveRDS(data.full , "FILE/PATH.RData") 211 Appendix: R Code B.5 Example Model-Application Code with Adaptive Lasso Tuning Parameter The following is an example of code used to apply adaptations which incorporate the adaptive lasso tuning parameter. This example is taken specifically from the code that generates the models for the LAD lasso adaptation. #load package packages library(hqreg) library(glmnet) library(magrittr) library(purrr) #load data simulation.data <- readRDS("FILE/PATH.RData") SNCDLAD.lasso.sim.fnct <- function(data) { #create vector to track progress of simulation tracker <- as.vector(unlist(data$conditions)) #print progress tracker cat("n = " , tracker[1] , " , p = " , tracker[2] , " , eta.x = " , tracker[3] , " , eta.y = " , tracker[4] , " , g = " , tracker[5] , " , h = " , tracker[6] , ";\n") #load X, Y, n, and p X <- data$X Y <- data$Y n <- length(Y) p <- data$conditions$p #Sequence of lambda values for cross-validated selection lambda.try <- seq(log(1400) , log(0.01) , length.out = 100) lambda.try <- exp(lambda.try) #Initial Ridge estimates for adaptive tuning parameter weighting ridge.model <- cv.glmnet(x = X , y = Y , lambda = lambda.try , alpha = 0) # #Optimal lambda for ridge estimates lambda.ridge.opt <- ridge.model$lambda.min 212 Appendix: R Code # #Initial ridge estimates for weighting, intercept removed best.ridge.coefs <- predict(ridge.model , type = "coefficients" , s = lambda.ridge.opt)[-1] #Sequence of nu/gamma values to try nu.try <- exp(seq(log(0.01) , log(10) , length.out = 100)) #Initialize full list of LAD elnet results from each nu/gamma LADlasso.nu.cv.full <- list() #Initialize objects of metrics and minimizing results # #for each nu/gamma LADlasso.nu.cv.lambda <- numeric() LADlasso.nu.cv.mse <- numeric() LADlasso.nu.cv.msesd <- numeric() LADlasso.nu.cv.coefs <- list() #Loop over nu/gamma values for CV, # #storing minimizing lambda within each nu/gamma for(i in 1:length(nu.try)) { #Generate LAD lasso model, hiding console output invisible(capture.output(LADlasso.nu.cv.full[[i]] <- cv.hqreg(X = X , y = Y , method = "quantile" , tau = 0.5 , lambda = lambda.try , alpha = 1 , preprocess = "standardize" , screen = "SR" , penalty.factor = 1 / abs(best.ridge.coefs)^nu.try[i] , FUN = "hqreg" , type.measure = "mse"))) #Store results to nu/gamma cv objects LADlasso.nu.cv.mse[i] <- min(LADlasso.nu.cv.full[[i]]$cve) LADlasso.nu.cv.msesd[i] <- LADlasso.nu.cv.full[[i]]$cvse[which.min(LADlasso.nu.cv.full[[i]]$cve)] LADlasso.nu.cv.lambda[i] <- LADlasso.nu.cv.full[[i]]$lambda.min LADlasso.nu.cv.coefs[[i]] <- LADlasso.nu.cv.full[[i]]$fit$beta[-1 , which.min(LADlasso.nu.cv.full[[i]]$cve)] } #specify minimizing nu value and resulting model info 213 Appendix: R Code nu.opt <- nu.try[which.min(LADlasso.nu.cv.mse)] #select optimizing lambda for optimizing nu/gamma lambda.opt <- LADlasso.nu.cv.lambda[which.min(LADlasso.nu.cv.mse)] #store optimizing lasso weights weights.opt <- 1 / abs(best.ridge.coefs)^nu.opt #store resulting coefficients from optimized weights coefs.opt <- LADlasso.nu.cv.coefs[[which.min(LADlasso.nu.cv.mse)]] #Store minimized training mse and SE LADlasso.mse.min <- min(LADlasso.nu.cv.mse) LADlasso.mse.min.se <- LADlasso.nu.cv.msesd[which.min(LADlasso.nu.cv.mse)] #save results return(list(important = data.frame( cbind(n = tracker[1] , p = tracker[2] , eta.x = tracker[3] , eta.y = tracker[4] , g = tracker[5] , h = tracker[6] , data.seed = tracker[7] , alpha = 1 , lambda = lambda.opt , nu = nu.opt , mpe = LADlasso.mse.min , mpe.sd = LADlasso.mse.min.se , fpr = length(which(coefs.opt[c(5:p)] != 0)) / length(coefs.opt[c(5:p)]) , fnr = length(which(coefs.opt[c(1:4)] == 0)) / length(coefs.opt[1:4]) ) ) ) ) } #map across all simulated datasets LADlasso.models <- simulation.data %>% map(safely(SNCDLAD.lasso.sim.fnct)) 214 Appendix: R Code saveRDS(LADlasso.models , "FILE/PATH.RData") 215 Appendix: R Code B.6 ExampleModel-ApplicationCodewithoutAdaptiveLassoTuningParam- eter The following is an example of code used to apply adaptations which do not incorporate the adaptivelassotuningparameter. Thisexampleistakenspecificallyfromthecodethatgenerates the models for the standard elastic net. #load package libraries library(glmnet) library(magrittr) library(purrr) #load data simulation.data <- readRDS("FILE/PATH.RData") elnet5.sim.fnct <- function(data) { #create vector to track progress of simulation tracker <- as.vector(unlist(data$conditions)) #print progress tracker cat("n = " , tracker[1] , " , p = " , tracker[2] , " , eta.x = " , tracker[3] , " , eta.y = " , tracker[4] , " , g = " , tracker[5] , " , h = " , tracker[6] , ";\n") #load X, Y, n, and p X <- data$X Y <- data$Y n <- length(Y) p <- data$conditions$p #Sequence of lambda values for cross-validated selection lambda.try <- seq(log(1400) , log(0.01) , length.out = 100) lambda.try <- exp(lambda.try) #run elastic net model elnet5.model <- cv.glmnet(X , Y , family = "gaussian" , lambda = lambda.try , alpha = 0.5) 216 Appendix: R Code #etermine minimized lambda.elnet5.opt <- elnet5.model$lambda.min #save lambda-optimized coefficients without intercept elnet5.coefs <- predict(elnet5.model , type = "coefficients" , s = lambda.elnet5.opt)[-1] return(list(model = list(full.model = elnet5.model , lambda = lambda.elnet5.opt , coefs = elnet5.coefs) , metrics = list(mpe = elnet5.model$cvm[which(elnet5.model$lambda == lambda.elnet5.opt)] , mpe.sd = elnet5.model$cvsd[which(elnet5.model$lambda == lambda.elnet5.opt)] , fpr = length(which(elnet5.coefs[c(5:p)] != 0)) / length(elnet5.coefs[c(5:p)]) , fnr = length(which(elnet5.coefs[c(1:4)] == 0)) / length(elnet5.coefs[1:4])) , important = list(coefs = elnet5.coefs , info = data.frame(cbind(n = tracker[1] , p = tracker[2] , eta.x = tracker[3] , eta.y = tracker[4] , g = tracker[5] , h = tracker[6] , data.seed = tracker[7] , alpha = 0.5 , lambda = lambda.elnet5.opt , mpe = elnet5.model$cvm[which(elnet5.model$lambda == lambda.elnet5.opt)] , mpe.sd = elnet5.model$cvsd[which(elnet5.model$lambda == lambda.elnet5.opt)] , fpr = length(which(elnet5.coefs[c(5:p)] != 0)) / length(elnet5.coefs[c(5:p)]) , fnr = length(which(elnet5.coefs[c(1:4)] == 0)) / length(elnet5.coefs[1:4]) ) ) 217 Appendix: R Code ) ) ) } #map simulation function across all simulated datasets elnet5.models <- simulation.data %>% map(safely(elnet5.sim.fnct)) saveRDS(elnet5.models , "FILE/PATH.RData") 218 Appendix: R Code B.7 Model-Application Code: Outlier-Shifted Lasso The following was used to simulate models for the outlier-shifted lasso proposed by Jung et al. (2016). The custom code for the model itself is within the model-application function, and is adapted from code generously provided by one of the original study authors. #libraries library(glmnet) library(purrr) library(magrittr) #load data simulation.data <- readRDS("FILE/PATH.RData") #KFold subsetter function kfold_subsetter <- function(data , k , seed = 7 , list = FALSE , random = TRUE) { if(length(dim(data)) == 2) { ###For 2D data #determine number of larger subsets (when unequal subsets) nsams.large <- nrow(data) %% k #determine number of smaller subsets (total number when equal subsets) nsams.small <- k - nsams.large #determine sample size of larger subsets (when unequal subsets) samsize.large <- ceiling(nrow(data) / k) * (nsams.large != 0) #determine sample size of smaller subsets # #(all subset size when equal subsets) samsize.small <- floor(nrow(data) / k) #indicator for which subset subset.indicator <- c(rep((1 : k) , floor(nrow(data) / k)) , rep((1 : (nsams.large) ) , (1 * (nsams.large != 0)) )) #fix random assignment process if(seed) { set.seed(seed) 219 Appendix: R Code } #combine subset indicator with original data if(random) { newdata <- cbind(data , subset = sample(subset.indicator)) } else { newdata <- cbind(data , subset = sort(subset.indicator)) } if(list) { newdata <- return(split(newdata[ , -ncol(newdata)] , f = newdata[ , ncol(newdata)])) } else { newdata <- return(newdata) } } else if (length(dim(data)) == 0){ #for 1D data #determine number of larger subsets (when unequal subsets) nsams.large <- length(data) %% k #determine number of smaller subsets (total number when equal subsets) nsams.small <- k - nsams.large #determine sample size of larger subsets (when unequal subsets) samsize.large <- ceiling(length(data) / k) * (nsams.large != 0) #determine sample size of smaller subsets # #(all subset size when equal subsets) samsize.small <- floor(length(data) / k) #indicator for which subset subset.indicator <- c(rep((1 : k) , floor(length(data) / k)) , rep((1 : (nsams.large) ) , (1 * (nsams.large != 0)) )) #fix random assignment process if(seed) { set.seed(seed) } #combine subset indicator with original data #create split list if desired 220 Appendix: R Code newdata <- matrix(cbind(data , subset = sample(subset.indicator)) , ncol = 2) if(list) { newdata <- return(split(newdata[ , -ncol(newdata)] , f = newdata[ , ncol(newdata)])) } else { newdata <- return(newdata) } } } OSlassoPLUS.sim.fnct<- function(data){ #create vector to track progress of simulation tracker <- as.vector(unlist(data$conditions)) #print progress tracker cat("n = " , tracker[1] , " , p = " , tracker[2] , " , eta.x = " , tracker[3] , " , eta.y = " , tracker[4] , " , g = " , tracker[5] , " , h = " , tracker[6] , ";\n") #load X, Y, n, and p X <- data$X Y <- data$Y n <- length(Y) p <- data$conditions$p Y.orgn<- Y p <- data$conditions$p #set sequence of parameter values for tuning parameter lambda # #and outlier-shifting gamma values lambda.lasso.try <- seq(log(0.01) , log(1400) , length.out = 100) lambda.lasso.try <- exp(lambda.lasso.try) lambda.gamma.try <- exp(seq(log(1) , log(1400) , length.out = 100)) #determine initial lasso coefs model.for.cv<- cv.glmnet(X, Y, family="gaussian", lambda=lambda.lasso.try) 221 Appendix: R Code lambda.lasso.opt<- model.for.cv$lambda.min model.est<- glmnet(X,Y,family="gaussian", lambda=lambda.lasso.opt) fit.lasso<- predict(model.est,X,s=lambda.lasso.opt) res.lasso<- Y - fit.lasso sigma.est<- mad(Y-fit.lasso) beta.est<- as.numeric(model.est$beta) gamma.est<-rep(0,n) #initialize subset indices K <- 5 X.new <- kfold_subsetter(X , k = K , random = FALSE) Y.new <- cbind(Y , X.new[ , "subset"]) n.cv <- n/ K CV.error2<-CV.error<-rep(NA,length(lambda.gamma.try)) Y.pred.cv<-matrix(NA,nrow=length(Y), ncol=length(lambda.gamma.try)) for (tt in 1:length(lambda.gamma.try)) { gamma.est.cv<-rep(0,n-n.cv) for (jj in 1:K) { subset <- unique(X.new[ , "subset"])[jj] sample.out.index <- which(X.new[ , "subset"] == jj) if(FALSE %in% (which(X.new[ , "subset"] == jj) == which(Y.new[ , 2] == jj))) { stop("X and Y subsets do not match") } ##return error if the x and y subset indices do not match X.train<- X.new[X.new[ , "subset"] != subset , -ncol(X.new)] Y.train<- Y.new[Y.new[ , 2] != subset , 1] X.test<- X.new[X.new[ , "subset"] == subset , -ncol(X.new)] model.train.temp<- glmnet(X.train, Y.train,family="gaussian", lambda=lambda.lasso.opt) beta.pre<-beta.post<- as.numeric(model.train.temp$beta) tol<-100; n.iter <- 0 while(tol>1e-6 & n.iter<100) { 222 Appendix: R Code resid.temp<- Y.train-X.train%*%beta.pre nonzero<-which(abs(resid.temp)>= sigma.est*lambda.gamma.try[tt]) gamma.est.cv[nonzero]<- resid.temp[nonzero] Y.train.new <- Y.train - gamma.est.cv model.train.temp<- glmnet(X.train, Y.train.new, family="gaussian", lambda=lambda.lasso.opt) beta.post <- as.numeric(model.train.temp$beta) tol<- sum((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<-beta.post } Y.pred.cv[sample.out.index,tt] <-X.test%*%beta.post } CV.error2[tt]<- mean((Y.pred.cv[,tt]-Y.orgn)^2) CV.error[tt]<- mean(abs(Y.pred.cv[,tt]-Y.orgn)) } #store optimized outlier-shifted parameter gamma lambda.gamma.opt <- lambda.gamma.try[which.min(CV.error2)] #now generate final model with optimized lambda and gamma model.opt<- glmnet(X,Y.orgn,family="gaussian", lambda=lambda.lasso.opt) beta.pre<- beta.post<- as.numeric(model.opt$beta) tol<-100; n.iter <- 0 while(tol>1e-6 & n.iter<100) { resid.opt<- Y.orgn-X%*%beta.pre nonzero<-which(abs(resid.opt)>= sigma.est*lambda.gamma.opt) gamma.est[nonzero]<- resid.opt[nonzero] Y.new2 <- Y.orgn - gamma.est model.opt<- glmnet(X,Y.new2,family="gaussian", lambda=lambda.lasso.opt) 223 Appendix: R Code beta.post <- as.numeric(model.opt$beta) tol<- mean((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<-beta.post } Y.fit<- X%*%beta.post #store number of nonzero coefs st.lad <- sum(beta.post != 0) #save results to list return(list(model = list(coefficient = beta.post , fit = Y.fit , iter = n.iter , sigma.est = sigma.est , mpe = mse.OS , mpe.sd = sd.mse.OS , n.outlier = length(which(gamma.est != 0)) , gamma.est = gamma.est , lambda.lasso.opt = lambda.lasso.opt , lambda.gamma.opt = lambda.gamma.opt) , important = list(diagnostics = data.frame(cbind(data.seed = tracker[7] , model.seed.lasso = seed.lasso) ), coefs = beta.post , info = data.frame(cbind(n = tracker[1] , p = tracker[2] , eta.x = tracker[3] , eta.y = tracker[4] , g = tracker[5] , h = tracker[6] , data.seed = tracker[7] , lambda.lasso = lambda.lasso.opt , lambda.gamma = lambda.gamma.opt , n.outlier = length(which(gamma.est != 0)) , 224 Appendix: R Code mpe = mse.OS , mpe.sd = sd.mse.OS , fpr = length(which(beta.post[c(5:p)] != 0)) / length(beta.post[c(5:p)]) , fnr = length(which(beta.post[c(1:4)] == 0)) / length(beta.post[1:4])) ) ) ) ) } #map model generation across all simulated data OSlassoPLUS <- simulation.data %>% map(safely(OSlassoPLUS.sim.fnct)) #save file to computer saveRDS(OSlassoPLUS , "FILE/PATH.RData") 225 Appendix: R Code B.8 Model-Application Code: Outlier-Shifted Huberized Lasso The following was used to simulate models for the outlier-shifted Huber-loss lasso proposed by Jung et al. (2016). The custom ”winzorized” custom function was taken directly from code provided by one of the original study authors, while the ”HforOSH2” is adapted from original they provided. Additionally, the outlier-shifting portion of the model is within the model- application function, and is also adapted from code generously provided by one of the original study authors. #load package libraries library(glmnet) library(purrr) library(magrittr) #load data simulation.data <- readRDS("FILE/PATH.RData") #winsorized function winsorized<- function(x,a=1.5,sigma=1) { s<-sigma newx<-x indp<-x>(a*s) newx[indp]<-(a*s) indn<- x<(a*-s) newx[indn]<- (-a*s) return(newx)} #Huber lasso function HforOSH2 <- function(X,Y,lambda.lasso.try){ #load n n <- length(Y) #set seed for generating initial lasso coefficients for weighting #seed.lasso <- data$seeds[ , "seed.10"] #set.seed(seed.lasso) #set possible lambda and gamma values #lambda.lasso.try <- seq(log(0.01) , log(1400) , length.out = 100) #lambda.lasso.try <- exp(lambda.lasso.try) 226 Appendix: R Code #initial lasso model model.for.cv<- cv.glmnet(X, Y, family="gaussian",lambda=lambda.lasso.try) lambda.lasso.opt<- model.for.cv$lambda.min model.est<- glmnet(X,Y,family="gaussian",lambda=lambda.lasso.opt) fit.lasso<- predict(model.est,X,s=lambda.lasso.opt) res.lasso<- Y-fit.lasso sigma.init<- mad(Y-fit.lasso) beta.pre<- as.numeric(model.est$beta) Y.old<- Y tol = 10 n.iter <- 0 while(tol>1e-4 & n.iter<100) { Y.new<- fit.lasso + winsorized(res.lasso, a=1.5, sigma=sigma.init) model.for.cv<- cv.glmnet(X, Y.new, family="gaussian", lambda=lambda.lasso.try) model.est<- glmnet(X, Y.new, family="gaussian", lambda=model.for.cv$lambda.min ) fit.lasso<- predict(model.est,X,s=model.for.cv$lambda.min) res.lasso<- Y.new-fit.lasso beta.post <- as.numeric(model.est$beta) tol<- sum((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<- beta.post } sigma.est<- mean(Y.new-(X%*%beta.post)^2) Y.fit<- X%*%beta.post Y.res<- Y.new - Y.fit object<- list(coefficient = beta.post, fit=Y.fit, iter = n.iter, sigma.est = sigma.est, 227 Appendix: R Code lambda.lasso.opt = model.est$lambda, residual = Y.res) } OSHlasso.sim.fnct <- function(data){ #create vector to track progress of simulation tracker <- as.vector(unlist(data$conditions)) #print progress tracker cat("n = " , tracker[1] , " , p = " , tracker[2] , " , eta.x = " , tracker[3] , " , eta.y = " , tracker[4] , " , g = " , tracker[5] , " , h = " , tracker[6] , ";\n") #load X, Y, n, and p X <- data$X Y <- data$Y n <- length(Y) p <- data$conditions$p #set sequence of parameter values for tuning parameter lambda # #and outlier-shifting tuning parameter lambda_gamma lambda.lasso.try <- seq(log(0.01) , log(1400) , length.out = 100) lambda.lasso.try <- exp(lambda.lasso.try) #initial lasso model model.for.cv<- cv.glmnet(X, Y, family="gaussian",lambda=lambda.lasso.try) lambda.lasso.opt<- model.for.cv$lambda.min model.est <- HforOSH2(data) fit.lasso <- model.est$fit res.lasso<- Y - fit.lasso sigma.est<- mad(Y-fit.lasso) beta.pre<- as.numeric(model.est$coefficient) Y.old<- Y tol = 10 n.iter <- 0 gamma.est<-rep(0,n) Y.old<- Y 228 Appendix: R Code tol = 10 n.iter <- 0 outliers.init<- abs(scale(Y-model.est$residual)) n.outlier<- length(which(as.vector(outliers.init)>2.5)) lambda.gamma<- sigma.est*qnorm((2*n-n.outlier)/(2*n)) while(tol>1e-4 & n.iter<100) { nonzero<-which(abs(res.lasso)>=lambda.gamma) gamma.est[nonzero]<- res.lasso[nonzero] Y.new<- Y.old - gamma.est model.est<- HforOSH2(X,Y.new,lambda.lasso.try) beta.post<- model.est$coefficient res.lasso<- model.est$residual tol<- sum((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<- beta.post } sigma.est<- mean(Y.new- (X%*%beta.post)^2) Y.fit<- X%*%beta.post Y.res<- Y.new - Y.fit #store number of nonzero coefs st.lad <- sum(beta.post != 0) #store optimized lambda lambda.lasso.opt <- model.est$lambda return(list(model = list(coefficient = beta.post , fit = Y.fit , iter = n.iter , sigma.est = sigma.est , gamma.est = gamma.est , lambda.opt = lambda.lasso.opt) , important = list(diagnostics = data.frame(cbind(data.seed = tracker[7] , model.seed.lasso = seed.lasso) ), coefs = beta.post , 229 Appendix: R Code info = data.frame(cbind(n = tracker[1] , p = tracker[2] , eta.x = tracker[3] , eta.y = tracker[4] , g = tracker[5] , h = tracker[6] , data.seed = tracker[7] , lambda.lasso = lambda.lasso.opt , mpe = mse.OS , mpe.sd = sd.mse.OS , fpr = length(which(beta.post[c(5:p)] != 0)) / length(beta.post[c(5:p)]) , fnr = length(which(beta.post[c(1:4)] == 0)) / length(beta.post[1:4])) ) ) ) ) } OSHuberLasso <- simulation.data %>% map(safely(OSHlasso.sim.fnct)) saveRDS(OSHuberLasso , "FILE/PATH.RData") 230 Appendix: R Code B.9 ExampleSimulationCode: GeneratingCoe cientPrecisionandTestMSE The following is an example of code used to generate coe cient estimate precision for each estimated model, as well as that model’s test-set MSE. This example is taken specifically from the LAD lasso adaptation. #package libraries library(mvtnorm) library(magrittr) library(purrr) library(hqreg) #load test set test.data <- readRDS("FILE/PATH.RData") #load model data ladlasso.data <- readRDS("FILE/PATH.RData") #initialize list with combined training and testing data combined.data <- list() #combine data for(i in 1:length(ladlasso500.data)) { combined.data[[i]] <- c(ladlasso.data[[i]] , test.data[[i]]) } #clear separated data so memory isn’t taken up rm(list = c("ladlasso500.data" , "test500.data")) msebias.ladlasso <- function(data) { #create vector to track progress of simulation tracker <- as.vector(unlist(data$conditions)) #print progress tracker cat("n = " , tracker[1] , " , p = " , tracker[2] , " , eta.x = " , tracker[3] , " , eta.y = " , tracker[4] , " , g = " , tracker[5] , " , h = " , tracker[6] , ";\n") #generate predicted Y values for test set pred.y <- data$X %*% data$result$full$coefs.opt 231 Appendix: R Code #generate residuals resid <- data$Y - pred.y #generate squared residuals resid.sq <- resid^2 #sum the squared residuals sum.resid.sq <- sum(resid.sq) #calculate test mse by dividing by sample size mse <- sum.resid.sq / data$result$important$n #store coefficient values for true non-zero coefs true.coefs <- c(0.5 , 1.0 , 1.5 , 2.0) #generate coef estimate differences coefs.dif <- data$result$full$coefs.opt[1:4] - true.coefs #square coef differences coefs.dif.sq <- coefs.dif^2 #sum sum.coefs.dif.sq <- sum(coefs.dif.sq) #divide by true number of non-zero coefs coefs.bias <- sum.coefs.dif.sq / 4 #print result to make sure sensible #’s are generated cat("coefs.bias = " , coefs.bias , "\n") #store important values to save later alpha <- 1.0 lambda.lasso <- data$result$important$lambda fpr <- data$result$important$fpr fnr <- data$result$important$fnr nu <- data$result$important$nu #save info to dataset, including NA’s for values that are # #relevant in other models for ease of subsequent combination # #of simulated models from all adaptations 232 Appendix: R Code return(data.frame(cbind(n = conditions[1] , p = conditions[2] , eta.x = conditions[3] , eta.y = conditions[4] , g = conditions[5] , h = conditions[6] , data.seed = conditions[7] , alpha = alpha , lambda.lasso = lambda.lasso , lambda.gamma = NA , n.outlier = NA , fpr = fpr , fnr = fnr , mse = mse , coefs.bias = coefs.bias , method = "ladlasso" , nu = nu ) ) ) } #map across full dataset ladlasso.mse.bias <- combined.data %>% map(safely(msebias.ladlasso)) #save resulting file to computer saveRDS(ladlasso.mse.bias , "FILE/PATH.RData") 233 Appendix C Applied Data Measures 234 Name____________________ Date________ DEPRESSION SCALE INSTRUCTIONS This questionnaire includes questions about symptoms of depression. For each item please indicate how well it describes you during the PAST WEEK, INCLUDING TODAY. Circle the number in the columns next to the item that best describes you. RATING GUIDELINES 0=not at all true (0 days) 1=rarely true (1-2 days) 2=sometimes true (3-4 days) 3=often true (5-6 days) 4=almost always true (every day) During the PAST WEEK, INCLUDING TODAY.... 1. I felt sad or depressed ............................................................................................................. 0 1 2 3 4 2. I was not as interested in my usual activities ....................................................................... 0 1 2 3 4 3. My appetite was poor and I didn't feel like eating ............................................................... 0 1 2 3 4 4. My appetite was much greater than usual ............................................................................ 0 1 2 3 4 5. I had difficulty sleeping ........................................................................................................... 0 1 2 3 4 6. I was sleeping too much ......................................................................................................... 0 1 2 3 4 7. I felt very fidgety, making it difficult to sit still...................................................................... 0 1 2 3 4 8. I felt physically slowed down, like my body was stuck in mud .......................................... 0 1 2 3 4 9. My energy level was low ........................................................................................................ 0 1 2 3 4 10. I felt guilty ................................................................................................................................ 0 1 2 3 4 11. I thought I was a failure .......................................................................................................... 0 1 2 3 4 12. I had problems concentrating ................................................................................................ 0 1 2 3 4 13. I had more difficulties making decisions than usual ............................................................ 0 1 2 3 4 14. I wished I was dead ................................................................................................................. 0 1 2 3 4 15. I thought about killing myself ................................................................................................ 0 1 2 3 4 16. I thought that the future looked hopeless ............................................................................. 0 1 2 3 4 17. Overall, how much have symptoms of depression interfered with or caused difficulties in your life during the past week? 0) not at all 1) a little bit 2) a moderate amount 3) quite a bit 4) extremely 18. How would you rate your overall quality of life during the past week? 0) very good, my life could hardly be better 1) pretty good, most things are going well 2) the good and bad parts are about equal 3) pretty bad, most things are going poorly 4) very bad, my life could hardly be worse Copyright © 2008, Mark Zimmerman, M.D. All rights reserved. Appendix: Applied Measures C.1 Clinically Useful Depression Outcome Scale 1 1 Zimmerman et al. (2008) 235 Name:___________________________ Date:_____________ ANXIETY SCALE INSTRUCTIONS: This scale includes questions about the symptoms of anxiety. For each item please indicate how well it describes you during the PAST WEEK, INCLUDING TODAY. Circle the number in the columns next to the item that best describes you. During the PAST WEEK, INCLUDING TODAY.... 1. I felt nervous or anxious ............................................................................................................0 1 2 3 4 2. I worried a lot that something bad might happen ......................................................................0 1 2 3 4 3. I worried too much about things ................................................................................................0 1 2 3 4 4. I was jumpy and easily startled by noises .................................................................................0 1 2 3 4 5. I felt "keyed up" or "on edge" ....................................................................................................0 1 2 3 4 6. I felt scared................................................................................................................................0 1 2 3 4 7. I had muscle tension or muscle aches ......................................................................................0 1 2 3 4 8. I felt jittery..................................................................................................................................0 1 2 3 4 9. I was short of breath..................................................................................................................0 1 2 3 4 10. My heart was pounding or racing ..............................................................................................0 1 2 3 4 11. I had cold, clammy hands .........................................................................................................0 1 2 3 4 12. I had a dry mouth ......................................................................................................................0 1 2 3 4 13. I was dizzy or lightheaded.........................................................................................................0 1 2 3 4 14. I felt sick to my stomach (nauseated)........................................................................................0 1 2 3 4 15. I had diarrhea ............................................................................................................................0 1 2 3 4 16. I had hot flashes or chills...........................................................................................................0 1 2 3 4 17. I urinated frequently ..................................................................................................................0 1 2 3 4 18. I felt a lump in my throat............................................................................................................0 1 2 3 4 19. I was sweating...........................................................................................................................0 1 2 3 4 20. I had tingling feelings in my fingers or feet................................................................................0 1 2 3 4 0=not at all true 1=rarely true 2=sometimes true 3=often true 4=almost always true Appendix: Applied Measures C.2 Clinically Useful Anxiety Outcome Scale 2 2 Zimmerman et al. (2010) 236 5 facet questionnaire: short form (ffmq-sf) Below is a collection of statements about your everyday experience. Using the 1–5 scale below, please indicate, in the box to the right of each statement, how frequently or infrequently you have had each experience in the last month (or other agreed time period). Please answer according to what really reflects your experience rather than what you think your experience should be. never or not often sometimes true often very often very rarely true true sometimes not true true or always true 1 2 3 4 5 1 I’m good at finding the words to describe my feelings DS 2 I can easily put my beliefs, opinions, and expectations into words DS 3 I watch my feelings without getting carried away by them NR 4 I tell myself that I shouldn’t be feeling the way I’m feeling /NJ 5 it’s hard for me to find the words to describe what I’m thinking /DS 6 I pay attention to physical experiences, such as the wind in my hair or sun on my face OB 7 I make judgments about whether my thoughts are good or bad. /NJ 8 I find it difficult to stay focused on what’s happening in the present moment /AA 9 when I have distressing thoughts or images, I don’t let myself be carried away by them NR 10 generally, I pay attention to sounds, such as clocks ticking, birds chirping, or cars passing OB 11 when I feel something in my body, it’s hard for me to find the right words to describe it /DS 12 it seems I am “running on automatic” without much awareness of what I’m doing /AA 13 when I have distressing thoughts or images, I feel calm soon after NR 14 I tell myself I shouldn’t be thinking the way I’m thinking /NJ 15 I notice the smells and aromas of things OB 16 even when I’m feeling terribly upset, I can find a way to put it into words DS 17 I rush through activities without being really attentive to them /AA 18 usually when I have distressing thoughts or images I can just notice them without reacting NR PTO. Appendix: Applied Measures C.3 Five Facet Mindfulness Questionnaire - Short Form, Page 1 3 3 Bohlmeijer et al. (2011) 237 never or not often sometimes true often very often very rarely true true sometimes not true true or always true 1 2 3 4 5 19 I think some of my emotions are bad or inappropriate and I shouldn’t feel them /NJ 20 I notice visual elements in art or nature, such as colors, shapes, textures, or patterns of light and shadow OB 21 when I have distressing thoughts or images, I just notice them and let them go NR 22 I do jobs or tasks automatically without being aware of what I’m doing /AA 23 I find myself doing things without paying attention /AA 24 I disapprove of myself when I have illogical ideas /NJ correct scores for items preceded by a slash (/NJ, /AA, etc) by subtracting from 6 non react = ; observe = ; act aware = ; describe = ; non judge = In the research study where the short form of the FFMQ was developed (see Bohlmeijer et al. below), most of the 376 participants were educated women with “clinically relevant symptoms of depression and anxiety”. They were randomized to a nine week clinical intervention involving an Acceptance & Commitment Therapy (ACT) self-help book “Living life to the full”, plus 10 to 15 minutes per day of Mindfulness-Based Stress Reduction meditation exercises, plus some email support. Mean (and Standard Deviation) scores pre- and post- intervention were: non react observe act aware describe non judge pre- mean (sd) 13.47 (3.07) 13.86 (3.21) 13.19 (3.32) 16.28 (3.91) 14.09 (3.63) ~70% 10.4–16.5 10.6–17.0 9.9–16.6 12.4–20.2 10.5–17.7 ~95% 7.3–19.6 7.4–20.3 6.5–19.8 8.5–24.1 6.8–21.3 post- intervention 16.90 15.22 15.98 18.46 18.14 Bohlmeijer, E., P. M. ten Klooster, et al. (2011). "Psychometric properties of the five facet mindfulness questionnaire in depressed adults and development of a short form." Assessment 18(3): 308-320. In recent years, there has been a growing interest in therapies that include the learning of mindfulness skills. The 39-item Five Facet Mindfulness Questionnaire (FFMQ) has been developed as a reliable and valid comprehensive instrument for assessing different aspects of mindfulness in community and student samples. In this study, the psychometric properties of the Dutch FFMQ were assessed in a sample of 376 adults with clinically relevant symptoms of depression and anxiety. Construct validity was examined with confirmatory factor analyses and by relating the FFMQ to measures of psychological symptoms, well-being, experiential avoidance, and the personality factors neuroticism and openness to experience. In addition, a 24-item short form of the FFMQ (FFMQ-SF) was developed and assessed in the same sample and cross-validated in an independent sample of patients with fibromyalgia. Confirmatory factor analyses showed acceptable model fit for a correlated five-factor structure of the FFMQ and good model fit for the structure of the FFMQ-SF. The replicability of the five-factor structure of the FFMQ-SF was confirmed in the fibromyalgia sample. Both instruments proved highly sensitive to change. It is concluded that both the FFMQ and the FFMQ-SF are reliable and valid instruments for use in adults with clinically relevant symptoms of depression and anxiety. Appendix: Applied Measures C.4 Five Facet Mindfulness Questionnaire - Short Form, Page 2 238 My Flexibility Scores The Acceptance and Action Questionnaire (AAQ-2) This is perhaps the most commonly used measure of psychological flexibility that you can find. It has been cited over 2000 times in scientific publications, and we generally know what it’s scores mean. My colleagues and I developed the original AAQ (Hayes et al., 2004) as well as the updated version presented below (Bond et al., 2011). You can use it weekly or biweekly to track how you are doing with applying psychological flexibility skills in your daily life. Don’t worry about trying to get to a “perfect” score. Use this number as a way to keep track of changes in your life over time. As you apply what you learned in A Liberated Mind, your psychological flexibility will improve. ! Steven C. Hayes Resource for A Liberated Mind Appendix: Applied Measures C.5 Acceptance and Action Questionnaire II Bond et al. (2011) 239
Abstract (if available)
Abstract
Statistical tools for variable selection provide the most utility the closer they are able to properly select the true variables underlying a data-generating mechanism. Likewise the selection of only these true predictors stands important. Applying the logic of robust hypothesis testing to variable selection, an ideal tool provides reliable selection properties across a variety of underlying data conditions. The proliferation of adaptations to two popular machine learning methods for dimension-reduction, the lasso and the elastic net, provide a basis from which to inform tool choice for an applied researcher interested in variable selection methods. The current research, therefore, attempts to lay a foundation for study of many of these adaptations’ robust selection properties while making recommendations for practical application with real data. Through numerous simulation studies and an applied example, I demonstrate the distinct limitations of some lasso and elastic net adaptations. On the other hand, the adaptive Huber lasso, the adaptive Huber elastic net, adaptive Least Absolute Deviation (LAD) lasso, and the newly-studied adaptive LAD elastic net show relatively-greater reliability and consistency across simulation conditions. This includes in response to heavy outlier contamination and heavy-tailed error distributions, although some selection instability is noted under perturbations made to the intercorrelations among real-data predictors.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Outlier-robustness in adaptations to the lasso
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
A Bayesian region of measurement equivalence (ROME) framework for establishing measurement invariance
PDF
Comparing skipped correlations: the overlapping case
PDF
Incorporating prior knowledge into regularized regression
PDF
On the latent change score model in small samples
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Using classification and regression trees (CART) and random forests to address missing data
PDF
An analysis of the robustness and reproducibility of computational tools used in biomedical research
PDF
Robust estimation of high dimensional parameters
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Regularized structural equation modeling
PDF
The implementation of data-driven techniques for the synthesis and optimization of colloidal inorganic nanocrystals
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Physics-based data-driven inference
Asset Metadata
Creator
Multach, Matthew Daniel
(author)
Core Title
The robustification of the lasso and the elastic net: utility in practical research settings
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Degree Conferral Date
2021-08
Publication Date
07/30/2021
Defense Date
06/16/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
elastic net,lasso,machine learning,non-normality,OAI-PMH Harvest,outliers,regression,robust statistics,variable selection
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wilcox, Rand (
committee chair
), Beam, Christopher (
committee member
), Dehghani, Morteza (
committee member
), Kaplan, Jonas (
committee member
), McCarthy, T.J. (
committee member
)
Creator Email
matt.multach87@gmail.com,multach@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15289001
Unique identifier
UC15289001
Legacy Identifier
etd-MultachMat-9949
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Multach, Matthew Daniel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
elastic net
lasso
machine learning
non-normality
outliers
regression
robust statistics
variable selection