Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Outlier-robustness in adaptations to the lasso
(USC Thesis Other)
Outlier-robustness in adaptations to the lasso
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
University of Southern California Outlier-Robustness in Adaptations to the Lasso by Matthew D. Multach A thesis submitted in partial fulfillment for the degree of Master of Arts in Psychology in the University of Southern California Department of Psychology Degree Conferral Date: December 2018 Declaration of Authorship I, Matthew Daniel Multach, declare that this thesis titled, ‘Outlier-Robustness in Adap- tations to the Lasso’ and the work presented in it are my own. I confirm that: ⌅ This work was done wholly or mainly while in candidature for a research degree at this University. ⌅ Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. ⌅ Where I have consulted the published work of others, this is always clearly at- tributed. ⌅ Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. ⌅ I have acknowledged all main sources of help. ⌅ Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i “Write a funny quote here.” University of Southern California Abstract Rand Wilcox Department of Psychology Master of Science by Matthew D. Multach Model selection in regression techniques has risen to the forefront in recent statistical applications. Regularization methods, and the lasso especially, have seen increasing attention in the past decade, with a variety of adaptations developed to enhance the robustness of the model selection capabilities of these techniques. Outliers in particular present diculty for the standard lasso to perform most e↵ectively, and a number of these adaptations significantly improve performance over the standard lasso. However, many of the recently-developed adaptations have not been compared in the context of a single analysis. The current paper provides the first single-paper overview of the most significant robustified lasso methods gaining traction in the literature, specifically with respect to handling contamination by outlying observations. Among seven adaptations studied, those that used an adaptive weighting scheme for the regularization tuning parameter peformed best by a significant margin; however, inconsistent patterns of per- formance across conditions in other methods merit further consideration. Future work will benefit from increasing the number of iterations per simulation condition to ensure stability of the results. Acknowledgements I would first like to thank my doctoral advisor, Dr. Rand R. Wilcox, for his help in my search for a research area worth investigating, and for his continued help with my thesisasithasdeveloped. Iwouldalsoliketothanktheothertwomembersofmythesis committee, Dr. Morteza Dehghani and Dr. Christopher R. Beam, for their support on my project. I would also like to thank the researchers who so generously supported my project by providing code from their own simulations and answered all my questions about their work: Dr. Yoonsuh Jung, Dr. Ezequiel Smucler, and Dr. Qi Zheng. Finally, I must express gratitude to my parents for their unending support throughout my years of study and especially over the last few months as I have been working on this thesis. This would not have been possible without them. Matt Multach iv Contents Declaration of Authorship i Abstract iii Acknowledgements iv 1 Introduction 1 2 Outlier Robustness and Lasso-Based Adaptations 4 2.1 Adaptive Lasso and LAD-lasso ........................ 4 2.2 Huberized Lasso ................................ 5 2.3 Outlier Shifting................................. 6 2.3.1 Outlier-Shifted Huberized Lasso ................... 7 2.4 MM-Lasso.................................... 7 2.4.1 Adaptive MM-lasso........................... 8 3 Simulations 9 3.1 Software Implementation............................ 9 3.2 Simulation Conditions ............................. 10 3.3 Results...................................... 11 A R Code: R Package Libraries 16 B Custom R Statistical Functions: lsa.linear.txt Source Code 17 C Custom R Statistical Functions: Huberizing and Outlier-Shifting 24 D Custom R Statistical Functions: MM-Lasso and Adaptive MM-Lasso 30 E Custom R Statistical Functions: Adaptive Lasso and LAD-Lasso 41 F R Simulation Code 43 G R Code: Generating Metrics 48 v Contents vi Bibliography 51 Chapter 1 Introduction The issue of model selection in regression modelling has seen extensive review since the late 20th century. Exclusion of valid predictors can negatively impact predictive capabilitiesofandinducebiasintotheresultingmodel;however,inclusionofunnecessary variablesreducesmodelperformanceviaincreasedvarianceincoecientestimates. The significanceoftechniquesthatavoidoverfittingprovideddata,andconsequentlyreducing utility and true understanding of a data-driving process also cannot be understated. Conceptually,ensuringtheselectionofthetrueparametersensuresfidelitytotheprocess itself; selecting variables that truly define the process of interest allows more faithful extrapolation to external situations, and more faithful recreations of those situations already studied. Selecting incorrect variables, on the other, only lengthens the roads that need to be traveled to obtain full understanding of the process of interest. Given themulti-prongednatureoftheselectionproblem,theabilityofastatisticaltechniqueto consistentlyselectvariablesfromthetrueunderlyingmodelisofparamountimportance. Although a number of techniques have been developed or adapted to address the variable selection problem, most have demontrated inadequate performance, including stepwise regression (Derksen and Keselman, 1992; Miller, 1990) and related techniques (Chatterjee and Hadi, 1988; Huberty, 1989; Kuo and Mallick, 1998), techniques relying on R 2 , and the standard F omnibus test. Other non-robust techniques to have been proposed and deemed unsatisfactory include Mallow’s C p and the ridge regression esti- mator. More recently a wide array of potential variable selection solutions have risen to prominence. Some promising examples include the use of cross-validation methods, the 0.632 bootstrap (Efron and Tibshirani, 1993), and the nonnegative garrote estima- tor developed by Breiman (1995) and robustified by Gijbels and Vrinssen (2015). Still other methods include the lasso (Tibshirani, 1996) and least angle regression (Zhang and Zamar, 1288). 1 Contents 2 Thelassoinparticularhasreceivedalotofattentioninthegeneralvariableselection literature since its development, and more focused attention in robust variable selection research since the mid 2000’s. Developed in the 1990’s, the least absolute shrinkage and selection operator (lasso) (Tibshirani, 1996) has seen reasonable success in model selec- tion applications, in particular with accurate and reliable reduction of large parameter spaces; thismakesitsuitableformodelselectioninexploratorysituationswherethetrue model of the process is not yet fully understood, or when uncertainty exists regarding the importance of all possible available variables. The Lasso makes use of regularization to restrict the sum of the absolute value of coecient estimates in a regression model (Hastie et al., 2015), with the consequence of reducing some of these estimates com- pletely to zero. Let y=(y i ,...,y N ), and X represent the vector of p responses across N observations (an N⇥ p matrix). Expressed in matrix-vector notation, lasso=argmin 2 p 1 2N ky 0 1 X k 2 2 , subject tok k 1 t (1.1) wherek k 1 t represents the ` 1 -norm constraint,k·k 2 the Euclidian norm, t a con- stantchosentoinducevariableshrinkage,andtheequationtobeminimizedthestandard least-squares minimization problem. The constant t bounds the parameter estimates of theresultingmodelbylimitingthesumoftheabsolutevaluesoftheseestimates,thereby constraining the model by shrinking parameter estimates, with some being shrunk com- pletely to zero. This constraint, and its isomorphic equivalent discussed below, is typically chosen using an external, data-driven technique such as cross-validation. The ` 1 -norm, alternatively known as the Least Absolute Deviation, is the minimizer of the absolute value of the di↵erence between an observed value and the corresponding value predicted by the model; the Euclidean norm, or ` 2 -norm, represents the same di↵erence, exceptminimizingthesquareofthisdi↵erenceinsteadoftheabsolutevalue. Alternately expressed in Lagrangian form, lasso=argmin 2 p 1 2N ky X k 2 2 + k k 1 (1.2) for some 0, where represents a tuning parameter with a direct relationship to t under the constraint k k 1 t. For convenience with later adaptations, another formulation of the lasso criterion is as follows (Tibshirani, 1996; Wang et al., 2007): lasso=argmin 2 p n X i=1 (y i x 0 i ) 2 +n p X j=1 | j |. (1.3) where the first term on the right-hand side of the equation represents the least-squares Contents 3 criterion (i.e., the Euclidean norm from equation 1.1), and the second the ` 1 -norm lasso constraint in relation to sample size, n. This study attempts to summarize and compare a number of recent and successful adaptations made to robustify the lasso, specifically within the context of their outlier- robustness. Tibshirani (1996) and others showed that the Lasso has shown reliable performance when handling normally-distributed errors. The general performance of the lasso with respect to outliers, however, is unclear, and many techniques have been proposed to ensure robustness in their presence; consequently, it is important to study the lasso and its adaptations’ performance and under what outlier conditions their re- spectiveperformancesvary. Sincethemid-2000’s, avarietyofstudieshaveattemptedto improve the lasso’s ability to handle outliers both in predictors and outcomes. Most of these methods fall within a particular family of adaptations, and previous studies have comparedsomeoftheseadaptationslargelywithinthecontextofthesefamilies; todate, however, no study has compared the panoply of adaptations under a unified simulation framework that allows direct comparison of the methods. Section 2 of this article outlines all of the methods used herein and the underlying rationale behind these methods 1 . Section 3 outlines the broad simulation framework used to study these methods and describes the subsequent findings of these simulations. The article concludes in Section 4 with a brief discussion of the simulation findings. 1 Although key statistical formulations are presented, full theoretical considerations for each method are left to their respective publications. These publications are cited under the heading of each tech- niques’ corresponding subsection. Chapter 2 Outlier Robustness and Lasso-Based Adaptations Outlierrobustnessrepresentsaparticularlynoteworthyissueintherobustnessliterature. Astatisticaltechnique’srobustnesstooutliershelpsincreasecertaintythatthemodelor results derive from the majority of the data, and not to a handful of especially deviant values. Though these points might be relevant for understanding of niche processes, a moregeneralizablemodelshouldcatertothemostcommonportionsofadata-producing process. The lasso’s tuning parametere and model constraints already help to reduce the influence a few select points on model development, and its subsequent adaptations can potentially further reduce the e↵ects of these outliers. 2.1 Adaptive lasso 1 and LAD-lasso 2 One of the first proposed modifications of the lasso uses adaptive tuning parameters j (Zou, 2006): Adlasso=argmin 2 p n X i=1 (y i x 0 i ) 2 + p X j=1 ˆ w j | j |. (2.1) where ˆ w=1/ ˆ isapre-determinedweightsvector, forsome> 0andforsomeroot- n consistent estimator of , ˆ . The adaptive lasso, or weighted lasso, replaces the single tuning parameter with a weighted combination of feature-specific tuning parameters. This modification is largely made to address problems of inconsistent variable selection that arises from finding the optimal tuning parameter for best prediction accuracy (Fan and Li, 2001; Meinshausen and Buhlman, 2004). 1 See Zou (2006) for primary theoretical considerations. 2 See Wang et al. (2007) for primary mathematical considerations. 4 Contents 5 Wang et al. (2007) shortly thereafter proposed one of the earliest attempts at robustifying the lasso for handling outliers in the response variable using least absolute deviation (LAD) instead of least squares criterion: LADlasso=argmin 2 p n X i=1 y i x 0 i +n p X j=1 j | j |. (2.2) Due to the timing of their study, Wang et al. (2007) do not explicitly make use of the findings and techniques from Zou (2006). However, their LAD-lasso technique also uses an adaptive tuning parameter, and their procedure can consequently be reformulated as follows: LADlasso=argmin 2 p n X i=1 y i x 0 i + p X j=1 ˆ w j | j |. (2.3) Althoughthiscouldtechnicallybeconsideredasan”adaptiveLAD-lasso,”itwillhence- forth be referred to as the ”LAD-lasso,” given its use in the literature. Wang et al. (2007), Lambert-Lacroix and Zwald (2011), and Zheng et al. (2016) all demonstrated superior model selection performance of the LAD-lasso over the stan- dard lasso (without adaptive tuning parameter or LAD loss function) when handling outliers in the response. Both Lambert-Lacroix and Zwald (2011) and Zheng et al. (2016) note, however, LAD-lasso’s reduced eciency with normal, light-tailed error dis- tributions compared to non-LAD adaptive lasso (Lambert-Lacroix and Zwald, 2011) and another adaptation (RA-lasso; Zheng et al., 2016) to be considered in future work. (Zheng et al., 2016) further showed the LAD-lasso’s inferior performance overrall com- pared to the RA-lasso when dealing with outliers. In a previous broad comparison of lassoadaptations and non-OLS regression techniques, Smucler and Yohai (2017)showed that although the LAD-lasso generally performed well in selecting models, two other techniques (MM-Lasso and adaptive MM-lasso) performed better in the presence of high-leverage outliers. 2.2 Huberized Lasso 3 Looking towards older work in robust estimation, Rosset and Zhu (2007) and Lambert- Lacroix and Zwald (2011) adapted the lasso by incorporating the Huber loss function 3 See Rosset and Zhu (2007)and Lambert-Lacroix and Zwald (2011) for primary mathematical con- siderations with the lasso, as well as Huber (1964)and Huber (1981) for mathematical considerations with Huberized estimators in general Contents 6 (Huber, 1964) rather than the squared-error loss Hlasso=argmin 2 p n X i=1 L(y i x 0 i )+ p X j=1 ˆ w j | j | (2.4) where L(z) is the Huber loss function L(z)= 8 < : z 2 |z| M, 2M |z| M 2 |z| M (2.5) TheHuberestimatorwasoriginallydevelopedtoenhancerobustnessqualitiesinregression- based methods and measures of central tendency by applying a piecewise loss function instead of the standard squared-error loss Huber (1981). Per Huber (1981), the tran- sition parameter M will be fixed to 1.345, which is recommended to balance both ro- bustness and eciency with normally-distributed data. However, as noted by Zheng et al. (2016), further work on transition parameter selection for the Huberized lasso should be conducted. It is also worth emphasizing that the Huberized lasso developed by Lambert-Lacroix and Zwald (2011) also makes use of the adaptive weighted tuning parameter scheme proposed by Zou (2006). Zheng et al. (2016) found that although the Huberized lasso performed well both withandwithoutresponseoutliersandcomparably-sototheLAD-lasso,itwasgenerally outperformed by the RA-lasso performed therein. 2.3 Outlier Shifting 4 Jungetal.(2016)proposedageneralrobustregressionadaptationthatpenalizespotentially- outlyingobservationsinthedatawithanadditionalpenaltyterm. Theauthorsdescribe the technique as comparable to best subset selection used for ”selecting observations instead of variables,” (3). Applied to the lasso, the resulting outlier-shifted model pro- cedure is as follows: OSLasso=argmin 2 p n X i=1 (y i x 0 i i ) 2 + p X j=1 | j |+ n X i=1 2 i I( y i x 0 i ) (2.6) where the 2 i ’s are case-specific parameters that shrink towards zero for non-outlying observations in the data, based on the outlier threshold parameter . These parame- ters are conceptually analogous to the tuning parameter in the general lasso, except 4 See Jung et al. (2016) for primary mathematical considerations Contents 7 tailored towards the outlier-ness of observed values in particular cases. I() represents theindicatorfunction,valuedat1whenthecontainedconditionistrue,and0otherwise. 2.3.1 Outlier-Shifted Huberized Lasso Jung et al. (2016) also proposed a Huberized version of their outlier-shifted lasso 5 : OSHuberLasso=argmin 2 p n X i=1 L(y i x 0 i i )+ p X j=1 | j |+ n X i=1 2 i I( y i x 0 i ) (2.7) Jung et al. (2016) assessed the false negative (i.e., non-zero coecients set to zero) and false positive (i.e., zero coecients set to non-zero value) rates of the Huberized lasso, as well as the adaptive lasso, the outlier-shifted (OS) lasso and the OS-Huberized lasso. All four methods had comparably-low false-negative rates; the Huberized lasso tendedtohavemiddlingperformanceamongthefourmethods,however,infalsepositive rates, althoughallfourmethodshadaveragefalsepositiveratesgreaterthan44%across conditions. Overall, the outlier-shifted lasso performed comparably to the adaptive lasso, the Huberized lasso and an outlier-shifted Huberized lasso procedure (Jung et al., 2016), including high rates of false-positive coecients, although has not been studied elsewhere. 2.4 MM-Lasso 6 Smucler and Yohai (2017) adapted the concept of MM-estimators for use in the lasso setting. InitiallyproposedbyYohai(1987),MM-estimatorsincorporatefromresidualsof estimatestoperformecientlyunderidealizedconditionsofnormalityaswellashandlea large percentage of arbitrary data contamination. Here and subsequently, contaminated dataisdefinedasoutlyingdatathathasbeenmadearbitrarilylargeorsmall. Asdefined by Maronna (2011), a ⇢ -function is a function with the following characteristics: i) the function is continuous, even, and bounded; ii) ⇢ (x) is a nondecreasing function of |x|; 5 At time of writing, it is unclear whether or not this procedure uses the adaptive weighting found in Lambert-Lacroix and Zwald (2011), however, the second term on the right-hand side of the formulation suggests it was not. 6 See Smucler and Yohai (2017) for mathematical considerations of the MM-lasso. Mathematical considerations of MM-estimators in general can be found in Yohai (1987), while mathematical consider- ations of applications to other regularization techniques can be found in Maronna (2011), Alfons et al. (2013), and Alfons et al. (2016) Contents 8 iii) ⇢ (0) = 0; iv) lim x!1 ⇢ (x)=1; and v) ⇢ (u) ⇢ (v) for any ⇢ (v) 1 and 0 u v. MMlasso=argmin 2 p n X i=1 ⇢ 1 ✓ r i ( ) s n (r( ˆ 1 )) ◆ + p X j=1 | j | (2.8) where r( ) is the residual of the coecients, and s n (r( )) is the M-estimate of scale of the residuals. Smucler and Yohai (2017) found that the MM-lasso, and an adaptive version (formulated below), both performed well in false positive and false negative rates under simulated data conditions with a variety of underlying distributions. The MM-lasso’s strong performance was most clear in the presence of high-leverage outliers. 2.4.1 Adaptive MM-lasso AsresearchershavedonewiththeLAD-lasso,SmuclerandYohai(2017)furthermodified their MM-lasso using weighted tuning parameters: AdaMMlasso=argmin 2 p n X i=1 ⇢ 1 ✓ r i ( ) s n (r( ˆ 1 )) ◆ + n p X j=1 ˆ w j | j | (2.9) Chapter 3 Simulations This chapter includes a description of the computational implementation of the various methodsdiscussedaswellasanoverviewofthesimulationmethodsandresultsformodel selection. The simulation and all statistical implementations were conducted using the R statistical software. 3.1 Software Implementation The model selection properties of the following methods are included: • Adaptive Lasso The adaptive lasso was implemented using a custom R function adapted from that generously provided by Dr. Qi Zheng, one of the authors of Zheng et al. (2016). •LAD-Lasso The LAD-lasso was implemented using a custom R function adapted from that generously provided by Dr. Qi Zheng, one of the authors of Zheng et al. (2016). • Huberized Lasso The Huberized lasso was implemented using a custom R function adapted from that generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016) 1 . • Outlier-Shifted Lasso The outlier-shifted lasso was implemented using a custom R function adapted from that generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016). 1 At time of writing, it is unclear whether the function uses the adaptive tuning parameter weighting scheme of Zou (2006)andusedin Lambert-Lacroix and Zwald (2011)’s original formulation. However, the formulation of the outlier-shifted Huberized lasso in the study suggests it does not. 9 Contents 10 • Outlier-Shifted Huberized Lasso The outlier-shifted Huberized lasso was imple- mented using a custom R function adapted from that generously provided by Dr. Yoonsuh Jung, one of the authors of Jung et al. (2016) 2 . • MM-Lasso The MM-lasso was implemented from R code adapted from that gener- ously provided by Dr. Ezequiel Smucler, one of the authors of Smucler and Yohai (2017). • Adaptive MM-Lasso TheadaptiveMM-lassowasimplementedfromRcodeadapted fromthatgenerouslyprovidedbyDr. EzequielSmucler,oneoftheauthorsofSmu- cler and Yohai (2017). Additionally, the standard lasso was included as a benchmark against which to compare the other methods. 3.2 Simulation Conditions Following the methodology used by Turkmen and Ozturk (2016) in their second simula- tion, data for the simulations were generated from the following mixture distributions: ✏⇠ (1 ⌘ y )N(0,⌃ )+⌘ y N(10,⌃ ) (3.1) and x⇠ (1 ⌘ x )N(0,⌃ )+⌘ x N(10,⌃ ) (3.2) where⌃ is the p⇥ p identity matrix with 1’s on the diagonal and 0’s otherwise, and ⌘ x and ⌘ y the amount of contamination in the predictor and response space to be discussed subsequently. The full linear model to be replicated by the analyses was then generated using the standard linear model, y i =x 0 i +✏ i (3.3) with 1 =0.5, 2 =1.0, 3 =1.5, 4 =2.0, and all else equal to 0. The following features of the data were varied across conditions: • p = number of potential predictors p varied across three levels: 8, 30, 50 • n = sample size sample size varied across four levels: 25, 50, 100, 200 2 See previous note on unclear weighting usage for Huberized lasso implementation Contents 11 • ⌘ x = amount of predictor contamination x-spacecontaminationvariedacrossthree levels: 0.0, 0.1, and 0.2, corresponding to 0%, 10%, and 20% contamination, re- spectively • ⌘ y = amount of response contamination y-spacecontaminationvariedacrossthree levels: 0.0, 0.1, and 0.2, corresponding to 0%, 10%, and 20% contamination, re- spectively with each possible combination of feature levels represented for a total of 108 data conditions. 100 iterations of each condition were simulated, and analyses conducted on each iteration. The main metric for assessing the model selection abilities of each technique is as follows: • False-Positive Rate (FPR) the number of true zero coecients incorrectly esti- mated to be non-zero The author also assessed False Negative Rate, i.e. the number of non-zero coecients estimated to be zero. The False-Negative Rate (FNR) is defined as the number of non-zero coecients incorrectly estimated to be zero. 3.3 Results The false-positive rates for each method under the various conditions are presented in Tables 1 3. Tables 1, 2, and 3 give the FPR’s for p = 8, 30, and 50 predictors, respectively. Although false-negative rate was assessed, the methods for the most part all performed similarly, with many FNR’s of 0 and few above 1%. Consequently, tables were omitted for this metric, and noteworthy cases highlighted in the summary below. The results for each method, in summary: • The adaptive lasso demonstrated very strong behavior in all of the conditions, with mostFPR’sbelow5%andFPR’sgenerallylowerthanallmethodsexcepttheLAD- lasso. However, this came with consistent FNR’s above 0 and occasionally as high as 7.5%. With larger X-contamination, Y-contamination reduced performance, although performance was roughly stable across contamination levels otherwise. • Across data conditions, the LAD-lasso consistently performed the best in terms of FPR.AlthoughFNR’stypicallyfellbelow2%andwereoftenlowerthantheadap- tive lasso, they were consistently non-zero. Performance of the LAD-lasso tended to increase as contamination in either space increased. Contents 12 • The Huberized lasso had among the lowest FNR’s overall, with no false negatives in most of the data conditions. However, the Huberized lasso also typically had the highest rates of false positives, with FPR’s ranging from 20% to 43%. Perfor- manced was most drastically impaired by simultaneous outliers in the predictors and the response. • The outlier-shiftedlassoperformedbetter thaneither the Huberized lassoorthe com- bined outlier-shifted, Huberized lasso in terms of FPR, with FNR’s approaching those of the two Huberized methods. The method performed best with increased sample size, increased number of potential predictors, and moderate-to-large Y- contamination (but no X-contamination). • The outlier-shifted Huberized lasso performed comparably with the Huberized lasso. FPR’s were more unstable across conditions, ranging from low 20% to above 40% and no consistent pattern of greater or reduced performance. • The MM-lasso performed moderately well relative to the other methods, with FPR’s typically ranging from low 20-26%. The MM-lasso performed more e↵ectively with greater X-contamination, and performance decreased with greater number of potential predictors. • TheadaptiveMM-lassodemonstratedoneofthebestFPR’soverall, behindtheadap- tive lasso and the LAD-lasso and with rates ranging from 7.4% to 20.3%. As with the adaptive lasso and the LAD-lasso (the other strongest performers in term of FPR),theadaptivelassomoreconsistentlyhadnon-zeroFNR’s,althoughthisrate almost never exceeded .5% The MM-lasso performed best with a larger number of potential predictors and larger sample size, but did not demonstrate any clear patterns with respect to amount of outlier contamination in either predictors or response. Overall,theLAD-lasso,adaptivelasso,andadaptiveMM-lassoconsistentlyhadthe best performance, respectively, in terms of rate of true-zero coecients incorrectly esti- mated to be non-zero. However, these methods were also much more likely to estimate any true non-zero coecients to be zero at rates corresponding to their e↵ectiveness at correctly eliminating true-zero coecients. Huberized methods had the most consistent 0% FNR’s, but also performed worse than most other methods and had the most incon- sistent patterms of performance across conditions. Overall, the LAD-lasso provided the best balance of FPR and FNR across data conditions, consistently removing true-zero coecients less than 5% of the time and correctly including non-zero coecients more than 98% of the time. Contents 13 Future work on this project will explore the inconsistent behavior of some of these techniques. The main goal is to increase the number of iterations per data condition from100to1000. Additionally,theauthorwilladdsimulationsusingdi↵erentparameter structures as well as analysis of applied data sets. False-Positive Rate P=8 ⌘ y = 0.0 ⌘ x = 0.1 ⌘ x = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 N = 25 Lasso 0.2888 0.2113 0.2688 0.2850 0.2825 0.2813 0.2425 0.2775 0.2850 AdaLasso 0.0700 0.0088 0.0200 0.0113 0.0225 0.0225 0.0663 0.0300 0.0113 LAD-Lasso 0.0463 0.0050 0.0013 0.0025 0.0013 0.0025 0.0563 0.0088 0.0013 Huber Lasso 0.3125 0.2025 0.2375 0.2900 0.3200 0.3200 0.2488 0.3000 0.3200 OS Lasso 0.2613 0.2125 0.2425 0.2713 0.1875 0.1438 0.2575 0.2888 0.2825 OS Huber Lasso 0.3113 0.2050 0.2425 0.3123 0.3050 0.3413 0.2450 0.3200 0.3150 MM Lasso 0.3738 0.2650 0.2763 0.3050 0.2888 0.2838 0.2875 0.3113 0.2488 Ada MM-Lasso 0.1688 0.1475 0.1400 0.1238 0.1325 0.1525 0.1600 0.1188 0.1200 N = 50 Lasso 0.3075 0.3050 0.2688 0.2500 0.2825 0.2813 0.2713 0.2663 0.2850 AdaLasso 0.0425 0.0563 0.0200 0.0388 0.0225 0.0225 0.0363 0.0350 0.0113 LAD-Lasso 0.0275 0.0188 0.0013 0.0325 0.0013 0.0025 0.0350 0.0225 0.0013 Huber Lasso 0.3125 0.3125 0.2375 0.2500 0.3200 0.3200 0.25875 0.2625 0.3200 OS Lasso 0.2825 0.2225 0.2425 0.2575 0.1875 0.1438 0.2663 0.2588 0.2825 OS Huber Lasso 0.3113 0.3275 0.2425 0.2675 0.3050 0.3413 0.2500 0.2638 0.3150 MM Lasso 0.3025 0.3238 0.2763 0.2725 0.2888 0.2838 0.2888 0.2913 0.2488 Ada MM-Lasso 0.1138 0.1238 0.1400 0.1425 0.1325 0.1525 0.1313 0.1300 0.1200 N = 100 Lasso 0.2863 0.3038 0.2650 0.2538 0.2727 0.2513 0.2263 0.2538 0.2788 AdaLasso 0.0288 0.0300 0.0400 0.0175 0.0488 0.0225 0.0238 0.0175 0.0563 LAD-Lasso 0.0113 0.0025 0.0063 0.0113 0.0113 0.0038 0.0100 0.0038 0.0113 Huber Lasso 0.2938 0.3150 0.3025 0.2488 0.2775 0.2838 0.2150 0.2488 0.2925 OS Lasso 0.2750 0.2088 0.1588 0.2688 0.2838 0.2800 0.2213 0.2350 0.2913 OS Huber Lasso 0.2988 0.3138 0.3413 0.2538 0.3038 0.3300 0.2138 0.2438 0.2938 MM Lasso 0.3013 0.2900 0.2888 0.2525 0.2613 0.2663 0.2675 0.2300 0.2525 Ada MM-Lasso 0.1288 0.1075 0.1275 0.1025 0.1063 0.1325 0.1113 0.0925 0.0888 N = 200 Lasso 0.2850 0.2825 0.2813 0.2325 0.2775 0.2850 0.2113 0.2688 0.2813 AdaLasso 0.0113 0.0225 0.0225 0.0088 0.0300 0.0113 0.0088 0.0200 0.0338 LAD-Lasso 0.0025 0.0013 0.0025 0.0038 0.0088 0.0013 0.0050 0.0013 0.0063 Huber Lasso 0.2900 0.3200 0.3200 0.2625 0.0300 0.3200 0.2025 0.2375 0.2838 OS Lasso 0.2713 0.1875 0.1438 0.2400 0.2888 0.2825 0.2125 0.2425 0.2875 OS Huber Lasso 0.3013 0.3050 0.3413 0.2538 0.3200 0.3150 0.2050 0.2425 0.2850 MM Lasso 0.3050 0.2888 0.2838 0.2700 0.3113 0.2488 0.2650 0.2763 0.2913 Ada MM-Lasso 0.1238 0.1325 0.1525 0.1250 0.1188 0.1200 0.1475 0.1400 0.1300 Table 1: False Positive Rates: p = 8 predictors Contents 14 False-Positive Rate P = 30 ⌘ y = 0.0 ⌘ x = 0.1 ⌘ x = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 N = 25 Lasso 0.2813 0.2260 0.2527 0.2617 0.2347 0.2333 0.2253 0.3157 0.2663 AdaLasso 0.0338 0.0120 0.0157 0.0143 0.0190 0.0157 0.0143 0.0287 0.0187 LAD-Lasso 0.0063 0.0120 0.0053 0.0240 0.0137 0.0067 0.0177 0.0103 0.0050 Huber Lasso 0.2838 0.2527 0.2760 0.2703 0.2547 0.2990 0.2733 0.3777 0.3060 OS Lasso 0.2875 0.2370 0.2607 0.2343 0.0607 0.0350 0.2513 0.2900 0.2763 OS Huber Lasso 0.2850 0.2537 0.2823 0.2850 0.3007 0.3263 0.3513 0.4243 0.3743 MM Lasso 0.2913 0.2233 0.2357 0.2697 0.2787 0.2660 0.2517 0.2613 0.2543 Ada MM-Lasso 0.1300 0.0823 0.0737 0.2007 0.2027 0.1507 0.0937 0.1180 0.1120 N = 50 Lasso 0.3067 0.2263 0.2030 0.2617 0.2347 0.2420 0.2263 0.3157 0.2663 AdaLasso 0.0763 0.0677 0.0597 0.0697 0.0190 0.0543 0.0143 0.0287 0.0187 LAD-Lasso 0.0953 0.0403 0.0323 0.1103 0.0137 0.0403 0.0177 0.0103 0.0050 Huber Lasso 0.3060 0.2870 0.2643 0.2963 0.2547 0.2777 0.2733 0.3777 0.3060 OS Lasso 0.2620 0.0990 0.0697 0.2700 0.0607 0.2300 0.2513 0.2900 0.2763 OS Huber Lasso 0.3217 0.3353 0.3217 0.3333 0.3007 0.3453 0.3513 0.4243 0.3743 MM Lasso 0.4623 0.4257 0.3107 0.4120 0.2787 0.3020 0.2517 0.2613 0.2543 Ada MM-Lasso 0.2437 0.2520 0.1660 0.1817 0.2027 0.1337 0.0937 0.1180 0.1120 N = 100 Lasso 0.2640 0.2117 0.2273 0.2380 0.3060 0.2643 0.2467 0.2480 0.3247 AdaLasso 0.0237 0.0327 0.0357 0.0220 0.0320 0.0297 0.0200 0.0203 0.0473 LAD-Lasso 0.0380 0.0163 0.0110 0.0283 0.0233 0.0140 0.0357 0.0127 0.0193 Huber Lasso 0.2747 0.2750 0.2800 0.2710 0.3530 0.3097 0.2750 0.2920 0.3950 OS Lasso 0.2357 0.0633 0.0483 0.2617 0.2910 0.2547 0.2740 0.2650 0.2780 OS Huber Lasso 0.2760 0.3043 0.3123 0.3267 0.4057 0.3823 0.2830 0.2753 0.4000 MM Lasso 0.3200 0.2767 0.2450 0.2853 0.2637 0.2553 0.2350 0.2663 0.2653 Ada MM-Lasso 0.1817 0.1557 0.1267 0.1060 0.1177 0.1133 0.0927 0.0603 0.1117 N = 200 Lasso 0.2617 0.2347 0.2333 0.2263 0.3157 0.2663 0.2260 0.2527 0.3450 AdaLasso 0.0143 0.0190 0.0157 0.0143 0.0287 0.0187 0.0120 0.0157 0.0350 LAD-Lasso 0.0240 0.0137 0.0057 0.0177 0.0103 0.0050 0.0120 0.0053 0.0107 Huber Lasso 0.2703 0.2547 0.2990 0.2733 0.3777 0.3060 0.2527 0.2760 0.4260 OS Lasso 0.2343 0.0607 0.0350 0.2513 0.2900 0.2763 0.2370 0.2607 0.3077 OS Huber Lasso 0.2850 0.3007 0.3263 0.3513 0.4243 0.3743 0.2537 0.2823 0.4153 MM Lasso 0.2697 0.2787 0.2660 0.2517 0.2613 0.2543 0.2233 0.2357 0.2583 Ada MM-Lasso 0.2007 0.2027 0.1507 0.0937 0.1180 0.1120 0.0823 0.0737 0.0883 Table 2: False Positive Rates: p = 30 predictors Contents 15 False-Positive Rate P = 50 ⌘ y = 0.0 ⌘ x = 0.1 ⌘ x = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 ⌘ y = 0.0 ⌘ y = 0.1 ⌘ y = 0.2 N = 25 Lasso 0.3450 0.1880 0.1884 0.1900 0.1908 0.1832 0.1998 0.2738 0.2148 AdaLasso 0.0350 0.0066 0.0094 0.0112 0.0124 0.0168 0.0060 0.0200 0.0140 LAD-Lasso 0.0107 0.0136 0.0034 0.0188 0.0072 0.0058 0.0146 0.0094 0.0028 Huber Lasso 0.4260 0.2214 0.2458 0.2084 0.2098 0.2340 0.2294 0.3548 0.2628 OS Lasso 0.3077 0.1952 0.2110 0.1712 0.0322 0.0252 0.2018 0.2370 0.2192 OS Huber Lasso 0.4153 0.2198 0.2276 0.2194 0.2436 0.2546 0.3012 0.4248 0.3436 MM Lasso 0.2583 0.1916 0.2166 0.2198 0.2014 0.2008 0.2058 0.2116 0.2276 Ada MM-Lasso 0.0883 0.0514 0.0866 0.1808 0.1734 0.1462 0.0764 0.0960 0.0916 N = 50 Lasso 0.3450 0.1880 0.1880 0.1900 0.1908 0.1832 0.1998 0.2738 0.2146 AdaLasso 0.0350 0.0066 0.0066 0.0112 0.0124 0.0168 0.0060 0.0200 0.0140 LAD-Lasso 0.0107 0.0136 0.0136 0.0188 0.0072 0.0058 0.0146 0.0094 0.0028 Huber Lasso 0.0426 0.2214 0.2214 0.2084 0.2098 0.2340 0.2294 0.3548 0.2628 OS Lasso 0.3077 0.1952 0.1952 0.1712 0.0322 0.0252 0.2018 0.2370 0.2192 OS Huber Lasso 0.4153 0.2198 0.2198 0.2194 0.2436 0.2546 0.3012 0.4248 0.3436 MM Lasso 0.2583 0.1916 0.1916 0.2198 0.2014 0.2008 0.2058 0.2116 0.2276 Ada MM-Lasso 0.0883 0.0514 0.0514 0.1808 0.1734 0.1462 0.0764 0.0960 0.0916 N = 100 Lasso 0.2198 0.1898 0.1732 0.2134 0.2770 0.1996 0.1994 0.1944 0.2662 AdaLasso 0.0242 0.0366 0.0230 0.0210 0.0234 0.0240 0.0126 0.0112 0.0308 LAD-Lasso 0.0410 0.0150 0.0122 0.0336 0.0226 0.0102 0.0298 0.0108 0.0176 Huber Lasso 0.2330 0.2366 0.2268 0.2546 0.3458 0.2518 0.2266 0.2504 0.3964 OS Lasso 0.1920 0.0608 0.0396 0.2160 0.2354 0.2158 0.2044 0.2108 0.2314 OS Huber Lasso 0.2506 0.2750 0.2970 0.3418 0.4170 0.3240 0.2266 0.2566 0.4014 MM Lasso 0.4074 0.2630 0.2048 0.2974 0.2768 0.2278 0.2750 0.2176 0.2150 Ada MM-Lasso 0.2408 0.1598 0.1354 0.1178 0.1196 0.0812 0.1114 0.0562 0.0798 N = 200 Lasso 0.1900 0.1908 0.1832 0.1998 0.2738 0.2146 0.1880 0.1884 0.2908 AdaLasso 0.0112 0.0124 0.0168 0.0060 0.0200 0.0140 0.0066 0.0094 0.0290 LAD-Lasso 0.0188 0.0072 0.0058 0.0146 0.0094 0.0028 0.0136 0.0034 0.0098 Huber Lasso 0.2084 0.2098 0.2340 0.2294 0.3548 0.2628 0.2214 0.2458 0.4338 OS Lasso 0.1712 0.0322 0.0252 0.2018 0.2370 0.2192 0.1952 0.2110 0.2302 OS Huber Lasso 0.2194 0.2436 0.2546 0.3012 0.4248 0.3436 0.2198 0.2276 0.4356 MM Lasso 0.2198 0.2014 0.2008 0.2058 0.2116 0.2276 0.1916 0.2166 0.2232 Ada MM-Lasso 0.1808 0.1734 0.1462 0.0764 0.0960 0.0916 0.0514 0.0866 0.1122 Table 3: False Positive Rates: p = 50 predictors Appendix A R Code: R Package Libraries The following code includes all of the packages used in the simulation. The mmlasso package was installed from github. library(care) library(doParallel) library(glmnet) library(lars) library(MASS) library(mvtnorm) library(parallel) library(githubinstall) library(quantreg) library(Rcpp) library(readr) library(robustbase) library(robustHD) library(spc) source("lsa.linear.txt") library(mmlasso) 16 Appendix B Custom R Statistical Functions: lsa.linear.txt Source Code 1 The following code was sourced in the libraries in Appendix A and used throughout the simulation for the adaptive lasso and LAD-lasso. lsa.linear<-function(x,y){ require(lars) ## Least square approximation. This version Oct 19, 2006 ## Reference Wang, H. and Leng, C. (2006) and Efron et al. (2004). ## ## Written by Chenlei Leng ## ## Input ## obj: lm/glm/coxph or other object ## ## Output ## beta.ols: the MLE estimate ## beta.bic: the LSA-BIC estimate ## beta.aic: the LSA-AIC estimate lsa <- function(obj) { intercept <- attr(obj$terms,’intercept’) 1 generously provided by Zheng et al. (2016) 17 Contents 18 if(class(obj)[1]==’coxph’) intercept <- 0 n <- length(obj$residuals) Sigma <- vcov(obj) SI <- solve(Sigma) beta.ols <- coef(obj) l.fit <- lars.lsa(SI, beta.ols, intercept, n) t1 <- sort(l.fit$BIC, ind=T) t2 <- sort(l.fit$AIC, ind=T) beta <- l.fit$beta if(intercept) { beta0 <- l.fit$beta0+beta.ols[1] beta.bic <- c(beta0[t1$ix[1]],beta[t1$ix[1],]) beta.aic <- c(beta0[t2$ix[1]],beta[t2$ix[1],]) } else { beta0 <- l.fit$beta0 beta.bic <- beta[t1$ix[1],] beta.aic <- beta[t2$ix[1],] } obj <- list(beta.ols=beta.ols, beta.bic=beta.bic, beta.aic = beta.aic) obj } ################################### ## lars variant for LSA lars.lsa <- function (Sigma0, b0, intercept, n, type = c("lasso", "lar"), eps = .Machine$double.eps,max.steps) Contents 19 { type <- match.arg(type) TYPE <- switch(type, lasso = "LASSO", lar = "LAR") n1 <- dim(Sigma0)[1] ## handle intercept if (intercept) { a11 <- Sigma0[1,1] a12 <- Sigma0[2:n1,1] a22 <- Sigma0[2:n1,2:n1] Sigma <- a22-outer(a12,a12)/a11 b <- b0[2:n1] beta0 <- crossprod(a12,b)/a11 } else { Sigma <- Sigma0 b <- b0 } Sigma <- diag(abs(b))%*%Sigma%*%diag(abs(b)) b <- sign(b) nm <- dim(Sigma) m <- nm[2] im <- inactive <- seq(m) Cvec <- drop(t(b)%*%Sigma) ssy <- sum(Cvec*b) if (missing(max.steps)) max.steps <- 8 * m beta <- matrix(0, max.steps + 1, m) Gamrat <- NULL arc.length <- NULL R2 <- 1 RSS <- ssy first.in <- integer(m) active <- NULL Contents 20 actions <- as.list(seq(max.steps)) drops <- FALSE Sign <- NULL R <- NULL k <- 0 ignores <- NULL while ((k < max.steps) & (length(active) < m)) { action <- NULL k <- k + 1 C <- Cvec[inactive] Cmax <- max(abs(C)) if (!any(drops)) { new <- abs(C) >= Cmax - eps C <- C[!new] new <- inactive[new] for (inew in new) { R <- updateR(Sigma[inew, inew], R, drop(Sigma[inew, active]), Gram = TRUE,eps=eps) if(attr(R, "rank") == length(active)) { ##singularity; back out nR <- seq(length(active)) R <- R[nR, nR, drop = FALSE] attr(R, "rank") <- length(active) ignores <- c(ignores, inew) action <- c(action, - inew) } else { if(first.in[inew] == 0) first.in[inew] <- k active <- c(active, inew) Sign <- c(Sign, sign(Cvec[inew])) action <- c(action, inew) } } } else action <- -dropid Gi1 <- backsolve(R, backsolvet(R, Sign)) dropouts <- NULL Contents 21 A <- 1/sqrt(sum(Gi1 * Sign)) w <- A * Gi1 if (length(active) >= m) { gamhat <- Cmax/A } else { a <- drop(w %*% Sigma[active, -c(active,ignores), drop = FALSE]) gam <- c((Cmax - C)/(A - a), (Cmax + C)/(A + a)) gamhat <- min(gam[gam > eps], Cmax/A) } if (type == "lasso") { dropid <- NULL b1 <- beta[k, active] z1 <- -b1/w zmin <- min(z1[z1 > eps], gamhat) # cat(’zmin ’,zmin, ’ gamhat ’,gamhat,’\n’) if (zmin < gamhat) { gamhat <- zmin drops <- z1 == zmin } else drops <- FALSE } beta[k + 1, ] <- beta[k, ] beta[k + 1, active] <- beta[k + 1, active] + gamhat * w Cvec <- Cvec - gamhat * Sigma[, active, drop = FALSE] %*% w Gamrat <- c(Gamrat, gamhat/(Cmax/A)) arc.length <- c(arc.length, gamhat) if (type == "lasso" && any(drops)) { dropid <- seq(drops)[drops] for (id in rev(dropid)) { R <- downdateR(R,id) } dropid <- active[drops] beta[k + 1, dropid] <- 0 active <- active[!drops] Sign <- Sign[!drops] } Contents 22 actions[[k]] <- action inactive <- im[-c(active)] } beta <- beta[seq(k + 1), ] dff <- b-t(beta) RSS <- diag(t(dff)%*%Sigma%*%dff) if(intercept) beta <- t(abs(b0[2:n1])*t(beta)) else beta <- t(abs(b0)*t(beta)) if (intercept) { beta0 <- as.vector(beta0)-drop(t(a12)%*%t(beta))/a11 } else { beta0 <- rep(0,k+1) } dof <- apply(abs(beta)>eps,1,sum) BIC <- RSS+log(n)*dof AIC <- RSS+2*dof object <- list(AIC = AIC, BIC = BIC, beta = beta, beta0 = beta0) object } ##This part is written by Hansheng Wang. vcov.rq <- function(object,...) { q=object$tau x=as.matrix(object$x) resid=object$residuals f0=density(resid,n=1,from=0,to=0)$y COV=q*(1-q)*solve(t(x)%*%x)/f0^2 COV } Contents 23 # adaptive lasso for linear reg, tuning parameter by bic # calls software from Wang and Leng (2007, JASA). ok<-complete.cases(x,y) x<-x[ok,] # get rid of na’s y<-y[ok] # since regsubsets can’t handle na’s m<-ncol(x) n<-nrow(x) as.matrix(x)->x lm(y~x)->out lsa(out)->out.lsa coeff<-out.lsa$beta.bic coeff2<-coeff[2:(m+1)] # get rid of intercept pred<-x%*%coeff2+coeff[1] st<-sum(coeff2 !=0) # number nonzero mse<-sum((y-pred)^2)/(n-st-1) if(st>0) x.ind<-as.vector(which(coeff2 !=0)) else x.ind<-0 return(list(fit=pred,st=st,mse=mse,x.ind=x.ind,coeff=coeff2, intercept=coeff[1])) } Appendix C Custom R Statistical Functions: Huberizing and Outlier-Shifting 1 Thefollowingcodewasimplementedpriortothesimulation-generatingfunctioninorder tousetheHuberizedlasso,theOutlier-shiftedlasso,andtheOutlier-shiftedHuberlasso. ###Huber lasso H.lasso <- function(X,Y,lambda.lasso.try,k=1.5){ n<-length(Y) Y.orgn<- Y model.for.cv<- cv.glmnet(X, Y, family="gaussian",lambda=lambda.lasso.try) lambda.lasso.opt<- model.for.cv$lambda.min model.est<- glmnet(X,Y,family="gaussian", lambda=lambda.lasso.opt) fit.lasso<- predict(model.est,X,s=lambda.lasso.opt) res.lasso<- Y-fit.lasso sigma.init<- mad(Y-fit.lasso) beta.pre<- c(model.est$a0,as.numeric(model.est$beta)) Y.old<- Y tol = 10 n.iter <- 0 while(tol>1e-4 & n.iter<100) { Y.new<- fit.lasso + winsorized(res.lasso,a=k, sigma=sigma.init) 1 generously provided by Jung et al. (2016) 24 Contents 25 model.for.cv<- cv.glmnet(X,Y.new, family="gaussian", lambda=lambda.lasso.try) model.est<- glmnet(X,Y.new,family="gaussian", lambda=model.for.cv$lambda.min ) fit.lasso<- predict(model.est,X,s=model.for.cv$lambda.min) res.lasso<- Y.new-fit.lasso beta.post <- c(model.est$a0,as.numeric(model.est$beta)) tol<- sum((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<- beta.post } sigma.est<- mean((Y.new-cbind(rep(1,n),X)%*%beta.post)^2) Y.fit<- cbind(rep(1,n),X)%*%beta.post Y.res<- Y.new - Y.fit object<- list(coefficient=beta.post,fit=Y.fit, iter = n.iter, sigma.est = sigma.est, lambda.lasso.opt = model.est$lambda, residual = Y.res) H.Lasso.beta.mtx <<- object } winsorized<- function(x,a=1.5,sigma=1) { s<-sigma newx<-x indp<-x>(a*s) newx[indp]<-(a*s) indn<- x<(a*-s) newx[indn]<- (-a*s) return(newx)} ###OS Huber Lasso OSH.lasso <- function(X,Y,lambda.lasso.try,k=1.5){ n<-length(Y) Y.orgn<- Y model.for.cv<- cv.glmnet(X, Y, family="gaussian", lambda=lambda.lasso.try) #cat("OSH.lasso/cv.glmnet" , "\n") model.est <- H.lasso(X,Y,lambda.lasso.try,k) Contents 26 #cat("OSH.lasso/H.lasso, j =" , j , "\n") fit.lasso <- model.est$fit #cat("OSH.lasso/fit.lasso, j =" , j , "\n") res.lasso <- Y - fit.lasso #cat("OSH.lasso/res.lasso, j =" , j , "\n") sigma.est<- mad(Y-fit.lasso) #cat("OSH.lasso/sigma.est, j =" , j , "\n") beta.pre<- model.est$coefficient #cat("OSH.lasso/beta.pre, j =" , j , "\n") gamma.est<-rep(0,n) Y.old<- Y tol = 10 n.iter <- 0 outliers.init<- abs(scale(Y-model.est$residual)) n.outlier<- length(which(as.vector(outliers.init)>2.5)) lambda.gamma<- sigma.est*qnorm((2*n-n.outlier)/(2*n)) cat("OSH.lasso/lambda.gamma" , "\n") while(tol>1e-4 & n.iter<100) { nonzero<-which(abs(res.lasso)>=lambda.gamma) gamma.est[nonzero]<- res.lasso[nonzero] Y.new<- Y.old - gamma.est model.est<- H.lasso(X,Y.new,lambda.lasso.try,k) beta.post<- model.est$coefficient res.lasso<- model.est$residual tol<- sum((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<- beta.post } cat("OSH.lasso/whileloop" , "\n") Y.fit<- cbind(rep(1,n),X)%*%beta.post #cat("OSH.lasso/Y.fit" , "\n") object<- list(coefficient=beta.post,fit=fit.lasso,iter = n.iter, n.outlier=n.outlier,sigma.est=sigma.est,gamma.est = gamma.est, lambda = model.est$lambda) #cat("OSH.lasso/object" , "\n") } Contents 27 ###OS Lasso OS.lasso<- function(X,Y,lambda.lasso.try,lambda.gamma.try){ n<-length(Y) Y.orgn<- Y model.for.cv<- cv.glmnet(X, Y, family="gaussian", lambda=lambda.lasso.try) lambda.lasso.opt<- model.for.cv$lambda.min model.est<- glmnet(X,Y,family="gaussian",lambda=lambda.lasso.opt) fit.lasso<- predict(model.est,X,s=lambda.lasso.opt) res.lasso<- Y - fit.lasso sigma.est<- mad(Y-fit.lasso) beta.est<- as.numeric(model.est$beta) gamma.est<-rep(0,n) n.fold<- 5 n.cv <- n/n.fold CV.error2<-CV.error<-rep(NA,length(lambda.gamma.try)) Y.pred.cv<-matrix(NA,nrow=length(Y),ncol=length(lambda.gamma.try)) Y.new<- Y for (tt in 1:length(lambda.gamma.try)) { gamma.est.cv<-rep(0,n-n.cv) for (jj in 1:n.fold) { sample.out.index<- (1+n.cv*(jj-1)):(n.cv*(jj)) X.train<- X[-sample.out.index,] Y.train<- Y[-sample.out.index] X.test<- X[sample.out.index,] model.train.temp<- glmnet(X.train,Y.train,family="gaussian", lambda=lambda.lasso.opt) beta.pre<-beta.post<- c(model.train.temp$a0, as.numeric(model.train.temp$beta)) tol<-100; n.iter <- 0 while(tol>1e-6 & n.iter<100) { Contents 28 resid.temp<- Y.train-cbind(rep(1,n-n.cv),X.train)%*%beta.pre nonzero<-which(abs(resid.temp)>=sigma.est*lambda.gamma.try[tt]) gamma.est.cv[nonzero]<- resid.temp[nonzero] Y.train.new <- Y.train - gamma.est.cv model.train.temp<-glmnet(X.train,Y.train.new, family="gaussian",lambda=lambda.lasso.opt) beta.post <- c(model.train.temp$a0, as.numeric(model.train.temp$beta)) tol<- sum((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<-beta.post } Y.pred.cv[sample.out.index,tt] <-cbind(rep(1,n.cv), X.test)%*%beta.post } CV.error[tt]<- mean((Y.pred.cv[,tt]-Y.orgn)^2) CV.error2[tt]<- mean(abs(Y.pred.cv[,tt]-Y.orgn)) } lambda.gamma.opt<- lambda.gamma.try[which.min(CV.error)] model.opt<- glmnet(X,Y.orgn,family="gaussian",lambda=lambda.lasso.opt) beta.pre<- beta.post<- c(model.opt$a0,as.numeric(model.opt$beta)) tol<-100; n.iter <- 0 while(tol>1e-6 & n.iter<100) { resid.opt<- Y.orgn-cbind(rep(1,n),X)%*%beta.pre nonzero<-which(abs(resid.opt)>=sigma.est*lambda.gamma.opt) gamma.est[nonzero]<- resid.opt[nonzero] Y.new <- Y.orgn - gamma.est model.opt<- glmnet(X,Y.new,family="gaussian",lambda=lambda.lasso.opt) beta.post <- c(model.opt$a0,as.numeric(model.opt$beta)) tol<- mean((beta.pre-beta.post)^2) n.iter<- n.iter+1 beta.pre<-beta.post Contents 29 } Y.fit<- cbind(rep(1,n),X)%*%beta.post object<- list(coefficient=beta.post,fit=fit.lasso,iter = n.iter, sigma.est = sigma.est,CV.error=CV.error, n.outlier=length(which(gamma.est!=0)), gamma.est=gamma.est,lambda.opt=lambda.gamma.opt) } Appendix D Custom R Statistical Functions: MM-Lasso and Adaptive MM-Lasso 1 Thefollowingcodewasimplementedpriortothesimulation-generatingfunctioninorder to use the MM-lasso and adaptive MM-lasso. ###MM Lasso/Adaptive MM Lasso #SRidge CVSE<-function(X,y,nfold,lam,gradlib,nkeep,niter.S){ #Performs nfold-CV for S-Ridge #INPUT #beta.ini, scale.ini: initial estimates of regression and scale #X,y: data #lam: penalization parameter #gradlib: degrees of freedom, used to calculate the delta for the M-scale #nkeep: number of candidates for full iterations of IWLS #niter.S : number of maximum iterations of IWLS #OUTPUT #mse: resulting MSE (estimated using a tau-scale) ###Segment data n<-nrow(X) p<-ncol(X) 1 generously provided by Smucler and Yohai (2017) 30 Contents 31 indin<-1:n nestim<-n*(1-1/nfold) lamcv<-lam deltaesc<-0.5*(1-(gradlib)/nestim) inint<-floor(seq(0,n,length.out=nfold+1)) resid<-vector(mode=’numeric’,length=n) ### for (kk in 1:nfold){ testk<-(inint[kk]+1):inint[kk+1] estik<-setdiff(indin,testk); Xtest<-X[testk,] Xesti<-X[estik,] ytest<-y[testk] yesti<-y[estik] se<-try(rr_se(Xesti,yesti,lamcv,deltaesc,cc_scale=1,nkeep,niter.S, epsilon=1e-4)) if (class(se)=="try-error"){ return(Inf) } bet<-se$coef bint<-bet[1] beta<-bet[2:(p+1)] fitk<-Xtest%*%beta+bint resid[testk]<-ytest-fitk } mse<-scale_tau(resid) return(mse) } findlam<-function(vals,r){ #Finds lamdas which yield edf=r p<-length(vals) nr<-length(r) lamr<-vector(mode=’numeric’,nr); lam1<-0 lam2<-max(vals)*(p/r-1) for (i in 1:nr){ lam0<-c(lam1, lam2[i]+0.5) #the value 0.5 is for lam=0 lamr[i]<-uniroot(sumrat,lam0,vals,r[i])$root Contents 32 } return(lamr) } sumrat<-function(lam,vals,r){ susu<-sum(vals/(vals+lam))-r return(susu) } const_marina<-function(delta){ integrand<- function(x,c){dnorm(x)*rho_marina(x,c)} expe<-function(c,delta){ integrate(integrand,lower=-Inf,upper=Inf,c)$value-delta } init<-c(0.1,100) try(cw<-uniroot(expe,init,delta)$root,silent=TRUE) if (class(cw)=="try-error"){ warning("Something’s wrong, could not find tuning constant for the scale") return(NULL) } return(cw) } sridge<-function(x,y,cualcv.S=5,numlam.S=30,niter.S=50,normin=0,denormout=0, alone=0,ncores=1){ #Solves n*s_n^2 +lam*||beta1||^2} = min. Adapted from Ricardo Maronna’s original MATLAB code. #INPUT #cualcv.S: method for estimating prediction error. cualcv-fold cross-validation #normin: normalize input data?. 0=no, default; 1=yes #denormout: denormalize output?. 0=no, default;1=yes #alone: are you calculating the estimator for its sake only? 0=no, default;1=yes #numlam.S: number of lambda values, default 30 #niter.S : number of maximum iterations of IWLS #ncores : number of cores to use for parallel computations. Default is one core. #OUTPUT Contents 33 #coef: (p+1)-vector of regression parameters, beta(1)=intercept #scale: M-estimate of scale of the final regression estimate #edf: final equivalent degrees of freedom #lamda: optimal lambda #delta: optimal delta ###Center and scale covariates and response using median and mad if (normin==1){ prep<-prepara(x,y) Xnor<-prep$Xnor ynor<-prep$ynor mux<-prep$mux sigx<-prep$sigx muy<-prep$muy }else{ Xnor<-x ynor<-y } #Spherical Principal Components (no centering) #privar, Beig= vector of robust "eigenvalues" and matrix of eigenvectors #Xnor is now = PCA scores pca<-SPCC(Xnor) privar<-pca$lamda Beig<-pca$b Xnor<-pca$scores n<-nrow(Xnor) p<-ncol(Xnor) nkeep<-5 #number of candidates to be kept for full iteration in the Pena-Yohai procedure privar<-privar*n #Makes the robust eigenvalues of the same order as those of classical PCA used for LS nlam<-min(c(p,numlam.S)) pmax<-min(c(p, n/2)) #edf<=n/2 to keep BDP >=0.25 pp<-seq(1,pmax,length=nlam) lamdas<-findlam(privar,pp) #find lambdas corresponding to the edf’s deltas<-0.5*(1-(pp)/n) #for the M-scale used with Penia-Yohai Contents 34 ###Reorder data for CV srt<-sort.int(sample(1:n),index.return=TRUE) #Random permutation tt<-srt$x orden<-srt$ix Xnord<-Xnor[orden,] ynord<-ynor[orden] ### if (alone==1){ ###Set-up cluster for parallel computations cores<-min(detectCores(),ncores) try(cl<-makeCluster(cores)) try(registerDoParallel(cl)) ### } ###Parallel CV exp.se<-c(’CVSE’) klam<-NULL mse<-foreach(klam=1:length(lamdas),.combine=c,.packages=c(’mmlasso’), .export=exp.se)%dopar%{ CVSE(Xnord,ynord,nfold=cualcv.S,lamdas[klam],pp[klam],nkeep,niter.S) } if(any(is.infinite(mse))){ warning(’IWLS for S-Ridge failed when lambda equaled:’) print(lamdas[which(is.infinite(mse))]) } indmin<-which.min(mse) lamin<-lamdas[indmin] delmin<-deltas[indmin] ### if(alone==1){ stopCluster(cl) } fin<-rr_se(Xnor,ynor,lamin,deltaesc=delmin,cc_scale=1,nkeep,niter.S, epsilon=1e-4) Contents 35 beta<-fin$coef betaslo<-beta[2:(length(beta))] bint<-beta[1] res<-ynor-Xnor%*%betaslo-as.vector(bint) edf<-fin$edf deltult<-0.5*(1-(edf+1)/n)#"delta" for final scale deltult<-max(c(deltult, 0.25)) #c0: constant for consistency of final scale c0<-const_marina(deltult) sigma<-Mscale_mar(res,deltult,c0) a_cor<-mean(psi_marina(res/sigma,c0)^2) b_cor<-mean(Mchi(res/sigma,c0,’bisquare’,2)) c_cor<-mean(psi_marina(res/sigma,c0)*(res/sigma)) corr<-1+(edf+1)/(2*n)*(a_cor/(b_cor*c_cor)) fscale<-sigma*corr #bias correction for final scale #Back from PC to ordinary coordinates betaslo<-Beig%*%betaslo #De-normalize beta if required by user if (normin==1 & denormout==1){ betaslo<-betaslo/sigx bint<-muy+bint-mux%*%betaslo } beta<-c(bint,betaslo) re<-list(coef=beta,scale=fscale,edf=edf,lamda=lamin,delta=delmin) return(re) } #Rcpp Exports rho_marina <- function(x, cw) { .Call(’mmlasso_rho_marina’, PACKAGE = ’mmlasso’, x, cw) } rho_bisq <- function(x, cw) { .Call(’mmlasso_rho_bisq’, PACKAGE = ’mmlasso’, x, cw) } psi_marina <- function(x, cw) { .Call(’mmlasso_psi_marina’, PACKAGE = ’mmlasso’, x, cw) } weight_marina <- function(x, cw) { .Call(’mmlasso_weight_marina’, PACKAGE = ’mmlasso’, x, cw) Contents 36 } Mscale_mar <- function(x, b, cc) { .Call(’mmlasso_Mscale_mar’, PACKAGE = ’mmlasso’, x, b, cc) } scale_tau <- function(x) { .Call(’mmlasso_scale_tau’, PACKAGE = ’mmlasso’, x) } my_svdecon <- function(x) { .Call(’mmlasso_my_svdecon’, PACKAGE = ’mmlasso’, x) } SPCC <- function(x) { .Call(’mmlasso_SPCC’, PACKAGE = ’mmlasso’, x) } MMLassoCpp_ini <- function(xx, y, beta_ini) { .Call(’mmlasso_MMLassoCpp_ini’, PACKAGE = ’mmlasso’, xx, y, beta_ini) } MMLassoCpp1 <- function(x, y, beta_ini, scale_ini, c1) { .Call(’mmlasso_MMLassoCpp1’, PACKAGE = ’mmlasso’, x, y, beta_ini, scale_ini, c1) } MMLassoCpp2 <- function(xjota, yast, beta_lars, beta_o, alpha) { .Call(’mmlasso_MMLassoCpp2’, PACKAGE = ’mmlasso’, xjota, yast, beta_lars, beta_o, alpha) } rr_se <- function(X, y, lambda2, deltaesc, cc_scale, nkeep, niter, epsilon) { .Call(’mmlasso_rr_se’, PACKAGE = ’mmlasso’, X, y, lambda2, deltaesc, cc_scale, nkeep, niter, epsilon) } #MMLassoPrep CVLasso<-function(X,y,beta.ini,scale.ini,nfold,lam,c1,niter.mm){ #Performs nfold-CV for MM-Lasso #INPUT #X,y: data #beta.ini, scale.ini: initial estimate of regression and scale #nfold: nfold-CV #lam: penalization parameter #c1: tuning constant for the rho function #niter.mm: maximum number of weighted Lasso iterations for MM-Lasso Contents 37 #OUTPUT #mse: resulting MSE (estimated using a tau-scale) ###Segment data n<-nrow(X) p<-ncol(X) indin<-1:n nestim<-n*(1-1/nfold) # Actual number of observations in an estim sample lamcv<-lam inint<-floor(seq(0,n,length.out=nfold+1)) resid<-vector(mode=’numeric’,length=n) ### for (kk in 1:nfold){ ###Get test and estimating samples testk<-(inint[kk]+1):inint[kk+1] estik<-setdiff(indin,testk); Xtest<-X[testk,] Xesti<-X[estik,] ytest<-y[testk] yesti<-y[estik] fit.MMBR<-try(MMLasso(Xesti,yesti,beta.ini,scale.ini,lambda=lamcv,c1=c1, niter.mm)) if (class(fit.MMBR)=="try-error"){ return(Inf) } beta.MMBR<-fit.MMBR$coef beta.MMBR.slo<-beta.MMBR[2:length(beta.MMBR)] beta.MMBR.int<-beta.MMBR[1] fitk<-Xtest%*%beta.MMBR.slo+beta.MMBR.int resid[testk]<-ytest-fitk } mse<-scale_tau(resid) return(mse) } MMLasso<-function(xx,y,beta.ini,scale.ini,lambda,c1,niter.mm){ #Performs iteratively re-weighted Lasso #INPUT Contents 38 #xx,yy: data #beta.ini, scale.ini: initial estimate of regression and scale #lambda: penalization parameter #c1: tuning constant for the rho function #niter.mm: maximum number of iterations #OUTPUT: #coef: final estimate MMcpp_ini<-MMLassoCpp_ini(xx,y,beta.ini) x<-MMcpp_ini$x resid.n<-MMcpp_ini$resid.n tol<-1 m<-0 beta.n<-beta.ini p<-length(beta.ini) while (tol>= 1e-4){ beta.o<-beta.n if(all(beta.o==0)){ return(list(coef=beta.o)) } MMcpp1<-MMLassoCpp1(x,y,beta.o,scale.ini,c1) xort<-MMcpp1$xort xjota<-MMcpp1$xjota yast<-MMcpp1$yast alpha<-MMcpp1$alpha u.Gram<- !(p>=500) if(all(xjota==0)){ return(list(coef=beta.o)) } try(res.Lasso <- robustHD:::fastLasso(xort, yast,lambda, intercept=FALSE, normalize=FALSE, use.Gram=u.Gram),silent=TRUE) if (class(res.Lasso)=="try-error"){ warning("fastLasso failed") return(list(coef=beta.o)) } beta.Lasso <- coef(res.Lasso) MMcpp2<-MMLassoCpp2(xjota,yast,beta.Lasso,beta.o,alpha) beta.n<-MMcpp2$beta.n Contents 39 tol<-MMcpp2$tol m<-m+1 if (m >= niter.mm) {tol<-0} } return(list(coef=beta.n)) } optim.lam<-function(x,y,beta.ini,scale.ini,lambdamax,c1,niter.mm){ #Finds (approximately) smallest lambda such that the slope of the MM-Lasso estimate is zero #INPUT #X, y: data set #beta.ini, scale.ini: initial estimates and scale #lambdamax: initial guess for lambda #c1: tuning constant for the rho function #niter.mm: maximum number of iterations #OUTPUT #lambda: smallest lambda such that the slope of the MM-Lasso estimate is zero n<-nrow(x) p<-ncol(x) lambda2<-lambdamax beta.n<-MMLasso(x,y,beta.ini,scale.ini,lambda2,c1,niter.mm)$coef zeros.n<-sum(beta.n[2:(p+1)]==0) while (zeros.n<p & lambda2<max(n,1e4)){ lambda2<-2*lambda2 beta.n<-MMLasso(x,y,beta.ini,scale.ini,lambda2,c1,niter.mm)$coef zeros.n<-sum(beta.n[2:(p+1)]==0) } if(lambda2>=max(n,1e4)) { return(lambda2) } lambda1<-lambdamax/2 beta.o<-MMLasso(x,y,beta.ini,scale.ini,lambda1,c1,niter.mm)$coef zeros.o<-sum(beta.o[2:(p+1)]==0) while(zeros.o==p){ lambda1<-lambda1/2 beta.o<-MMLasso(x,y,beta.ini,scale.ini,lambda1,c1,niter.mm)$coef zeros.o<-sum(beta.o[2:(p+1)]==0) Contents 40 } for (j in 1:5){ lambda<-0.5*(lambda1+lambda2) beta.n<-MMLasso(x,y,beta.ini,scale.ini,lambda,c1,niter.mm)$coef zeros.n<-sum(beta.n[2:(p+1)]==0) if (zeros.n<p){ lambda1<-lambda }else{ lambda2<-lambda } } return(lambda2) } prepara<-function(X,y){ #Centers y and the columns of X to zero median, and normalizes X to unit mad #INPUT #X, y: data set #OUTPUT #Xnor: centered normalized X #ycen: centered y #mux: median vector of X #scalex: mad vector of X #muy: median of y #sigy:scale of y muy<-median(y) ycen<-y-muy sigy<-mad(y) mux<-apply(X,2,’median’) sigx<-apply(X,2,’mad’) Xnor<-scale(X,center=mux,scale=sigx) res<-list(Xnor=Xnor,ynor=ycen,mux=mux,sigx=sigx,muy=muy,sigy=sigy) return(res) } Appendix E Custom R Statistical Functions: Adaptive Lasso and LAD-Lasso 1 The following code was implemented within the simulation-generating function in order to use the adaptive lasso and LAD-lasso. set.seed(seed) { #i <- 1 #cat("i = " , i , "\n") object1=lsa.linear(X,Y); adlasso=object1$coef; X=X Y=Y n=length(Y); grid=seq(log(0.01),log(1400),length.out=100); grid=exp(grid); rqob=rq(Y~0+X); BIC=rep(0,100); weights=1/abs(rqob$coef); for ( i in 1:100){ #cat("i = " , i , "\n") rqfit=rq.fit.lasso(X,Y,lambda=grid[i]*weights); betalad_tmp=rqfit$coef; betalad_tmp=betalad_tmp*(betalad_tmp>1e-8); 1 generously provided by Zheng et al. (2016) 41 Contents 42 mse=mean(abs(rqfit$resi)); mdsize=length(which(betalad_tmp!=0)); BIC[i]=log(mse)+mdsize*log(n)/n; } step=which.min(BIC); betalad=rq.fit.lasso(X,Y,lambda=grid[step]*weights)$coef; ladlasso=betalad*(betalad>1e-8); write(ladlasso, "ladlasso.txt", ncol=length(ladlasso)); write(adlasso, "adlasso.txt", ncol=length(adlasso)) #write(Y,"y.txt",ncol=1); #write(t(X), "x.txt", ncol=dim(X)[2]) } Appendix F R Simulation Code The following code was used to generate the data for each data condition as well as the coecient estimates for each method. A separate command was used for each data condition, setting the arguments as appropriate for each condition, and then writing a separate .csv file for each database and each collection of coecient estimates. data.sim.all <- function(NSim = 500 , p , n , eta.x , eta.y , beta.str , Nmeth = 1) { set.seed(1) ###CV parameters lambda.lasso.try = seq(0.01,0.6,length.out=100) lambda.gamma.try = seq(1,4,length.out=50) data.mtx <- matrix(0 , nrow = NSim * n , ncol = p + 1) beta.mtx <- matrix(0 , nrow = NSim * p , ncol = Nmeth + 2) betas <- matrix(0 , nrow = p , ncol = 1) if(beta.str == 1) { betas[1,1] <- 0.5 betas[2,1] <- 1.0 betas[3,1] <- 1.5 betas[4,1] <- 2.0 } if(beta.str == 2) { betas[1,1] <- 3.0 betas[2,1] <- 1.5 betas[3,1] <- 0 betas[4,1] <- 0 betas[5,1] <- 2.0 43 Contents 44 } if(beta.str == 3) { betas[1,1] <- 2.5 betas[2,1] <- 2.5 betas[3,1] <- 2.5 betas[4,1] <- 2.5 betas[5,1] <- 2.5 betas[6,1] <- 0 betas[7,1] <- 1.5 betas[8,1] <- 1.5 } for(j in 1:NSim){ seed <- j + 1000 set.seed(seed) cat("precovar, j =" , j , "\n") covar.X <- matrix(rep(0,p^2),ncol = p) diag(covar.X) <- 1 X.UC <- rmvnorm((1 - eta.x)*n , mean = rep(0 , p) , sigma = covar.X) seed <- j + 1000 set.seed(seed) if(eta.x > 0) { X.C <- rmvnorm(eta.x*n , mean <- rep(10 , p) , sigma = covar.X) X <- rbind(X.UC , X.C) } else { X.C <- 0 X <- X.UC } data.mtx[( (j - (j - 1)) + ((n * j) - n) ):( (j - (j - 1)) + ((n * j) - 1) ) , 1:p] <- X cat("data.mtx = X, j =" , j , "\n") seed <- j + 1000 set.seed(seed) err.UC <- rnorm((1-eta.y)*n , mean = 0 , sd = 1) seed <- j + 1000 set.seed(seed) if(eta.y > 0) { err.C <- rnorm(eta.y*n , mean = 2 , sd = 5) err <- c(err.UC , err.C) } else { Contents 45 err.c <- 0 err <- err.UC } Y <- X %*% betas[ , 1] + err cat("Y, j =" , j , "\n") #cat("i = " , i) seed <- j + 1000 set.seed(seed) lasso.model <- cv.glmnet(X, Y, family="gaussian", lambda=lambda.lasso.try) lambda.lasso.opt <- lasso.model$lambda.min cat("lasso.model.ests, j =" , j , "\n") #cat("i = " , i) seed <- j + 1000 set.seed(seed) mm.ad.lasso.model <- mmlasso(x = X , y = Y) cat("mm.ad.lasso.model, j =" , j , "\n") #cat("i = " , i) #i <- j seed <- j + 1000 set.seed(seed) Huber.model <- H.lasso(X = X , Y = Y , lambda.lasso.try = lambda.lasso.try) cat("Huber.model, j =" , j , "\n") #cat("i = " , i) seed <- j + 1000 set.seed(seed) { #i <- 1 #cat("i = " , i , "\n") object1=lsa.linear(X,Y); adlasso=object1$coef; X=X Y=Y n=length(Y); grid=seq(log(0.01),log(1400),length.out=100); grid=exp(grid); rqob=rq(Y~0+X); BIC=rep(0,100); Contents 46 weights=1/abs(rqob$coef); for ( i in 1:100){ #cat("i = " , i , "\n") rqfit=rq.fit.lasso(X,Y,lambda=grid[i]*weights); betalad_tmp=rqfit$coef; betalad_tmp=betalad_tmp*(betalad_tmp>1e-8); mse=mean(abs(rqfit$resi)); mdsize=length(which(betalad_tmp!=0)); BIC[i]=log(mse)+mdsize*log(n)/n; } step=which.min(BIC); betalad=rq.fit.lasso(X,Y,lambda=grid[step]*weights)$coef; ladlasso=betalad*(betalad>1e-8); write(ladlasso, "ladlasso.txt", ncol=length(ladlasso)); write(adlasso, "adlasso.txt", ncol=length(adlasso)) #write(Y,"y.txt",ncol=1); #write(t(X), "x.txt", ncol=dim(X)[2]) } #cat("i = " , i , "\n") seed <- j + 1000 set.seed(seed) OSH.lasso.model <- OSH.lasso(X = X , Y = Y , lambda.lasso.try = lambda.lasso.try) cat("OSH.lasso.model, j =" , j , "\n") #cat("i = " , i , "\n") #OSH.lasso.cv.model <-OSH.lasso.cv(X = X , Y = Y , lambda.lasso.try = lambda.lasso.try , lambda.gamma.try = lambda.gamma.try) #cat("OSH.lasso.cv.model, j =" , j , "\n") seed <- j + 1000 set.seed(seed) OS.lasso.model <- OS.lasso(X = X , Y = Y , lambda.lasso.try = lambda.lasso.try , lambda.gamma.try = lambda.gamma.try) cat("OS.lasso.model, j =" , j , "\n") #cat("i = " , i , "\n") data.mtx[( (j - (j - 1)) + ((n * j) - n) ):( (j - (j - 1)) + ((n * j) - 1) ) , (p+1) ] <- Y cat("data.mtx = Y, j =" , j , "\n") Contents 47 #cat("i = " , i , "\n") seed <- j + 1000 set.seed(seed) beta.mtx[( (j - (j - 1)) + ((p * j) - p) ):( (j - (j - 1)) + ((p * j) - 1) ) , 1:(Nmeth + 2) ] <- cbind(betas[1:p , 1] , predict(lasso.model , type = "coefficients" , s = lambda.lasso.opt)[2:(p+1)] , ladlasso , adlasso , Huber.model$coefficient[2:(p+1)] , OS.lasso.model$coefficient[2:(p+1)] , OSH.lasso.model$coefficient[2:(p+1)] , mm.ad.lasso.model$coef.MMLasso[2:(p+1)] , mm.ad.lasso.model$coef.MMLasso.ad[2:(p+1)] ) cat("p = " , p , ", n = " , n , ", eta.x = " , eta.x , " , \n" , "eta.y = ", eta.y , ", beta.str = " , beta.str , ", j = " , j, "\n") } data.mtx4 <<- data.frame(data.mtx) beta.mtx4 <<- data.frame(beta.mtx) colnames(beta.mtx4) = c("True Estimates" , "Lasso" , "LAD Lasso" , "Adaptive LAD Lasso" , "Huber Lasso" , "OS Lasso" , "OS Huber Lasso" #, "CV OS Huber Lasso" , "MM Lasso" , "Adaptive MM Lasso") beta.mtx4 <<- beta.mtx4 } Appendix G R Code: Generating Metrics The following code pulled together all coecient estimate .csv files into one list, cal- culated the average FPRs and FNRs for each data condition, and created variables indicating the levels of each data characteristic of interest. filenames1 <- sprintf("%s_%s.csv" , "beta8_25" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames2 <- sprintf("%s_%s.csv" , "beta8_50" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames3 <- sprintf("%s_%s.csv" , "beta8_100" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames4 <- sprintf("%s_%s.csv" , "beta8_200" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames5 <- sprintf("%s_%s.csv" , "beta30_25" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames6 <- sprintf("%s_%s.csv" , "beta30_50" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames7 <- sprintf("%s_%s.csv" , "beta30_100" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames8 <- sprintf("%s_%s.csv" , "beta30_200" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames9 <- sprintf("%s_%s.csv" , "beta50_25" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames10 <- sprintf("%s_%s.csv" , "beta50_50" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames11 <- sprintf("%s_%s.csv" , "beta50_100" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) 48 Contents 49 filenames12 <- sprintf("%s_%s.csv" , "beta50_200" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames13 <- sprintf("%s_%s.csv" , "beta100_25" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames14 <- sprintf("%s_%s.csv" , "beta100_50" , c("0_0_1" , "01_0_1" , "02_0_1" , "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames15 <- sprintf("%s_%s.csv" , "beta100_100" , c("0_0_1" , "01_0_1" , "02_0_1", "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames16 <- sprintf("%s_%s.csv" , "beta100_200" , c("0_0_1" , "01_0_1" , "02_0_1", "0_01_1" , "01_01_1" , "02_01_1" , "0_02_1" , "01_02_1" , "02_02_1")) filenames <- c(filenames1 , filenames2 , filenames3 , filenames4 , filenames5 , filenames6 , filenames7 , filenames8 , filenames9 , filenames10 , filenames11 , filenames12 , filenames13 , filenames14 , filenames15 , filenames16) metrics <- matrix(ncol = 20 , nrow = 144) colnames(metrics) <- c("Lasso.FNR" , "Lasso.FPR" , "LADLasso.FNR" , "LADLasso.FPR" , "AdLasso.FNR" , "AdLasso.FPR" , "HuberLasso.FNR" , "HuberLasso.FPR" , "OSLasso.FNR" , "OSLasso.FPR" , "OSHuberLasso.FNR" , "OSHuberLasso.FPR" , "MMLasso.FNR" , "MMLasso.FPR" , "AdMMLasso.FNR", "AdMMLasso.FPR" , "p" , "n" , "eta.x" , "eta.y") rownames(metrics) <- names(L) colnames(metrics)[17] #p metrics[1:36 , 17] <- 8 metrics[37:72 , 17] <- 30 metrics[73:108 , 17] <- 50 metrics[109:144 , 17] <- 100 #n metrics[1:9 , 18] <- 25 metrics[10:18 , 18] <- 50 metrics[19:27 , 18] <- 100 metrics[28:36 , 18] <- 200 metrics[37:45 , 18] <- 25 Contents 50 metrics[46:54 , 18] <- 50 metrics[55:63 , 18] <- 100 metrics[64:72 , 18] <- 200 metrics[73:81 , 18] <- 25 metrics[82:90 , 18] <- 50 metrics[91:99 , 18] <- 100 metrics[100:108 , 18] <- 200 metrics[109:117 , 18] <- 25 metrics[118:126 , 18] <- 50 metrics[127:135 , 18] <- 100 metrics[136:144 , 18] <- 200 #eta.x metrics[c(1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52,55,58 ,61,64,67,70,73,76,79,82,85,88,91,94,97,100,103,106,109,112 ,115,118,121,124,127,130,133,136,139,142) , 19] <- 0 metrics[c(etax1) , 19] <- 0.1 metrics[c(etax2) , 19] <- 0.2 #eta.y metrics[c(etay0) , 20] <- 0 etay0 <- c(1:3 , 10:12 , 19:21 , 28:30 , 37:39 , 46:48 , 55:57 , 64:66 , 73:75 , 82:84 , 91:93 , 100:102 , 109:111 , 118:120 , 127:129 , 136:138) etay1 <- etay0 + 3 etay2 <- etay0 + 6 metrics[c(etay1) , 20] <- 0.1 metrics[c(etay2) , 20] <- 0.2 Bibliography Alfons, A., Croux, C., and Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1):226–248. Alfons, A., Croux, C., and Gelper, S. (2016). Robust groupwise least angle regression. Computational Statistics & Data Analysis, 93(C):421–435. Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technomet- rics, 37:373–384. Chatterjee, S. and Hadi, A. (1988). Sensitivity Analysis in Linear Regression Analysis. New York, New York: Wiley. Derksen,S.andKeselman,H.(1992). Backward,forwardandstepwiseautomatedsubset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology, 45:265–282. Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. New York, New York: Chapman and Hall. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihod and its oracle properties. Journal of the American Statistical Association, 96:1348–1360. Gijbels, I. and Vrinssen, I. (2015). Robust nonnegative garrote variable selection in linear regression. Computational Statistics and Data Analysis, 85:1–22. Hastie,T.,Tibshirani,R.,andWainwright,M.(2015). StatisticalLearningwithSparsity. Taylor & Francis. Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73–101. Huber, P. (1981). Robust Statistics. New York, New York: Wiley. Huberty, C. (1989). Problems with stepwise methods - better alternatives. Advanced Social Science Methodology, 1:43–70. 51 Bibliography 52 Jung, Y., Lee, S., and Hu, J. (2016). Robust regression for highly corrupted response by shifting outliers. Statistical Modelling, 16(1):1–23. Kuo, L. and Mallick, B. (1998). Variable selection for regression methods. Sankhya, 60:65–81. Lambert-Lacroix, S. and Zwald, L. (2011). Robust regression through the huber’s crite- rion and adaptive lasso penalty. Electronic Journal of Statistics, 5:1015–1053. Maronna, R. (2011). Robust ridge regression for high-dimensional data. Technometrics, 53(1):44–53. Meinshausen, N. and Buhlman, P. (2004). technical report. University of Wisconsin- Madison, Dept. of Statistics. Miller, A. (1990). Subset Selection in Regression. London, UK: Chapman and Hall. Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3):1012–1030. Smucler, E. and Yohai, V. (2017). Robust and sparse estimators for linear regression models. Computational Statistics and Data Analysis, 111:116–130. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society - Series B, 58:347–355. Turkmen,A.andOzturk,O.(2016).Generalisedrankregressionestimatorwithstandard erroradjustedlasso. Australian and New Zealand Journal of Statistics,58(1):121–135. Wang, H., Li, G., and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25(3):347–355. Yohai, V. (1987). High breakdown-point and high eciency robust estimates for regres- sion. Annals of Statistics, 15(2):642–656. Zhang,H.andZamar,R.(2014.doi:10.1002/wics.1288). Leastangleregressionformodel selection. WIREs Computational Statistics, 6:116–123. Zheng, Q., Gallagher, C., and Kulasekera, K. (2016). Robust adaptive lasso for variable selection. Communications in Statistics - Theory and Methods, 46(9):4642–4659. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 91:258–266.
Abstract (if available)
Abstract
Model selection in regression techniques has risen to the forefront in recent statistical applications. Regularization methods, and the lasso especially, have seen increasing attention in the past decade, with a variety of adaptations developed to enhance the robustness of the model selection capabilities of these techniques. Outliers in particular present difficulty for the standard lasso to perform most effectively, and a number of these adaptations significantly improve performance over the standard lasso. However, many of the recently-developed adaptations have not been compared in the context of a single analysis. The current paper provides the first single-paper overview of the most significant robustified lasso methods gaining traction in the literature, specifically with respect to handling contamination by outlying observations. Among seven adaptations studied, those that used an adaptive weighting scheme for the regularization tuning parameter performed best by a significant margin
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The robustification of the lasso and the elastic net: utility in practical research settings
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Psychological distance in the public’s response to terrorism: An investigation of the 2016 Orlando nightclub shooting
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Making appropriate decisions for nonnormal data: when skewness and kurtosis matter for the nominal response model
PDF
Comparing skipped correlations: the overlapping case
PDF
On the latent change score model in small samples
PDF
Regularized structural equation modeling
PDF
Incorporating prior knowledge into regularized regression
PDF
Bound in hatred: a multi-methodological investigation of morally motivated acts of hate
PDF
A Bayesian region of measurement equivalence (ROME) framework for establishing measurement invariance
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Biometric models of psychopathic traits in adolescence: a comparison of item-level and sum-score approaches
PDF
Sex differences in moral judgements across 67 countries
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Studies in bivalve aquaculture: metallotoxicity, microbiome manipulations, and genomics & breeding programs with a focus on mutation rate
PDF
Deterministic evolutionary game theory models for the nonlinear dynamics and control of chemotherapeutic resistance
PDF
Estimation of nonlinear mixed effects mixture models with individually varying measurement occasions
PDF
Feature selection in high-dimensional modeling with thresholded regression
Asset Metadata
Creator
Multach, Matthew Daniel
(author)
Core Title
Outlier-robustness in adaptations to the lasso
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Psychology
Publication Date
11/07/2018
Defense Date
11/07/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
lasso,model selection,OAI-PMH Harvest,regression,robust statistics,variable selection
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wilcox, Rand (
committee chair
), Beam, Christopher (
committee member
), Dehghani, Morteza (
committee member
)
Creator Email
matt.multach87@gmail.com,multach@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-104879
Unique identifier
UC11676908
Identifier
etd-MultachMat-6933.pdf (filename),usctheses-c89-104879 (legacy record id)
Legacy Identifier
etd-MultachMat-6933.pdf
Dmrecord
104879
Document Type
Thesis
Format
application/pdf (imt)
Rights
Multach, Matthew Daniel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
lasso
model selection
regression
robust statistics
variable selection