Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Difference-of-convex learning: optimization with non-convex sparsity functions
(USC Thesis Other)
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
University of Southern California Dissertation Dierence-of-convex learning: optimization with non-convex sparsity functions by Miju Ahn Submitted to University of Southern California Graduate School In partial fulllment of the requirements for the degree of Doctor of Philosophy in Industrial and Systems Engineering Dissertation and proposal committee: Dr. Jong-Shi Pang (advisor and chair) Dr. Jinchi Lv Dr. Meisam Razaviyayn Proposal committee: Dr. Suvrajeet Sen Dr. Jack Xin August 2018 Acknowledgements Throughout the duration of my study at University of Southern California, Dr. Jong-Shi Pang has provided me with consistent support and professional guidance. I truly enjoyed our meetings which were lled with passion for research, persistence and joy of discovering. I would like to sincerely thank him for being an inspirational advisor. I would like to thank the following committee members who shared their time and provided valu- able advice for the dissertational work: Dr. Jinchi Lv, Dr. Meisam Razaviyayn, and Dr. Suvrajeet Sen. I would like to thank my collaborators Dr. Hongbo Dong and Dr. Jack Xin who also served in the proposal committee. There are friends who had walked through the journey of becoming a PhD ahead of me. Based on their own experiences, Semih Atakan and Youjin Kim guided me throughout my PhD study with helpful advice. I deeply appreciate their kindness and encouragements. My family members have been always providing me with unconditional support. I would like to show the greatest gratitude to my sister Jinyoung Kim and my brother-in-law Heedyo Kim, and to my father Taekeun Ahn and my mother Jaeyeon Sul. Contents 1 Introduction 1 1.1 Overview of Sparsity in Learning Problems . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Preview of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Background in Nonlinear Mathematical Optimization . . . . . . . . . . . . . . . . . 10 1.4.1 Review of the optimization theory . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.2 Concepts of stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Dierence-of-convex Learning: Directional Stationarity, Sparsity and Optimal- ity 14 2.1 The Unied DC Program of Statistical Learning . . . . . . . . . . . . . . . . . . . . 15 2.1.1 A model-control constrained formulation . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Stationary solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Minimizing and Sparsity Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Preliminary discussion of the main result . . . . . . . . . . . . . . . . . . . . 21 2.2.2 The assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 The case = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.4 The case of g being a weighted ` 1 function . . . . . . . . . . . . . . . . . . . . 27 2.3 Sparsity Functions and Their DC Representations . . . . . . . . . . . . . . . . . . . 30 2.4 Exact Sparsity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 K-sparsity function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 The ` 1 ` 2 function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Surrogate Sparsity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1 The SCAD family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.2 The MCP family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 The capped ` 1 family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.4 The transformed ` 1 -family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.5 The logarithmic family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.6 Some comments on results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Structural Properties of Ane Sparsity Constraints 44 3.1 ASC Systems: Introduction and Preliminary Discussion . . . . . . . . . . . . . . . . 45 3.1.1 Source problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Closedness and Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.1 Necessary and sucient conditions for closedness . . . . . . . . . . . . . . . . 52 3.2.2 Closure of SOL-ASC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.3 Extension to SOL-xASC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Continuous Approximation and Convergence . . . . . . . . . . . . . . . . . . . . . . 58 3.3.1 Approximating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Tangent Properties of SOL-ASC and its Approximation . . . . . . . . . . . . . . . . 66 3.4.1 Tangent cone of approximated sets: Fixed . . . . . . . . . . . . . . . . . . . 68 3.5 Convergence of B-Stationary Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.1 The case A 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5.2 The case of column-wise uni-sign . . . . . . . . . . . . . . . . . . . . . . . . . 77 4 D-stationary Solutions of Sparse Sample Average Approximations 82 4.1 Sparsity Regularized Sample Average Approximations . . . . . . . . . . . . . . . . . 83 4.1.1 The Lagrangian formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.2 Theoretical vs computable solutions . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.3 A summary for the LASSO analysis . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Properties of the D-stationary Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.1 The neighborhood of strong convexity . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2 Consistency bound the D-stationary solutions . . . . . . . . . . . . . . . . . . 97 4.2.3 Bounds on the prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.4 Support recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Bibliography 106 Chapter 1 Introduction Statistical learning refers to the study of available data points (or samples) with the goal of con- structing models so that they can be exploited to make future predictions. Under the supervised learning setting, each data point contains information about features and an outcome which is either a continuous value (e.g., price of a house, time to develop a distant metastasis) for regression problems or a categorical label (e.g., presence of loan default history, malignant or benign cell) for classication problems. A combination of a model and a loss function is selected according to the type of the learning problem, then an optimization problem is formulated based on available data points. The goal of the optimization problem is to nd a set of model parameters that minimizes the errors on the training data. Once we obtain the model, then it predicts the outcome values for new data points which arrive with only feature information. The dissertation is motivated by the setting where some feature information is redundant thus training a model involves the task of embedding sparsity in the model parameters. Such problem is referred as sparse learning where op- timization problems are solved to identify important features and set coecients of the insignicant variables to zero in order to construct a robust, ecient and interpretable model. 1.1 Overview of Sparsity in Learning Problems Sparse representation [47] is a fundamental methodology of data science in solving a broad range of problems from statistical and machine learning in articial intelligence to physical sciences and engineering (e.g., imaging and sensing technologies), and to medical decision making (e.g., classica- 1 tion of healthy versus unhealthy patients, benign and cancerous tumors). The task of constructing sparse structure in the model parameters involves the ` 0 -function wherejtj 0 = 8 < : 1 if t6= 0 0 if t = 0 for a scalar t. In the last decade, signicant advances have been made on constructing intrinsically low-dimensional solutions in high-dimensional problems via convex programming. In linear regres- sion problems, the least absolute shrinkage and selection operator (LASSO) [78] was introduced to perform variable selection based on the minimization of the ` 1 -normkxk 1 , n X i=1 jx i j of the model variablex2R n , which is employed as a surrogate forkxk 0 , n X i=1 jx i j 0 , the cardinality of nonzero components, either subject to a certain prescribed residual constraint, leading to a constrained optimization problem, or employing such a residual as an additional criterion to be minimized, resulting in an unconstrained minimization problem. Theoretical analysis of LASSO and its abil- ity for recovery of the true support of the ground truth vector was pioneered by Cand es and Tao [16, 17]. This operator has many good statistical properties such as model selection consistency [89], estimation consistency [9, 52], and persistence property for prediction [45]. There exist many modied versions and algorithms proposed to improve the computational eciency of LASSO and to achieve dierent properties in the solution including adaptive LASSO [92], Bregman Iterative Algorithms [83], the Dantzig Selector [21, 18], iteratively reweighted LASSO [20], elastic net [93], and LARS (Least Angle Regression) [31]. Till now, due to its favorable theoretical underpinnings and many ecient solution methods, con- vex optimization has been a principal venue for solving many statistical learning problems. Yet there is increasing evidence supporting the use of non-convex formulations to enhance the realism of the models and improve their generalizations. For instance, in compressed sensing and image science [29], recent ndings reported in [57, 82, 86] show that the dierence of` 1 and` 2 norms (` 12 for short) and a (non-convex) \transformed ` 1 " surrogate outperform ` 1 and other known convex penalties in sparse signal recovery when the sensing matrix is highly coherent; such a regime occurs in super-resolution imaging where one attempts to recover ne scale structure from coarse scale data information; see [28, 15] and references therein. More broadly, important applications in other areas such as computational statistics, machine learning, bio-informatics, and portfolio selection [48] oer promising sources of problems for the employment of non-convex functionals to express 2 model loss, promote sparsity, and enhance robustness. The urry of activities pertaining to the use of non-convex functionals for statistical learning prob- lems dates back to more than fteen years ago in statistics when Fan and Li [33] pointed out that LASSO is a biased estimator and postulated several desirable statistics based properties for a good univariate surrogate function for ` 0 to have; the authors then introduced a univariate, butter y shaped, folded-concave [36], piecewise quadratic function called SCAD for smoothly clipped ab- solute deviation that is symmetric with respect to the origin and concave on R + ; see also [35]. Besides SCAD and ` 12 , there exist today many penalized regression methods using non-convex penalties, including the` q (forq2 (0; 1)) penalty [38, 42], the related` p ` q penalty forp 1 and q2 [0; 1) [22], the combination of ` 0 and ` 1 penalties [55], the smooth integration of counting and absolute deviation (SICA) [58], the minimax concave penalty (MCP) [85], the capped-` 1 penalty [87], the logarithmic penalty [60], the truncated` 1 penalty [77], the hard thresholding penalty [90], and the transformed ` 1 [66, 58, 86] mentioned above. 1.2 Preview of the Dissertation Incorporating the non-convex and non-dierentiable ` 0 -surrogates in the formulation makes the learning problem challenging to analyze and solve. A major goal of this dissertation is to ad- dress these challenges, and provide analysis to demonstrate the properties and benets of such non-convex sparsity functions. We explore sparse learning problems from dierent perspectives, investigate formulations of varied emphases and properties of the resulting solutions, and organize our discoveries in separate chapters. As a building block to provide a systematic analysis for most of the existing` 0 -surrogates, we elect the dierence-of-convex (dc) representation for the mentioned sparsity functions. Use of such representation is supported by recent work which show that all of the above non-convex sparsity functions belong to the class of dc functions [1, 41, 82, 86]. For- mulations involving dc functionals may benet from computational advances developed in the dc literature. The dierence-of-convex algorithm (DCA) introduced by Pham and Le Thi [71, 72] received much attention for solving optimization problems involving sparsity surrogate functionals. It is an iterative scheme for dc minimization problems which solves a convex subproblem resulted by linearization of the non-convex component of the objective function. While there exist analyses 3 for specic surrogate functions using dc approaches [41, 91, 44, 54], this dissertation is an eort to understand various types of optimization problems through a unied dc formulation that is applica- ble to most of the existing sparsity functions, analyze their stationary solutions that are practically computable, and extend the existing results in the literature. We consider two approaches of incorporating the non-convex sparsity functions in the formulations. The rst approach is to introduce the sparsity function in the objective function in addition to the composite of model and loss functions, resulting in a bi-criteria Lagrangian formulation. The two criteria are the tness of the model measured by the loss and model combination and the model complexity determined by the sparsity function. The second approach is to formulate a constrained problem which either minimizes the sparsity function subject to a prescribed error on the loss, or minimizes the error made by the model and the loss functions on the available samples subject to some restrictions involving the sparsity functions. Due to our choice of the ` 0 -surrogates, existing studies based on the global optima in convex settings are unrealistic to both kinds of the non- convex problems. The overarching contention is that for the analysis of the resulting optimization problems, it is essential that we focus on solutions that are computable in practice, rather than on solutions that are not provably computable (such as global minimizers that cannot be guaranteed to be the outcome of an algorithm). Thus, our contribution is to provide the theoretical underpin- ning of what may be termed computable statistical learning, relying on the recent work [68] where numerical algorithms supported by convergence theories have been introduced for computing di- rectional stationary solutions to general nonsmooth dc programs; see also [69]. The d(-irectional) stationary solutions are of the `sharpest' realistic stationarity concept for non- smooth and non-convex problems. [See section 1.4.2 for the formal statement of the denition]. In [68], the authors showed that there is no hope for a point to be a local optimum without be- ing a d-stationary solution indicating that the d-stationarity is a necessary condition for local or global optimality. This raises the question of whether it is possible for the attained stationary solution of a non-convex program to achieve local or even global optimality properties. We address the question for the sparse representation problems by investigating the d-stationary solutions of the bi-criteria Lagrangian formulation with non-convex` 0 -surrogates. The Lagrangian formulation 4 typically involves a regularization parameter which controls the balance between the weights on tting the model and producing a sparse vector of the model coecients. In particular, the value of the regularization parameter aects the number of nonzero components in the solution, e.g., as the value of the parameter increases, the number of nonzero components in the solution can be expected to decrease, establishing embedded connections between the regularization parameter and the computed solutions. For practical concerns, it would be interesting to investigate how the value of the weight parameter aects the properties of the formulation and the resulting solutions. With the discussed motivations, we study the bi-criteria Lagrangian formulation and the resulting computationally achievable d-stationary solutions by exploring the fundamental structures of the problem with non-convex sparsity functions. The concept of d-stationarity is considered for the entire dissertational work as our primary interests lie in developing realistic analysis that can be readily applied for numerical studies carried out in practice. This will help narrowing the gap between the existing minimizer-based analysis and practical computations by relaxing the require- ment of optimality to a more realistic stationarity property of the computational models. In this vein, we investigate fundamental properties of the dc formulations in relation to the values of the weight controlling parameter. If such relationship is discovered, the resulting analysis can provide a guidance for tuning the parameter to achieve desired characteristics in the computed solutions. We are further interested in investigating the minimizing properties of the d-stationary solutions, and analyzing their sparsity structures hence providing meaningful insights about the solutions prior to actual computations. The original univariate ` 0 -function is discrete so that solving a formulation involving the function to nd a model with the `best t' requires performing the best subset selection among the model variables. The process is computationally intractable as the number of the subsets increases ex- ponentially as the number of model parameters increases. While the sparsity functions serve as continuous surrogates of the ` 0 -function, some logical conditions among the variables can be for- mulated exploiting the discrete attribute of the function. For example, the cardinality constraint which enforces the number of nonzero components in the solution to not exceed a prescribed bound can be formulated with the ` 0 -function [8, 14, 37]. An example from linear regression with in- 5 teraction terms is the hierarchical variable selection [10] where entrance of an interaction term to the model depends on the corresponding individual terms. The strong hierarchical condition means that an interaction term can be selected only if both of the individual terms are selected and the weak hierarchical condition means that an interaction term can be selected only if one of the individual terms is selected. The group variable selection is another example where each of the variables belongs to some predened groups [49, 48]. One version of group variable selection is to require all members within a group to become zero (or nonzero) simultaneously. Another version of the group selection is to choose a variable only if one of the groups containing the variable is selected. All of the above logical conditions among the model parameters can be faithfully formulated as a system of hard constraints using the discrete property of the ` 0 -function. We dene such sys- tem where each of the constraints is a linear inequality involving the ` 0 -function as a system of ane sparsity constraints (ASC). While the ASC serves its purpose by properly formulating many kinds of logical conditions into a unied framework, a recent work has reformulated optimization problems involving the ASC as mathematical programs with complementarity constraints. For practical computational considerations, the` 0 -function in the constraint set of the formulation can be replaced by non-convex sparsity functions, and the resulting set dened by the continuous ` 0 - surrogates approximates the original ASC. This gives rise to the question of whether it is possible for the solution of an approximated problem to be the solution of the original problem, and how we can relate the two such solutions. To address these questions, we investigate and analyze the fun- damental properties of the original ASC and the approximated system of constraints. Such analysis provides the foundation for possible subsequent optimization and algorithmic developments. [26] Widely used in practice, the sample average approximation (SAA) [76, Section 7.2.5] is a statistical method which constructs a nite dimensional model by minimizing the average training error mea- sured by a given model and a loss function over the available data points. The method is a realistic approach for tackling the learning problem while exploiting the limited information of the data. In statistical theory, it is often assumed that all of the collected samples are generated from an underlying data distribution which is not known in reality. The vector of coecients that minimizes the expected error of the given model over the entire data-generating distribution is known as the 6 ground truth, and many existing theories are developed based on the assumption of existence and uniqueness of the ground truth vector [16, 18, 47, 56]. As the method of SAA is often used as the practical alternative to solving the underlying expectation minimization problem that involves an unknown distribution, connections between the two problems under certain settings have been studied in the literature. In [76], the authors investigate the relationship between the sample average approximation and the expectation minimization problems in a convex and dierentiable setting. They present the connections between the optimal solutions and the optimal objective values of the two convex pro- grams showing asymptotic convergences for the outcomes of the SAA problem to the outcomes of the underlying problem as the number of samples goes to innity. The established theory for the formulations involving smooth and convex functions unfortunately does not apply to the SAA problems for sparse representations. Due to the non-convexity and non-dierentiability in the for- mulation introduced by employing the sparsity functions, it is uncertain whether the stationary (or even globally optimal) solutions of the non-convex approximation problem would achieve a desired level of performance over the general population of the data. For the special case of LASSO which minimizes the least squared error of the linear model with the convex ` 1 -norm, the connection between the ground truth and the optimal solution of the cor- responding convex SAA problem has been studied. In [16], Cand es and Tao showed that the exact recovery of the ground truth can be obtained by solving a problem formulated with a deterministic matrix under certain settings. The analysis of the properties of the optimal empirical solution has been developed from statistical perspectives [47]. Therein, the authors present a consistency bound which measures the distance between the two solutions, a prediction error bound which determines the dierence between the trained model's predictions for the sample data and the noise-free sam- ple outcomes, and the support recovery result comparing the sets of the nonzero components of the two solutions. Inspired by the above work, we investigate the same properties in terms of the d-stationary solutions for the non-convex ` 0 -surrogate regularized SAA problems. We aim to provide a generalization of the existing study of LASSO by extending the linear regression problem to general combinations of model and loss functions, and by incorporating dc sparsity functions 7 instead of the convex ` 1 -norm. The resulting analysis is our initial attempt to narrow the gap between the statistical inference and the computational optimization theory with the hope that the provided study promotes active discussions and stimulates collaborative research among the various academic elds. 1.3 Contributions of the Dissertation With the above background and the preview of the dissertation, we are ready to summarize the multi-fold contributions of the current work organized by the topics. The starting point of these contributions is the introduction of a unied formulation for sparse rep- resentations with a piecewise-smooth dc objective function for many well-known sparsity functions. Since the convex components and some of the concave summands of several objective functions are of the piecewise smooth, thus non-dierentiable kind, we build the theory on the concept of d-stationarity which has been shown as the least restrictive concept for non-smooth dc programs. A major contribution of the dissertation includes derivation of several theoretical results for the non-convex Lagrangian formulation of the bi-criteria statistical learning problem. We provide con- ditions under which the d-stationary solutions of the Lagrangian optimization problem are the global minima possibly of a restricted kind due to non-dierentiability. The sparsity properties of the d-stationary solutions are investigated based on the given model constants. Generalized and specialized results for such properties of the stationary solutions are provided according to the specic choices of the surrogate functions. We emphasize on developing analysis that can be applied for practical computations. Our analysis delivers realistic properties of the computational models while relaxing the requirement of optimality and knowledge about unattainable information. We next introduce a new system of constraints called ane sparsity constraints (ASC) dened by linear inequalities involving the ` 0 -function. By incorporating this discontinuous function in the formulations, the constraint system can accommodate logical conditions for variable selection in sparse statistical learning. Many existing problems including the hierarchical variable selection and (dierent kinds of) the group variable selection can be formulated as hard constraints under 8 the ASC framework. We investigate the new kind of constraint system, and provide mathematical analysis on the fundamental structural properties of the discrete set such as closedness and the characterization of the closure. With considerations for computational optimizations, we introduce a set which approximates the original ASC dened by the univariate non-convex ` 0 -surrogate func- tions. We relate the ASC and the approximation set in terms of their solution sets, and show that the solutions of the approximation set converge to the solutions of the original set. The charac- terization for the tangent cones of the above sets are provided. We believe that the fundamental structural properties of the ane sparsity constraints provided in this dissertation lay foundations for subsequent research of solving optimization problems subject to these constraints. The third contribution of the dissertation lies in investigating the relationship between the sample average approximation problems involving non-convex sparsity surrogates and the expectation min- imization problem over hidden data distribution. The connection is analyzed through two measures of interests: a d-stationary solution of the SAA problem and a near-optimal solution of the underly- ing problem. [Instead of using the concept of ground truth, we compare the d-stationary solutions to a vector which could be the global optimal solution of the latter problem with a high probability. See 4.1.2 for more details]. For the deterministic setting of a xed data matrix, we derive a bound for the` 2 -distance between the two solutions; the result is a complete generalization of the LASSO consistency result [47][Chapter 11]. We achieve the same bound while employing general loss and model functions, broadening the restricted setting of the least squared error for linear regression, and extending the` 1 regularizer to the folded-concave sparsity functions. Furthermore, we establish a bound for the prediction error which is the dierence between the model outcomes generated by the two solutions. For the linear regression problem, we achieve the same bound as the LASSO prediction bound while including the non-convex sparsity functions in the formulation. Due to the lack of the linearity structure in the model, the prediction bound for Lipschitz continuous statistical models involves the average value of Lipschitz constants (of the model function) which may depend on the data matrix. We investigate the conditions under which the support of a given d-stationary solution achieves no false inclusion and no false exclusion; these terms are dened to represent the inclusion relationship between the supports of the two solutions. A noteworthy point is that the former kind of inclusion can be shown for any arbitrarily given subset of indices. In summary, we 9 provide a generalization of the existing LASSO theory by developing a deterministic analysis for the formulation involving general model and loss functions with existing non-convex ` 0 -surrogates. Some review on nonlinear optimization theory and concepts of stationarity immediately follows the current section. The organization of the remaining chapters is as follows. The minimizing and sparsity properties of the d-stationary solutions for the bi-criteria Lagrangian formulations are provided in Chapter 2. The analysis on the original ASC and the approximation system including fundamental structural properties of the two sets and the convergence theory for the solutions of the approximation set is presented in Chapter 3. In the fourth and nal chapter of the dissertation, we provide a summary of the statistical inference results of the LASSO solutions and present a generalization of the existing theory to the sparse SAA learning problem providing consistency, prediction error bounds and support recovery results for the d-stationary solutions of the resulting non-convex program. We note that each chapter of the dissertation contains individual research work. As a result, some notations are dened only for the particular chapter that they belong to. 1.4 Background in Nonlinear Mathematical Optimization We present a brief review of fundamental concepts in the optimization theory including the de- nition of convexity, constrained and unconstrained problems, and the concepts of stationarity for general nonlinear optimization problems. Provided are selected results from [5] and [68], and we refer to the cited references for more information about the background, precise statements of the related theorems and their proofs. 1.4.1 Review of the optimization theory Unconstrained optimization. For a n-dimensional vector x and f : R n ! R, we dene the unconstrained minimization problem: minimize x2R n f(x): (1.1) A vector x is a global minimizer iff( x)f(x) for allx2R n and a local minimizer iff( x)f(x) for allx in the neighborhood of x denoted asN ( x); i.e., for some constant",N ( x),fxjkx xk"g 10 for some normkk . Suppose the function f is twice dierentiable then a necessary condition for x to be a local/global minimum is rf( x) = 0 and x T r 2 f( x)x 0 for all x: (1.2) If the inequality is strict for all nonzero x, then the condition (1.2) is also sucient. The gradient condition in (1.2) is necessary and sucient for x to be a global minimizer of f if the function is convex (to be dened shortly). Convexity. A function r : R n !R is convex if r(x + (1)y )r(x) + (1)r(y) for any constant 2 [0; 1] and for all x and all y. We can also determine convexity of a smooth function using its gradient or hessian; a function r is convex if and only if one of the following holds: 1. r(y)r(x) +rr(x) T (yx) for all x and y; 2. y T r 2 r(x)y 0 for all x and y. The function is strongly convex with modulus > 0 if and only if one of the following holds: 1. r(y)r(x) +rr(x) T (yx) + 2 kxyk 2 2 for all x and y; 2. y T r 2 r(x)y for all x and y. The setX is a convex set if for any given subset of its elementsfx 1 ; :::; x m gX for some positive integerm, its convex combination m X i=1 i x i , where each i is a nonnegative scalar such that m X i=1 i = 1, is also contained in the set. The empty set and the entire spaceR n are by convention convex sets. Constrained optimization. We dene a constrained minimization problem: minimize x f(x) (1.3) subject to x2X 11 whereX is a closed subset ofR n . Supposef is continuously dierentiable. If x is a local minimizer then rf( x) T d 0; 8d2T (X ; x) (1.4) where the tangent cone ofX at the point x is dened as T (X ; x), v v = 0 or9fx k g! x such that x k 6= x8k and lim k!1 x k x kx k xk = v kvk : If the constraint setX is convex then (1.4) is equivalent to rf( x) T (x x) 0; for all x2X: (1.5) The condition (1.5) is a necessary and sucient global optimality condition if the problem (1.3) is a convex program, that is, if the objective function and the constraint set are both convex. 1.4.2 Concepts of stationarity We introduce concepts of stationarity which serve as the foundation for the theory to be presented in this dissertation. Consider the constrained optimization problem (1.3) where the objective function is not dierentiable and the constraint setX is convex. A vector x is formally a directional stationary solution of the problem if the objective function is directionally dierentiable, and the directional derivative at the point x in the direction xx for all x contained in the constraint set is nonnegative, i.e., f 0 (x ; xx ), lim #0 f(x +(xx ))f(x ) 0 for all x2X where is a positive scalar. Suppose f is a dierence-of-convex function such that f(x) =g(x)h(x) where both g and h are convex functions, then the directional stationary point x satises f 0 (x ; xx ) =g 0 (x ; xx )h 0 (x ; xx ) 0 for all x2X: The above is equivalent to g 0 (x ; xx ) max v2@h(x ) v T (xx ) 12 ifh is not dierentiable, sinceh 0 (x;d) = max v2@h(x) v T d where the subdierential of the convex function h at the point x is dened as @h(x),fvjh(x)h(y)v T (xy)8yg . For the constrained problem (1.3) with a general (not necessarily convex) constraint set, we say that a vector ~ x2X is a Bouligand stationary solution of the problem iff 0 (~ x; d) 0 for alld2T (X ; ~ x) whereT (X ; ~ x) is the tangent cone ofX at ~ x. 13 Chapter 2 Dierence-of-convex Learning: Directional Stationarity, Sparsity and Optimality In sparse learning, the statistical models are trained by solving optimization formulations involving two criteria. Given choices of the model and the loss functions, we train the model such that the error (sometimes referred as loss) of the model outputs compared to the known information about the sample are minimized. It is also desired that the trained model achieves sparsity in the solution which is the vector of the model parameters, i.e., many components of the solution are exactly zero. The task of promoting sparsity can be performed by formulating the problem with surrogates of the discrete ` 0 -function. As many of the existing sparsity functions are non-convex and non-dierentiable, solving the formulation involving such surrogates introduces new challenges such as understanding properties of the resulting formulation and properties of the (computable) stationary solutions. Furthermore, the sparsity functions are developed and studied independently from the existing literature; many of the functions have shown desirable properties with respect to theoretical or numerical perspectives under the framework that are designed for specic functions. We introduce a unied formulation for the learning problems involving sparsity functions which is applicable to most of the existing non-convex ` 0 -surrogates, and provide a systematic analysis while investigating minimizing and sparsity properties of the d-stationary solutions. 14 2.1 The Unied DC Program of Statistical Learning We consider the following Lagrangian form of the bi-criteria statistical learning optimization prob- lem for selecting the model variable w2R d : minimize w2W Z (w), `(w) + P (w) (2.1) where is a positive scalar parameter balancing the loss (or residual) function `(w) and the model control (also called penalty) function P (w) andW is a closed convex (typically polyhedral) set inR d constraining the model variable w. The unconstrained case whereW =R d is of signicant interest in many applications. Throughout the chapter, we assume that (A 0 )`(w) is a convex function on the closed convex setW andP (w) is a dierence-of-convex (dc) function given by P (w), g(w)h(w); with h(w), max 1iI h i (w); for some integer I > 0; (2.2) where g is convex but not necessarily dierentiable and each h i is convex and dierentiable. Thus the concave summand ofP (w), i.e.,h(w), is the negative of a pointwise maximum of nitely many convex dierentiable functions; as such h(w) is piecewise dierentiable (PC 1 ) according to the denition of such a piecewise smooth function [32, Denition 4.5.1]. [Specically, a continuous functionf(w) is PC 1 on an open domain if there exist nite many C 1 functionsff i g M i=1 for some positive integer M such that f(w)2ff i (w)g M i=1 for all w2 . If each f i is an ane function, we say that f is piecewise ane on .] In many (inexact) surrogate sparsity functions, the function P (w) is separable in its arguments; i.e., P (w) = d X i=1 p i (w i ), where each p i (w i ) is a univariate dc function whose concave summand is the negative of a pointwise maximum of nitely many convex dierentiable (univariate) functions; i.e., the univariate analog of (2.2). It is important to note that while it has been recognized in the literature (e.g. [41, 43, 54, 81]) that several classes of sparsity functions can be formulated as dc functions, the particular form (2.2) of the function h(w) has not been identied in the general dc approach to the sparsity optimization. Our work herein exploits this piecewise structure protably. Since every convex function can be represented as the pointwise maximum of a family of ane functions, possibly a continuum of them, it is natural to ask if the results in our work can be extended to a general convex function h. A full treatment of this question is regrettably beyond the scope of this chapter which is aimed at addressing problems in sparsity optimization and not for general dc programs. 15 2.1.1 A model-control constrained formulation With two criteria, the loss `(w) and model control P (w), there are two alternative optimization formulations for the choice of the model variable w; one of these two alternative formulations constrains the latter function while minimizing the former; this is in contrast to balancing the two criteria using a weighing parameter into a single combined objective function to be minimized. Specically, given a prescribed scalar > 0, consider minimize w2W `(w) subject to P (w) ; (2.3) which we call the model control-constrained version of (2.1). Since both (2.1) and (2.3) are non- convex problems and involve non-dierentiable functions, the connections between them are not immediately clear. For one thing, from a computational point of view, it is not practical to relate the two problems via their global minima; the reason is simple: such minima are computationally intractable. Instead, one needs to relate these problems via their respective computable solutions. Toward this end, it seems reasonable to relate the d(irectional)-stationary solutions of (2.1) to the B(ouligand)-staionary solutions of (2.3) as both kinds of solutions can be computed by the ma- jorization/linearization algorithms described in [68] and they are the \sharpest" kind of stationary solutions for directionally dierentiable optimization problems. Before presenting the details of this theory relating the two problems (2.1) and (2.3), we note that a third formulation can be introduced wherein one minimizes the penalty function P (w) while constraining the loss function `(w) not to exceed a prescribed tolerance. As the loss function `(w) is assumed to be convex in our treatment, the latter formulation is a convex constrained dc program; this is unlike (2.3) that is a dc constrained dc program, a problem that is considerably more challenging than a convex constrained program. Thus we will omit this loss function constrained formulation and focus only on relating the Lagrangian formulation (2.1) and the penalty constrained formulation (2.3). 2.1.2 Stationary solutions WithW being a closed convex set, a vector w 2W is formally a d-stationary solution of (2.1) if the directional derivative Z 0 (w ;ww ), lim #0 Z (w +(ww ))Z (w ) 0; for all w2W: 16 Letting c W ,fw 2W j P (w) g be the (non-convex) feasible set of (2.3), we say that a vector w2 c W is a B-stationary solution of this problem if ` 0 ( w;dw) 0 for all dw2T ( c W ; w), whereT ( c W ; w) is the tangent cone of c W at w; the latter cone consists of all vector dw for which there exist a sequence of vectorsfw k g c W converging to w and a sequence of positive scalars f k g converging to zero such that dw = lim k!1 w k w k . This chapter focuses on the derivation of optimality and sparsity properties of directional stationary solutions to the problem (2.1). Through the connections with the constrained formulation (2.3) established below, the obtained results for (2.1) can be adopted to the former problem. A natural question arises why we choose to focus on directional stationary solutions rather than stationary solutions of other kinds, such as that of a critical point for dc programs [43, 54, 70, 71]. For reasons given in [68, Section 3.3], see in particular the summary of implications therein, directional stationary solutions are the sharpest kind among stationary solutions of other kinds in the sense a directional stationary solution must be stationary according to other denitions of stationarity. Moreover, as shown in Proposition 1 below and Proposition 6 later, d-stationary solutions possess minimizing properties that are not in general satised by stationary solutions of other kinds. In essence, the reason for the sharpness of d-stationarity is because it captures all the active pieces as described by the index setA(w ) at the point under considerationw ; see (2.4). In contrast, a critical point of a dc program fails to capture all the active pieces. Thus any property that a critical point might have may not be applicable to all the active pieces; therefore, Our rst result is a characterization of a d-stationary solution of (2.1). Proposition 1. Under assumption (A 0 ), a vectorw 2W is a d-stationary solution of (2.1) if and only if w 2 argmin w2W 2 6 4`(w) + g(w)rh i (w ) T (ww ) | {z } convex function in w 3 7 5; 8i2A(w ); (2.4) whereA(w ),fij h i (w ) = h(w )g. Moreover, if the function h is piecewise ane, then any such d-stationary solution must be a local minimizer of Z onW. Proof. It suces to note that h 0 (w ;dw) = max i2A(w ) rh i (w ) T dw for all vectors dw2R d . The last assertion about the local minimizing property follows readily from the fact [32, Exercise 4.8.10] that we have h(w) = h(w ) +h 0 (w ;ww ) for any piecewise ane function h, provided that w is suciently near w . 17 Remark. The last assertion in the above proposition generalizes a result in [53] where an additional dierentiability assumption was assumed. The characterization of a B-stationary point of (2.3) is less straightforward, the reason being the non-convexity of the feasible set c W that makes it dicult to unveil the structure of the tangent coneT ( c W ; w). IfP ( w)<, thenT ( c W ; w) =T (W; w). The following result shows that this case is not particularly interesting when it pertains to the stationarity analysis of w. Proposition 2. Under assumption (A 0 ), if w2W is a B-stationary point of (2.3) such that P ( w)<, then w is a global minimizer of (2.3). Proof. As noted above, the fact that P ( w) < impliesT ( c W ; w) =T (W; w); hence w satises: ` 0 ( w;dw) 0 for all dw2T (W; w). Since ` is a convex function andW is a convex set, it follows that w is a global minimizer of ` onW. In turn, this implies that w is a global minimizer of ` on c W because c W is a subset ofW. . To understand the coneT ( c W ; w) whenP ( w) =, we need certain constraint qualications (CQs), under which it becomes possible to obtain a necessary and sucient condition for a B-stationary point of (2.3) similar to that in Proposition 1 for a d-stationary point of (2.1). We list two such CQs as follows, one of which is a pointwise condition while the other one pertains to the entire set c W : The pointwise Slater CQ is said to hold at w2 c W satisfying P ( w) = if d2T (W ; w) exists such that g 0 ( w; d)<rh j ( w) T d for all j2A( w). The piecewise ane CQ (PACQ) is said to hold for c W ifW is a polyhedron and the function g is piecewise ane. [This CQ implies neither the convexity nor the piecewise polyhedrality of c W because no linearity is assumed for the functions h i .] Under these CQs, we have the following characterization of a B-stationary point of the penalty constrained problem (2.3). See [68] for a proof. Proposition 3. Under assumption (A 0 ), if either the pointwise Slater CQ holds at w or the PACQ holds for the set c W , then w is a B-stationary point of (2.3) if and only if ` 0 ( w;dw) 0 for all dw2 b C j ( w), d2T (W; w)jg 0 ( w;d)rh j ( w) T d and every j2A(w ). With the above two propositions, we can formally relate the two problems (2.1) and (2.3) as follows. Proposition 4. Under assumption (A 0 ), the following two statements hold. 18 (a) Ifw is a d-stationary solution of (2.1) for some 0, and if the pointwise Slater CQ holds at w for the set c W where P (w ), then w is a B-stationary solution of (2.3). (b) If w is a B-stationary solution of (2.3) for some > 0, and if the pointwise Slater CQ holds at w for the set c W , then the following two statements hold: for each j2A( w), a scalar j 0 exists such that w is a d-stationary solution of minimize w2W [`(w) + j (g(w)h j (w) ) ] ; (2.5) nonnegative scalars andfb j g I j=1 exist with I X j=1 b j = 1 andb j (h(w)h j (w)) = 0 for all j such that w is a d-stationary solution of minimize w2W 2 4 `(w) + 0 @ g(w) I X j=1 b j h j (w) 1 A 3 5 : Finally, statements (a) and (b) remain valid if the PACQ holds instead of the pointwise Slater CQ. Proof. If w is a d-stationary solution of (2.1), then ` 0 (w ;ww ) + g 0 (w ;ww )rh i (w ) T (ww ) 0 for allw2W and alli2A(w ). Letdw2 b C j ( w) for somej2A(w ). We havedw = lim k!1 w k w k for some sequencefw k gW converging to w and sequencef k gR ++ converging to zero. For each k, we have ` 0 (w ;w k w ) + h g 0 (w ;w k w )rh i (w ) T (w k w ) i 0: Dividing by k and letting k!1, we deduce ` 0 (w ;dw) + g 0 (w ;dw)rh i (w ) T dw 0; which implies ` 0 (w ;dw) 0 because the term in the square bracket is non-positive. This proves parts (a) and (c). To prove (b), let w be as given. By Proposition 3, it follows that for everyj2A( w),` 0 ( w;dw) 0 for all dw2 b C j ( w), d2T (W; w)jg 0 ( w;d)rh j ( w) T d . Thus dw = 0 is a minimizer of ` 0 ( w;dw) for dw2 b C j ( w). Since the latter set is convex and satises the Slater CQ, a standard 19 result in nonlinear programming duality theory (see e.g. [5]) yields the existence of a scalar j such that dw = 0 is a minimizer of ` 0 ( w;dw) + j g 0 ( w;dw)rh j ( w) T dw onT (W; w). This proves statement (i) in (b). Adding up the inequalities: ` 0 ( w;dw) + j g 0 ( w;dw)rh j ( w) T dw 0; dw2T (W; w) for j2A( w), we deduce ` 0 ( w;dw) + 0 @ g 0 ( w;dw) X j2A( w) b j rh j ( w) T dw 1 A 0; dw2T (W; w); where , 1 jA( w)j X j2A( w) j andb j , j X j 0 2A( w) j 0 . This establishes (b). Part (b) of Proposition 4 falls short in establishing the converse of part (a), i.e, in establishing the full equivalence of the two families of problemsf(2.1)g 0 andf(2.3)g 0 in terms of their d-stationary solutions and B-stationary solutions, respectively. This adds to the known challenges of the latter dc-constrained family of dc problems over the former convex constrained dc programs. In the next section, we focus on the Lagrangian formulation (2.1) only and rely on the connections established in this section to obtain parallel results for the constrained formulation (2.3). Proposition 4 complements the penalty results in dc programming initially studied by [72] and recently expanded in [44, 68]. These previous penalty results address the question of when a xed constrained problem (2.3) has an exact penalty formulation as the problem (2.1) for sucient large in terms of their global minima. Furthermore, in the case of a quadratic loss function, the reference [44] derives a lower bound for the scalar for the penalty formulation to be exact. In contrast, allowing for a general convex loss function, Proposition 4 deals with stationary solutions that from a computational perspective are more reasonable as the results pertaining to global minima lack pragmatism and should at best be regarded as providing only theoretical insights and conceptual evidence about the connection between the two formulations. To close this subsection, we give a result pertaining to the case where the function h is piecewise ane; in this case, by Proposition 1, every d-stationary solution of (2.1) must be a local minimizer. Combining this fact with Proposition 4(b), we can establish a similar result for the problem (2.3). Proposition 5. Let assumption (A 0 ) hold with each h i being ane additionally. If w is a B- stationary solution of (2.3) for some > 0, such that P ( w) =, and if either the pointwise Slater 20 CQ holds at w for the set c W or the PACQ holds for the same set, then w is a local minimizer of (2.3). Proof. By Proposition 4(b) and (c), for each j2A( w), a scalar j 0 exists such that w is a d-stationary solution, thus minimizer, of (2.5), which is a convex program because h j is ane. To complete the proof, let w2 c W be suciently close to w such thatA(w)A( w). Let j2A(w). We then have `( w) = `( w) + j P ( w) j = `( w) + j (g( w)h j ( w) ) j because j2A( w) `(w) + j (g(w)h j (w) ) j by the minimizing property of w = `(w) + j (P (w) ) `(w) because j2A(w) and P (w): This establishes that w is a local minimizer of (2.3). 2.2 Minimizing and Sparsity Properties In general, a stationary solution of (2.1) is not guaranteed to possess any minimizing property. For smooth problems involving twice continuously dierentiable functions, it follows from classical nonlinear programming theory that with an appropriate second-order suciency condition, a sta- tionary solution can be shown to be strictly locally minimizing. Such a well-known result becomes highly complicated when the functions are non-dierentiable. Although one could in principle apply some generalized second derivatives, e.g., those based on Mordukhovich's non-smooth calculus [63], it would be preferable in our context to derive a simplied theory that is more readily applicable to particular sparsity functions, such as those to be discussed in Sections 2.4 and 2.5. 2.2.1 Preliminary discussion of the main result In a nutshell, our goal is to extend the characterization of a d-stationary solution of (2.1) in Proposition 1 by showing that under a set of assumptions, including a specic choice of the convex functiong, for a range of values of to be identied in the analysis, any such nonzero d-stationary solutionw either haskw k 0 bounded above by a scalar computable from certain model constants, 21 or w 2 argmin w2W 2 6 4`(w) + (g(w)h i (w) ) | {z } remains dc 3 7 5 8i2A(w ); (2.6) see Theorem 1 in Subsection 2.2.4 for details. In terms of the original function Z , the above condition implies a restricted global optimality property of w ; namely, w 2 argmin w2W Z (w); (2.7) whereW ,fw2W jA(w)\A(w )6=;g. To see this implication, it suces to note that if w2W , then letting i be a common index inA(w)\A(w ), we have Z (w)Z (w ) = [`(w) + (g(w)h i (w) ) ] [`(w ) + (g(w )h i (w ) ) ] 0: Borrowing a terminology from piecewise programming, we say that the vectors w and w share a common piece ifA(w)\A(w )6=;. Thus the restricted global optimality property (2.7) of w says thatw is a true global minimizer ofZ (w) among those vectorsw2W that share a common piece withw . The subsetW includes a neighborhood ofw inW; i.e., for a suitable scalar> 0, I B(w ;)\WW , where I B(w ;) is an Euclidean ball centered at w and with radius > 0. Thus, the optimality property (2.6) implies in particular that w is a local minimizer of Z onW. Another consequence of (2.6) is that if I = 1 (so that h is convex and dierentiable), then w is a global minimizer of Z onW. Besides the above special cases, our main result applies to many well-known sparsity functions in the statistical learning literature, plus the relatively new and largely un-explored class of exact K-sparsity functions to be discussed later. 2.2.2 The assumptions The analysis requires two main sets of assumptions: one on the model functions`(w) andfh i (w)g I i=1 and the other one on the constraint setW. We rst state the former set of assumptions, which introduce the key model constants Lip r` , ` min , Lip rh , and . In principle, we could localize these assumptions around a given d-stationary solution; instead, our intention of stating these assumptions globally on the setW is to use the model constants to identify the parameter that denes the optimization problem (2.1). 22 (A) In addition to (A 0 ), (A L ` ) the loss function`(w) is of class LC 1 (for continuously dierentiable with a Lipschitz gradient) onW; i.e., a positive scalar Lip r` exists such that for all w and w 0 inW, kr`(w)r`(w 0 )k 2 Lip r` kww 0 k 2 ; 8w;w 0 2W; (2.8) (A cvx ` ) a nonnegative constant ` min exists such that `(w)`(w 0 )r`(w 0 ) T (ww 0 ) ` min 2 kww 0 k 2 2 ; 8w;w 0 2W; (2.9) (A L h ) nonnegative constants Lip rh and exist such that for each i2f1; ;Ig and all w and w 0 inW, with 0=0 dened to be zero, 0 h i (w)h i (w 0 )rh i (w 0 ) T (ww 0 ) Lip rh 2 + kw 0 k 2 kww 0 k 2 2 : (2.10) (A B h ) max 1iI sup w2W krh i (w)k 2 , <1. By the mean-value theorem for multivariate functions [67, 3.2.12], we derive from the inequality (2.8), `(w)`(w 0 )r`(w 0 ) T (ww 0 ) Lip r` 2 kww 0 k 2 2 ; 8w;w 0 2W: Thus Lip r` ` min . When `(w) = 1 2 w T Qw +q T w is a strongly convex quadratic function with a symmetric positive denite matrix Q, the ratio ` min Lip r` = min (Q) max (Q) , where the numerator and denominator of the right-side ratio are the smallest and largest eigenvalues of Q, respectively, is exactly the reciprocal of the condition number of the matrix Q. We discuss a bit about the two assumptions (A cvx ` ) and (A L h ). The constant ` min expresses the con- vexity of the loss function `, and the strong convexity when ` min > 0. For certain non-polyhedral sparsity functions, this constant needs to be positive in order for interesting results to be obtained. One may argue that the latter positivity condition may be too restrictive to be satised in appli- cations. For instance, in sparse linear regression where `(w) = 1 2 kAwbk 2 2 , the matrix A can be expected to be column rank decient so that ` is only convex but not strongly. For this problem, the results below are applicable to the regularized loss function b ` (w) = `(w) + 2 kwk 2 2 for any positive scalar . More generally, any regularized convex function is strongly convex and thus sat- ises condition (A cvx ` ) with a positive ` min . Another way to soften the positivity of ` min satisfying 23 (2.9) is to require that this inequality (with a positive ` min ) holds only on a subset ofW, e.g., on the setL \W, where L , n w2 R d j w i = 0 whenever w i = 0 o : In this case, the minimizing property ofw will be restricted to the setL \W. It turns out that if the functionsh i are ane so that the function h(w) = max 1iI h i (w) is piecewise ane (and convex), the main result could hold for ` min = 0. Yet another localization of the global inequality (2.9) is to assume that the Hessianr 2 `(w ) exists and is positive denite (either on the entire spaceR d or on the subspaceL ). Under such a pointwise strong convexity condition (via the positive deniteness of the Hessian matrix), one obtains a locally strictly minimizing property ofw . The upshot of this discussion is that in general, if one desires a strong minimizing property to hold on the entire set W, then the inequality (2.9) with a positive ` min is essential to compensate for the non-convexity of the sparsity functionP whenh i is non-ane; relaxation of the condition (2.9) with a positive ` min is possible either with a piecewise linear function h or at the expense of weakening the minimizing conclusion when the functions h i are not ane. As we will see in the subsequent sessions, Assumption (A L h ) accommodates a host of convex functions h i . Foremost is the case when the functionsh i are of class LC 1 ; in this case, we may take = 0 and Lip rh to be the largest of the Lipschitz moduli of the gradientsfrh i (w)g I i=1 onW. In particular, when each function h i is ane, we may further take Lip rh to be zero or any positive scalar. The Assumption (A L h ) also includes the case where I = 1 and h(w) =kwk 2 . Indeed, this follows from the following inequality: provided w 6= 0, kwk 2 kw k 2 w kw k 2 T (ww ) 1 kw k 2 kww k 2 2 ; which is equivalent to: kwk 2 kw k 2 kw k 2 2 (w ) T (ww )kww k 2 2 : The left-hand side is equal tokwk 2 kw k 2 (w ) T w. We have kww k 2 2 kwk 2 kw k 2 (w ) T w =kwk 2 2 (w ) T w +kw k 2 2 kwk 2 kw k 2 kwk 2 2 2kw k 2 kwk 2 +kw k 2 2 = [kwk 2 kw k 2 ] 2 0; 24 which shows that we may take Lip rh = 0 and = 1 for the 2-norm function. It is important to remark that in general, if (2.10) holds with a zero Lip rh , it certainly holds with a positive Lip rh . Nevertheless, we refrain from taking this constant as always positive because it aects the selection of the constant as it will become clear in the main Theorem 1. Finally, (A L h ) also accommodates any linear combination of functions each satisfying this assumption. Such combination may be relevant in problems of group sparsity minimization where one may consider the use of dierent sparsity functions for dierent groups. The second assumption (B) below is on the constraint setW. Consistent with the previous assump- tions, we state assumption (B) with respect to an arbitrary vectorw 2W, making it a global-like condition; however, it will be clear from the analysis that the assumption is most specic to a sta- tionary solution being analyzed. The assumption is satised for example ifW = d Y i=1 W i where each W i is one of three sets: R (for an un-restricted variable) orR (for problems with sign restrictions on the variables w i ). Thus our results are applicable in particular to sign constrained problems, a departure from the literature which typically either deals with unconstrained problems, or makes an interiority assumption on w which essentially converts the analysis to the unconstrained case. (B) For a given vectorw 2W, there exists" > 0 such that for all"2 (0;" ] and for everyi with w i 6= 0, the vectors w ";i dened below belong toW: w ";i j , 8 > < > : w j " if j =i w j if j6=i j = 1; ;d: While our analysis is applicable to a general convex function g, a most interesting case is when g is a weighted ` 1 -function given by g(w), d X i=1 i jw i j; w2 R d ; (2.11) for some positive constants i . This choice of the function g was employed in the study [54] of DC approximations to sparsity minimization; Proposition 6 in this reference shows that a broad class of separable sparsity functions in the statistical learning literature has a dc decomposition with the above function g as the convex part. For this choice of g, if 2 @g(w) with @g(w) denoting the subdierential of g at w, then k 6=0 k 2 = s X i :w i 6= 0 2 i min 1id i p kwk 0 ; 25 where 6=0 denotes the sub-vector of with components corresponding to the nonzero components of w. Let min , min 1id i so thatk 6=0 k 2 min p kwk 0 for all w2R d . 2.2.3 The case = 0 We begin our analysis with a result pertaining to the case where condition (A L h ) holds with = 0. This is the case where eachh i is of class LC 1 . The special property (B) onW is not needed in this case; also, no assumption is imposed on g except for its convexity. Proposition 6. Suppose that (A cvx ` ) and (A L h ) with = 0 hold. For any scalar > 0 such that , ` min Lip rh 0, the following two statements hold: (a) If w is a d-stationary point of Z onW, then for all i2A(w ), [`(w) + (g(w)h i (w) ) ] [`(w ) + (g(w )h i (w ) ) ] 2 kww k 2 2 ; 8w2W: Hence w is a minimizer of Z (w) onW . Moreover, if > 0, then w is the unique minimizer of `(w) + (g(w)h i (w) ) onW for all i2A(w ), hence the unique minimizer of Z (w) onW . (b) If I = 1, then Z is convex onW with modulus (thus is strongly convex if > 0). Proof. To prove (a), let w2W be arbitrary. For any i2A(w ), we have [`(w) + (g(w)h i (w) ) ] [`(w ) + (g(w )h i (w ) ) ] [`(w) + (g(w)h i (w) ) ] [`(w ) + (g(w )h i (w ) ) ] r`(w ) T (ww ) + g 0 (w ;ww )rh i (w ) T (ww ) = [`(w)`(w )r`(w )(ww ) ] + [g(w)g(w )g 0 (w ;ww ) ] h i (w)h i (w )rh i (w ) T (ww ) 1 2 ` min Lip rh kww k 2 2 : Thus (a) follows. To prove (b), suppose I = 1. We have, for any w and w 0 inW, Z (w)Z (w 0 )Z 0 (w 0 ;ww 0 ) 1 2 ` min Lip rh kww 0 k 2 2 : Statement (b) follows readily from the above inequality. Statement (b) of the above result is in general not true if I > 1 as illustrated by the univariate function Z (t) = t 2 jtj = t 2 max(t;t). [This univariate function ts our framework with 26 g =jtj and h = 2jtj.] It can be seen that for any > 0 this function is not convex because Z (0) = 0 > 2 4 = 1 2 [Z ( =2) +Z ( =2) ]: Nevertheless, this function has two stationary points at =2 that are both global minima of Z . This simple result illustrates a couple noteworthy points. First, the convexity of the function Z cannot be expected for I > 1 even though the loss function ` may be strongly convex. Yet it is possible for a d-stationary solution of Z to be a global minimizer. The latter possibility is encouraging and serves as the motivation for the subsequent extended analysis with > 0, where an upper bound on will persist. 2.2.4 The case of g being a weighted ` 1 function In this subsection, we rene the proof of Proposition 6 by taking g to be the weighted ` 1 function (2.11) and including the constant . We have, for all w2W and all i2A(w ), [`(w) + (g(w)h i (w) ) ] [`(w ) + (g(w )h i (w ) ) ] [`(w)`(w )r`(w )(ww ) ] + [g(w)g(w )g 0 (w ;ww ) ] h i (w)h i (w )rh i (w ) T (ww ) ` min 2 Lip rh 2 + kw k 2 kww k 2 2 : The remaining analysis uses the d-stationarity of w to establish the nonnegativity of the multi- plicative factor M , ` min 2 Lip rh 2 + kw k 2 . By the characterization of d-stationarity in Proposition 1, there exists a subgradient i 2@g(w ) such that r`(w ) + i rh i (w ) T (ww ) 0; 8w2W: By Assumption (B) on the constraint setW, we may substitute w = w ";k for all k such that w k 6= 0 and for arbitrary "2 (0;" ] and obtain r`(w ) + i rh i (w ) k = 0 8k such that w k 6= 0: (2.12) Thus lettingr 6=0 h i (w) be the vector @h i (w) @w k k:w k 6=0 , we have kw k 2 Lip r` +kr`(0)k 2 kr`(w )k 2 = k 6=0 r 6=0 h i (w )k 2 [k 6=0 k 2 kr 6=0 h i (w )k 2 ]; 27 where the rst inequality is by the LC 1 property of ` and the triangle inequality which also yields the last inequality. Suppose ckr`(0)k 2 for a constant c > 0 to be determined momentarily. We deduce kw k 2 Lip r` k 6=0 k 2 kr 6=0 h i (w )k 2 c 1 : With g being the weighted ` 1 -function (2.11) and by recalling the scalar in condition (A B h ), we deduce kw k 2 Lip r` h min p kw k 0 c 1 i : (2.13) Moreover, provided that min p kw k 0 > +c 1 , we obtain kw k 2 Lip r` min p kw k 0 c 1 : Hence, if p kw k 0 > 2 min +c 1 , [the factor 2 can be replaced by any constant greater than 1] then kw k 2 < Lip r` +c 1 . Thus M 1 2 ` min Lip rh Lip r` +c 1 . Consequently, if (c), 1 2 ` min Lip rh Lip r` +c 1 0; (2.14) it follows thatM 0. It remains to determine the constantc. To ensure the validity of the above derivations, we need to have 1 2 ckr`(0)k 2 Lip rh ` min 2 Lip r` +c 1 ; (2.15) which is equivalent to q (c), c 2 kr`(0)k 2 Lip rh +c h kr`(0)k 2 Lip rh + 2 Lip r` ` min i ` min 0: Summarizing the above derivations, we have established the following main result which does not require further proof. Theorem 1. Under the assumptions in Subsection 2.2.2 with the constants ` min , Lip r` , Lip rh ,, , and min as given therein, letg be given by (2.11) andc> 0 be any constant satisfyingq (c) 0. Let > 0 satisfy ckr`(0)k 2 Lip rh Lip rh ` min 2 Lip r` +c 1 : (2.16) If w is a d-stationary solution of Z onW, then either p kw k 0 2 min +c 1 ; (2.17) 28 or for all i2A(w ) and all w2W, [`(w) + (g(w)h i (w) ) ] [`(w ) + (g(w )h i (w ) ) ] (c)kww k 2 2 : If the above inequality holds, then w is a minimizer ofZ (w) onW . Moreover, if (c)> 0, then w is the unique minimizer of `(w) + (g(w)h i (w) ) onW for all i2A(w ), hence the unique minimizer of Z (w) onW . Remark. As shown above,q (c) 0 if and only if (2.15) holds. With the latter inequality, we can choose > 0 so that (2.16) holds. In turn, the inequalities in (2.16) implicitly impose a condition on the four model constants: ` min , Lip r` , Lip rh , and and suggest a choice of in terms of them for the theorem to hold. As a (restricted) global minimizer of Z , w has the property that for every w 2 W , either `(w )`(w) orP (w )<P (w). In particular, ifw is not a global minimizer of the loss function ` onW , then we must haveP (w )<P (e w) where e w is the latter minimizer. In the language of multi- criteria optimization such a vector w is a Pareto point onW of the two criteria: loss and model control; specically, there does not exist aw2W such that the pair (`(w);P (w)) (`(w );P (w )). The implication of the (restricted) global minimizing property of a d-stationary point is reminiscent of the class of \pseudo-convex functions" [59] which, by their denition, are dierentiable functions with the property that all its stationary points are global minimizers of the functions. There are two dierences, however. One dierence is that the function Z is only directionally dierentiable and its stationary solutions are dened in terms of the directional derivatives. Another dierence is that when > 0, a (non-)sparsity stipulation (i.e., the failure of (2.17)) on the d-stationary solution w is needed for it to be a (restricted) global minimizer. The following corollary of Theorem 1 is worth mentioning. For simplicity, we state the corollary in terms of Z on the subsetW . No proof is required. Corollary 1. Under the assumptions of Theorem 1, suppose that Lip rh = 0. Let g be given by (2.11). Let c > 0 satisfy: c 2 Lip r` ` min ` min . For any ckr`(0)k 2 , if w is a d-stationary solution of Z onW, then either Z (w)Z (w ) ` min 2 Lip r` +c 1 kww k 2 2 ; 8w2W ; (2.18) or p kw k 0 2 ( +c 1 ) min . 29 We close this section by giving a bound onkw k 0 based on the inequality (2.13). The validity of this bound does not require the positivity of ` min ; nevertheless, the sparsity bound pre-assumes a bound onkw k 2 ; this is in contrast to Theorem 1 which makes no assumption on w except for its d-stationarity. Proposition 7. Suppose that assumptions (A cvx ` ), (A L ` ), and (B) hold. Let g be given by (2.11). For every > ckr`(0)k 2 for some scalar c > 0, if w is a d-stationary solution of Z onW such thatkw k 2 ckr`(0)k 2 Lip r` h min p L + 1c 1 i for some integer L> 0 thenkw k 0 L. Proof. Assume for contradiction thatkw k 0 L+1. Thenw 6= 0; hence min p L + 1> +c 1 . We have ckr`(0)k 2 min p L + 1c 1 < min p L + 1c 1 by the choice of min kw k 0 c 1 by assumption onkw k 0 kw k 2 Lip r` by (2.13) ckr`(0)k 2 min p L + 1c 1 by assumption onkw k 2 : This contradiction establishes the desired bound onkw k 0 . 2.3 Sparsity Functions and Their DC Representations In this and the next section, we investigate the application of the results in Section 2.2 to a host of non-convex sparsity functions that have been studied extensively in the literature. These non- convex functions are deviations from the convex ` 1 -norm that is a well-known convex surrogate for the ` 0 -function. As mentioned in the introduction, we classify these sparsity functions into two categories: the exact ones and the surrogate ones. The next section discusses the class of exact sparsity functions; Section 2.5 discusses the surrogate sparsity functions. In each case, we identify the constants Lip rh , , and min of the sparsity functions and present the specialization of the results in the last section to these functions. For convenience, we summarize in Table 2.3 the sparsity functions being analyzed. The results obtained in the next two sections are summarized in Tables 2.5.6 and 2.5.6 appended at the end of the chapter. 30 Penalty Dierence of Convex Representation K-sparsity jjwjj 1 K X k=1 jw [k] j = Lip rh = 0 ` 1 ` 2 jjwjj 1 jjwjj 2 = 1; Lip rh = 0 SCAD d X i=1 i jw i j d X i=1 8 > > > > > > < > > > > > > : 0 ifjw i j (jw i j ) 2 2 (a 1 ) if jw i j a jw i j (a + 1 ) 2 2 ifjw i j a = 0; Lip rh = 2 MCP 2 d X i=1 i jw i j d X i=1 8 > < > : w 2 i a i ifjw i j a i i 2 i jw i ja i 2 i ifjw i j a i i = 0; Lip rh = 1 Capped ` 1 d X i=1 jw i j a i d X i=1 max 0; w i a i 1; w i a i 1 = Lip rh = 0 Transformed ` 1 d X i=1 a i + 1 a i jw i j d X i=1 a i + 1 a i jw i j (a i + 1)jw i j a i +jw i j = 0; Lip rh = 2 max 1id a i + 1 a 2 i Logarithmic d X i=1 i " i jw i j d X i=1 i jw i j " i log(jw i j +" i ) + log" i = 0; Lip rh = max 1id i " i Table 2.3: Exact and surrogate penalty functions and their properties 2.4 Exact Sparsity Functions 2.4.1 K-sparsity function While the class of exact K-sparsity functions has been discussed exclusively in the context of matrix optimization, for the sake of completeness, we formally dene these functions for vectors and highlight their properties most relevant to our discussion. For an m-vector w = (w i ) d i=1 , 31 let [jwj], (w [i] ) d i=1 be derived from w by ranking the absolute values of its components in non- increasing order: max 1id jw i j,jw [1] jjw [2] jjw [m] j, min 1id jw i j; thusjw [k] j is thekth largest of the m components of w in absolute values andf[1]; ; [m]g is an arrangement of the index set f1; ;dg. For a xed positive integer K, let P [K] (w), kwk 1 |{z} g [K] (w) K X k=1 jw [k] j | {z } h [K] (w) = d X k=K+1 jw [k] j: Clearly P [K] (w) = 0 if and only if w has no more than K nonzero components. The convexity of h [K] , thus the dc-property of P [K] , follows from the value-function representation below: h [K] (w) , K X k=1 jw [k] j = maximum v2 K;m d X i=1 v i jw i j; where K;m , ( v2 [ 0; 1 ] d j d X i=1 v i = K ) : (2.19) Since h [K] is piecewise linear, it is LC 1 with Lip rh = 0; moreover, we have = 0. In order to calculate the constant associated with the function h [K] , it would be useful to express h [K] (w) as the pointwise maximum of nitely many linear functions. For this purpose, letE( K;m ) be the nite set of extreme points of the polytope K;m . We then have h [K] (w) = max ( d X i=1 v i i w i j v2E( K;m ); 2f1g d ) Note that if v2E( K;m ) then each component of v is either 0 or 1 and there are exactly K ones. For a given pair (v;)2E( K;m )f1g d , the linear function h v; :w7! d X i=1 v i i w i has gradient given by v, where is the notation for the Hadamard product of two vectors. Hence, kr 6=0 h v; (w)k 2 = s X i :w i 6=0 v 2 i 2 i = p min(K;kwk 0 ): Since for any nonzero vector w and for any 2@g [K] (w),k 6=0 k 2 = p kwk 0 , it follows that for all w2W, all 2@g [K] (w), and all pairs (v;)2E( K;m )f1g d , k 6=0 r 6=0 h v; (w)k 2 k 6=0 k 2 kr 6=0 h v; (w)k 2 8 > < > : = 0 ifkwk 0 K p K + 1 p K ifkwk 0 K + 1: 32 Since each function h v; is ane, Proposition 6 applies. In order to identify the vectors w2W that share a common piece with w , i.e., w2W , we arrange the components ofjw j as follows: jw [1] jjw [s ] j > jw [s +1] j = =jw [K] j = =jw [t 1] j | {z } all equal tojw [K] j > jw [t ] jjw [m] j; and let I > ,f [1]; [s ]g; I = ,f [s + 1]; ; [t 1]g; and I < ,f [t ]; ; [m]g: Vectors w that share a common piece with w are those such that P [K] (w) = X i2I < jw i j + X i2J = jw i j whereJ = is any subset ofI = with t (K + 1) elements. Let Z [K]; (w),`(w) + P [K] (w). Theorem 2. Assume that conditions (A L ` ), (A cvx ` ), and (B) hold. For a xed integer K 1 and scalar > 0, the following two statements hold for a d-stationary pointw of the functionZ [K]; (w) onW. (a) w is a global minimizer of Z [K]; (w) on the subset of vectors w2W that share a common piece with w . (b) Suppose that > ckr`(0)k 2 for some c > 0 satisfying p K + 1 > p K +c 1 . Ifkw k 2 ckr`(0)k 2 Lip r` h p K + 1 p Kc 1 i , thenkw k 0 K. Proof. Statement (a) follows from Proposition 6 with Lip rh = 0. Statement (b) follows from the inequalitykw k 2 Lip r` k 6=0 r 6=0 h v; (w)k 2 c 1 by a contrapositive argument and the lower bound onk 6=0 r 6=0 h v; (w)k 2 ; cf. the proof of Proposition 7. 2.4.2 The ` 1 ` 2 function When K = 1, the zeros of the function P [1] (w) = kwk 1 kwk 1 are the 1-sparse vectors. It turns out that the function P ` 12 (w),kwk 1 kwk 2 has the same property as P [1] (w) in this regard. Nevertheless, structurally, these two functions are dierent: specically, P [1] is a piecewise linear function whereas the` 2 -function inP ` 12 is not piecewise smooth although it is dierentiable everywhere except at the origin. As such, Corollary 1 is applicable. As shown previously, condition (A L h ) holds with Lip rh = 0 and = 1. Moreover, for any w6= 0,krkwk 2 k 2 = 1. Corresponding to P ` 12 , we write Z ` 12 ; (w),`(w) + P ` 12 (w). 33 Theorem 3. Assume conditions (A L ` ), (A cvx ` ), and (B). Let c> 0 satisfy: c h 2 Lip r` ` min i ` min : For any >ckr`(0)k 2 , if w is a d-stationary solution of Z ` 12 ; onW, then either p kw k 0 2 ( 1 +c 1 ), or Z ` 12 ; (w)Z ` 12 ; (w ) ` min 2 Lip r` 1 +c 1 kww k 2 2 ; 8w2W: (2.20) Remark. The inequality (2.20) is global on the entire setW. We should point out that there are other piecewise linear functions whose zeros are 1-sparse but are not necessarily exact sparsity functions. Indeed, for any nite subset V of the unit Euclidean ball, the function P V (w) =kwk 1 max v2V v T w is one such function. This can be seen from the inequality P V (w)kwk 1 kwk 2 , from which it follows that the zeros of P V (w) must be 1-sparse. Nevertheless, the 1-sparse vectors are not necessarily the zeros of P V (w) if the subset V is not properly chosen. Yet a result similar to Theorem 2 can be derived for the function P V . 2.5 Surrogate Sparsity Functions We next examine several inexact sparsity functions. All such functions to be examined are of the folded concave type [36] and separable so thatP (w) = d X i=1 p i (w i ) with eachp i (w i ) representable as the dierence of a convex and a pointwise maximum of nitely many dierentiable convex function; cf. 2.2); i.e.,p i (w i ) =g i (w i ) max 1jI i h i;j (w i ) whereg i and eachh i;j are univariate convex functions with h i;j dierentiable. For each sparsity function examined below, we discuss the applicability of the results in Section 2.2. 2.5.1 The SCAD family Foremost among the separable sparsity functions is the SCAD family (for smoothly clipped absolute deviation) [33, 35, 58]. Parameterized by two scalars a > 2 and > 0 and with the origin as its unique zero, this univariate function is once continuously dierentiable except at the origin and 34 given by: for all t2R, p SCAD a; (t), 8 > > > > > > > < > > > > > > > : jtj ifjtj (a + 1 ) 2 2 (ajtj ) 2 2 (a 1 ) if jtj a (a + 1 ) 2 2 ifjtj a: The representation of this function as a dc functiong SCAD (t)h SCAD a; (t) withh SCAD a; (t) being dier- entiable andg SCAD (t) being a multiple of the absolute-value function is known [41]. [Nevertheless, Theorem 4 below is new.] Specically, we may take g SCAD (t), jtj and h SCAD a; (t), 8 > > > > > > > < > > > > > > > : 0 ifjtj (jtj ) 2 2 (a 1 ) if jtj a jtj (a + 1 ) 2 2 ifjtj a The function h SCAD a; (t) is continuously dierentiable with derivative given by dh SCAD a; (t) dt = 8 > > > > > > < > > > > > > : 0 ifjtj jtj a 1 sign(t) if jtj a sign(t) ifjtj a It is not dicult to see that the function h SCAD a; (t) is LC 1 onR with its derivative being Lipschitz continuous with a Lipschitz constant of 2 (recall thata> 2). To verify the latter property, we need to show that dh SCAD a; (t) dt dh SCAD a; (t 0 ) dt 2jtt 0 j for two arbitrary scalars t and t 0 . The derivation below establishes this for t and t 0 satisfying at and t 0 a: dh SCAD a; (t) dt dh SCAD a; (t 0 ) dt = t + a 1 t 0 a 1 = tt 0 a 1 2 a 1 2jtt 0 j because a > 2. The same inequality can be derived for all other cases of the pair (t;t 0 ). As a consequence of this LC 1 property, Proposition 6 is applicable to the SCAD family of surrogate 35 sparsity functions Z SCAD ;a; (w), `(w) + P SCAD a; (w), where for given positive constantsfa i ; i g d i=1 with a i > 2 for all i, P SCAD a; (w), d X i=1 i jw i j | {z } weighted ` 1 d X i=1 h SCAD a i ; i (w i ) | {z } h SCAD a; (w) : The cited proposition yields the strong convexity of the objective Z SCAD ;a; (w) onR d , provided that ` min =2. To obtain a bound onkw k 0 from Theorem 1, we evaluatek 6=0 r 6=0 h SCAD a; (w)k 2 2 for a subgradient 2@g SCAD (w). We have k 6=0 r 6=0 h SCAD a; (w)k 2 2 = X k :w k 6= 0 k sign(w k ) dh SCAD a k ; k (w k ) dw k ! 2 = X k : 0<jw k j< k 2 k + X k : k jw k ja k k k jw k j k a k 1 2 = X k : 0<jw k j< k 2 k + X k : k jw k ja k k a k k jw k j a k 1 2 min 1jd 2 j provided that there is one k such that 0<jw k j< k (2.21) Theorem 4. Assume conditions (A L ` ), (A cvx ` ), and (B). Letfa k ; k g d k=1 be positive scalars such thata k > 2 for allk. For every positive scalar 1 2 ` min , ifw is a d-stationary point ofZ SCAD ;a; (w) onW, then w is a minimizer of this function onW; more precisely, Z SCAD ;a; (w)Z SCAD ;a; (w ) ` min 2 kww k 2 2 ; 8w2W: Moreover, if >ckr`(0)k 2 for some c> 0 and kw k 2 ckr`(0)k 2 Lip r` min 1jd j c 1 ; (2.22) then for everyk = 1; ;d, eitherw k = 0 orjw k j k ; hence p kw k 0 kr`(0)k 2 Lip r` 2 4 c 1 min 1jd j 3 5 . Proof. It suces to show the last two assertions of the theorem. Assume the choice of and the bound onkw k 2 . There is nothing to prove ifw = 0. Otherwise, we must have min 1jd j c 1 > 0. Moreover, if there existsk such that 0<jw k j< k , thenk 6=0 r 6=0 h SCAD a; (w)k 2 min 1jd j . This 36 contradicts the inequality: kw k 2 Lip r` h k 6=0 r 6=0 h SCAD a; (w)k 2 c 1 i . To complete the proof of the theorem, letkw k 0 =K. Then p K min 1jd j kw k 2 ckr`(0)k 2 Lip r` min 1jd j c 1 ; from which the desired bound on K follows. Remarks. The two conditions: ckr`(0)k 2 < 1 2 ` min and (2.22) together yield min 1jd j 2kr`(0)k 2 ` min , which oers a guide in choosing the parameters j in the SCAD function so that Theorem 4 is applicable to the function Z SCAD ;a; (w). The recipe of deriving a bound onkw k 0 from the individual components w k persists in the later results; details will not be repeated. 2.5.2 The MCP family Next we discuss the MCP (for minimax concave penalty) family of surrogate sparsity functions [85]. Also parameterized by two positive scalars a> 2 and , the building block of these functions is the univariate, piecewise quadratic function: for t2R, p MCP a; (t), a 2 [ (ajtj ) + ] 2 a : Similar to the SCAD decomposition, we may take g MCP (t), 2jtj and h MCP a; (t) , 8 > > < > > : t 2 a ifjtj a 2jtja 2 ifjtj a Moreover, the function h MCP a; (t) is convex and continuously dierentiable with derivative given by dh MCP a; (t) dt = 8 > > < > > : 2t a ifjtj a 2 sign(t) ifjtj a: Moreover, using the fact that a> 2, we can verify that, for any two scalars t and t 0 , dh MCP a; (t) dt dh MCP a; (t 0 ) dt jtt 0 j: 37 The MCP sparsity function is dened as follows: for some positive constantsfa i ; i g d i=1 witha i > 2 for all i, P MCP a; (w), 2 d X i=1 i jw i j d X i=1 h MCP a i ; i (w i ) | {z } h MCP a; (w) : At this point, we can proceed similarly to the SCAD function and obtain a result analogous to Theorem 4. In particular, since h MCP a; is dierentiable, the globally minimizing property of a stationary solution is on the entire setW. We omit the details. 2.5.3 The capped ` 1 family One distinction of this family from the previous two families is that the capped ` 1 functions are piecewise linear; thus Propositions 6 and 7 are applicable. The building block of this family of functions is: for a given scalar a> 0 and for all t2R, p capL1 a (t), min jtj a ; 1 = jtj a max 0; jtj a 1 : This leads to the surrogate penalty function: given positive scalarsfa i g d i=1 and for all t2R, P capL1 a (w), d X i=1 min jw i j a i ; 1 = d X i=1 jw i j a i | {z } g capL1 a (w) d X i=1 max 0; w i a i 1; w i a i 1 | {z } h capL1 a (w) We have the following result for the functionZ capL1 ;a (w),`(w) + P capL1 a (w). Due to the piecewise property of the function h capL1 a (w), the minimizing property is of the restricted kind for vectors sharing a common piece with the given stationary solution on hand. Theorem 5. Assume conditions (A L ` ), (A cvx ` ), and (B). Letfa k g d k=1 be positive scalars. The following two statements hold for a d-stationary point w of Z capL1 ;a (w) onW for any > 0. (a) w is a global minimizer ofZ capL1 ;a (w) on the subset of vectors ofW that share a common piece with w . (b) If >ckr`(0)k 2 for somec> 0 andkw k 2 ckr`(0)k 2 Lip r` min 1jd 1 a j c 1 , then for every k = 1; ;d, eitherw k = 0 orjw k ja k ; thus p kw k 0 ckr`(0)k 2 min 1jd a j Lip r` min 1jd 1 a j c 1 . 38 2.5.4 The transformed ` 1 -family Employed recently by [86], this function is given by: for a given a> 0 and for t2R, p TL1 a (t), (a + 1 )jtj a +jtj which has the dc decomposition p TL1 a (t) = a + 1 a jtj | {z } g TL1 a (t) a + 1 a jtj (a + 1 )jtj a +jtj | {z } h TL1 a (t) : The univariate function h TL1 a (t) is strictly convex and (innitely many times) dierentiable on the real line as can be seen from its rst and second derivatives: dh TL1 a (t) dt = a + 1 a a (a + 1 ) (a +jtj ) 2 sign(t) and d 2 h TL1 a (t) dt 2 = 2a (a + 1 ) (a +jtj ) 3 : The second derivative also shows that dh TL1 a (t) dt is Lipchitz continuous with modulus 2 (a + 1 ) a 2 . For given positive parametersfa i g d i=1 , and with g TL1 a (w), d X i=1 a i + 1 a i jw i j and h TL1 a (w), d X i=1 a i + 1 a i jw i j (a i + 1 )jw i j a i +jw i j ; we have for any 2@g TL1 a (w), 6=0 r 6=0 h TL1 a (w) 2 = v u u t X i :w i 6=0 a i (a i + 1 ) (a i +jw i j ) 2 2 1 4 1 + min 1id 1 a i if9k such that 0 <jw k j <a k : Taking Lip rh = 2 max 1id a i + 1 a 2 i , we obtain a result for the function Z TL1 ;a (w), `(w) + P TL1 a (w), where P TL1 a (w), d X i=1 (a i + 1 )jw i j a i +jw i j that is similar to Theorem 4 for the function Z SCAD ;a; and Theorem 5 for the function Z capL1 ;a (w). Theorem 6. Assume conditions (A L ` ), (A cvx ` ), and (B). Letfa k g d k=1 be positive scalars. For every positive scalars c and such that ckr`(0)k 2 ` min 2 max 1id a i + 1 a 2 i ; 39 ifw is a d-stationary point ofZ TL1 ;a (w) onW, thenw is a minimizer of this function onW; more precisely, Z TL1 ;a (w)Z TL1 ;a (w ) ` min 2 max 1id a i + 1 a 2 i kww k 2 2 ; 8w2W: Moreover, if > ckr`(0)k 2 andkw k 2 ckr`(0)k 2 Lip r` 1 4 1 + min 1id 1 a i c 1 , then for every k = 1; ;d, either w k = 0 orjw k ja k . 2.5.5 The logarithmic family Introduced in [20] as an optimization formulation for the reweighted ` 1 procedure, and studied in particular in [41, 54], this family of functions is built from the univariate function: given positive scalars and ", p log ;" (t), log(jtj +") log"; t2 R which has the dc decomposition p log ;" (t) = " jtj jtj " log(jtj +") + log" | {z } h log ;" (t) : [Although this logarithmic function fails to satisfy the \continuity property" in the set of postulates of [33], we include it here to illustrate the breadth of our framework.] The univariate function h log ;" is strictly convex and (innitely many times) dierentiable on the real line as can be seen from its rst and second derivatives: dh log ;" (t) dt = t " (" +jtj ) and d 2 h log ;" (t) dt 2 = (" +jtj ) 2 ; 8t2 R: The second derivative also shows that dh log ;" (t) dt is Lipchitz continuous with modulus=" 2 . For given positive parametersf i ;" i g d i=1 , and with g log ;" (w), d X i=1 i " i jw i j and h log ;" (w), d X i=1 i jw i j " i log(jw i j +" i ) + log" i ; we have for any 2@g log ;" (w), 6=0 r 6=0 h log ;" (w) 2 = v u u t X i :w i 6=0 2 i (" i +jw i j ) 2 min 1id i 2" i if9k such that 0 <jw k j <" k : 40 Similar to previous results for the functionsZ SCAD ;a; (w) andZ TL1 ;a (w), the result below pertains to a d-stationary point of the functionZ log ;;" (w),`(w)+ P log ;" (w), whereP log ;" (w), d X i=1 i log(jw i j+" i ). Theorem 7. Assume conditions (A L ` ), (A cvx ` ), and (B). Letfa k ;" k g d k=1 be positive scalars. For every positive scalars c and such that ckr`(0)k 2 ` min max 1id i " 2 i ; ifw is a d-stationary point ofZ log ;;" (w) onW, thenw is a minimizer of this function onW; more precisely, Z log ;;" (w)Z log ;;" (w ) 1 2 ` min max 1id i " 2 i kww k 2 2 ; 8w2W: Moreover, if > ckr`(0)k 2 andkw k 2 ckr`(0)k 2 Lip r` min 1id i 2" i c 1 , then for every k = 1; ;d, either w k = 0 orjw k j" k . Hence, p kw k 0 ckr`(0)k 2 min 1d " i Lip r` min 1id i 2" i c 1 : 2.5.6 Some comments on results We note that in the statistical learning literature [90, 94] among others, the bounds on the estimators from minimizing a regularized cost function are probabilistic and typically in the limit of large sample size or dimension. In contrast, our sparsity bounds in Proposition 7, Theorems 2, 3, 4, 5, 6, and 7 on a d-stationary point are deterministic and hold in any xed sample size and dimension. Moreover, these results are explicit and helpful for analyzing numerical algorithms. In a set of regularized least-squares tests, we nd that our sparsity bounds are robust and can be still valid when the theoretical sucient conditions are not satised. In this regard, we should emphasize that these conditions are necessarily conservative and an advanced analysis could yield better sparsity bounds for targeted sparsity functions. 41 Result Condition on constants Conclusions Thm. 2 (K-sparsity) w is a minimizer onW Thm. 2 (K-sparsity) cjjr`(0)jj2 < p K + 1> p K +c 1 kw k2 ckr`(0)k2 Lip r` h p K + 1 p Kc 1 i ) kw k0K Thm. 3 (`1`2) cjjr`(0)jj2 < c ` min 2Lip r` ` min Either p kw k0 2(1 +c 1 ) or w is a global minimizer onW Thm. 4 (SCAD) 1 2 ` min w is a global minimizer onW Thm. 4 (SCAD) ckr`(0)k2 < 1 2 ` min kw k2 ckr`(0)k2 Lip r` min 1jd jc 1 ) p kw k0 kr`(0)k2 Lip r` " c min 1jd j 1 # Thm. 5 (Capped `1) w is a global minimizer onW Thm. 5 (Capped `1) ckr`(0)k2 < kw k2 ckr`(0)k2 Lip r` min 1jd 1 aj c 1 ) p kw k0 ckr`(0)k2 min 1jd aj Lip r` min 1jd 1 aj c 1 Thm. 6 (Transformed `1) cjjr`(0)jj2 ` min 2 max 1id ai + 1 a 2 i w is a global minimizer onW Thm. 6 (Transformed `1) cjjr`(0)jj2 < ` min 2 max 1id ai + 1 a 2 i kw k2 ckr`(0)k2 Lip r` 1 4 1 + min 1id 1 ai c 1 )8i = 1; ;d, either w i = 0 orjw i jai Thm. 7 (Logarithmic) cjjr`(0)jj2 ` min max 1id i "i w is a global minimizer onW Thm. 7 (Logarithmic) ckr`(0)k2 < ` min max 1id i "i kw k2 ckr`(0)k2 Lip r` min 1id i 2"i c 1 ) p kw k0 ckr`(0)k2 min 1d "i Lip r` min 1id i 2"i c 1 Table 2.5.6: Summary of results|Specialized q (c), c 2 kr`(0)k 2 Lip rh +c kr`(0)k 2 Lip rh + 2 Lip r` ` min ` min 42 Result Condition on constants Conclusions Thm. 1 cjjr`(0)jj2Lip rh Lip rh ` min 2Lip r` +c 1 q(c) 0 Either p jjw jj0 2 min +c 1 or w is a minimizer onW, unique if Lip rh < ` min 2Lip r` +c 1 Prop. 7 cjjr`(0)jj2 < kw k2 ckr`(0)k2 Lip r` h min p L + 1c 1 i ) kw k0L. Table 2.5.6: Summary of results{General 43 Chapter 3 Structural Properties of Ane Sparsity Constraints In many applications, logical conditions among the variables need to be considered, in order to ensure a meaningful and interpretable statistical model, or to exploit domain knowledge to increase model delity. Such logical conditions could be that certain variables are allowed to become active only if certain other variables are already in the model [10], or that variables in the model must come from a small number of pre-dened group [49]. Many of the existing constraints on the relationships among the model variables, including the above examples, can be formulated with the ` 0 -function. We introduce a new type of the constraint system called ane sparsity constraints (ASC) which is a system of linear inequalities involving the` 0 -function developed to faithfully formulate the logical conditions among the model variables as hard constraints. As the solution set of an optimization problem involving ASC may not be closed, we aim to study the fundamental properties of the new type of constraint system such as closedness of the set and the characterization of the closure. Fur- thermore, for practical computational considerations, such constraint system can be approximated by replacing the ` 0 -function with the continuous surrogate functions. The primary interest of our study lies in understanding the convergence of such an approximated continuous system to the given discontinuous ASC system. Overall, this chapter is devoted to the study of the fundamental structural properties of ane sparsity constraints in preparation for subsequent research of solving optimization problems subject to these constraints. 44 Notation. We dene some notations used in this chapter. It is important to clarify the use of the notationkxk 0 . In the previous chapter, the notation represents the cardinality of nonzero components of the given vector x, i.e.,kxk 0 is a scalar. In the current chapter,kxk 0 is a binary vector where the individual component of the vector is 1 (or 0) if the corresponding value of the component of x is nonzero (or zero). For a given subsetJ off1; ;dg and a vector x2R d , x J denotes the sub-vector ofx with components indexed by the elements ofJ . For a matrixA2R md and the same subsetJ ,A J denotes the columns ofA indexed byJ . A similar notation applies to the rows ofA. The support of a vectorx is the index set of nonzero components ofx and is denoted by supp(x). For any matrix A2R md , the matrices A + and A are the entry-wise nonnegative and nonpositive parts of A, such that for all i and j, [A + ] ij , max(A ij ; 0) and [A ] ij , max(A ij ; 0): (3.1) Thus we have the decomposition: A = A + A where A are nonnegative matrices. A similar denition applies to a vector x. For any x2 R d , I B n (x;r) is an open ball (in a suitable norm) centered at x with radius r. For any set SR d , cl(S) denotes its closure. A vector of all ones of a given dimension is written as 1 with the dimension omitted in the notation. 3.1 ASC Systems: Introduction and Preliminary Discussion Given a matrix A2 R md and a vector b2 R m , the ASC system is dened as the problem of nding a vector w2R d such that Akwk 0 b , 8 < : d X j=1 A ij jw j j 0 b i ; for all i = 1; ;m 9 = ; : (3.2) The (possibly empty) solution set of this system is denoted by SOL-ASC(A;b). Clearly, the system (3.2) can be written as: A + kwk 0 A kwk 0 +b: A particularly important special case is when A is 0-1 matrix, i.e., all its entries are either 0 or 1, and b is a positive integral vector. In this case, the constraint d X j=1 A ij jw j j 0 b i becomes X j2I i jw j j 0 b i ; whereI i ,fj j A ij 6= 0g; 45 which is an upper cardinality constraint stipulating that for each i = 1; ;m, the number of nonzero components w j for all j2I i is no more than the given cardinality b i (assumed integer). Cardinality constraints of this type have been studied in [8, 11, 14, 37, 91] using various reformu- lations; applications of such constraints to sparse portfolio selection can be found in [8, 11, 13, 23]. Needless to say, the ASC(A;b) is closely related to the system of linear inequalities: Aw b, w2 R d . Nevertheless, there are many important dierences. One obvious dierence is that the solution set of the former system is a cone possibly with the origin omitted, i.e., the following clearly holds: w2 SOL-ASC(A;b) ) w2 SOL-ASC(A;b); 8 scalars 6= 0; thus, under no other restriction on w, a non-zero SOL-ASC(A;b) is always an unbounded set. In contrast, a polyhedron, which is the solution set of a system of linear inequalities, does not have the scaling property in general. Further, an important feature that distinguishes the general ASC(A;b) where A can have positive and negative entries from the special case of a cardinality constraint system where A is nonnegative is that the solution set of the latter system must be closed (due to the lower semi-continuity of the ` 0 -function) while SOL-ASC(A;b) is not necessarily so in the general case, as illustrated by the simple Example 8 below. Example 8. Consider the simple case where A = [ 1 1 ] and b = 0, yielding the system jw 1 j 0 jw 2 j 0 . It is not dicult to see that SOL-ASC(A;b) is the entire plane R 2 except for the two halfw 1 -axes; i.e., SOL-ASC(A;b) = (w 1 ;w 2 )2R 2 jw 2 6= 0 [f(0; 0)g, which is obviously not closed. As we shall see later, the non-closedness of SOL-ASC(A;b) is due to the presence of some negative entries in the matrix A. An extreme case of this is when the entries of A are all either 0 or -1 and b is a negative integral vector. In this case, we obtain a lower cardinality constraint of the form: X j2I i jw j j 0 jb i j; whereI i ,fj j A ij < 0g; that has minimally been studied in the literature to date. By imposing no sign restrictions on the matrix A, our treatment goes far beyond these special cases and accommodates recent interests in statistical variable selection subject to logical constraints. Another important dierence between 46 SOL-ASC and a polyhedron is their respective tangent cones; see Section 3.4 for details. It is natural to consider an extension of the ASC by including continuous variables; specically, let b :R k !R m be a given mapping and letM be a closed convex set inR k . Dened by the triple (A;b;M), the extended ASC (xASC) system is the problem of nding a pair (w;)2R d M such that Akwk 0 b(): The (possibly empty) solution set of this system is denoted by SOL-xASC(A;b;M). Subsequently, we will discuss how results of the ASC system can be extended to the xASC system, and show how these extended systems could arise in the approximation of the ASC (see Subsection 3.3.2). 3.1.1 Source problems In general, logical conditions on the sparsity of the model variables can be modeled by ane con- straints using the binary ` 0 -indicators of these variables. In what follows we present two models of statistical regression with logical conditions on the unknown parameters. Hierarchical variable selection. Consider the following regression model with interaction terms [10]: y = d X i=1 w (1) i x i + X 1i<jd w (2) ij x i x j +"; (3.3) where x2R d is the vector of model inputs, y2R is the (univariate) model output, the w (1) i and w (2) ij are the unknown model parameters to be estimated, and" is the (random) error of the model. It is common practice in the variable selection process to maintain certain hierarchical conditions (also called \heredity constraints" or \marginality" in the literature [10, 46, 61]) between the co- ecient of the linear terms, w (1) i , and those in the interaction terms, w (2) ij . There are two types of hierarchical conditions. The strong hierarchical condition means that an interaction term can be selected only if both of the linear terms are selected, i.e.,jw (2) ij j 0 min n jw (1) i j 0 ;jw (1) j j 0 o for any i < j, while the weak hierarchical condition means that an interaction term can be selected only if one of the corresponding linear terms is selected, i.e.,jw (2) ij j 0 jw (1) i j 0 +jw (1) j j 0 . Clearly, both conditions can be represented by linear inequalities inkwk 0 , wherew is the concatenated vector of 47 w (1) and w (2) . Group variable selection. In many applications, variables are naturally divided into groups. There are dierent versions of grouped variable selection, and a selective review is given in [49]; see also [50, 84]. Here we consider two versions of this problem. One version is a variation of the group lasso discussed in [47, Section 4.3] where it is stated that: It is desirable to have all coecients within a group to become zero (or nonzero) simultaneously. The formulation in the cited reference is as follows: instead of the basic linear regression model: y = d X i=1 w i x i +"; (3.4) that expresses the model output directly as a combination of the core predictorsfx i g d i=1 , an aggre- gate model: y = J X j=1 (z j ) T j +" 0 in terms of theJ groupf j g J j=1 of (unknown) variates is postulated, where each j 2R p j represents a group ofp j regression coecients among thew i 's, with the vectorsz j being the known covariates in group j. A convex least-squares objective function [47, expression (4.5)] that contains the penalty J X j=1 k j k 2 is minimized whereby the unknown variables j are obtained. While this provides a plausible approach to the group selection process, it does not exactly model the desired grouping as stipulated above with reference to the basic model (3.4) that is at the level of the individual predictors x i 's for i = 1; ;d. Instead, we may minimize a combined (Lagrangian) objective f (w),`(w) +P (w), which comprises of a loss function weighed by a sparsity penalty function, both in terms of the original variable w, (w i ) d i=1 , subject to the grouping conditions that can be formulated as follows. For each j = 1; ;J, letG j be the subset off1;ng containing indices i such that the variable w i in group j; thusf1; ;dg = J [ j=1 G j . Consider the system with some auxiliary group variables j : jw i j 0 = j 8i2G j and j = 1; ;J; which can easily be seen to model exactly the desired grouping requirement. Clearly, the above is an xASC system in the pair of variables (w;). Note that no constraints are imposed on j . When 48 the` 0 -function is subsequently approximated by a surrogate function, properties of such a function (e.g., nonnegativity and upper bounds) will naturally transfer to restrictions on j . Consider the alternative stipulation of choosing the variables w i so that the number of groups covering all nonzero components is minimal. Together with the Lagrangian function f (w) , `(w) +P (w), the optimization problem of variable selection is: given positive coecientsfc j g J j=1 , minimize w; f (w) + J X j=1 c j j subject to jw i j 0 X j :i2G j j ; 8i = 1; ;d and j 2f 0; 1g; j = 1; J; (3.5) where each xASC models the coverage of the variate w i in the groups that contain them; i.e., the variable w i is selected only if at least one of the groups containing predictor i is selected. The minimization of the (weighed) sum of the binary variables j is a slight generalization of the goal of selecting the minimum number of groups in the coverage of all the nonzero w i 's which is the special case with equal weights. Clearly, the relaxation of the binary condition j 2f0; 1g to the continuous condition j 2 [0; 1] leads to an xASC system in the pair of variables (w;). A special case: Tree structure. In what follows, we discuss a special case of the group selection problem that justies the relaxation of the 0-1 restriction on the group indicator variables j in the problem (3.5). The key to this justication is the fact that for xed w, the constraints of the problem are those of a well-known set covering problem [65, Part III]. As such the theory of the latter problem can be applied. In turn, this application is based on the theory of balanced matrices which we quickly review; see [24, 25]. A 0-1 matrix is balanced matrix if and only if for any odd numberk 3, it does not contain akk submatrix whose row sums and column sums all equal 2 and which does not contain the 2 2 submatrix 2 4 1 1 1 1 3 5 . A 0-1 matrix is called totally balanced (TB) if such condition holds for all k 3 (not just odd k). A sucient condition for a matrix to be TB is that it does not contain a submatrix F = 2 4 1 1 1 0 3 5 [65, Part III, Proposition 4.5]. (In fact this condition is also necessary up to permutations of rows and columns [65, Part 49 III, Proposition 4.8].) Clearly the TB property is invariant under permutations of rows and columns. The model discussed below is related to the tree structured group lasso presented in [80]. Suppose that any two distinct members of the family G,fG j g J j=1 are such that either they do not intersect or one is a proper subset of the other. In this case (which clearly includes as a special case of non-overlapping groups; i.e.,G i \G j =; fori6=j), we can arrange the groups in a tree structure as follows. Let u, max 1jJ jG j j be the largest number of elements contained in the individual groups. For each integer k = 1; ;u, let T k =fG k 1 ; ;G k q k g be the sub-family of G consisting of all (distinct) groups each with exactly k variables; thus eachG k r is one of theG j . For simplicity, assume that each T k is a non-empty sub-family for k = 1; ;u; thus J = u X k=1 q k . It then follows that any two dierent members of T k do not overlap, i.e., for any k = 1; ;u and r6=r 0 both in f1; ;q k g,G k r \G k r 0 =;. Furthermore, for each pair (k;k 0 ) satisfying 1k 0 <ku, we callG k r a parent node ofG k 0 r 0 ifG k 0 r 0 is a proper subset ofG k r . Thus a parent node contains more elements than its descendent node(s). Note that it is not necessary for every element in the sub-family T k to have a parent node in the sub-family T k 0, nor is it necessary for every element in T 0 k to have a parent node in T k . We dene the element-group incidence matrix E2R nJ as follows. Arrange the columns of E in the order of ascendency of the groups in T k for k = 1; ;u; i.e., G 1 1 ; ;G 1 q 1 | {z } T 1 ; G 2 1 ; ;G 2 q 2 | {z } T 2 ; ; G u1 1 ; ;G u1 q u1 | {z } T u1 ;G u 1 ; ;G u qu | {z } T u ; then let E ij = 8 < : 1 if i is contained in group j 0 otherwise : In terms of this matrix E, the constraints: X j :i2G j j jw i j 0 8i2 supp(w) can be written simply as E S 1 jSj where S, supp(w) and E S denotes the rows of E indexed by S. 50 w 3 ; w 4 ; w 5 ; w 6 T 3 w 1 ; w 2 w 3 ; w 4 w 5 ; w 6 T 2 w 5 T 1 G 1 G 2 G 3 G 4 G 5 w 1 0 1 0 0 0 w 2 0 1 0 0 0 w 3 0 0 1 0 1 w 4 0 0 1 0 1 w 5 1 0 0 1 1 w 6 0 0 0 1 1 An illustration of tree structure and the element-group incidence matrix E 6 variables, 3 levels, and 5 groups: G 1 =fw 5 g,G 2 =fw 1 ;w 2 g,G 3 =fw 3 ;w 4 g,G 4 =fw 5 ;w 6 g, andG 5 =fw 3 ;w 4 ;w 5 ;w 6 g Proposition 8. Let the family of groupsfG j g J j=1 have the tree structure dened above. For any w2R d and any nonnegative coecientsfc j g J j=1 , it holds that min 8 < : J X j=1 c j j j 2f 0; 1g J ; X j :i2G j j 1; 8i2 supp(w) 9 = ; = min 8 < : J X j=1 c j j j 2 [ 0; 1 ] J ; X j :i2G j j 1; 8i2 supp(w) 9 = ; : Proof. Letw be given. Suppose that the matrixE S has a submatrixF, 2 4 1 1 1 0 3 5 corresponding to rows i<i 0 and columns j <j 0 . Then i is contained in both groups corresponding to columns j and j 0 . By the tree structure assumption and the ordering of the columns of E, this can only happen if the group at column j is a descendent node of the group at column j 0 . However, since E i 0 j = 1 and E i 0 j 0 = 0, element i 0 is in the descendent group but not in the parent group, hence a contradiction. Thus the constraint matrix E S is TB. By [65, Part III, Thm 4.13], the polyhedron 0jE S 1 jSj is integral; i.e., all its extreme points are binary. This is enough to establish the desired equality of the two minima. 51 3.2 Closedness and Closure The closedness of a feasible region is very important in constrained optimization because the min- imum value of a lower semi-continuous objective function may not be attained on a non-closed feasible region, even if the objective has bounded level sets in the region. This section addresses the closedness issue of an (x)ASC system and derives an expression of the closure of SOL-(x)ASC. 3.2.1 Necessary and sucient conditions for closedness It is convenient to dene two related sets of binary vectors which capture the \combinatorial structures" of SOL-ASC(A;b): Z(A;b) , z2f0; 1g d j Az b andZ (A;b) , z2f0; 1g d j9 z2f0; 1g d such that A z b and z z : (3.6) The latter setZ (A;b) contains binary vectors obtained by \zeroing out" some entries of the vectors in Z(A;b). These two sets provide the intermediate step to characterize the closed- ness of SOL-ASC(A;b). The following proposition states the relationship betweenZ(A;b) and SOL-ASC(A;b). No proof is needed. Proposition 9. It holds that f 0; 1g d \ SOL-ASC(A;b) =Z(A;b) SOL-ASC(A;b) = fwjkwk 0 2Z(A;b)g | {z } to be extended to closure of SOL-ASC(A;b) = [ d :d i 6=0;8i fd zj z2Z(A;b)g: Moreover, the inclusion is strict ifZ(A;b) is neither the empty set nor the singletonf0g. Clearly,Z(A;b) is always closed and bounded, while SOL-ASC(A;b) needs not be closed and must be unbounded unless it is the singletonf0g. The following result provides sucient and necessary conditions for SOL-ASC(A;b) to be a closed set. Proposition 10. Suppose SOL-ASC(A;b)6=;. The following three statements are equivalent. (a) SOL-ASC(A;b) is closed. (b) Z(A;b) =Z (A;b). 52 (c) SOL-ASC(A;b) =fwj A + kwk 0 bg andZ(A;b) = z2f0; 1g d j A + zb . Proof. (a)) (b). It suces to proveZ (A;b)Z(A;b). By way of contradiction, suppose there exist binary vectors z z with z2Z(A;b)63 z. Clearly supp(z)$ supp( z). Dene the binary vector w k with components w k i , 8 > < > : 1=k if i2 supp( z)n supp(z) z i otherwise. It then follows thatkw k k 0 = z2Z(A;b) SOL-ASC(A;b). Moreover, lim k!1 w k = z62Z(A;b). This contradicts the closedness of SOL-ASC(A;b). (b)) (c). It suces to show the inclusion: Z(A;b) z2f0; 1g d j A + zb . Let z2Z(A;b). We claim that d X j=1 max (A ij ; 0 ) z j b i for arbitrary i = 1; ;m. For a given i, dene a vector z such that z j , 8 < : z j if A ij 0 0 if A ij < 0 which clearly satises z z. By (b), it follows that z2Z(A;b). We have b i d X j=1 max (A ij ; 0 )z j d X j=1 max (A ij ; 0 )z j = d X j=1 max (A ij ; 0 ) z j : Thus the claim holds. (c)) (a). This is obvious because the mappingkk 0 is lower semi-continuous and A + contains only nonnegative entries. Remark. It is possible for SOL-ASC(A;b) to be a closed set when A contains negative entries. A trivial example is the setfwjkwk 0 0g = R d which corresponds to A being the negative identity matrix and b = 0. Example 9. We illustrate the two setsZ(A;b) andZ (A;b) using the regression model (3.3) with interaction terms that satisfy the strong hierarchical conditions:jw (2) ij j 0 min jw (1) i j 0 ;jw (1) j j 0 for all i<j and also a cardinality constraint on w (1) . In this case, we have, for some integer K > 0, Z(A;b) = 8 > > > < > > > : (z (1) ;z (2) ) j z (2) ij min z (1) i ; z (1) j ; 8i < j j d X i=1 z (1) i K; z (2) ij ; z (1) i all binary 9 > > > = > > > ; : 53 We claim that Z (A;b) = ( (z (1) ;z (2) )j d X i=1 max j<i<` z (1) i ; z (2) ji ; z (2) i` K; z (2) ij ; z (1) i all binary ) : (3.7) To see this, let (z (1) ;z (2) ) be a pair belonging to the right-hand set. Dene z (1) i , max j<i<` z (1) i ; z (2) ji ; z (2) i` : It is not dicult to see that ( z (1) ;z (2) )2Z(A;b). Conversely, let (z (1) ;z (2) ) be a binary pair such that there exists ( z (1) ; z (2) )2Z(A;b) such that (z (1) ;z (2) ) ( z (1) ; z (2) ). Then, d X i=1 max j<i<` z (1) i ; z (2) ji ; z (2) i` d X i=1 max j<i<` z (1) i ; z (2) ji ; z (2) i` = d X i=1 z (1) i K: This proves that (z (1) ;z (2) ) belongs to the right-hand set in (3.7). Hence the equality in this expression holds. 3.2.2 Closure of SOL-ASC The expression (3.7) is interesting because the result below shows that the setZ (A;b) determines the closure of the solution set of the ASC system, not only for the regression model with interaction terms under the strong hierarchical relation among its variates, but also in general. Proposition 11. It holds that cl [ SOL-ASC(A;b) ] = wjkwk 0 2Z (A;b) : Proof. Letfw k g be a sequence of vectors in SOL-ASC(A;b) converging to the limit w 1 . We then have z k ,kw k k 0 2Z(A;b). SinceZ(A;b) is a compact set, we may assume without loss of generality that the binary sequencefz k g converges to binary vector z 1 that must belong to Z(A;b). We then have, jw 1 j j 0 lim inf k!1 jw k j j 0 = z 1 j ; 8j = 1; ;d: By the denition ofZ (A;b), it follows thatkw 1 k 0 2Z (A;b). Thus cl [ SOL-ASC(A;b) ] is a subset of wjkwk 0 2Z (A;b) . To show the reverse inclusion, let w be such thatk wk 0 z for some z2Z(A;b). The sequence w k , where w k j , 8 > < > : z j ifj w j j 0 = z j 1=k otherwise 9 > = > ; j = 1; ;d; 54 converges to w. Moreover, sincekw k k 0 = z, it follows that w k 2 SOL-ASC(A;b) for all k. Hence w2 cl [ SOL-ASC(A;b) ] as desired. Propositions 9 and 11 have established the fundamental role the two setsZ(A;b) andZ (A;b) play in the study of the ASC(A;b). The former setZ(A;b) is the intersection of the polyhedron fzjAzbg with the setf0; 1g d of binary vectors, while a continuous relaxation of the latter set Z (A;b) is H (A;b), n z2 [ 0; 1 ] d j9 z2 [ 0; 1 ] d such that A z b and z z o : Clearly cl [ SOL-ASC(A;b) ] wjkwk 0 2H (A;b) . A natural question is whether equality holds. An armative answer to this question will simplify the task of verifying if a given vector w belongs to the left-hand closure, and facilitate the optimization over this closure. Indeed, ac- cording to Proposition 11, the former task can be accomplished by solving an integer program. If the equality in question holds, then this task amounts to solving a linear program. Furthermore, a representation in terms of the polytopeH (A;b) is key to the convergence of the approximation of the ` 0 -function by continuous surrogate functions. Before addressing the above question, we give an example to show that the desired equality of the two sets in question does not always hold without conditions. Example 10. Consider the ASC system in two variables (w 1 ;w 2 ):jw 1 j 0 0:8 andjw 2 j 0 2jw 1 j 0 . Clearly, the only solution is (0; 0). The setH (A;b) for this system is: (z 1 ;z 2 )2 [0; 1] 2 j9 ( z 1 ; z 2 )2 [0; 1] 2 such that z 1 0:8; z 2 2 z 1 ; and (z 1 ;z 2 ) ( z 1 ; z 2 ) which clearly contains the point (0; 1). Hencefwjkwk 0 2H (A;b)g is a superset of cl [ SOL-ASC(A;b) ]. Roughly speaking, the condition below has to do with the rounding of the elements on certain faces ofH (A;b) to integers without violating the linear system Azb. Assumption A. For any subsetI off1; ;dg, n z2 [0; 1 ] d j Az b and z I = 1 jIj o 6=; ) Z(A;b)\ zj z I = 1 jIj 6=;: 55 It is not dicult to see that Example 10 fails this assumption withI =f2g because the set (z 1 ;z 2 )2 [0; 1] 2 j z 1 0:8; z 2 2z 1 ; and z 2 = 1 is nonempty, but the constraints have no solutions with z 1 2f0; 1g. Proposition 12. The following two statements hold: (a) Under Assumption A, cl [ SOL-ASC(A;b) ] = wjkwk 0 2H (A;b) . (b) If Assumption A is violated by the index setI, then any vector w with supp(w) =I belongs to the right-hand but not the left-hand set in (a). Proof. It suces to show that if w is such thatkwk 0 z for some z2 [0; 1] d satisfying A z b, then w belongs to the closure of SOL-ASC(A;b). This follows readily by applying Assumption A to the index setI = supp( z). Thus (a) holds. To prove (b), supposeI violates Assumption A. Let z2 [0; 1 ] d be such that A zb and z I = 1 jIj but there does not exist z2Z(A;b) with z I = 1 jIj . Clearly zkwk 0 . Hencekwk 0 2H (A;b). Butkwk 0 62Z (A;b); for otherwise, there exists b z2Z(A;b) satisfying b zkwk 0 , which implies b z I = 1 jIj , which is a contradiction. By Proposition 11, it follows that w62 cl [ SOL-ASC(A;b) ]. Assumption A holds for a matrix A satisfying the column-wise uni-sign property; i.e., when each column ofA has either all nonpositive or all nonnegative entries (in particular, whenA has only one row). In this case, fractional components (other than those in a given setI) of a vector z2 [0; 1] d satisfying Azb can be rounded either up or down, depending on the sign of the elements in the corresponding column, without violating the inequalities Az b. This special sign property of A turns out to be important in the optimization subject to ASC's; see Subsection 3.5.2. Another case where Assumption A holds is if convex hull ofZ(A;b) = n z2 [ 0; 1 ] d j Az b o : (3.8) Indeed, if the above equality holds, and if z 2 [0; 1] d satisfying Az b is such that z I = 1 I , then there existsfz k g K k=1 Z(A;b) and positive scalarsf k g K k=1 summing to unity such that z = K X k=1 k z k . Since z I = 1 I and each z k is a binary vector, we must have z k I = 1 I for all 56 k = 1; ;K. Thus Assumption A is valid under (3.8). Part (b) of Proposition 12 shows that Assumption A is sharp for part (a) to be valid. 3.2.3 Extension to SOL-xASC In what follows, we extend two main results in the last subsection to an xASC system. In both extensions, we assume that b() is a continuous function andM is a closed set. We rst extend Proposition 10. Proposition 13. The set SOL-xASC(A;b;M) is closed if and only if SOL-xASC(A;b;M) = n (w;)2 R d Mj A + kwk 0 b() o : Proof. It suces to prove that if SOL-xASC(A;b;M) is closed, then SOL-xASC(A;b;M) is con- tained inf(w;)j A + kwk 0 b()g. Let ( w; )2 SOL-xASC(A;b;M). Then w2 SOL-ASC(A;b( )), which must be closed. Hence A + k wk 0 b( ) by Proposition 10. Next is an extension of part (a) of Proposition 12. Part (b) is similar and omitted. Proposition 14. Suppose that for every2M, Assumption A holds for the pair (A;b()). Then cl [ SOL-xASC(A;b;M) ] = n (w;)2 R d Mjkwk 0 2H (A;b()) o : (3.9) Proof. Take any ( w; ) in the left-hand closure in (3.9). We claim that for every " > 0, w belongs to the closure of the set fw j Akwk 0 b( ) + "1g, which must be a subset of wjkwk 0 2H (A; b " ) , where b " ,b( )+"1. Letf(w ; )g be a sequence in SOL-xASC(A;b;M) that converges to ( w; ) as !1. By denition, we have Akw k 0 b( ). Sincefb( )g con- verges to b( ), for any "> 0, there exists such that for all , Akw k 0 b( ) +"1. Since w is the limit offw g, the claim holds. Hence, for every "> 0 there exists z " 2 [0; 1] d such that A z " b( ) +"1 andkwk 0 z " : It then follows by a continuity property of polyhedra that there existsb z2 [0; 1] d such thatAb zb( ) andk wk 0 b z. In other words, ( w; ) belong to the right-hand set in (3.9). 57 Conversely, suppose that ( w; ) belong to the right-hand set in (3.9). By Proposition 12, it follows that w2 cl [ SOL-ASC(A;b( )) ]; so there exists a sequencefw g converging to w such that (w ; ) belongs to SOL-xASC(A;b;M) for every . Hence ( w; ) belong to the left-hand closure in (3.9). 3.3 Continuous Approximation and Convergence In this section, we investigate the approximation of the set SOL-xASC(A;b;M) by replacing the univariate ` 0 -function by surrogate sparsity functions introduced in Section 2.5, and for some of them, we provide modied versions such that they are parameterized by single scalar which controls the tightness of the approximation. Convergence of such approximated functions to the` 0 -function has been ascertained in [54] in the context of sparsity optimization where the ` 0 -function is part of the objective to be optimized. In contrast, our focus here is dierent from the latter reference in that we analyze the convergence of the approximated sets to the solution set of the xASC. This analysis is complicated by the fact that SOL-xASC is generally not closed; so the convergence pertains to the closure of this solution set whose characterization in Proposition 14 is key. Before introducing the approximation functions, we rst summarize two key notions of set convergence; see [74] for a comprehensive study of such convergence and the connection to optimization. For any two closed sets C and D in R d for some integer N > 0, the Pompeiu-Hausdor (PH) distance is dened as: dist PH (C;D), max sup x2C dist(x;D); sup x2D dist(x;C) ; where the point-set distance dist(x;D) is by denition equal to inf y2D dist(x;y) with dist(x;y), kxyk andkk being a given vector norm inR d . LetC() be a closed set inR d parameterized by 2 where is a closed set in some Euclidean space. The familyfC()g 2 is said to converge to C( 0 ) in the PH sense if dist PH (C();C( 0 ))! 0 as ! 0 2 . Equivalently, C() converges to C( 0 ) in the PH sense if for any "> 0, there is an open neighborhoodN of 0 , such that for all 2N , C() C( 0 ) + cl [I B N (0;")] and C( 0 ) C() + cl [I B N (0;")]: 58 The other notion of set convergence is that of Painlev e-Kuratowski (PK) dened as follows. Again consider the case of C() as ! 0 ; by denition, the outer and inner limits are, respectively, lim sup ! 0 C() , xj lim inf ! 0 dist(x;C()) = 0 lim inf ! 0 C() , xj lim sup ! 0 dist(x;C()) = 0 : In other words, the outer limit lim sup ! 0 C() contains allx such that there exist sequencesf k g! 0 andfx k g!x such thatx k 2C( k ) for allk . The inner limit lim inf ! 0 C() contains all x such that for any sequencef k g! 0 , there exists a sequencefx k g! x such that x k 2 C( k ) for all k. It is easy to show that the inner limit is always a subset of the outer limit. The familyfC()g 2 is said to converge to C( 0 ) as! 0 in the PK sense if both of the outer and inner limits are equal to C( 0 ). It is proved in [74, Proposition 5.12] that PH convergence implies PK convergence; but the converse is not true. In later applications, we will be speaking about the convergence of a sequence of setsfC( k )g to a given set C( 1 ) as the sequencef k g converges to a limit 1 . The denition of such sequential convergence is similar to the above that applies to all near the base value 1 . In general, set convergence has an important role to play in the convergence of optimal solu- tions to an optimization problem when the constraints are being approximated, like in the case of SOL-ASC(A;b) to be discussed in the next section. This is made precise in the result below; see [74, Section E] where such convergence is discussed in the framework of functional epi-convergence. Proposition 15. Let f :R d !R be continuous. Letf k g be a sequence converging to 1 . If lim k!1 C( k ) = C( 1 ) in the PK sense; then lim sup k!1 argmin x2C( k ) f(x) argmin x2C( 1 ) f(x). Proof. Letf x k g be a sequence converging to x 1 such that x k 2 argmin x2C( k ) f(x) for all k. It suces to show two things: (i) x 1 2C( 1 ) and (ii) for every x2C( 1 ), there existsfx k g converging to x such that x k 2 C( k ) for all k. Both follow easily from the assumed convergence offC( k )g to C( 1 ). 59 While simple to prove and illustrative of the role of set convergence, Proposition 15 is only of conceptual importance because if the sets C( k ) are non-convex and/or the objective f(x) is non- convex, then since global minima of non-convex optimization problems are generally not computable in practice, the convergence of a sequence of minima cannot be used to deduce any property of the limit of a sequence of non-optimal solutions of the approximated problems. Instead, the convergence of computable stationary solutions should be investigated in order to eliminate the gap between practical computation and convergence of the computed solutions. This provides the motivation for Section 3.5 where we focus on a kind of stationary points that can be computed by methods of dc constrained optimization problems, such as those presented in the reference [68], and investigate the convergence of such stationary solutions. Before discussing this in detail, some preparatory work is needed that begins in the next subsection. 3.3.1 Approximating functions We consider the approximation of the univariate` 0 -functionjj 0 by various surrogate functions that are motivated by the families of surrogate sparsity functions summarized in [1, 54]. Specically, writing the constraint system Akwk 0 b() as d X j=1 A + ij jw j j 0 d X j=1 A ij jw j j 0 +b i (); i = 1; ;m; we approximate each ASC constraint by d X j=1 A + ij p + j (w j ; + j ) d X j=1 A ij p j (w j ; j ) +b i (); i = 1; ;m; (3.10) where j are positive scalars and each p j is a continuous bivariate function :R (0;1)! [0; 1] satisfying (R1) lim #0 (t;) =jtj 0 for any t2R [this limit is not required to be uniform in t]; (R2) for every > 0, (;) is symmetric on R (i.e., (t;) = (t;) for all t 0), and non- decreasing on [0;1); (R3) (t;) = 1 for all t such thatjtj. One special feature about the approximated system (3.10) is that we may use dierent approx- imation functions p j (; j ) corresponding to the individual entries of the matrix A. We let 60 SOL-xASC (A;b;M) denote the solution set of (3.10), emphasizing the control scalars j of the approximating functions. We use Example 10 to show that SOL-xASC (A;b;M) does not always converge to SOL-xASC(A;b;M) as # 0. When this happens, a limiting solution of an optimiza- tion problem subject to the approximated constraints may not even be feasible to the xASC system. Example 10 cont. We approximate the 2-variable system: jw 1 j 0 0:8 andjw 2 j 0 2jw 1 j 0 using the capped ` 1 -function (t;) = min jtj ; 1 for > 0, obtaining the approximated system consisting of the inequalities below: min jw 1 j ; 1 0:8 and min jw 2 j ; 1 2 min jw 1 j ; 1 : (3.11) The solution set of the ASC, which is closed in this example, and that of the approximated system are depicted in Figure 3.1 below. Algebraically, the solution set of (3.11), for xed > 0, is the union of two non-convex sets: f (w 1 ;w 2 )jjw 1 j 0:8; 2jw 2 jjw 1 jg [ f (w 1 ;w 2 )j 0:5jw 1 j 0:8g | {z } the vertical stripes in the right-hand gure : Since the two vertical stripes are always contained in the above union for any > 0 and shrink to the verticalw 2 -axis as# 0, it is not dicult to see that this vertical axis is the limit of SOL-ASC as # 0. Clearly, this axis is a proper superset of the solution set of the ASC system which is the singletonf(0; 0)g. z 1 z 2 0 1 1 Azb w 1 w 2 Figure 3.1: (a)Z =fz2f0; 1g 2 jAzbg =f(0; 0)g; (b) solution set of the approximated system (3.11) We present below various surrogate functions presented in Section 2.5 in their original form and the modied version, if necessary, using single scalar parameter > 0 and brie y discuss their 61 satisfaction of the assumptions (R1){(R3). The SCAD family. The original SCAD is parameterized by two scalars a> 2 and > 0, p SCAD a; (t), 8 > > > > > > > < > > > > > > > : jtj ifjtj (a + 1 ) 2 2 (ajtj ) 2 2 (a 1 ) if jtj a (a + 1 ) 2 2 ifjtj a: To conform to the stated assumptions, we scale the function by the reciprocal of (a + 1) 2 2 and identify =a and a being xed, obtaining SCAD a (t;) = 8 > > > > > > > > < > > > > > > > > : 2a (a + 1) jtj ifjtj a 1 (jtj ) 2 1 1 a 2 2 if a jtj 1 ifjtj Note that while lim #0 SCAD a (t;) =jtj 0 for anyt2R, this limit is not uniform int near 0. The same remark applies to the following families of functions. The MCP family. Also parameterized by two positive scalars a> 2 and , MCP is given by: p MCP a; (t), 1 2 ( a 2 [ (ajtj ) + ] 2 a ) for t2R: Again, to conform with the approximating conditions, we scale the function by 2 a 2 and identify =a and a being xed, obtaining MCP (t;) = 8 > > < > > : 2jtj t 2 2 ifjtj 1 ifjtj The capped ` 1 family. This has already been mentioned before; namely, for > 0, C` 1 (t;) = min jtj ; 1 ; for t2R: 62 The truncated transformed ` 1 family. Parameterized by a given scalar a > 0, this is a truncated modication of the transformed ` 1 -function in [1] taking into account the scaling factor > 0 that tends to zero: TTL` 1 a (t;) = min (a + )jtj (a +jtj ) ; 1 ; for t2R: It is easy to see that properties (R1) and (R2) hold for this function TL` 1 a (t;) for (t;)2R(0;1). Noting that (a + )jtj (a +jtj ) = 1 + a (jtj ) (a +jtj ) ; we may conclude that condition (R3) is also satised. The truncated logarithmic family. Derived similarly to the truncated transformed` 1 -functions and parameterized by a scalar"> 0, a truncated logarithmic penalty function is dened as follows: for > 0, Tlog " (t;), min 8 > > < > > : 1 log 1 + " [ log(jtj +" ) log(")]; 1 9 > > = > > ; ; for t2 R: It is not dicult to see that this function satises the desired conditions (R1), (R2), and (R3). In summary, we have identied a number of well-known univariate surrogate sparsity functions, suitably truncated outside a-neighborhood of the origin, that satisfy the basic requirements of an approximating function of the ` 0 -function. 3.3.2 Convergence The main convergence result of the family of approximated solution setsfSOL-xACS (A;b;M)g >0 as ! 0 is the following. Theorem 11. Suppose that for every 2M, Assumption A holds for the pair (A;b()). Assume further that the surrogate functions in the familyfp j (; j )g d j=1 satisfy in addition to (R1), (R2), and (R3), either d X j=1 A ij p j (w j ; j ) d X j=1 A ij p + j (w j ; + j ); 8w2 R d and i = 1; ;m: (3.12) 63 or d X j=1 A + ij p j (w j ; j ) d X j=1 A + ij p + j (w j ; + j ); 8w2 R d and i = 1; ;m: (3.13) Then the familyfSOL-xACS (A;b;M)g >0 converges to cl [SOL-xASC(A;b;M)] in the PH sense ask k 1 , max n j j j = 1;n o # 0. Proof. Let C( ), SOL-xACS (A;b;M) and C 0 , cl [SOL-xACS(A;b;M)]. We need to show that for every "> 0, a scalar > 0 exists such that for all 2 ( 0; ], C( ) C 0 + cl [I B n+k (0;")] and C 0 C( ) + cl [I B n+k (0;")]: (3.14) By Proposition 14, we have C 0 = n (w;)2 R d Mj9 z2 [ 0; 1 ] d such that A z b() andkwk 0 z o : Let ( w; )2C( ). Then 2M. For the rest of the proof, we assume that (3.12) holds. A similar proof can be applied when (3.13) holds. For all i = 1; ;m d X j=1 A + ij p + j ( w j ; + j ) d X j=1 A ij p j ( w j ; j ) +b i ( ) d X j=1 A ij p + j ( w j ; + j ) +b i (); By Assumption A, there exists z2f0; 1g d such thatA zb( ) and z j = 1 wheneverp + j ( w j ; + j ) = 1. Dene a vector w() with components given by w j (), 8 > < > : w j ifj w j j + j ; (thus p + j ( w j ; + ) = 1 ) sign( w j ) + j z j ifj w j j < + j ; where we dene sign(0) = 1. It is easily seen thatkw()k 0 = z. Hence Akw()k 0 b(); so w()2C 0 . For an index j such thatj w j j< + j , we have w j () w j = sign(w j ) h + j z j j w j j i ; implying thatjw j () w j j = + j z j j w j j + j . Hence the rst inclusion in (3.14) holds. To prove the second inclusion, let ( w; )2 C 0 be arbitrary. Dene a vector w() with components given by: w j (), 8 > < > : w j ifj w j j j or w j = 0 sign( w j ) j otherwise; 64 where j , max + j ; j . It then follows than p j (w j (); j ) = 1 =j w j j 0 unless w j = 0, in which case, p j ( w j ; j ) = 0 =j w j j 0 . Since Ak wk 0 b(), it follows that w()2 C( ). Since clearly jw j () w j j j , the second inclusion in (3.14) also holds. Two special cases worth noting are when A 0 or A 0. It is easy to verify that Assumption A holds in both cases (see discussion after Proposition 12); moreover, (3.12) holds in the former case and (3.13) holds in the latter case. Hence Theorem 11 is valid under the basic properties (R1), (R2) and (R3). In general, the two requirements (3.12) and (3.13) can be enforced by choosing p + j and p j such that p j (; j )p + j (; + j ) pointwise. Our next convergence result pertains to the situation where we x j > 0 for all j = 1; ;d and consider only a one-sided approximationjw j j 0 p + j (w j ; + j ) with + j # 0. Specically, we consider the approximation of the system d X j=1 A + ij jw j j 0 b b i (w;), d X j=1 A ij p j (w j ; j ) +b i (); 8i = 1; ;m; where we x j and approximatejj 0 on the left side as said. In this case, convergence in the PK sense can be established without Assumption A and with no restriction on the choice of the approximating functions p + j (; + j ) except for the basic properties (R1), (R2), and (R3). Recogniz- ing that the above system is a constraint system: Akwk 0 b b(w;) with A being a nonnegative matrix and b b being a continuous function of the pair (w;) whose dependence on n j o d j=1 we have suppressed, we state and prove a version of Theorem 11 for such a system. Proposition 16. Let A be a nonnegative matrix, b b(w;) be a continuous function, andM be a closed set. For each j = 1; ;d, let p + j (; + j ) be an approximating function satisfying conditions (R1), (R2), and (R3). Dene for any + , + j d j=1 > 0, the sets C( + ) , 8 < : (w; )2 R d Mj d X j=1 A j p + j (w j ; +;k j ) b b(w;) 9 = ; and C 0 , n (w; )2 R d Mj Akwk 0 b b(w;) o : Then the familyfC( + )g + >0 of converges toC 0 in the PK sense ask + k 1 , max n + j j j = 1; ;d o # 0. 65 Proof. Since C 0 is a subset of C( + ) for any + > 0, it follows that the former set is a subset of lim inf + #0 C( + ). It remains to show that lim sup + #0 C( + ) C 0 . Let (w 1 ; 1 ) be the limit of a sequence (w k ; k ) where for each k, the pair (w k ; k ) satises: d X j=1 A ij p + j (w k j ; +;k j ) b b i (w k ; k )8i = 1; ;m; corresponding to a sequence of positive scalarsf +;k g# 0. By the nonnegativity of A, property (R3), and the nonnegativity of the approximating functionsp + j (; + j ), we have for eachi = 1; ;m and all k suciently large, d X j=1 A ij jw 1 j j 0 = X j2supp(w 1 ) A ij d X j=1 A ij p + j (w k j ; +;k j ) b b i (w k ; k ): Passing to the limit k!1 establishes the desired inclusion. 3.4 Tangent Properties of SOL-ASC and its Approximation For a given closed sets CR d , the tangent cone of C at a vector x2C, denotedT (x;C) consists of vectors v such that there exist sequences of vectorsfx k g C converging to x and positive scalarsf k g converging to zero such that v = lim k!1 x k x k . Tangent vectors of closed sets play an important role in the stationarity conditions of optimization problems constrained by such sets. In this section, we characterize the tangent vectors of SOL-ASC(A;b) and those of its approximation SOL-ASC (A;b). We omit the extension (A;b;M) in order to focus on thekk 0 -function and its componentwise approximation by the penalty functions p j (; j ). Recalling that the tangent cone must be a closed set, we have the following expression of the tangent cone of the SOL-ASC, which shows in particular that the latter cone is also dened by an ASC. We also obtain a superset of the tangent cone in the case where the matrix A is nonnegative which will be useful subsequently. For a given w2 SOL-ASC(A;b), we letA ASC ( w) be the set of indices i2f1; ;mg such that A i k wk 0 =b i . Proposition 17. Let w 2 SOL-ASC(A;b). Let S c be the complement of S , supp( w) in 66 f1; ;dg. It holds that T ( w; SOL-ASC(A;b)) = cl [fvj A S ckv S ck 0 bAk wk 0 g ] = cl 2 4 8 < : vj 0 @ w S v S c 1 A 2 SOL-ASC(A;b) 9 = ; 3 5 : (3.15) Thus if A 0, then T ( w; SOL-ASC(A;b))fvj v j = 0 for all j such that9i2A ASC ( w) with A ij > 0g: Proof. The equality of the two closures is easy to see. To prove the equality of the tangent cone and the rst closure, write C, SOL-ASC(A;b). We rst show that fvj A S ckv S ck 0 bAk wk 0 gT ( w;C): This is enough to imply that the left-hand tangent cone in (3.15) contains the right-hand closures in the same expression. Let v be an arbitrary vector satisfying A S ckv S ck 0 bAk wk 0 . For all > 0 suciently small, we havek w +vk 0 = max (kwk 0 ;kvk 0 ). So Ak w +vk 0 = Ak wk 0 +A S ckv S ck 0 b; which implies that v is a tangent vector of C at w. Conversely, let d2T ( w;C) be given. Let fw k gC andf k g# 0 be such that lim k!1 w k = w and lim k!1 w k w k = d: Clearly, w k S c w S c k 0 =kw k S c k 0 ; moreover, for all k suciently large, b Akw k k 0 = Ak wk 0 +A S ckw k S c k 0 = Ak wk 0 +A S c w k S c w S c k 0 : Hence, it follows readily that d belongs to the right-hand closure in (3.15). To prove the last assertion of the proposition, let A 0. Let v satisfy A S ckv S ck 0 bAk wk 0 . For every i2A ASC ( w), we have X j62 S and v j 6=0 A ij 0: 67 Therefore if there is an i2A ASC ( w) with A ij > 0, then we must have v j = 0. Proposition 17 yields two interesting properties ofT ( w; SOL-ASC(A;b)) that can be contrasted with the tangent cone of the polyhedronP(A;b), w2R d jAwb . First, it is known that for a given w2P(A;b), we have T ( w;P(A;b)) = n v2 R d j A i v 08i2A( w) o whereA( w) is the index set of the active constraints at w2P(A;b). It follows from this represen- tation ofT ( w;P(A;b)) thatv is a tangent vector ofP(A;b) at w if and only if w +v2P(A;b) for all > 0 suciently small. In contrast, Proposition 17 shows that a vectorv is a tangent of the set SOL-ASC(A;b) at w if and only if it is the limit of a sequencefv k g for which a scalar > 0 and an integer k > 0 exist such that for all 2 (0; ] and all k k, w +v k 2 SOL-ASC(A;b). From this, it follows that if v is a tangent of the set SOL-ASC(A;b), then w +v2 cl[SOL-ASC(A,b)] for all > 0 suciently small. Thus the tangents of SOL-ASC(A;b) have exactly the same fea- sibility property as those ofP(A;b) provided that SOL-ASC(A;b) is closed. Another interesting consequence of Proposition 17 is that the entire set of constraints Akwk 0 b and the (in)activity of the sparsity constraintsjw j j 0, j = 1; ;d are all involved in the denition of the tangent vectors of SOL-ASC(A;b). In contrast,T ( w;P(A;b)) involves only the active constraints at w of the system Awb. 3.4.1 Tangent cone of approximated sets: Fixed Consider the approximated SOL-ASC (A;b) with each approximating function p j (; j ) given by: for some positive integer K j , p j (t; j ), j jtj max 1kK j g jk (t) | {z } ,g j (t) ; t2 R; (3.16) where each j is a positive scalar and each g jk is a univariate dierentiable convex function, all dependent on n j o d j=1 . As proved in [1], the surrogate sparsity functions in the SCAD, MCP, capped ` 1 , the transformed ` 1 , and the logarithmic families can all be expressed in the above dierence-of-convex (dc) form. From the expression: min (jtjg(t); 1 ) = jtj max (g(t); jtj 1 ); 68 we see that the truncation of a dc function of the above form can be represented in the same form; thus, dc functions given by (3.16) also include those in the truncated transformed ` 1 and truncated logarithmic families; thus all the functions discussed in Subsection 3.3.1 are covered by the form (3.16). With p j (; j ) as given, the inequality d X j=1 A + ij p + j (w j ; + j ) d X j=1 A ij p j (w j ; j ) +b i can be written very simply as i (w) i (w) | {z } i (w) b i 0; (3.17) where i (w), d X j=1 h A + ij + j jw j j +A ij g j (w j ) i and i (w), d X j=1 h A ij j jw j j +A + ij g + j (w j ) i are convex functions. Thus, SOL-ASC (A;b) is a \dc set"; i.e., it has the representation: SOL-ASC (A;b) =fwj i (w) b i ; 8i = 1; ;mg: We recall that the directional derivative of a function at a given vector w in the direction v is given by: 0 ( w;v), lim #0 ( w +v)( w) if the limit exists. The following representation of the tangent cone of SOL-ASC (A;b) is directly adopted from Corollary 6 and Proposition 7 in [68] where a proof can be found. Proposition 18. Let w2 SOL-ASC (A;b) be given. Let each p j (; j ) be given by (3.16). Let A ( w), 8 < : ij d X j=1 A + ij p + j ( w j ; + j ) = d X j=1 A ij p j ( w j ; j ) +b i 9 = ; : If each function g jk is linear for all j = 1; ;d and k = 1; ;K j , it holds that T ( w; SOL-ASC (A;b)) = vj 0 i ( w;v) 0; 8i2A ( w) ; (3.18) henceT ( w; SOL-ASC (A;b)) is the union of nitely many closed convex cones. 69 The linearity of the functions g jk implies that each approximating function p j (; j ) is piece- wise linear although no such piecewise linearity is required on p + j (; + j ). Among the ve families: SCAD, MCP, capped` 1 , truncated transformed` 1 , and truncated logarithmic, only the capped` 1 function is piecewise linear; the SCAD and MCP functions are dierentiable on the real line except at the origin; the latter two truncated functions have two additional non-dierentiable points at t =. In terms of the functions p j (; j ), the expression (3.18) yields the following: T ( w; SOL-ASC (A;b)) = 8 < : vj d X j=1 A + ij p + j (; + j ) 0 ( w j ;v j ) d X j=1 A ij p j (; j ) 0 ( w j ;v j ); 8i2A ( w) 9 = ; : (3.19) Unlike the tangent cone of SOL-ASC (cf. (3.15)), no closure is needed in the above right-hand set because this set is already closed. We provide an example showing that the equality (3.18) can fail without the linearity assumption on the functions g jk in Proposition 18. Example 8 cont. Consider the system jw 1 j 0 jw 2 j 0 at the feasible point w = (2; 2). We approximate both ` 0 functions by the MCP functions with a xed = 2 as follows: for i = 1; 2, i (w i ) =jw i jg i (w i ); where g i (t), 8 < : t 2 =4 ifjtj 2 jtj 1 ifjtj 2: It is easy to check that g i is convex and continuously dierentiable on R. Moreover, i (w i ) is dierentiable everywhere except at w i = 0. The dc set is (w 1 ;w 2 )2 R 2 j 1 (w 1 ) 2 (w 2 ) = (w 1 ;w 2 )2 R 2 jjw 1 jjw 2 j orjw 2 j 2 : (3.20) Since 0 i (2) = 0, it follows that (v 1 ;v 2 )2 R 2 j 0 1 (2)v 1 0 2 (2)v 2 = R 2 : (3.21) In contrast, the actual tangent cone of the set (3.20) at the given w = (2; 2) is: (d 1 ;d 2 )2 R 2 j d 1 d 2 or d 2 0 which is clearly a subset of (3.21). 70 3.5 Convergence of B-Stationary Solutions In this section, we apply the results derived in the above sections to address the convergence of stationary solutions. Consider the following optimization problem: minimize w f(w), h(w)g(w) subject to w2 C , cl [SOL-ASC(A,b)]; (3.22) where h is a convex function (not necessarily dierentiable) and g is a continuously dierentiable convex function, both dened on an open set containing the feasible setC. Thusf is a dierence- of-convex (dc) function. We recall that a feasible vector w2C is a B(ouligand) stationary solution [68] of (3.22) if h 0 ( w;v)rg( w) T v = f 0 ( w;v) 0; 8v2T ( w;C): We attempt to approximate such a stationary solution w by solving the approximated problem: minimize w f(w) subject to w2 C( ), SOL-ASC (A;b); (3.23) where the feasible region is dened by the family of approximating functions n p j (; j ) o d j=1 each satisfying assumptions (R1), (R2), and (R3) as well as condition (3.12) or (3.13). Letf ;k g be a sequence of positive scalars converging to zero. For each k, let w k be a B-stationary solution of (3.23) corresponding to ;k ; i.e., f 0 ( w k ;v) 0; 8v2T ( w k ;C( ;k )): Suppose that the sequencef w k g converges to w, which must necessarily belong to C by the con- vergence offC( ;k )g to C. The question is whether w is a B-stationary solution of (3.22). For this question to have an armative answer, it suces to identify for any b v 2 T ( w;C) an in- nite index set f1; 2;g and a sequence of tangentsfv k g k2 with v k 2 T ( w k ;C( ;k )) for each k such thatfv k g k2 converges to b v. Indeed, if such v k exist, then using the fact that h 0 ( w;b v) lim sup k(2)!1 h 0 ( w k ;v k ), by the convexity of h, we deduce, f 0 ( w;b v) lim sup k(2)!1 f 0 ( w k ;v k ); from which the desired B-stationarity of w follows readily. If f is dierentiable, then it suces for frf( w k ) T v k g!rf( w) T b v. Before constructing the desired sequencefv k g, we provide an example to illustrate that this may not always be possible, thus establishing the failed convergence of such stationary solutions in general. 71 Example 12. Consider the 2-variable optimization problem: minimize w 1 ;w 2 1 2 (w 1 1 ) 2 + (w 2 1 ) 2 subject to jw 1 j 0 +jw 2 j 0 1: (3.24) We approximate the constraint by the capped ` 1 surrogate function, obtaining the approximated problem: for b k > 0, minimize w 1 ;w 2 1 2 (w 1 1 ) 2 + (w 2 1 ) 2 subject to min jw 1 j b k ; 1 + min jw 2 j b k ; 1 1: (3.25) It is not dicult to verify that w( b k ), b k 2 ; b k 2 ! is a B-stationary solution of the latter problem. Yet the limit (0; 0) is not a B-stationary solution of the original problem (3.24), by noting that the tangent cone of the feasible region at this limit point is equal to the feasible region itself. For this example, we note that the constraint in (3.25) is binding at the approximated pair w( b k ) but the constraint in (3.24) is not binding at the limit (0; 0). Incidentally, this situation will not happen with a polyhedron under perturbation but happens here partly due to the discontinuity of the ` 0 function. w 1 w 2 contours of f C w C( b k ) w k rf( w k ) v 1 v 2 T ( w;C) T ( w k ;C( b k )) unique normal toT ( w;C) C =f(w 1 ;w 2 )jjw 1 j 0 +jw 2 j 0 1g C( b k ) = (w 1 ;w 2 )j min jw 1 j b k ; 1 + min jw 2 j b k ; 1 1 T ( w;C) =f(v 1 ;v 2 )jjv 1 j 0 +jv 2 j 0 1g T ( w k ;C( b k )) =f(v 1 ;v 2 )j v 1 +v 2 0g Figure 3.2: Illustration of Example 12 using iterates For operational purposes, we assume in the rest of the chapter that the representation (3.19) of the tangent coneT ( w k ;C( ;k )) of the approximated setC( ;k ) is valid. To facilitate the identication of the desired approximating tangents, we write the two conesT ( w;C) andT ( w k ;C( ;k )) as 72 follows. First, let S, supp( w) with complement S c . We have T ( w;C) = cl 2 4 8 < : vj X j2 S c A ij jv j j 0 b i d X j=1 A ij j w j j 0 ; 8i = 1; ;m 9 = ; 3 5 while T w k ;C( ;k ) = 8 < : vj d X j=1 A + ij p + j (; +;k j ) 0 ( w k j ;v j ) d X j=1 A ij p j (; ;k j ) 0 ( w k j ;v j ); 8i2A ;k( w k ) 9 = ; : For any vector v2R d , let V =0 , j 2 S c j j 62 supp(v) and V 6=0 , j 2 S c j j 2 supp(v) whose union is the complement S c of the support of the vector w. We divide the terms in the sum d X j=1 A ij p j (; ;k j ) 0 ( w k j ;v j ) according to a given vectorb v2T ( w;C) and the two associated index sets b V =0 and b V 6=0 : d X j=1 A ij p j (; ;k j ) 0 ( w k j ;v j ) = X j2 S A ij p j (; ;k j ) 0 ( w k j ;v j ) +T k;0 (v) +T k;6=0 (v); where T k;0 (v), X j2 b V =0 A ij p j (; ;k j ) 0 ( w k j ;v j ) and T k;6=0 (v), X j2 b V 6=0 A ij p j (; ;k j ) 0 ( w k j ;v j ): Letting ;k j = b k for all j = 1; ;d, we can write T k;0 (v) = 1 b k X j2 b V =0 A ij h b k p j (; b k ) 0 ( w k j ;v j ) i and T k;6=0 (v) = 1 b k X j2 b V 6=0 A ij h b k p j (; b k ) 0 ( w k j ;v j ) i : Under assumption (R3) of the functionsp j (;), we deduce the following two one-sided derivatives for all k suciently large and all j2 S: p j (; b k ) 0 (t;1) = 0; 8t such thatjtj > b k : 73 For an index j2 S, sincef w k j g! w j 6= 0 andf b k g! 0, it follows that for all but nitely many k, w k j > b k . Hence, under the stipulation that ;k j = b k for all j = 1; ;d, it follows that for all k suciently large, v2T w k ;C( ;k ) if and only if for all i2A ;k( w k ), X j2 b V =0 A + ij h b k p + j (; b k ) 0 ( w k j ;v j ) i + X j2 b V 6=0 A + ij h b k p + j (; b k ) 0 ( w k j ;v j ) i X j2 b V =0 A ij h b k p j (; b k ) 0 ( w k j ;v j ) i + X j2 b V 6=0 A ij h b k p j (; b k ) 0 ( w k j ;v j ) i (3.26) This can be contrasted with the necessary and sucient condition forb v2T ( w;C), which is: X j2 b V 6=0 A + ij X j2 b V 6=0 A ij + 2 4 b i X j2 S A ij 3 5 ; 8i = 1; ;m; (3.27) provided that the set of vectors b v satisfying the latter inequalities is closed. This is the case for instance when the matrix A is nonnegative as we will assume in Subsection 3.5.1. One obvious dierence between (3.26) and (3.27) is that the components v j , forj2 b V 6=0 , of the tangentv of the approximated set C( ;k ) appear explicitly in the former, whereas the same components of b v do not in the latter. At this point, it would be useful to provide the directional derivatives of the surrogate sparsity functions discussed in Subsection 3.3.1, in particular the derived expressions will verify the follow- ing properties of (;) 0 (t;1) for all > 0 and all nonzero s2R, namely, (R4a) sign [(;) 0 (t;s) ] 8 < : = sign(s) sign(t) if 0<jtj < = 1 if t = 0; (R4b) (;) 0 (t;s) 8 < : = 0 if st> 0 0 if st< 0 9 = ; forjtj =; and (R4c) (;) 0 (t;s) = 0 forjtj>. The function (;) is not dierentiable at the origin and possibly at (see e.g. the truncated transformed ` 1 -function below). By denition, we must have (;) 0 (t; 0) = 0 for all t. Moreover, for st6= 0, (;) 0 (t;s)> 0 if and only ifjtj< and st> 0. 74 In what follows, we give expressions of the directional derivatives of three such functions: SCAD, MCP, and the truncated transformed ` 1 , and omit the other two: capped ` 1 and truncated loga- rithmic. The SCAD family. We have SCAD a (;) 0 (t;1) = 8 > > > > > > > > > > > > > < > > > > > > > > > > > > > : sign(t) 2a (a + 1) if 0 <jtj a 2a (a + 1) if t = 0 sign(t) 2 (jtj ) 1 1 a 2 2 if a jtj 0 ifjtj ; which is continuously dierentiable at all nonzero t2R. Thus SCAD a (;) 0 (t;s) = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : s sign(t) 2a a + 1 if 0 <jtj a jsj 2a a + 1 if t = 0 s sign(t) 2 1 jtj 1 1 a 2 if a jtj 0 ifjtj : The MCP family. We have MCP (;) 0 (t;1) = 8 > > > > > > > < > > > > > > > : sign(t) 2 2jtj 2 if 0 <jtj 2 if t = 0 0 ifjtj ; which has the same dierentiability properties as a SCAD function. Moreover, MCP (;) 0 (t;s) = 8 > > > > > > < > > > > > > : s sign(t) 2 2jtj if 0 <jtj 2jsj if t = 0 0 ifjtj : 75 The truncated transformed ` 1 family. We have TTL` 1 a (;) 0 (t;1) = 8 > > > > > > > > > > > > < > > > > > > > > > > > > : sign(t) a (a + ) (a +jtj ) 2 if 0 <jtj < a + a if t = 0 min sign(t) a (a + ) ; 0 ifjtj = 0 ifjtj>; yielding TTL` 1 a (;) 0 (t;s) = 8 > > > > > > > > > > > > < > > > > > > > > > > > > : s sign(t) a (a + ) (a +jtj ) 2 if 0 <jtj < jsj a + a if t = 0 min s sign(t) a a + ; 0 ifjtj = 0 ifjtj > : 3.5.1 The case A 0 Using the last inclusion of the tangent cone in Proposition (17), this nonnegative case is relatively easy to deal with. Proposition 19. Let A 0 and f = hg be a dc function with g and h both convex and g additionally continuously dierentiable. Let ;k j = b k for allj = 1; ;d and allk. Let lim k#0 b k = 0. For each k, let w k be a B-stationary solution of the problem: minimize w f(w) subject to w2 b C( b k ), 8 < : wj d X j=1 A ij p + j (w j ; b k ) b i ; 8i = 1; ;m 9 = ; : where each surrogate function p + j (; b k ) satises conditions (R1){(R4). If w is the limit off w k g satisfying the property that A 0 w if i is such that d X j=1 A ij p + j ( w k j ; b k ) = b i (3.28) 76 for innitely many k, then i2A ASC ( w), then w is a B-stationary solution of (3.22). Proof. Letb v2T ( w;C) with C, SOL-ASC(A;b). It suces to construct a sequencefv k g and identify an innite index set such that the subsequencefv k g k2 converges tob v and for all k2 suciently large, 0 X j2 b V =0 A ij h b k p + j (; b k ) 0 ( w k j ;v k j ) i + X j2 b V 6=0 A ij h b k p + j (; b k ) 0 ( w k j ;v k j ) i 9 > > > > > = > > > > > ; 8i such that (3.28) holds: (3.29) Dene the components v k j as follows: v k j , 8 > < > : b v j if either j2 S or j2 b V 6=0 w k j if j2 b V =0 For every k, there is a (possibly empty) index set A k of constraints i such that (3.28) holds corresponding to the pair (i;k). Since there are only nitely constraints, there exists an innite subset off1; 2;g such thatA k is a constant set, say A, for all k2. By assumption A 0 w , we have AA ASC ( w). It follows from Proposition 17 that b v j = 0 provided that there exists an i2A ASC ( w) such that A ij > 0. Thus, for every index j2 b V 6=0 , we must have A ij = 0 for all i2A ASC ( w). Hence, the requirement (3.29) for the sequencefv k g k2 reduces to 0 X j2 b V =0 A ij h b k p + j (; b k ) 0 ( w k j ;v k j ) i 8i2 A and all k2: By (R4), we have, forj2 b V =0 with w k j 6= 0, sign h p + j (; b k ) 0 ( w k j ;v k j ) i = sign(v k j ) sign( w k j ) =1, by the choice of v k j . Hence it follows that v k 2T w k ; SOL-ASC b k (A;b) for all k2. It remains to show thatfv k g k2 converges tob v. But this is clear from the denition of the components v k j and the fact that w k j ! w j = 0 =b v j for all j2 b V =0 . 3.5.2 The case of column-wise uni-sign In this subsection, we assume that the objective function f in (3.22) is continuously dierentiable (C 1 ) so that we can focus on the choice of the sequence of approximate tangents. Let W ( w) 77 be the set of vectorsb v satisfying the inequalities in (3.27). Since the closure of W ( w) is equal to T ( w; SOL-ASC(A;b)), it follows that a necessary and sucient condition for w to be a B-stationary solution of (3.22) is that 0rf( w) T b v = X j2 S @f( w) @w j b v j + X j2 b V 6=0 @f( w) @w j b v j ; 8b v2 W ( w): (3.30) In what follows, we obtain a necessary and sucient condition for a given w2 SOL-ASC(A;b) to be a B-stationary solution of (3.22), based on which we will address the convergence of the B-stationary solutions of the approximated problems. For this purpose, dene J ( w), 8 < : J S c j X j2J A j bAk wk 0 9 = ; : This familyJ ( w) collects all index sets b V 6=0 corresponding to the tangentsb v inT ( w; SOL-ASC(A;b)). Lemma 1. Let f be a C 1 function dened on an open set containing SOL-ASC(A;b). A vector w2 SOL-ASC(A;b) is a B-stationary solution of (3.22) if and only if @f( w) @w j = 0; 8j 2 supp( w)[ [ J2J ( w) J: (3.31) Proof. \If." Associated with every vector b v 2 W ( w) is an index set b J satisfying X j2 b J A j bAk wk 0 . By assumption, @f( w) @w j = 0 for allj2 b J. The equality in the expression forrf( w) T b v in (3.30) and the assumption (3.31) easily yields the B-stationarity of w. \Only if." Suppose that the inequality in (3.30) holds. Any vectorb v withV 6=0 =; belongs to the set W ( w). Thus, we must have X j2 S @f( w) @w j b v j = 0; 8b v S : This implies that @f( w) @w j = 0 for all j2 S. Hence the B-stationarity of w reduces to X j2 b V 6=0 @f( w) @w j b v j 0; 8 (b v j ) j2 b V 6=0 satisfying X j2 b V 6=0 A j bAk wk 0 : 78 Let b j2 b J for some b J2J ( w). We have X j 0 2 b J A j 0 bAk wk 0 . For any scalar "> 0, dene the vectorsb v "; as follows: b v "; j , 8 > > > > > < > > > > > : 0 if j62 b J " if j2 b J and j6= b j 1 if j = b j: It is not dicult to see that n j62 Sjb v "; j 6= 0 o = b J. Hence we have 0 " X j2 b Jnf b jg @f( w) @w j @f( w) @w b j : Since this holds for all "> 0, it follows by passing to the limit "# 0 that @f( w) @w b j 0; yielding @f( w) @w b j = 0; establishing that (3.31) is necessary for B-stationarity. Based on the above lemma, we can establish the following result when the matrix A satises the column-wise uni-sign property. With this special structure on A, we may divide the columns of A into two groups:C whose entries are all nonnegative, andC whose entries are all nonpositive. The noteworthy part in the proof of the result is that not all components of the sequencefv k g converge to the corresponding components of b v (those with indices in the sets b V ; 6=0 do not; see notation below), but we must haverf( w k ) T v k !rf( w) T b v restricted to an appropriate subsequence; this is sucient to establish the desired B-stationarity of the limit w. Theorem 13. Let A have the column-wise uni-sign property and f be a C 1 function dened on an open set containing SOL-ASC(A;b). Let ;k j = b k for allj = 1; ;d and allk. Let lim k#0 b k = 0. For each k, let w k be a B-stationary solution of the problem: minimize w f(w) subject to w2 b C( b k ), 8 < : wj d X j=1 A + ij p + j (w j ; b k ) b i + d X j=1 A ij p j (w j ; b k ); 8i = 1; ;m 9 = ; ; where each pair of surrogate functions p j (; b k ) satises conditions (R1){(R4) and either (3.12) or (3.13). If w is the limit off w k g satisfying the property that 79 (A w ) if i is such that d X j=1 A + ij p + j ( w k j ; b k ) = b i + d X j=1 A ij p j ( w k j ; b k ) (3.32) for innitely many k, then i2A ASC ( w), then w is a B-stationary solution of (3.22) if and only if @f( w) @w j = 0; 8j 2 [ J2J ( w) J such that9i2A ASC ( w) with A ij 6= 0: (3.33) Proof. It suces to prove the \if" statement. We proceed as in the proof of Proposition 19. For every k, there is a (possibly empty) index set A k of constraints i such that (3.32) holds corresponding to the pair (i;k). Since there are only nitely constraints, there exists an innite subset off1; 2;g such thatA k is a constant set, say A, for allk2. By assumption (A w ), we have AA ASC ( w). Dene the components v k j as follows: v k j , 8 > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > : b v j if j 2 S w k j if j 2 b V =0 \C , b V =0 w k j if j 2 b V =0 \C , b V =0 b v j if j 2 b V 6=0 and A ij = 08i2A ASC (A;b); denoted j 2 b V 0 6=0 0 if j 2 b V 6=0 \C and9i2A ASC (A;b) such that A ij 6= 0; denoted j 2 b V 6=0 w k j if j 2 b V 6=0 \C and9i2A ASC (A;b) such that A ij 6= 0; denoted j 2 b V 6=0 : Note that the components v k j for j2 b V ; 6=0 do not necessarily converge to b v j , whereas all other components do. Taking into account the column-wise uni-sign property of A, the inequality in (3.26) for i2 A, which is a subset ofA ASC (A;b), can be written equivalently as X j2 b V =0 A + ij h b k p + j (; b k ) 0 ( w k j ;v k j ) i + X j2 b V 6=0 A + ij h b k p + j (; b k ) 0 ( w k j ;v k j ) i X j2 b V =0 A ij h b k p j (; b k ) 0 ( w k j ;v k j ) i + X j2 b V 6=0 A ij h b k p j (; b k ) 0 ( w k j ;v k j ) i | {z } = 0 since v k j = 0 for all such j : By the denition of v k j and property (R4) of the directional derivatives p j (; b k ) 0 ( w j ;) of the surrogate functions, it follows that for all i2 A, the left-hand sum is nonpositive while the right- 80 hand sum is nonnegative. Thus, the above-dened vector v k 2T ( w k ; SOL-ASC b k (A;b)) for all k2 suciently large. Hence, we have for all such k, 0 rf( w k ) T v k = X j2 S f( w k ) @w j v k j + X j2 b V =0 f( w k ) @w j v k j + X j2 b V 6=0 f( w k ) @w j v k j = X j2 S f( w k ) @w j b v j | {z } converges to X j2 S f( w) @w j b v j + X j2 b V =0 f( w k ) @w j w k j X j2 b V =0 f( w k ) @w j w k j | {z } converges to 0 + X j2 b V 6=0 f( w k ) @w j v k j : We can write X j2 b V 6=0 f( w k ) @w j v k j = X j2 b V 6=0 f( w k ) @w j v k j + X j2 b V 6=0 f( w k ) @w j v k j + X j2 b V 0 6=0 f( w k ) @w j v k j = X j2 b V 6=0 f( w k ) @w j w k j + X j2 b V 0 6=0 f( w k ) @w j b v j converges to X j2 b V 0 6=0 f( w) @w j b v j ; by (3.33): Hence rf( w k ) T v k k2 !rf( w) T b v as desired. The condition (3.33) is void when A is a nonnegative matrix. At this time, we are not able to address the case where there are nonzero entries in some columns of A that have mixed signs. Remark 14. The assumptions (A ) and (A ) in Proposition 19 and Theorem 13, respectively, are not expected to be easily veriable in practice. Nevertheless, they provide an explanation of the failure of convergence in Example 12; more importantly, these results show that the persistent holding of binding constraints in the limit is essential for the convergence of the B-stationary solutions of the approximated problems to a desired B-stationary solution of the ASC constrained problem (3.22). 81 Chapter 4 D-stationary Solutions of Sparse Sample Average Approximations In practice, statistical models are often trained through the method of sample average approxima- tions (SAA) for the given loss and model functions such that the resulting models produce outcomes that yield minimal error measured by the available data points. Such methodology serves as a prac- tical treatment to t the `best' model while exploiting the restrictive information of limited sample points. Ideally, the ultimate goal of the statistical learning is to nd the model that accurately forecast outcomes for the future data points, i.e., the model achieves the minimum error on the unknown data-generating distribution. Obtaining the latter model involves solving an expectation minimization problem where the combination of the loss and the model function is minimized sub- ject to the entire data population. The relationship between the SAA problem and the (theoretical) expectation minimization problem has been established when the shared objective function of the model and the loss composition is convex and dierentiable [76]. Involving sparsity functions of the folded-concave kind, the objective function of the SAA problem is no longer convex or dierentiable hence the connections between the two problems built by existing analysis fail to apply. We aim to investigate the properties of the d-stationary solutions for the sparse SAA learning problem con- necting the practically achievable solutions of the problem to a vector of interest which is related to the underlying optimal solution of the theoretical problem. The main goal of our study is to provide a generalization for the existing analysis of statistical inferences of the LASSO provided in [47, Chapter 11] by extending the special case of the ` 1 -regularized convex learning problem to 82 general settings. We provide a summary of the existing theory and present comparable analysis developed based on the stationary solutions of the non-convex programs formulated with general model, loss and sparsity functions. 4.1 Sparsity Regularized Sample Average Approximations Given a statistical modelm :R d R d !R and data points, where each of them consists of feature information x2R d and an outcome value y2R, we want to nd a vector of model parameters w2R d which minimizes the prediction error with respect to the general data population. Assume that every sample (x;y) is generated by an underlying distributionD. Ideally, we want to solve minimize w2R d I E (x;y)D ` (m(w; x ); y , L(w) (4.1) where` :RR!R is the loss function that controls the model tness to the data. This problem is only theoretical, i.e., it can not be solved in practice, because the expectation is taken over the hidden data distribution. In reality, we observe a nite number of samples that are believed to be generated from the distributionD and wish to extract the best knowledge from the limited information. A statistical method that is designed to perform such task is the sample average approximation which solves an approximation of the problem (4.1) exploiting the available samples. The problem is dened as minimize w2R d 1 N N X s=1 ` (m(w; x s ); y s , b L N (w) (4.2) whereN is the number of data points. The two problems are related as solving (4.2) is the realistic approach to tackle the problem (4.1). In [76], the relationships between their optimal objective values and the optimal solutions have been studied. Therein, the authors showed that the objec- tive value and the solutions of problem (4.2) converge to the outcomes of the problem (4.1) as the number of sample size grows to innity when their (shared) objective functions are convex and dierentiable. The current chapter aims to investigate the connections between the two problems when the ground truth is a sparse vector, i.e., there exists an underlying vector of parameters that minimizes the model error on the hidden data distribution with majority of its components at exactly zero. Under 83 such assumption, sparsity functions are involved in the sample average approximation to reduce the values of insignicant components to zero in the computed solutions. We choose the penalization approach to formulate the problem with the sample average approximation and the non-convex folded-concave sparsity functions. Such choice of formulation is suitable to make comparisons between the analysis to be provided and the existing results to be summarized in section 4.1.3. 4.1.1 The Lagrangian formulation The bi-criteria Lagrangian formulation for sparse representations is dened with the sample average approximation term b L N (w) and a sparsity function P :R d !R, minimize w2R d N (w) , b L N (w) + N P (w); (4.3) where the penalty parameter N > 0 controls the level of model complexity or the number of nonzero components in the computed solution. Throughout the chapter, we assume that Assumption (A 0 ). The composite function ` (m(; x); y is convex and dierentiable for any given date point (x;y)2R d+1 and the sparsity function P (w) is a dierence-of-convex function of the form given by P (w), d X j=1 c j jw j j | {z } g j (w j ) | {z } g(w) d X j=1 max 1kJ j h jk (w j ) | {z } h j (w j ) | {z } h(w) for some integers J j > 0 for all j; (4.4) where all c j are positive integers and each of h jk is a convex and dierentiable function. ThusP (w) is the sum of univariate dc functions such thatP (w) = d P j=1 p j (w j ) = d P j=1 fg j (w j )h j (w j )g where its concave summand is the negative of a pointwise maximum of nitely many convex and dierentiable functions. In chapter 2, we showed that most of the existing folded-concave sparsity surrogates have the form (4.4). Moreover, most of such functions have dierentiable concave term with J j = 1, thus resulting h(w) is a smooth function. A noteworthy point of adopting such dc representation is that the embedded convexity and dierentiability of the sparsity functions in their original form are explicitly splitted into two terms, allowing us to provide separate treatments. We 84 refer to Table 4.1.1 for a list of some of existing univariate sparsity surrogates and their dc repre- sentations. Surrogate Dierence-of-convex representation for t2R SCAD p SCAD a;> 0 (t),jtj 8 > > > > > > > > > > < > > > > > > > > > > : 0 ifjtj (jtj ) 2 2 (a 1 ) if jtj a jtj (a + 1 ) 2 2 ifjtj a MCP p MCP a;> 0 (t), 2jtj 8 > > > > < > > > > : t 2 a ifjtj a 2jtja 2 ifjtj a Capped ` 1 p CL1 a> 0 (t), jtj a max 0; t a 1; t a 1 Transformed ` 1 p TL1 a> 0 (t), a + 1 a jtj a + 1 a jtj (a + 1)jtj a +jtj Logarithmic p Log ;> 0 (t), jtj jtj log(jtj +) + log Table 4.1.1 Dierence-of-convex representation for univariate sparsity functions 4.1.2 Theoretical vs computable solutions The expectation minimization problem (4.1) is convex and dierentiable by the assumption made on the composite function`(m(w;x); y ), hence any local minimum is a global minimizer provided that the function is bounded below. In many of the existing work, it is assumed that there exists a possibly unique global optimal solution of the problem, and the analysis is developed exploiting properties of the hypothetical solution. Such an assumption is unrealistic since it requires feeding the learning problem with innite number of data points in order to obtain such optimal solution thus the concept is only theoretical. Instead of developing analysis based on an unattainable solu- tion, we introduce the point of interest denoted by w which satises the following assumption: 85 Assumption (A N ). For any "> 0, let w satisfykr b L N (w )k 1 " for a given sample size N. We dene S as the support set of w , i.e., S,fij w i 6= 0g, in order to derive meaningful results and provide comparable analysis to the existing work. The assumption is motivated by a bound provided in [76, Theorem 7.77] which states that under suitable conditions for a given w, we have the bound sup kdk 2 =1 r b L N ( w) T drL( w) T d " for any"> 0 with a high probability. Assuming existence of the ground truth for the problem (4.1), the gradient ofL at the underlying solution is exactly zero. Thus the bound indicates that there is a good chance that the global optimal solution of the problem (4.1) satises the assumption. We point out that the global minimum of b L N (w) is also eligible forw since the gradient of the sample average approximation is zero by the rst order optimality condition. Moreover, any point that is suciently close to the minimizer of b L N may satisfy (A N ). By making the assumption, we bypass involving additional factors such as the noise vector or related probability distribution. Throughout the chapter, we aim to provide a deterministic analysis that is realistic for practical computations. In this vein, we comparew with the d(-irectional) stationary solution of the sparse SAA problem (4.3) in terms of the distance between the two vectors, their model prediction errors and the ability of the d-stationary solution to recover the support setS. Due to the employment of the sparsity surrogates, the objective function N (w) in (4.3) is non-convex and non-dierentiable. For such settings, d-stationary solutions are the `sharpest' kind of stationarity concept such that it is a necessary condition for local optimality, i.e., there is no hope for a point to be a local optimum without being a d-stationary point. In [68], the authors developed deterministic and randomized algorithms to compute d-stationary solutions for certain kinds of dc programs where the problem (4.3) is a special case, showing that the concept is provably computable using practical algorithms. For the rest of this work, we denote b w N as a d-stationary solution of the problem (4.3). 4.1.3 A summary for the LASSO analysis This section is to provide a literature review of one particular reference which presented a chapter on the statistical inferences analysis of the LASSO problem [47]. Originated from linear regression, 86 LASSO became the pioneer for sparse learning problems; it adopts the convex ` 1 -norm, dened askxk 1 , n P i=1 jx i j for x2R n , as a regularizer for the regression problem and penalizes nonzero components in the model parameters. Followed by many related theoretical and computational studies, LASSO showed its ability to produce sparse solutions as desired for problems in a wide range of applications. We formally dene the LASSO problem: minimize w2R d 1 2N kYXwk 2 2 + Lasso N kwk 1 , Lasso N (w) (4.5) where each row ofX2R nd andY 2R n1 are the feature information samplex T and its outcome y respectively. [See section 4.1.1 for their denitions]. LASSO is a convex program thus any local minimizer is a global optimum as proven from the convex optimization theory. Exploiting the fact, LASSO analysis given in [47, Chapter 7] compares the optimal solution of the sample average approximation problem (4.5), denoted by b w, with the underlying ground truth vector w 0 . The as- sumption onw 0 is that all attained samples are generated from the vector then perturbed by some random Gaussian noise, i.e., Y = Xw 0 + where each component i N (0; 2 ) for 1 i N. Moreover, the authors assume that the ground truth is a sparse vector with nonzero components dening its support set S 0 ,fij w 0 i 6= 0g. The topics the authors address in the referenced work are the following: how close the empirical solution is to the ground truth; if the ultimate goal of the problem is to make future predictions, is it possible to compare the empirical model outputs with the noise-free outputs produced by the ground truth on the available samples; and whether the empirical solution can recover the indices of nonzero components that are contained in S 0 . To answer these questions, the authors derive a region, which contains all vectors such that the dierence between the vector and w 0 is contained within the set. They dene V Lasso ,fv2R d kv S c 0 k 1 3kv S 0 k 1 g; where S c 0 is the complement of S 0 . Therein, the authors assume that Lasso N is strongly convex at the point w 0 with respect toV Lasso , making a connection between the empirical solution and the ground truth by utilizing the region. Though (4.5) is a convex program, strong convexity of the entire objective function can not be expected in general, e.g., high-dimensional data where the number of samples is smaller than the number of unknowns. Such assumption guarantees that the 87 submatrix of the Hessian,X T X, corresponding to the indices inS 0 has a full rank. The statement of the restricted eigenvalues assumption, analogous to restricted strong convexity [to be dened in section 4.2.1] for the special case of linear regression, is as follows: there exists a constant Lasso > 0 such that 1 N v T X T Xv kvk 2 2 Lasso for all nonzero v2V Lasso : (4.6) We list the results provided in the reference: a basic consistency bound, a bound on the prediction error, and the support recovery of b w. The assumptions imposed for each theorem and the key ideas for the proof will be discussed. Consistency result [47, Theorem 11.1]: Suppose the model matrix X satises the restricted eigenvalue bound (4.6) with respect to the setV Lasso . Given a regularization parameter Lasso N 2 N kX T k 1 > 0, any solution b w of (4.5) satises the bound kb ww 0 k 2 3 Lasso r jS 0 j N p N Lasso N : (4.7) Exploiting the fact that b w is the global minimizer of the LASSO problem (4.5), the proof of the theorem starts from Lasso N (b w) Lasso N (w 0 ). We substitute the assumption on the ground truth, Y = Xw 0 +, to both sides of the inequality, then apply the assumption on the regularization parameter Lasso N . These steps yield a key inequality given by, kX(b ww 0 )k 2 2 2N 3 2 p jS 0 j Lasso N kb ww 0 k 2 ; (4.8) which serves as a building block to derive the current theorem and the prediction error bound result to be shown. It can be veried that by lettingv = b ww 0 , the proof is complete provided that the restricted eigenvalue condition holds; the last step requires a lemma which shows that any error b ww 0 , associated with the LASSO solution b w belongs to the setV Lasso if the condition on Lasso N holds. Bounds on the prediction error [47, Theorem 11.2]: Suppose the matrix X satises the re- stricted eigenvalue condition (4.6) over the setV Lasso . Given a regularization parameter Lasso N 2 N kX T k 1 > 0, any solution b w of (4.5) satises the bound kX(b ww 0 )k 2 2 N 9 Lasso jS 0 j ( Lasso N ) 2 : (4.9) 88 The proof for the prediction error bound is straightforward; it can be shown by combining the restricted eigenvalue assumption (4.6) and the inequality (4.8) which is derived in the process of proving the consistency result. Assumptions for variable-selection consistency result: To address variable selection consistency of the LASSO solution b w, the authors provide a distinct set of assumptions which are related to the structures of the matrixX. The mutual incoherence (sometimes also referred as irrepresentability) condition states that there must exist some Lasso > 0 such that max j2S c 0 k (X T S 0 X S 0 ) 1 X T S 0 x j k 1 1 Lasso : (4.10) The authors point out that in the most desirable case, anyj th columnx j wherej belongs to the set of indices consisting zero components ofw 0 would be orthogonal to the columns ofX S 0 2R NjS 0 j which is the submatrix ofX that consists of columns belonging to the support setS 0 . As such is not attainable for high-dimensional linear regression, the assumption ensures that `near orthogonality' to hold for the design matrix. In addition, they assume max 1jd 1 p N kx j k 2 K clm (4.11) for someK clm > 0, which can be interpreted as the matrixX has normalized columns. For example, the matrix can be normalized such thatkx j k 2 is equal to p N for any j, resulting the value of constant K clm to be 1. The last assumption made on the matrix X is min X T S 0 X S 0 N ! C min (4.12) for some positive constant C min , where min denotes the minimum eigenvalue of the given matrix. The authors note that if this condition is violated then the columns of X S 0 are linearly dependent, and it is not possible to recover w 0 even if its supporting indices are known. Variable-selection consistency [47, Theorem 11.3]: Suppose the matrix X satises the mutual incoherence condition (4.10) with parameter Lasso > 0, the column normalization condition (4.11) and the eigenvalue condition (4.12). For a noise vector 2R N with i.i.dN (0; 2 ) entries, consider the LASSO problem (4.5) with a regularization parameter N 8K clm Lasso r logd N : 89 Then with a probability greater than 1c 1 e c 2 N 2 N , the Lasso has the following properties: 1. Uniqueness: the optimal solution b w is unique; 2. No false inclusion: The unique optimal solution has its support contained within S 0 , i.e., support(b w) support(w 0 ); 3. ` 1 - bound: the error b ww 0 satises the ` 1 bound kb w S 0 w 0 S 0 k 1 N 4 p C min +k (X T S 0 X S 0 =N) 1 k 1 | {z } B( N ;;X) wherekAk 1 for a matrix A is dened as max kuk1=1 kAuk 1 ; 4. No false exclusion: the nonzero components of the LASSO solution b w include all indices j2 S 0 such thatjw 0 j j > B( N ; ; X), and hence is variable selection consistent as long as min j2S 0 jw 0 j j>B( N ; ; X). The uniqueness of the LASSO solution is important because without such property, it would not make much sense to compare its support with the support of the ground truth. Showing the unique- ness involves solving a hypothetical problem; they set b w S c 0 = 0 and solve a reduced-size LASSO problem where you minimize (4.5) with respect to w S 0 2 R jS 0 j . By properties of convexity and the rst order optimality condition (referred as zero-subgradient condition in the reference), the authors show that all optimal solutions of the original LASSO are supported only on S 0 thus the solutions can be obtained by solving the reduced problem. The lower eigenvalue condition (4.12) then is used to show the uniqueness. By the rst order optimality condition for the convex and non-dierentiable problem, there exists a subgradient ofkk 1 , denoted byb z, such that 1 N X T (YX b w ) + N b z = 0. This equation can be rewritten in a block-matrix form by substituting the denition of Y : 1 N 2 6 4 X T S 0 X S 0 X T S 0 X S c 0 X T S c 0 X S 0 X T S c 0 X S c 0 3 7 5 2 6 4 b w S 0 w 0 S 0 0 3 7 5 + 1 N 2 6 4 X T S 0 X T S c 0 3 7 5 + N 2 6 4 b z S 0 b z S c 0 3 7 5 = 2 6 4 0 0 3 7 5: This is the key equation which is used to show the remaining parts of the theorem. By applying the assumptions, the authors investigate the quantity b w S 0 w 0 S 0 after solving the above equation. 90 Due to the existence of the error vector, probability is introduced in the statement; the error is a zero-mean Gaussian random noise hence the authors apply related probabilistic bounds to achieve the third part of the theorem. We have summarized Chapter 11 of the reference [47]; we have presented three theorems that show the statistical properties of the LASSO solution, along with the assumptions made for each theorem and the ideas behind their proofs. 4.2 Properties of the D-stationary Solution The LASSO problem (4.5) is a special case of the sparsity regularized sample average approximation problem (4.3) where a linear model and a quadratic loss are selected for regression and the convex ` 1 -norm is chosen to promote sparsity in the solution. Inspired by the statistical properties of the empirical LASSO solution presented in section 4.1.3, we aim to provide a comparable theory that is applicable to the formulation involving a general form of model/loss composition and most of the existing non-convex and non-dierentiable sparsity functions as discussed in section 4.1.1. Some of the assumptions and conditions imposed in the current chapter are analogous to the ones dened for the LASSO analysis. We provide a detailed comparison between our work and the theory of LASSO, and discuss about limitations which possibly arise by generalizing the problem, due to the non-convexity, or from the absence of probabilistic measures such as the underlying random noise vector. 4.2.1 The neighborhood of strong convexity We derive a region using the property of the directional stationary solutions. The region contains all vectors (b w N w ) where b w N is any d-stationary solution of the problem (4.3). Such set is used to dene the strong convexity assumption to be presented. By denition of the d-stationary solutions, b w N satises 0 (b w N ; w b w N ) 0 for any w2R d . Since b L N is dierentiable, the inequality can be rewritten as 0 1 N N X s=1 ` 0 (m(b w N ;x s ); y s )rm(b w N ; x s | {z } r b L N (b w N ) ) T (w b w N ) + N P 0 (b w N ; w b w N ) (4.13) 91 with a substitution of w for w. Since we assume the composite function `(m(w; x); y) is convex for any given sample, the gradient inequality of convex functions yields r b L N (b w N ) T (w b w N )r b L N (w ) T (w b w N ): (4.14) To investigate the directional derivative of the regularization term, we dene a set of active indices for each component of the functionh(w) = d P j=1 h j (w j ) where each ofh j (w j ) is a maximum of nitely many dierentiable functions, i.e., max 1kJ j h jk (w j ) for some J j > 0. For given component j, let b J N j be the set of indices dened asfk 0 jh jk 0(b w N j ) =h j (b w N j )g. Then the directional derivative of h j at the corresponding component of b w N in the direction d j is equal to max k2 b J N j h 0 jk (b w N j ; d j ) by Danskin's theorem. With the above discussion, we examine the directional derivative of the sparsity function P at b w N in the direction w b w N : P 0 (b w N ; w b w N ) =g 0 (b w N ; w b w N )h 0 (b w N ; w b w N ) g(w )g(b w N )h 0 (b w N ; w b w N ) by convexity of g: Since g and h are both sum of univariate functions, we split the sum into two groups according to the indices that belong to S and that belong to the complement of S. Knowing that w j = 0 for any j = 2S, the above is written as = X j= 2S " c j jb w N j j + max k2 b J N j n h 0 jk (b w N j )b w N j o # + X j2S " c j jw j jc j jb w N j j max k2 b J N j h 0 jk (b w N j )(w j b w N j ) # (4.15) X j= 2S " c j + max k2 b J N j jh 0 jk (b w N j )j # jb w N j j + X j2S " c j + min k2 b J N j jh 0 jk (b w N j )j # jw j b w N j j where the equality is obtained by the denition of g and due to the dierentiability of h jk . Hence, combining the d-stationary property of b w N given in (4.13), gradient inequality of ` from (4.14), and derivation on the directional derivative (4.15), we have X j= 2S " c j + max k2 b J N j jh 0 jk (b w N j )j # jb w N j j X j2S " c j + min k2 b J N j jh 0 jk (b w N j )j # jw j b w N j j + 1 N r b L N (w ) T (w b w N ) X j2S " c j + min k2 b J N j jh 0 jk (b w N j )j # jw j b w N j j + 1 N kr b L N (w )k 1 kw b w N k 1 (4.16) 92 where the last inequality is achieved by applying H older's inequality. Based on the inequality (4.16), we dene the region which serves as the neighborhood of strong convexity (to be formally stated). Before we provide the denition of the region, we make additional assumptions on the model constants: Assumption (A ). For any scalar 0<< 1, let N " min 1jd c j ; Assumption (A s ). For any j th component, the sparsity function P satises sup w j h 0 jk (w j ) c j where h 0 jk (w j ), dh jk (w j ) dw j for all 1kJ j . To compare (A ) with the condition imposed on the regularization parameter for the LASSO, we take c j = 1 for all components. If (A N ) holds, then the assumption (A ) indicates N "kr b L N (w )k 1 : (4.17) The last quantity for the special case of LASSO is equivalent to k 1 N X T k 1 if we substitute w =w 0 which is the underlying vector that generates the samples for the LASSO problem. [See section 4.1.3]. We verify that by taking = 1=2, the inequality (4.17) yields the condition made for the regularization parameter Lasso N 2 1 N kX T k 1 in the LASSO consistency theorem whereas we allow to be any value between 0 and 1. The constant is required to dene the region to be presented. We provide additional discussion on this constant when we compare the two regions. The assumption (A s ) is related to choices of the sparsity functions for the formulation (4.3). It can be interpreted as the curvature of concave component of selected univariate sparsity function does not dominate over the curvature of its convex component uniformly over all w. While such condition may seem to be too restrictive, we investigate sparsity surrogates listed in Table 4.1.1 to verify applicability of the assumption. Consider univariate folded-concave functions in t2R of the form p(t) = cjtj max 1kJ h k (t) dened by some positive parameters. By denition, we have the following: Capped ` 1 with three components in the concave term (J = 3) c = 1 a ; h capL1 0 a (t)2 0; 1 a ; 1 a ; 93 MCP (J = 1) c = 2; h MCP 0 a; (t) = 8 > > < > > : 2t a ifjtj a 2 sign(t) ifjtj a; SCAD (J = 1) c =; h SCAD 0 a; (t) = 8 > > > > > > < > > > > > > : 0 ifjtj jtj a 1 sign(t) if jtj a sign(t) ifjtj a: For the above surrogates, we observe that sup w j h 0 jk (w j ) = c j for any k and for all components. The following surrogates have innitely many times dierentiable h. We verify that Log (J = 1) c = ; h Log 0 ; (t) = t ( +jtj ) ; Transformed ` 1 (J = 1) c = a + 1 a ; h TL1 0 a (t) = a + 1 a a (a + 1 ) (a +jtj ) 2 sign(t): It is trivial that sup w j h 0 jk (w j ) is strictly less than c j for the last two surrogates, validating that the assumption is suitable for existing sparsity functions provided in the Table 4.1.1. To complete derivation of the region, we apply the stated assumptions to the inequality (4.16). Suppose that for some 0<< 0 < 1, we have max k2 b J N j jh 0 jk (b w N j )j c j ( 0 ) for every j that is not in the support set S. Then by assumptions (A N ), (A ) and (A s ), the inequality (4.16) deduces to X j= 2S 1 0 c j jb w N j j X j2S ( 2 + ) c j jw j b w N j j: Summarizing the above derivation, we dene the region for which restricted strong convexity to be imposed; the formal statement of the assumption will be followed by the discussions about the 94 region. For the given support set S of w , let and 0 be given scalars that satisfy 0<< 0 < 1. We dene the described region as V ; 0, 8 < : v2R d X j= 2S c j jv j j 2 + 1 0 X j2S c j jv j j 9 = ; : It is clear that for any xed constants and 0 , the region is a cone; we can verify that for any > 0, we have u2V ; 0 for any u2V ; 0. Moreover, it is a non-convex cone which can be illustrated by a simple example. Let u = (1; 1) and u 0 = (7; 1) be both contained inV 0:2; 0:7 . The convex combination of the 2-dimensional vectors u + (1)u 0 for = 0:1 is not in the region, showing violation of the denition of a convex set. The provided set is comparable to the region derived for the LASSO analysis; the only dierence between the two is the constant 2 + 1 0 which appears in front of the summation on the right-side. We recall the denition ofV Lasso given in section 4.1.3: V Lasso , v2R d X j= 2S jv j j 3 X j2S jv j j : The quotient in the denition ofV ; 0, is strictly greater than 2 and can be arbitrarily large by choosing 0 close to 1. If the value of the ratio is strictly greater than 3, then any vector contained in the LASSO region will be contained in our region. Nevertheless, we can choose the constants and 0 appropriately such thatV ; 0 is a subset ofV Lasso . For example, let us choose 0 = 2, and consider u = (u 1 ;u 2 ) contained inV ; 0 where S =f 2g. We can verify that the value of the quotient 2 + 1 2 is ranged as follows, 8 > > > > < > > > > : 2 < 2 + 1 2 3 if 0< 1 7 3 < 2 + 1 2 if 1 7 << 1 2 ; indicating that our region is a subset ofV Lasso in the former case and a superset ofV Lasso in the latter case. The example is illustrated in Figure 4.1; we plot the LASSO region then compare it withV ; 0 for the two dierent choices of . In the gure, we present a case when the con- stant takes a value of 1=100 such that the resultingV 1 100 ; 2 100 is a subset of the LASSO region. 95 We also present a plot of our region when is 1=3 showing that resultingV1 3 ; 2 3 is a superset ofV Lasso . u 1 u 2 (1, 1 3 ) u 1 u 2 (1, 0.4876) u 1 u 2 (1, 1 7 ) Figure 4.1: 2-dimensional examples of the region (a)V Lasso ; (b)V 1 100 ; 2 100 ; (c)V1 3 ; 2 3 The regionV ; 0 is required to stipulate the scope of strong convexity for the sparse SAA prob- lem (4.3). In the LASSO analysis, the authors discuss the motivation of imposing such as- sumption by examining the high-dimensional linear regression problem. Consider minimizing b L lse N (w) = 1 2N N X s=1 kXwYk 2 2 where the number of columns of X exceeds the number of rows. It can be veried that the function is strongly convex if and only if all the eigenvalues of its hes- sian, 1 2N X T X, are strictly positive. Thus we can not expect strong convexity of the least squared errors for linear regression, even though the function is convex, since the hessian is always rank decient for high-dimensional data matrices. Therefore, the authors assume that the function is only strongly convex with respect to the setV Lasso ; the assumption is provided in (4.6) for the special case when b L N (w) = b L lse N (w). We make an analogous assumption for the general loss/model composite b L N (w): Assumption (A RSC ). There exists ` min > 0 such that ` min kw wk 2 2 b L N (w ) b L N (w)r b L N (w) T (w w ) for all w such that (w w)2V ; 0. The assumption indicates that the (convex) sample average approximation b L N is strongly convex near the vectorw , and the range of the strong convexity is determined by the derived regionV ; 0. 96 4.2.2 Consistency bound the D-stationary solutions The empirical solutions of statistical learning problems are in general computed in such way that they minimize the error on the available data points for the given statistical model and loss function. While their minimizing property is of great interest for practical computational perspectives, it is desirable to compare the achieved solution to the underlying solution that minimizes the error for the general data population. As discussed in section 4.1.2, our analysis compares the d-stationary solutions of the problem (4.3) with a point of interest w . The following theorem investigates the distance between the two vectors in terms of the ` 2 -norm and provides the consistency bound. Theorem 15. Let assumptions (A 0 ), (A N ), (A ), (A s ), and (A RSC ) hold. Let b w N be a directional stationary solution of (4.3). If, for some 0<< 0 < 1, it holds that max k2 b J N j jh 0 jk (b w N j )j c j ( 0 ) for every j = 2S (4.18) where S is the support of w , then kw b w N k 2 3 ` min N max 1jd c j p jSj: (4.19) Proof. We start from the restricted strong convexity assumption (A RSC ): ` min kw b w N k 2 2 b L N (w ) b L N (b w N )r b L N (b w N ) T (w b w N ) since (w b w N )2V ; 0 = b L N (w ) b L N (b w N ) + N P 0 (b w N ; w b w N ) r b L N (b w N ) T (w b w N ) N P 0 (b w N ; w b w N ): By convexity of b L N and the d-stationarity of b w N , the above deduces to r b L N (w ) T (w b w N ) + N P 0 (b w N ; w b w N ) X j= 2S " " N c j N max k2 b J N j jh 0 jk (b w N j )j # b w N j + X j2S " " + N c j N min k2 b J N j jh 0 jk (b w N j )j # w j b w N j by the inequality (4.16) and the assumption (A N ). We verify that the rst term in the last expression is strictly negative by the assumption (A ) and (A s ) and by the condition of the theorem. Therefore 97 the above continues as 3 N max 1jd c j X j2S w j b w N j 3 N max 1jd c j p jSjkw b w N k 2 where the last inequality is obtained by using the properties of the vector norm. The assumption (A s ) assures that at eachj-th component of the stationary solution b w N , the slope of h j (b w N j ) is bounded above by c j which is the coecient in g j (b w N j ) for the selected sparsity func- tion. In addition to the above assumption, condition of the theorem requires b w N j to satisfy a strict inequality, min k2 b J N j jh 0 jk (b w N j )j<c j , for the components that are not in the support of w . Note that this condition is automatically satised if a convex sparsity function is employed for the problem (4.3) as the sparsity term P (w) would only have the convex function g(w), i.e.,h(w) = 0. In order to understand the condition, we examine dc surrogates that have dierentiableh(w) with piecewise structures. For SCAD, we verify that the condition only holds ifjb w N j j < where the function parameter is a positive scalar. Likewise, the strict inequality in the condition is only achieved ifjb w N j j < a for MCP where a and are both positive constants. Thus at least for the two mentioned surrogates, the condition can be interpreted as that we want the value of b w N j to be near the origin if the corresponding component of w is at zero. In the discussion provided in section 4.2.1, we veried that the logarithmic and transformed ` 1 functions always yield a strict inequality for any value of w j . We examine constants that appear in the bound (4.19). The strong convexity modulus ` min is inversely proportional to the bound indicating that if the neighborhood of w is more strongly convex, then the achieved bound is tighter. When the vector w is the optimal solution of the ex- pectation minimization problem (4.1), the quantitykr b L N (w )k 1 can be interpreted as a measure of error caused by the sample average approximation sincerL(w ) = 0. Such error is expected to appear since we are comparing w with the solution of the approximated problem. The right-side of (4.19) implicitly contains such quantity since the product N max 1jd c j is strictly greater than kr b L N (w )k 1 by assumptions (A N ) and (A ). 98 The derived result achieves the same bound as the LASSO consistency result shown in section 4.1.3. It remains to compare the conditions imposed on the regularization parameter N and Lasso N . Let c j = 1 for all components. By combining the assumptions (A N ) and (A ), we have N "k b L N (w )k 1 : If w satises the LASSO setting such that y = Xw +, thenk b L N (w )k 1 is equivalent to 1 N kX T k 1 . Thus the assumptions imply the condition imposed on the Lasso regularization pa- rameter Lasso N where takes a specic value of 1 2 . The obtained bound for w b w N is a deterministic result that applies to a xed data set. It is worthy to point out that we achieve a bound that is as good as the existing result while generalizing the model and the loss functions from the specic choices of a linear model and a quadratic loss. We also relax the requirement of the convex` 1 -norm for the sparsity function to dc surrogates that satisfy the assumption (A 0 ). Another noteworthy point is that our result is developed in terms of the stationary points unlike the existing analysis which requires the property of optimality in the computed solutions. 4.2.3 Bounds on the prediction error The above analysis shows how close empirically achievable d-stationary solutions can be compared to the vector w which satises the assumption (A N ). While the trained models involving the stationary solutions may achieve desired level of prediction accuracy on the training data, e.g., overtting, it is not clear whether such models would show superior performances for the general (unobserved) data. In this vein, we aim to derive a bound on the quantitym(w )m(b w N ); we refer the described quantity as prediction error. Since the model function is combined with a loss func- tion, the composite function b L N () has been explored thus far. To address the goal of the current section, we examine the model function and its properties independently from the loss functions. We note that the mentioned prediction error is only meaningful for regression problems since a model outcome for classication settings would be the probability of the given point belonging to a certain class. For this reason, we restrict the results to be presented in current section for regression problems. 99 We explore some loss/model combinations that suit the framework of the current chapter. In sec- tion 4.1.1, we make an assumption on the composite function`(m(w;x s ); y s ) that it is convex and dierentiable for any given (x s ;y s ). One may argue that the assumption may be too restrictive, however, such loss/model combination is common in practice. For regression problems, the linear model is often composited within dierentiable functions such as the quadratic loss, the Huber loss or the log-cosh function. For non-linear convex models, it is possible to obtain such convexity if the loss is convex and non-decreasing or concave and non-increasing. Concave models may satisfy the requirement in a similar way. Even more complex models, that are neither convex nor concave, can achieve the assumption. One example is the logistic binary classication; the (non-convex and non-concave) sigmoid model is composited within the log function resulting cross entropy error objective function which is convex. These examples indicate that functions of various classes can play the role of the model function. In fact, the challenge of developing a prediction bound involves deciding what kind of structures we assume for the model function as exploiting structural properties is essential to derive such bound. A good starting point would be investigating the case of linear model. Consider the sparse least squared error minimization problem involving the loss given by b L lse N (w), 1 N N X s=1 (w T x s y s ) 2 = 1 N kXwyk 2 2 ; (4.20) and dc sparsity surrogates. Assuming that the samples are generated by w then perturbed by a random Gaussian noise vector , as stated in the LASSO analysis, we can rewrite the restricted strong convexity assumption (A RSC ) for the d-stationary point b w N as ` min kw b w N k 2 2 b L lse N (w ) b L lse N (b w N )r b L lse N (b w N ) T (w b w N ) = 1 N kXw yk 2 2 1 N kXb w N yk 2 2 2 N (w b w N ) T X T (Xb w N y ) = 1 N kX(w b w N )k 2 2 since y =Xw + 3 N max 1jd c j p jSjkw b w N k 2 from the proof of Theorem 15: (4.21) With the above derivation, we are ready to present the prediction error bound for the stated special case of the problem (4.3) resulted by selecting a linear model and a quadratic loss. 100 Theorem 16. Suppose y =Xw + where each component of followsN (0; 2 ). Let assumptions (A 0 ), (A N ), (A ), (A s ), and (A RSC ) hold. Let b w N be a d-stationary solution of the problem (4.3) with b L N (w) of the form (4.20). If condition (4.18) holds, then 1 N kX(b w N w )k 2 2 9 ` min 2 N max 1jd c j 2 jSj: (4.22) Proof. From (4.21), it follows that 1 N kX(w b w N )k 2 2 ! 2 3 N max 1jd c j p jSj ! 2 kw b w N k 2 2 3 N max 1jd c j p jSj ! 2 1 ` min N kX(w b w N )k 2 2 : The presented result achieves a bound as tight as the LASSO prediction error bound in [47, The- orem 11.2]. Derived based on the d-stationary solutions, the bound is applicable for least squared error minimization problems formulated with dc sparsity functions. The current result and its proof are analogous to the bound and the derivation for the LASSO prediction error as both are consequences of the consistency bound presented in the previous section. The key step in achieving the provided bound involves examining the right-side expression in the restricted strong convexity assumption by explicitly expressing the terms using the model function. To understand the importance of the structure of the model, we inspect the expression for a special case where a general m is employed with a quadratic loss: b L N (w ) b L N (b w N )r b L N (b w N ) T (w b w N ) = 1 N N X s=1 m(w ; x s )m(b w N ; x s ) 2 (4.23) 2 N N X s=1 y s m(b w N ; x s ) m(w ; x s )m(b w N ; x s )rm (b w N ; x s ) T (w b w N ) : We observe that the rst term in (4.23) represents the prediction error between the two models and the second term is redundant. The equation clearly shows that even under the restricted setting 101 where the choice of the loss function is limited, the bound (4.22) is hard to achieve unless m is a linear model. Additional conditions could be imposed, e.g., a bound on the training error or strong convexity of the model, to handle the last term. Nevertheless, such remedies would introduce a trade-o of loosening the bound. For learning problems with general models, the rst quantity (4.23) represents the sample average of squared errors between the models generated by the stationary solution and the vector of com- parison. The aim of the current chapter is to provide a comprehensive analysis that is applicable for a wide range of settings while minimally exploiting the structures of the loss or the model functions. With this intent, we introduce the following assumption on the model function. Assumption (A m ). For any given sample x s , the model m :R d R d !R satises jm(w;x s )m(w 0 ;x s )j Lip ms kww 0 k 2 for some Lip ms > 0 for any w and w 0 . Such assumption on the model function is broad enough that any function with a bounded gradient would satisfy the assumption. When m is a linear model, the Lipschitz constant Lip ms is equal to kx s k 2 showing that the constant may depend on the samples. With above assumption, we present following theorem. Theorem 17. Let assumptions (A 0 ), (A N ), (A ), (A s ), (A RSC ), and (A m ) hold. If condition (4.18) holds, then any directional stationary solution b w N of the problem (4.3) satises 1 N N X s=1 m(w ;x s )m(b w N ;x s ) 3 ` min N max 1jd c j p jSj L m (4.24) where L m is dened as 1 N N X s=1 Lip ms . The proof is straightforward hence omitted. Presented result is a deterministic bound for regression problems involving Lipschitz statistical models. For classications, each model output yields the probability of a point being a specic class, e.g., binary classication sets a threshold of 0.5 and any point with a model output lower than the value is identied as one class whereas a point with a 102 value higher than the threshold is identied as the other class, thus it is not relevant to compare the absolute dierence between two model outputs as shown in (4.24). By assuming Lipschitz property on the model function, we are able to achieve the model prediction error bound without requiring any structure of the loss function. The bound (4.24) is the consistency bound (4.19) multiplied by a factor of L m . Assuming that samples are independent and identically distributed, the constant L m represents the sample expected value of the Lipschitz constant of the model function. For the specic case of linear model, we verify that L m is the average ` 2 -norm of all samples. 4.2.4 Support recovery Up to this point, we have examined the distance between the two vectors of interest and the bound for prediction error for the dierence between their model outputs. While these results provide insights for the d-stationary solution of the sparse SAA problem, a dierent kind of information may be sought for some applications. For example, in bioinformatics, a fundamental assumption is that only few variables in uence the outcomes of data. For such settings, the primary interest of solving the learning problem is to correctly identify those signicant variables. The term as- sociated for recovering supporting indices of the underlying optimal solution (of the expectation minimization problem (4.1)) is called variable selection consistency. Consider the case where w is the optimal solution of the underlying problem. Even with the tightest consistency or prediction error bounds, the relationship between the support of the d-stationary solution and the support of w is ambiguous. We develop analysis investigating the inclusion relationships among the two sets in current section. To provide a complete support recovery of the d-stationary solution b w N , we need to show the following two conditions: 1. No false inclusion where support(b w N ) support(w ); 2. No false exclusion where support(b w N ) support(w ). No false inclusion means that whenever w j = 0 for some component j, then b w N j = 0 for the corre- sponding component. Conversely, no false exclusion means that whenever w j 6= 0, then b w N j 6= 0. The following theorem presents variable consistency of the d-stationary solution. 103 Theorem 18. Let assumptions (A 0 ), (A N ), (A ), (A s ), and (A RSC ) hold. Let b w N be a d-stationary solution of problem (4.3). Given any set of indices S, if it holds that 1 N N N X s=1 @` m(b w N ; x s ); y s @w j + min k2 b J N j jh 0 jk (b w N j )j<c j for all j = 2 S (4.25) then the following statements hold: (a) The support of b w N is contained in S; (b) Suppose S = S where S is the support of w . If condition (4.18) holds, the error w b w N satises the ` 1 bound kw S b w N S k 1 3 ` min N max 1jd c j p jSj; | {z } B( N ; ` min ) (c) Furthermore, if min j2S jw j j>B( N ; ` min ); then S is contained in the support of b w N . Proof. To show part (a), let S be a given set of indices. By denition of the d-stationary solution and from derivation of the region (4.15), b w N satises 0r b L N (b w N ) T ( ~ w b w N ) + N d X j=1 " c j (j ~ w j jjb w N j j ) max k2 b J N j h 0 jk (b w N j ; ~ w j b w N j ) # (4.26) for any vector ~ w. Let us choose ~ w such that ~ w j = 8 > < > : b w N j for j2 S 0 for j = 2 S: By substituting the components of ~ w to the inequality (4.26), we obtain 0 1 N N X s=1 2 4 d X j=1 @` m(b w N ; x s ); y s @w j ( ~ w j b w N j ) 3 5 + N d X j=1 " c j (j ~ w j jjb w N j j ) max k2 b J N j h 0 jk (b w N j ; ~ w j b w N j ) # = X j= 2 S " 1 N N X s=1 @` m(b w N ; x s ); y s @w j (b w N j ) + N c j jb w N j j max k2 b J N j h 0 jk (b w N j ;b w N j ) !# X j= 2 S " 1 N N X s=1 @` m(b w N ; x s ); y s @w j jb w N j j + N c j jb w N j j + min k2 b J N j jh 0 jk (b w N j )jjb w N j j !# : 104 The above derivation shows that for any j-th component that is not contained in S, the corre- sponding summand is equivalent to " 1 N N X s=1 @` m(b w N ; x s ); y s @w j + N c j + min k2 b J N j jh 0 jk (b w N j )j !# jb w N j j: Therefore we must have b w N j = 0 for allj = 2 S by the condition of the theorem (4.26). This concludes proof of no false inclusion. The` 1 bound for the errorw b w N is a consequence of part (a) and the consistency bound (4.19) provided in section 4.2.2. It follows from (4.19) that 3 ` min N max 1jd c j p jSj 2 kw b w N k 2 2 = X j2S (w j b w N j ) 2 + X j= 2S (b w N j ) 2 : Since S = S, any b w N j where the component j is not contained in the support of w must equal to zero by part (a). By applying the norm inequalities, we obtain the bound 3 ` min N max 1jd c j p jSjkw S b w N S k 1 : Part (c) follows from the previous parts and no proof is required. The theorem presents no false inclusion for any given set of indices. This is dierent from the existing result given in [47, Theorem 11.3] where both kinds of inclusion relationships are shown based on the support of the ground truth. To be clear, the support of b w N is known by the time we compute the solution, hence support recovery with respect to a given set could be already veried by then. The result rather serves as a quantitative statement which formally expresses structural properties of the problem in order to achieve such inclusion. Let us consider the case of S = S to understand the meaning of the condition (4.25). The rst term indicates that b w N should be suciently close to w since we wantkr b L N (b w N )k 1 to be small. The second term of the in- equality veries the intuitive idea we discussed in section 4.2.2 from examining piecewise sparsity surrogates; the value ofjb w N j j has to be near the origin wheneverw j = 0 since the condition will be violated otherwise. The remaining parts of the theorem is followed by the consistency bound (4.19). 105 Bibliography [1] M. Ahn, J.S. Pang and J. Xin. Dierence of convex learning: directional stationarity, optimality and sparsity. SIAM Journal on Optimization 27(3) 1637-1665 (2017). [2] A. d'Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection. SIAM Journal on Matrix Analysis and its Applications 30(1): 55{66 (2008). [3] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization. Statistical Science 27(4): 450{468 (2012). [4] P. Belotti, C. Kirches, S. Leyffer, J. Linderoth, J. Luedtke, and A. Mahajan. Mixed-integer non- linear optimization. Acta Numerica 22: 1{131 (2013). [5] D.P. Bertsekas. Nonlinear Programming. Second Edition. Athena Scientic (Belmont 1999). [6] D. Bertsimas, M. Copenhaver and R. Mazumder. The Trimmed Lasso: sparsity and robustness. Manuscript. Submitted. (August 2017). [7] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics 44(2): 813{852 (2016). [8] D. Bertsimas and R. Shioda. Algorithm for cardinality-constrained quadratic optimization. Computational Optimization and Applications 43(1): 1{22 (2009). [9] B.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of LASSO and Dantzig selector. The Annals of Statistics 37(4): 1705{1732 (2009). [10] J. Bien, J. Taylor, and R. Tibshirani. A lasso for hierarchical interactions. Annals of Statistics 43(3): 1111{1141 (2013). [11] D. Bienstock. Computational study of a family of mixed-integer quadratic programming problems. Mathe- matical Programming, Series A 74(2): 121{140 (1996). [12] P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applica- tions to biological feature selection. The Annals of Applied Statistics 5(1): 232{253 (2011). [13] J. Brodie, I. Daubechies, C. De Mol and D. Giannone, and I. Loris. Sparse and stable markowitz portfolios. Proceedings of the National Academy of Sciences 106(30): 12267{12272 (2009). 106 [14] O.P. Burdakov, C. Kanzow, and A. Schwartz. Mathematical programs with cardinality constraints: Re- formulation by complementarity-type conditions and a regularization method. SIAM Journal Optimization 26(1):397{425 (2016). [15] E. Cand es and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communication on Pure and Applied Mathematics. 67(6): 906{956 (2014). [16] E. Cand es and T. Tao. Decoding by linear programming. IEEE Transactions Information Theory 51(12): 4203{4216 (2005). [17] E. Cand es and T. Tao. Near optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory 52(12): 5406{5425 (2006). [18] E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics 35(6) 2313-2351 (2007). [19] E. Candes, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics 59(8) 1207-1223 (2006). [20] E. Cand es, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted `1 minimization. Journal of Fourier Analysis and Applications 14(5): 877{905 (2008). [21] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientic Computing 20: 33{61 (1998). [22] X. Chen, D. Ge, Z. Wang, and Y. Ye. Complexity of unconstrained `2`p minimization. Mathematical Programming, Series A 143: 371{383 (2014). [23] C. Chen, X. Li, C. Tolman, S. Wang, and Y. Ye. Sparse portfolio selection via quasi-norm regularization. https://arxiv.org/abs/1312.6350 (2013). [24] M. Conforti and G. Cornujols. A class of logic problems solvable by linear programming. Journal of the ACM 42(5): 1107{1112 (1995). [25] M. Conforti and G. Cornujols. Balanced matrices. In: K. Aardal, G.L. Nemhauser, and R. Weisman- tel (Eds.) Discrete Optimization. Handbooks in Operations Research and Management Science. Volume 12 (Elsevier, Amsterdam 2005) pp. 277-320. [26] H. Dong. On integer and MPCC representability of ane sparsity. Optimization Online (April 2018). [27] H. Dong, M. Ahn, and J.S. Pang. Structural properties of ane sparsity constraints. Mathematical Pro- gramming Series B http://doi.org/10.1007/s10107-018-1283-3 (2018). [28] D. Donoho. Super-resolution via sparsity constraints. SIAM Journal on Mathematical Analysis 23(5): 1309{ 1331 (1992). [29] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory 52(4): 1289{1306 (2006). [30] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimization. Proceedings of the National Academy of Science USA 100: 2197{2202 (2003). 107 [31] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics 32(2): 407{499 (2004). [32] F. Facchinei and J.S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems, Volumes I and II. Springer-Verlag (New York 2003). [33] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456) 1348-1360 (2001). [34] J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20 101-148 (2010) [35] J. Fan and J. Lv. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory 57(8): 5467{5484 (2011). [36] J. Fan, L. Xue, and H. Zou. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics 42(3): 819{849 (2014). [37] M. Feng, J.E. Mitchell, J.S. Pang, A. Waechter, and X. Shen. Complementarity formulations of `0- norm optimization problems. Pacic Journal of Optimization 14(2): 273{305 (2018). [38] L.E. Frank, and J.H. Friedman. A statistical view of some chemometrics regression tools. Technometrics 35(2): 109{135 (1993). [39] J.H. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9(3): 432{441 (2013). [40] Y. Gao and D. Sun. A majorized penalty approach for calibrating rank constrained correlation matrix prob- lems. Manuscript, Department of Mathematics, National University of Singapore. (Revised May 2010). [41] G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with a certain family of nonconvex penalties and DC programming. IEEE Transactions on Signal Processing 57(12): 4686{4698 (2009). [42] D. Ge, X. Jiang, and Y. Ye. A note on the complexity of Lp minimzation. Mathematical Programming 129: 285{299 (2011). [43] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning pp. 37{45 (2013). [44] J.Y. Gotoh, A. Takeda, and K. Tono. DC formulations and algorithms for sparse optimization problems. Manuscript. Department of Mathematical Informatics, The University of Tokyo (August 2015). [45] E. Greenshtein. Best subset selection, persistence in high-dimensional statistical learning and optimization under `1 constraint. The Annals of Statistics 34(5): 2367{2386 (2006). [46] M. Hamada, and C.F.J. Wu. Analysis of designed experiments with complex aliasing. Journal of Quality Technology 24:130{137 (1992). [47] T. Hastie, R. Tibshirani and M.J. Wainwright. Statistical Learning with Sparsity: the Lasso and Gener- alizations. Chapman and Hall/CRC Press, Series in Statistics and Applied Probability (2015). 108 [48] M. Ho, Z. Sun, and J. Xin. Weighted elastic net penalized mean-variance portfolio design and computation SIAM Journal on Financial Mathematics 6:1220{1244 (2015). [49] J. Huang, P. Breheny, and S. Ma. A selective review of group selection in high-dimensional models. Sta- tistical Science 27(4): 481{499 (2012). [50] L. Jacob, G. Obozinski, and J.P. Vert. Group lasso with overlap and graph lasso. Proceeding of the 26th Annual International Conference on Machine Learning, Montreal, Canada (ICML '09, ACM New York) pp. 433{440 (2009). [51] C. Kanzow and A. Schwartz. A new regularization method for mathematical programs with complementarity constraints with strong convergence properties. SIAM Journal Optimization 23(2): 770{798 (2013). [52] K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics 28(5): 1356{1378 (2000). [53] H.A. Le Thi and D.T. Pham. The DC programming and DCA revised with DC models of real world nonconvex optimization problems. Annals of Operations Research 133: 25{46 (2005). [54] H.A. Le Thi, D.T. Pham, and X.T. Vo. DC approximation approaches for sparse optimization. European Journal of Operations Research 244: 26{46 (2015). [55] Y. Liu and Y. Wu. Variable selection via a combination of the L0 and L1 penalties. Journal of Computational and Graphical Statistics 16(4): 782{798 (2007). [56] P.L. Loh and M.J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. Annals of Statistics 45(6): 2455{2482 (2017). [57] Y. Lou, P. Yin, and J. Xin. Point source super-resolution via nonconvex L1 based methods. Journal of Scientic Computing 68(3): 1082{1100 (2016). [58] J. Lv and Y. Fan. A unied approach to model selection and sparse recovery using regularized least squares. The Annals of Statistics 37(6A): 3498{3528 (2009). [59] O.L. Mangasarian. Pseudo-convex functions. SIAM Journal on Control 3: 281{290 (1965). [60] R. Mazumder, J.H. Friedman, and T. Hastie. SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association 106(495): 1125{1138 (2011). [61] P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman & Hall (London 1983). [62] A.-V. de Miguel, M. Friedlander, F. J. Nogales, and S. Scholtes. A two-sided relaxation scheme for mathematical programs with equilibrium constraints. SIAM Journal on Optimization 16(2):587{609 (2006). [63] B. Mordukhovich. Variational Analysis and Generalized Dierentiation. Springer-Verlag (Heidelberg 2005). http://www.springer.com/series/138 [64] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of Computation 24(2): 227{ 234 (1995). [65] G. Nemhauser and L. Wolsey. Integer and Combinatorial Optimization, John Wiley & Sons (1999). 109 [66] M. Nikolova. Local strong homogeneity of a regularized estimator. SIAM Journal on Applied Mathematics 61(2): 633{658 (2000). [67] J.M. Orgeta and W.C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Classics in Applied Mathematics, SIAM Publications (Philadelphia 2000) [68] J.S. Pang, M. Razaviyayn and A. Alvarado. Computing B-stationary points of nonsmooth DC programs. Mathematics of Operations Research 42(1): 95{118 (2017). [69] J.S. Pang and M. Tao. Decomposition methods for computing directional stationary Solutions of a class of non-smooth non-convex optimization problems. SIAM Journal on Optimization 28(2): 1640{1669 (2018). [70] D.T. Pham and H.A. Le Thi. Recent advances in dc programming and dca. Transactions on Computational Collective Intelligence 8342: 1{37 (2014). [71] D.T. Pham and H.A. Le Thi. Convex analysis approach to D.C. programming: Theory, algorithms and applications. ACTA Mathematica Vietnamica 22(1): 289{355 (1997). [72] D.T. Pham, H.A. Le Thi, and D.M. Le. Exact penalty in d.c. programming. Vietnam Journal of Mathematics 27(2): 169{178 (1999). [73] D. Ralph and S.J. Wright. Some properties of regularization and penalization schemes for MPECs. Opti- mization Methods and Software 19(5): 527{556 (2004). [74] R.T. Rockafellar and R.J.-B. Wets. Variational Analysis. (Springer 1998). [75] S. Scholtes. Convergence properties of a regularisation scheme for mathematical programs with complemen- tarity constraints. SIAM Journal on Optimization 11(4): 918{936 (2001). [76] A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Modeling and Theory. SIAM Publications (Philadelphia 2009). [77] X. Shen, W. Pan, and Y. Zhu. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107(497): 223{232 (2012). [78] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society 58(1) (1996) 267{288. [79] R. Tibshirani, M.A. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1): 91{108 (2005). [80] J. Wang and J. Ye. Multi-layer feature reduction for tree structured group lasso via hierarchical projection. Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada (December 2015) pp. 1279{1287. [81] Q. Yao and J.T. Kwok. Ecient learning with a family of nonconvex regularizers by redistributing non- convexity. In M.F. Balcan and K.Q. Weinberger, editors. Proceedings of the 33rd International Conference on Machine Learning (2016). [82] P. Yin, Y. Lou, Q. He and J. Xin. Minimization of L1-L2 for compressed sensing. SIAM Journal on Scientic Computing 37(1): 536{563 (2015). 110 [83] W. Yin, S. Osher, D. Goldfarb and J. Darbon. Bregman iterative algorithms for L1-minimization with applications to compressed Sensing. SIAM Journal on Imaging Sciences 1(1): 143{168 (2008). [84] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society Series B: Statistical Methods 68(1) 49{67 (2006). [85] C. Zhang. Nearly Unbiased Variable Selection Under Minimax Concave Penalty. The Annals of Statistics 38(2): 894{942 (2010). [86] S. Zhang and J. Xin. Minimization of transformed L1 penalty: theory, dierence of convex function algorithm, and robust application in compressed sensing. ftp://ftp.math.ucla.edu/pub/camreport/cam14-86.pdf. Sub- mitted (2016). [87] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research 11: 1081{1107 (2010). [88] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics 37(6A): 3468{3497 (2009). [89] P. Zhao and B. Yu. On model selection consistency of LASSO. Journal of Machine Learning Research 7: 2541{2563 (2006). [90] Z. Zheng, Y. Fan, and J. Lv. High dimensional thresholded regression and shrinkage eect. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(31): 627{649 (2014). [91] X. Zheng, X. Sun, D. Li, and J. Sun. Successive convex approximations to cardinality-constrained convex programs: a piecewise-linear DC approach. Computational Optimization and Applications 59(1-2): 379{397 (2014). [92] H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101(476): 1418{1429 (2006). [93] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67(2): 301{320 (2005). [94] H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics 36(4): 1509{1533 (2008). 111
Abstract (if available)
Abstract
This dissertation contains three individual collaborative studies for sparse learning problems and presents each work as a separate chapter. In Chapter 2, followed by Introduction, we study a bi-criteria Lagrangian formulation involving non-convex and non-differentiable functions which surrogate the discrete sparsity function. We introduce a unified difference-of-convex formulation applicable to most of the existing such surrogates and study d(irectional)-stationary solutions of the non-convex program. We provide conditions under which the d-stationary solutions of the problem are the global minima possibly of a restricted kind due to non-differentiability. The sparsity properties of the d-stationary solutions are investigated based on the given model constants. Generalized and specialized results for such properties of the stationary solutions are provided according to the specific choices of the sparsity functions. In Chapter 3, we introduce a new constraint system for sparse variable selection problems. Such a system arises when there are logical conditions among certain model variables that need to be incorporated into their selection process. Formally, extending a cardinality constraint, an affine sparsity constraint (ASC) is defined by a linear inequality with two sets of variables: one set of continuous variables and the other set represented by their nonzero patterns. We study an ASC system consisting of finitely many affine sparsity constraints. We investigate fundamental structural properties of the solution set of such a non-standard system of inequalities, including its closedness and the description of its closure, continuous approximations and their set convergence, and characterizations of its tangent cones for use in optimization. Based on the obtained structural properties of an ASC system, we investigate the convergence of B(ouligand)-stationary solutions when the ASC is approximated by the sparsity functions. Our study lays a solid mathematical foundation for solving optimization problems involving these affine sparsity constraints through their continuous approximations. In the fourth and final chapter, we investigate the relationship between the sparse sample average approximation (SAA) problem and the expectation minimization problem where the expected model loss is minimized over an unknown data distribution. The connection is analyzed through two measures of interests: a d-stationary solution of the sparse SAA problem and a point which satisfies a verifiable condition and is possibly the optimal solution of the underlying problem. We derive a bound for the distance between the solutions and a prediction error bound measuring the difference between the model outcomes generated by the two solutions. We further investigate conditions under which the support of a given d-stationary solution achieves no false inclusion and no false exclusion. A noteworthy point is that the former kind of inclusion is shown for any arbitrarily given subset of indices. The presented results generalize the existing theory developed based on the optimal solution of a convex program by relaxing requirement of the global optimum and employing general model and loss functions with non-convex sparsity surrogates.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Robustness of gradient methods for data-driven decision making
PDF
Stochastic games with expected-value constraints
PDF
Train scheduling and routing under dynamic headway control
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Traffic assignment models for a ridesharing transportation market
PDF
Essays on revenue management with choice modeling
PDF
Equilibrium model of limit order book and optimal execution problem
PDF
Learning enabled optimization for data and decision sciences
Asset Metadata
Creator
Ahn, Miju
(author)
Core Title
Difference-of-convex learning: optimization with non-convex sparsity functions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
08/13/2018
Defense Date
06/18/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bouligand stationary points,difference-of-convex programs,directional stationary points,non-convex,OAI-PMH Harvest,optimality,optimization,set convergence,sparse learning,sparsity,statistical learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pang, Jong-Shi (
committee chair
), Lv, Jinchi (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
mijuahn@gmail.com,wgn411@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-66247
Unique identifier
UC11672118
Identifier
etd-AhnMiju-6730.pdf (filename),usctheses-c89-66247 (legacy record id)
Legacy Identifier
etd-AhnMiju-6730.pdf
Dmrecord
66247
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ahn, Miju
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bouligand stationary points
difference-of-convex programs
directional stationary points
non-convex
optimality
optimization
set convergence
sparse learning
sparsity
statistical learning