Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Landscape analysis and algorithms for large scale non-convex optimization
(USC Thesis Other)
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Landscape Analysis and Algorithms for Large Scale Non-Convex Optimization A Dissertation Submitted by Maher Nouiehed To the DEPARTMENT OF INDUSTRIAL & SYSTEMS ENGINEERING and FACULTY OF THE USC GRADUATE SCHOOL In Partial Fulllment of the Requirements For the Degree of Doctor of Philosophy University of Southern California August 2019 Contents Abstract 5 Acknowledgment 7 1 Introduction 9 1.1 Landscape Analysis for Deep Non-Convex Models . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Algorithms for Constrained Non-Convex Optimization . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Robust Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Application in Fair Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Analyzing the Landscape of Optimization Problems Arising in Deep Learning 20 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Motivation and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Local Openness of the Symmetric and Non-Symmetric Matrix Multiplication Mappings . . . 29 2.5 Non-linear Deep Neural Network with a Pyramidal Structure . . . . . . . . . . . . . . . . . . 31 2.6 Two-Layer Linear Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7 Multi-Layer Linear Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Algorithms for Constrained Non-Convex Optimization 36 3.1 First and Second Order Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Projected Gradient Descent with Random Initialization May Converge to Strict Saddle Points with Positive Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Finding or Checking (Approximate) Second Order Stationarity is NP-Hard Even in the Pres- ence of Linear Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Easy Instances of Finding Second Order Stationarity in Constrained Optimization: A Second Order Frank-Wolfe Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1 3.4.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 A Trust Region Algorithm for Solving Linearly-Constrained Non-convex Optimization Problems 47 3.5.1 Background on Traditional Trust Region Algorithm and TRACE . . . . . . . . . . . . 47 3.5.2 Dierence Between LC-TRACE and TRACE . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.3 Description of LC-TRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.4 Convergence of First-Order-LC-TRACE to First-order Stationarity . . . . . . . . . . . 54 3.6 Second-Order-LC-TRACE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Algorithms for Non-convex Min-Max Problems with Applications in Robust and Fair Learning 57 4.1 Two-player Min-Max Games and First-Order Nash Equilibrium . . . . . . . . . . . . . . . . . 58 4.2 Non-Convex PL-Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Multi-step gradient descent ascent for PL-games . . . . . . . . . . . . . . . . . . . . . 62 4.2.2 Convergence analysis of Multi-Step Gradient Descent Ascent Algorithm for PL games 64 4.3 Non-Convex Concave Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Min-Max Games for Fair Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.1 R enyi Correlation As a Measure of Dependence . . . . . . . . . . . . . . . . . . . . . . 70 4.4.2 A General Min-Max Framework for R enyi Fair Inference . . . . . . . . . . . . . . . . . 71 4.4.3 R enyi Fair Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.5 Fair K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A Proofs for results in Chapter 2 92 A.1 Proof of the Theorem 2.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.2 Proof of Corollary 2.6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.3 Proof of Lemma 2.7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.4 Proof of Theorem 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.5 Proof of Theorem 2.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.6 Proof of Theorem 2.7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.7 Proof of Corollary 2.7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 2 B Proofs for results in Chapter 3 119 B.1 Proof of Lemma 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B.2 Proof of Theorem 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 B.3 Proof for Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 B.4 Proof of Theorem 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.5 Proof of Theorem 3.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 C Proofs for results in Section 3.5 126 C.1 Proof of Theorem 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 C.2 Proof of Theorem 3.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 C.3 Proof of Theorem 3.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 D Proofs for results in Chapter 4 148 D.1 Proofs for results in Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 D.1.1 Proof of Lemma 4.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 D.1.2 Proof of Theorem 4.2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 D.2 Proofs for results in in Section 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 D.2.1 Proof of Lemma 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 D.2.2 Proof of Theorem 4.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 3 List of Figures 2.1 Local openness relates the local minima of the original and auxiliary problems . . . . . . . . 26 2.2 Local minima collapse under local openness (1-dim example) . . . . . . . . . . . . . . . . . . 26 2.3 Local minima collapse under local openness (2-dim example) . . . . . . . . . . . . . . . . . . 27 3.1 Landscape and negative gradient mapping of f . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Trade-o between accuracy and fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4 Abstract Solving non-convex optimization problems are becoming increasingly important in various science and en- gineering disciplines. In this dissertation, we develop a general framework for analyzing the landscape of these non-convex objective functions, and propose algorithms for computing solutions with strong theoretical guarantees. We rst start by developing a unifying theoretical framework for identifying \nice" non-convex optimization problems. In particular, motivated by the simple structure of the landscape of simple non-convex problems arising in statistics and machine learning, we develop a methodology for checking whether every local op- timum of a given function is globally optimal. Our theoretical framework harnesses the concept of local openness from dierential geometry to provide sucient conditions under which local optima of the objec- tive function are globally optimal. We use our general condition to develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks. Moreover, we provide sucient conditions under which no spurious local optima exist for over-parameterized hierarchical non-linear deep neural networks. Although the equivalence between the sets of local and global optima establishes signicant understanding of the underlying loss surface, nding a local optimum point may be NP-Hard in general. Given this hardness result, recent focus has shifted to computing rst or second-order stationary points of the objective function. While almost all existing results study the problem in the absence of constraints, we consider the problem of nding an approximate second-order stationary point for constrained non-convex optimization problems. We show that, unlike the unconstrained scenario, vanilla projected gradient descent algorithm may converge to a strict saddle point even in the presence of a single linear constraint. We then provide a hardness result by showing that checking approximate second order stationarity is NP-hard even for linearly constrained problems. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done eciently. For such instances, we propose a dynamic second order Frank-Wolfe algorithm which converges to ( g ; H )-second order stationary points inO maxf 2 g ; 3 H g iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained 5 quadratic sub-problems can be solved eciently. Since the iteration complexity of our Frank-Wolfe method does not match the lower bound, we restrict our focus to linearly constrained problems with constant num- ber of constraints for which we propose an algorithm that matches the lower bound. More specically, we propose a trust region algorithm that nds an ( g ; H )-second order stationary point for linearly constrained problems in e O maxf 3=2 g ; 3 H g iterations. This iteration complexity matches the lower bound and hence it is order optimal. Finally, due to practical concerns on the reliability of non-convex models arising from critical applications, we study the problem of nding robust solutions for these non-convex problems. More specically, we study the Min-Max optimization problem in the non-convex-concave regime. Using a simple smoothing technique, we propose an alternative multi-step gradient descent-ascent algorithm that nds a "{rst order stationary solution in e O(" 3:5 ) gradient evaluations. Moreover, we extend our algorithm to the case where the objective of the inner optimization problem satises the Polyak- Lojasiewicz (PL) condition. Under this condition, we show that the worst-case complexity of our algorithm is e O(" 2 ). Lastly, we evaluate the performance of our algorithm on a fair statistical learning problem. 6 Acknowledgment First and foremost, I would like to express my deepest appreciation to my advisor Professor Meisam Raza- viyayn. Throughout my years as a Ph.D. student, he has provided me with unparalleled support, encourage- ment, and inspiration. His constant invaluable advice and feedback continuously push me towards having a creative and independent mindset. Meisam's passion for tackling dicult yet practical problems instilled within me a drive to conduct purposeful research that establishes fundamental solutions and methodologies. His advice on scientic writing and presentation has taught me how to present rigorous theoretical concepts in a succinct yet clear manner. As an individual, Meisam encompasses a supportive and motivational per- sona, and I am honored to call him my academic mentor. I would also like to extend my deepest gratitude to my second advisor Professor Jong-Shi Pang. His wise and inspirational advice has shaped my research philosophy. Jong-Shi's dedication to rigorous and fundamental research has ingrained within me a desire to \honestly" handle dicult problems. After discussing various research topics with Jong-Shi, I have learned that good research starts by asking the right questions. I would like to extend my sincere thanks to Professor Sheldon Ross for whom I have had the honor of working with. He has taught me how to simplify dicult problems by approaching them from a dierent perspective. His vast depth of knowledge and his unorthodox thinking shape him into an ideal academic gure whom I look up to. I am also grateful for Professor Maged Dessouky for providing me with the opportunity to work on my rst research project at the University of Southern California. It was a very enriching and rewarding experience. I also want to thank Professors Survajeet Sen, Mahdi Soltanolkotabi, Mihailo Jovanovic, and Rahul Jain for serving on my qualication and dissertation committees. Many thanks to other USC Professors who have directly impacted my academic motives through inspiring lectures and presentations. Throughout my years at USC, I was fortunate to work alongside distinguished and bright peers. A spe- cial thanks goes to my fellow lab mates: Babak Barazandeh, Sina Baharlouei, Tianjian Huang, and Ali 7 Ghafelebashi. Our weekly group meetings and discussions have established an enjoyable yet fruitful aca- demic research environment. Last but certainly not least, I am deeply indebted to my family for their unconditional love and support. Without your continuous encouragement, I would not be the person who I am today. 8 Chapter 1 Introduction Machine learning has become an indispensable part of our modern society. Despite its recent popularity, the foundations of various methods in machine learning date back to a couple of decades ago. The recent rapid growth of this eld can be explained by the exponential increase in available digital data, the recent advances in hardware computational platforms, and the advancements in the elds of statistics and optimization. In particular, a vital pillar of machine learning success is mathematical optimization in both its computational and theoretical aspects. Modern experiments and observations generate petabytes of data across dierent science disciplines ranging from genomics to astronomy. Using these massive datasets for statistical inference and model tting requires solving large-scale optimization problems. In particular, in statistics and machine learning, the objective is to recover a statistical model that can eectively predict a response variable as a function of the input data. Many of these models, such as deep neural networks, low-rank matrix recovery, and matrix completion, lead to non-convex optimization problems. Unfortunately, due to non-convexity, traditional optimization tools and techniques fail to solve these problems with reasonable computational resources. While many algorithms and analyses have been developed for studying each of these problems, almost all existing results are problem-specic. In particular, algorithms and analyses of one problem cannot be easily extended to other non-convex problems. In this dissertation, we develop a general framework for analyzing the landscape of these non-convex objective functions, and propose ecient algorithms for computing solutions with strong theoretical guarantees under certain sucient conditions. We rst start by developing a landscape analysis framework for a class of non-convex optimization problems. While we narrow down our focus to a certain class of non-convex problems, the studied class is general enough to cover many practical applications such as matrix completion, low-rank matrix recover, and deep learning. Our analysis provide checkable sucient conditions under which every local optimal point of the non-convex 9 optimization is globally optimal. Then we propose optimal-rate algorithms that can escape saddle points and nd local optima of such non-convex optimization problems. Lastly, we focus on nding robust solutions for these problems. Let us rst brie y highlight each of these parts. 1.1 Landscape Analysis for Deep Non-Convex Models Deep learning is an inference tool that has recently led to signicant practical success in various elds rang- ing from computer vision to natural language processing. Despite its wide empirical use, the theoretical understanding of the landscapes of optimization problems corresponding to the underlying neural network architecture models is still very limited. While some recent works have tried to explain these successes through the lens of expressivity by showing the power of these models in learning large class of mappings, other works nd the root of the success in the generalizability of these models from learning perspective. From optimization perspective, training deep models requires solving non-convex optimization problems, where non-convexity arises from the \deep" structure of the model. In fact, it has been shown by [25] that training neural networks to global optimality is NP-complete in the worst case even for the simple three node networks. Despite this hardness result, the practical success of deep learning may suggest that local optima of these models are close to the global optima. In particular, [45] uses spin glass theory and empirical experiments to show that the local optima of deep neural network optimization problems are close to the global optima. In an eort to better understand the landscape of training deep neural networks, [97, 106, 157, 80] studied deep linear neural networks and provided sucient conditions under which critical points (or local optimal points) of the training optimization problems are globally optimal. By providing simple examples of one- hidden layer non-linear networks, [158] shows that this local/global equivalence cannot be generally extended to deep non-linear networks. Despite the existence of spurious local minima in non-linear networks, multiple works have shown that with over-parameterization and proper random initialization, local optima of the resulting optimization problems can be easily found using local search procedures. However most of these results either assume quadratic activation functions [153, 144, 58] or apply to unrealistically wide networks for which the number of nodes is polynomial in the number of sampled data [5, 103, 4, 60, 165, 9]. Focusing on shallow neural networks with smooth activation functions, [129] shows that (stochastic) gradient descent with random initialization converges to nearly global solution with modest over-parameterization. However, these results are algorithm-dependent and require assumptions on the input data. 10 Despite the growing interest in studying the landscape of deep optimization problems, many of the results and mathematical analyses are problem specic and cannot be generalized to other problems and network structures easily. In [125], we developed a rigorous theoretical framework for characterizing the landscape of general optimization problems arising in statistical machine learning settings; and studied the implications of this framework on the training of deep neural networks. The theoretical framework harnesses the concept of local openness from dierential geometry to provide sucient conditions under which local optima of the objective function are global. Due to the deep structure of these models, matrix multiplication mapping (multiplication of two matri- ces) naturally appears in the formulation of these problems. Moreover, this mapping is widely used as a non-convex factorization approach for rank constrained problems. In [125], we completely characterize the local openness of this mapping in its range, i.e. providing necessary and sucient conditions under which the matrix multiplication mapping is locally open. Based on this theoretical result, we develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks and provide sucient conditions for which no spurious local optima exist under hierarchical non-linear deep neural net- works. This work was done in collaboration with Professor Meisam Razaviyayn. The detailed results are demonstrated in Chapter 2. Knowing that all local optima of an optimization problem are global does not necessarily imply that nding the global solution is tractable. In fact, it is well-known that computing a local optimum for a general non-convex problem is NP-Hard [117]. In the next section, we consider the problem of developing ecient algorithms for solving unconstrained/constrained non-convex optimization problems. 1.2 Algorithms for Constrained Non-Convex Optimization Due to its wide application in machine learning, solving non-convex optimization problems encountered sig- nicant attention in recent years [6, 38, 40, 41, 48, 49, 110]. While this topic has been studied for decades, recent applications and modern analytical and computational tools revived this area of research. In particu- lar, a wide variety of numerical methods for solving non-convex problems have been proposed in recent years [102, 85, 121, 35, 36, 37, 127]. For general non-convex optimization problems, it is well-known that computing a local optimum is NP-Hard [117]. Given this hardness result, recent focus has been shifted toward computing (approximate) rst and second-order stationary points of the objective function. The latter provides stronger guarantees as it con- stitutes a smaller subset of the critical points of the objective function that includes local and global optima. 11 Therefore, when applied to problems with \nice" geometrical properties, the set of second order stationary points could even coincide with the set of global optima { see [13, 14, 27, 125, 147, 68, 148, 149] for examples of such objective functions. Convergence to second-order stationarity in smooth unconstrained settings has been thoroughly investigated in the optimization literature [69, 47, 36, 37, 121, 48, 49, 68]. As a second-order algorithm, [121] proposed a cubic regularization method that converges to approximate second-order stationarity in nite number of steps. More recently, [36, 37] proposed the Adaptive Regularization Cubic algorithm (ARC) that computes an approximate solution for a local cubic model at each iteration. They established convergence to rst and second order stationary points with optimal complexity rates. Motivated by these rates, [49] proposed an adaptive trust region method, entitled TRACE, that can nd an-rst-order stationary point with worst-case iteration complexityO( 3=2 ); and can nd ( g ; H )-second-order stationary point with worst-case complex- ityO(maxf 3=2 g ; 3 H g). This method alters the acceptance criteria adopted by traditional trust region methods, and implements a new mechanism for updating the trust region radius. A more recent second- order algorithm that uses a dynamic choice of direction and step-size was proposed in [47]. This method computes rst and second order descent directions and chooses the direction that predicts a more signicant reduction in the objective value. All of the above methods satisfy the set of generic conditions of a general framework proposed in [48]. Recent results show that for smooth unconstrained optimization problems, even rst order methods can converge to second-order stationarity, almost surely. For instance, [68] shows that noisy stochastic gradient descent escapes strict saddle points with probability one. Therefore, when applied to problems satisfying the strict saddle property this method converges to a local minimum. A similar result was shown for the vanilla gradient descent algorithm in [102]. A negative result provided by [57] shows that vanilla gradient descent can take exponential number of steps to converge to second-order stationarity. This computational ineciency can be overcome by a smart perturbed form of gradient descent proposed in [88]. Most of the above results can be extended to smooth constrained optimization in the presence of simple manifold constraints. In this case, [101] shows that manifold gradient descent converges to second-order stationarity, almost surely. More recently, [85] established similar results for gradient primal-dual algorithms applied on linearly constrained optimization problems. When the constraints are non-manifold type, pro- jected gradient descent is a natural replacement of gradient descent. As a negative result, we provide an example that shows that projected gradient descent with random initialization may converge to a strict sad- dle point with positive probability even in the presence of a single linear constraint. This raises the question 12 of whether there exist a rst order method that guarantees convergence to second-order stationarity in the presence of inequality constraints. To our knowledge, no armative answer has been given to this question to date. The answer to the question above is obvious when replacing rst-order methods with second-order methods. In fact, convergence to second-order stationarity in the presence of convex constraints has been established by adapting many of the aforementioned second-order algorithms [35, 40, 39]. The work in [35] adapts the ARC algorithm and shows convergence to g -rst-order stationarity in at mostO( 3=2 g ) iterations. [24] uses active set method and cubic regularization to achieve this rate for special types of constraints. The work [23] uses interior point method to achieve second order stationarity inO maxf 3=2 g ; 3 H g iterations for box constraints. For general constraints, [41] proposed a conceptual trust region algorithm that can compute an -q th stationary point in at mostO( q1 ) iterations. The iteration complexity bounds computed for these methods hide the per-iteration complexity of solving the sub-problem. These sub-problems are either quadratic or cubic constrained optimization problems, which are in general NP-complete; see Section 3.3 for more details. More recently, [115] proposed a general framework for computing ( g ; H )-second-order stationary points for convex-constrained optimization problems with worst-case complexityO maxf 2 g ; 3 H g . In particular, this framework allows for using Frank-Wolfe or projected gradient descent to converge to an approximate rst-order method, and then computes a second-order descent direction if it exists. More specically, the framework uses a rst order method to converge to an approximate rst order stationary point, and then computes a second order descent direction if it exists. Since solving the quadratic sub-problem to optimality is NP-Hard, they suggest to approximately solve these sub-problems. In [124], we show that for linearly constrained non-convex problems, even checking whether a given point is an approximate second-order sta- tionary point is NP-Hard. Despite this hardness result, we propose in [124] a second-order Frank-Wolfe algorithm that adapts the dynamic method introduced in [47], and identies instances for which solving the constrained quadratic sub-problem can be solved in polynomial time. The algorithm converges to approxi- mate rst and second-order stationarity with a worst-case complexity similar to [47]. However, second-order information as utilized in the adapted ARC algorithm yields better iteration com- plexity rates. Motivated by this result, we propose a trust region algorithm, entitled LC-TRACE, that adapts TRACE to linearly-constrained non-convex problems. We establish the convergence of our algorithm to ( g ; H )-second order stationarity in at most e O( 3=2 g ; 3 H ) iterations. Details of the algorithms and the- oretical results are demonstrated in Section 3. This work was done in collaboration with Professor Meisam 13 Razaviyayn and Professor Jason D. Lee. In the next section, we focus on computing robust solutions for unconstrained/constrained optimization problems. 1.3 Robust Algorithms Recent years have witnessed a wide range of machine learning and robust optimization applications being formulated as a min-max saddle point game; see [140, 51, 50, 135, 71, 142] and the references therein. Although this formulation was previously studied for two-player zero sum-games, the new applications have brought signicant attention to this class of problems. Examples of problems that are formulated under this framework include generative adversarial networks (GANs) [140], reinforcement learning [51], adversarial learning [142], learning exponential families [50], fair statistical inference [63, 156, 141, 111], generative adversarial imitation learning [31, 84], distributed non-convex optimization [108] and many others. These applications require solving an optimization problem of the form min 2 max 2A f(;): (1.1) One can view this optimization problem as a zero-sum game between two players where the goal of the rst player is to minimize the objective function f(;) over the set of strategies , while the other player's objective is to maximize the objective function over the set of strategiesA. Gradient-based methods, espe- cially gradient descent-ascent (GDA), are widely used in practice to solve these problems. GDA algorithm alternates between a gradient ascent step on and a gradient descent step on . Despite its popularity, this algorithm fails to converge even for simple bilinear zero-sum games [114]. This failure of simple GDA for bilinear games was overcome by adding negative momentum or by using primal-dual methods proposed by [73, 72, 43, 52, 53]. A more general setting of the bilnear problem, we denote as convex-concave min-max saddle point games, is when the objective f is convex in and concave in . This setting has been extensively studied in the literature and based on the corresponding monotone variational inequality, dierent algorithms have been developed for nding a Nash equilibrium of such convex-concave games [118, 72, 116, 90, 113]. In another thread, several methods were designed to directly solve the convex-concave saddle point game. Among these methods is an extension of Frank-Wolfe method proposed by [74], and a randomized primal-dual method proposed by [79]. While the convex-concave setting has been extensively studied in the literature, recent machine learning applications urge the necessity of moving beyond this classical setting. An example of these applications 14 is GANs which constitutes of two neural networks (generator and discriminator) competing in a zero-sum game framework [75]. For general non-convex non-concave games, [89, Proposition 10] provided an example for which local Nash equilibrium does not exist. Similarly, we can show that second-order Nash equilibrium may not exist for non-convex non-concave games with quadratic objectives, see Section 4.1 for more details. Therefore, inspired by the successes of rst order algorithms in non-convex optimization, researchers have focused on nding rst order Nash equilibrium of such games; see denitions and discussion in Section 4.1. The rst order Nash equilibrium can be viewed as a direct extension of the concept of rst order stationarity in optimization to the above min-max game setting. While "{rst order stationarity in the context of opti- mization can be found eciently inO(" 2 ) iterations with gradient descent algorithm [119], the question of whether it is possible to design a polynomial-time algorithm that can nd an "{rst order Nash equilibrium for general non-convex saddle point games remains unresolved. Several recent results provide an armative answer to the latter question when the underlying game has some special structure. For instance, [140] proposed a stochastic gradient descent algorithm for the case when the objective function is non-convex in and strongly concave in . They show convergence of the algorithm to an "{rst-order Nash equilibrium point with e O(" 4 ) gradient evaluations. More recently, [108, 109] consid- ered the problem when the objective f is non-convex in and concave in . This setting was motivated by several applications [108]. They developed a gradient-based descent-ascent alternating algorithm that com- putes an"{rst-order Nash equilibrium point with worst-case complexity e O(" 4 ). In this non-convex concave setting, [135] proposed a stochastic sub-gradient descent method and showed convergence to "{rst-order Nash equilibrium point with worst-case complexity e O(" 6 ). Under the same concavity assumption on f, we propose an alternative multi-step framework that nds an "{rst order Nash equilibrium/stationary with e O(" 3:5 ) gradient evaluations. At each iteration, our algorithm runs multiple steps of Nesterov accelerated projected gradient ascent [120] to estimate the solution of a regularized version of the maximization problem. This solution is then used to estimate the gradient of the value function of the maximization problem, which directly provides a descent direction. To our knowledge, under similar settings, the complexity rate achieved by our method is superior to all other known methods. In an eort to crack the more general non-convex non-concave setting, [104] developed a framework that successively solves strong monotone variational inequalities. They show convergence to "-rst order station- arity/Nash equilibrium under the assumption that there exists a solution to the Minty variational inequality. Although among the rst algorithms to have theoretical convergence guarantees in the non-convex non- concave setting, the conditions required are strong and dicult to check. To the best of our knowledge there is no practical problem for which the Minty variational inequality condition has been proven. 15 With the motivation of exploring the non-convex non-concave setting, we propose a simple multi-step gradient descent ascent algorithm that nds an"{rst order Nash equilibrium when the objective of one of the players satises the Polyak- Lojasiewicz (PL) condition. We show that the worst-case complexity of our algorithm is e O(" 2 ). It is worth noting that this rate is optimal in terms of depdendency on " up to logarithmic factors as discussed in Section 4.2. Compared to Minty variational inequality condition used in [104], the PL condition is very well studied in the literature and has been theoretically veried for objectives of optimization problems arising in many practical problems. For example, it has been proven to be true for objectives of over-parametrized deep networks [59], learning LQR models [65], phase retrieval [150], and many other simple problems discussed in [95]. In the context of min-max games, it has also been proven useful in generative adversarial imitation learning with LQR dynamics [31] where our result can be applied as we discuss in Section 4.2. Another signicant application of our results appears in fair statistical inference which we detail in the next section. This work was done in collaboration with Maziar Sanjabi, Professor Meisam Razaviyayn, and Professor Jason D. Lee. 1.4 Application in Fair Statistical Inference As we experience the widespread adoption of machine learning models in automated decision making, we have witnessed increased reports of instances in which the employed model results in discrimination against certain group of individuals [54, 152, 26, 7]. In this context, we refer to discrimination as the unwanted distinction against individuals based on their membership to a certain tribal or group. For instance, [7] presents an example of a computer-based risk assessment model for recidivism which has bias against cer- tain ethnicities. In another example, [54] demonstrates gender discrimination in online advertisement for web pages associated with employment. In addition to its ethical standpoint, equal treatment of dierent groups is legally required by many countries [1]. Due to its wide application and direct eect on soci- etal balance, research on fairness in machine learning encountered signicant attention in recent years; see [33, 66, 81, 162, 156, 62, 67, 155, 159, 160, 133, 15]. Anti-discrimination laws imposed by many countries typically evaluate fairness by notions of disparate treat- ment and disparate impact. We say a decision-making process suers from disparate treatment if its decisions discriminate against individuals of a certain protected group (e.g. females, males, black, white) based on their sensitive/protected attribute information. On the other hand, we say it has disparate impact if the decisions adversely aect a protected group of individuals with a certain sensitive attribute [160]. In simpler words, disparate treatment is intentional discrimination against a protected group, while disparate impact is an unintentional disproportionate outcome that hurt a protected group. To quantify these denitions, 16 several notions of fairness have been proposed in recent works [32, 81]. Examples of these notions include demographic parity, equalized odds and equalized opportunity. The demographic parity condition requires that the model output (for example assigned label) is indepen- dent of sensitive attributes. This condition formalizes the legal doctrine of disparate impact. However, it might not be desirable when the base rates of the two groups are dierent. This shortcoming can be avoided when using the equalized odds notion of fairness [81]. The equalized odds condition requires that the model output is conditionally independent of sensitive attributes given the ground-truth label. In the case of binary sensitivity attributes, this translates to having equal false positive and false negative rates across protected groups. Finally, equalized opportunity requires having equal false positive or false negative rates across protected groups. All of the above fairness notions require (conditional) independence between the model output and the sensitive attribute. Many approaches have been adopted for imposing fairness, which can be broadly classied into three main categories: pre-processing methods, post-processing methods, and in-processing methods. Pre-processing methods modify the training data to remove the discriminatory eects before passing the data to the clas- sier [32, 66, 92, 91, 93, 61, 33, 139]. Other pre-processing methods recently proposed using adversarial training to learn a fair representation of the data. More specically, the method maps the training data to a transformed space in which the dependencies between the class label and the sensitive attributes are removed [63, 81, 156, 141, 134, 111, 161, 105]. On the other hand, post-processing adjusts a pre-trained classier to remove discrimination while maintaining high classication accuracy [67, 62, 155]. The third category is the in-process approach that enforces fairness to classiers by either introducing constraints or adding a regularization term to the objective [159, 160, 133, 15, 19, 2, 42, 56, 137, 94, 162, 15, 99, 112, 3]. The method we propose is in line with the in-process approaches. Several forms of constraints and regularization terms have been recently proposed to enforce fairness to classiers. For instance, [94] proposed a regularization framework that uses a relaxed normalized mutual in- formation measure to penalize the dependency between the model output and sensitivity attributes. Another regularization framework that penalizes a relaxation of the dierence in false positive and false negative rates across groups was proposed in [15]. Unlike these regularization methods, [56] proposed adding constraints to impose the equalized odds condition. A common drawback of these methods is that they do not provide any bound on the relaxations used. Along the same line of work, [160, 159] imposed constraints to bound the Pearson correlation between sen- 17 sitive attributes and the signed distance from users to the decision boundary. The formulated constrained problem was solved for logistic regression and linear SVM machine learning models. A major drawback of using Pearson correlation as an independence measure, is that it only captures linear dependence between random variables. To capture second-order dependencies, [133] proposed using the Hilbert Schmidt inde- pendence criterion (HSIC) as a fairness regularization term. The HSIC term is zero if and only if there is no second-order dependence between the random variables (model output and sensitivity attribute). A common limitation among the proposed formulations is that the used independence measure (Pearson and HSIC) can be zero even if the two random variables have dependencies; see Section 4.4.1 for more details. Motivated by this limitation, we introduce Hirschfeld-Gebelein-R enyi (HGR) correlation principle as a tool to impose several known group fairness measures. Unlike Pearson correlation and HSIC, the HGR correla- tion captures high order dependencies between random variables. Moreover, the HGR correlation coecient of two random variables is zero if and only if the two random variables are independent; and is one if the two random variables are strictly dependent. For other interesting properties of the HGR correlation, we refer the readers to [64]. Using HGR correlation coecient as a regularization term, we propose a min-max game formulation for fair statistical inference. The goal of the rst player is to minimize the loss objective over the set of model parameters, while the adversarial player's goal is to maximize the fairness loss formulated using the maximal HGR correlation. By applying our HGR min-max regularization framework to classication, we show that the maximization problem has a closed form solution if the ground-truth labels are binary and the sensitivity attribute is discrete (or vice-versa). Motivated by the algorithm proposed in [127], we use an alternating gradient method to solve the Min-Max optimization problem to rst-order stationarity. When the objective loss is convex, our methods computes a global solution. In addition to classication, we deploy our regularization method to the popular K-means clustering problem. Most of the recent works on fair clustering propose a two-phase algorithm for solving the problem [44, 146]. In the rst phase, the set of data points is partitioned into small subsets, referred to as fairlets, that satisfy fairness requirements. Then in the second phase, these fairlets are merged into K clusters by one of the existing clustering algorithms. This method imposes fairness as a hard constraint which can increase the loss signicantly. Moreover, the pre-processing step (phase one) dominates the running time of the algorithm limiting the scalability of the method. We use our HGR regularization framework to propose an ecient fair K-means clustering algorithm. More specically, we add the HGR correlation regularization term to K-means to penalize the dependence between the allocation of points to clusters and their corresponding sensitive attributes. To solve the formulated regularized problem, we propose a fair K-means algorithm. We 18 show that when the regularization coecient is suciently large, our algorithms can achieve perfect fairness under disparate impact doctrine. Details of our results are demonstrated in Section 4.4. This work was done in collaboration with Sina Baharlouei and Professor Meisam Razaviyayn. 19 Chapter 2 Analyzing the Landscape of Optimization Problems Arising in Deep Learning Deep learning models have recently led to signicant practical successes in various elds of science and en- gineering. Despite these signicant empirical successes, the theoretical understanding of the landscape of the optimization problems arising from learning deep non-convex models is still very limited. In an attempt to demystify the existing gap between empirical success and theoretical results, many recent works study the optimization problem under over-parameterization schemes [153, 143, 144, 58, 5, 103, 9, 122]. Under dierent assumptions, these results agree that over-parameterized deep neural networks can be trained to nearly global minima using (stochastic) gradient descent methods. However, many of these results require unrealistic overparameterization for which the number of nodes is polynomial in the number of sampled data. Moreover, these results hold for a specic algorithm (stochastic gradient descent) and require strong assump- tions on the input data. Attempting to understand the landscape of deep neural networks, [97, 98, 80, 157] analyzed the landscape of deep linear networks providing sucient conditions under which local optimal points of the training optimization problems are globally optimal. Many of the existing results and mathematical analysis are problem specic and do not generalize to explore the landscape of other problems and network structures. In this chapter, we develop a unifying theoretical framework for studying the local/global optima equivalence of the optimization problems arising from train- ing non-convex deep models. Before proceeding to the results, we clarify some notations and denitions that are often used in later parts. Other notations and terminologies will be dened as needed. 20 2.1 Notation First, we use A l;: and A :;l to denote the l th row and l th column of the matrix A respectively. We denote by I d 2 R dd the dd-dimensional identity matrix. LetkAk,N (A),C(A), rank(A) be respectively the Frobenius norm, null-space, column-space, and rank of the matrix A. Given subspaces U and V, we say U? V if U is orthogonal to V, and U = V ? if U is the orthogonal complement of V. We say matrix A2R d1d0 is rank decient if rank(A)< minfd 1 ;d 0 g, and full rank if rank(A) = minfd 1 ;d 0 g. We useR andN to denote the set of real numbers and set of integers respectively. Denition 2.1.1. We call a point W = (W h ;:::; W 1 ), with W i 2R didi1 , non-degenerate if rank( Q h i=1 W i ) = min 0ih d i , and degenerate if rank( Q h i=1 W i )< min 0ih d i . Denition 2.1.2. We say a point W is a second order saddle point (strict saddle point) of an unconstrained optimization problem if the gradient of the objective function is zero at W and the Hessian of the objective function at W has a negative eigenvalue. 2.2 Motivation and Contributions To understand the landscape of non-convex deep models, we study the general optimization problem min w2W `(F(w)); (2.1) where `() is the loss function andF() represents a statistical model with parameter w that needs to be learned by solving the above optimization problem. A simple example is the popular linear regression problem min w kXw yk 2 2 ; where y is a given constant response vector and X is a given constant feature matrix. In this example, the loss function is the ` 2 loss, i.e., `(z) =kz yk 2 2 , and the tted modelF is a linear model, i.e.,F(w) = Xw. While this linear regression problem is convex and easy, tting many practical models, such as deep neural networks, requires solving non-trivial non-convex optimization problems. Our result uses local openness property of the mappingF to provide sucient conditions under which every local optimum of (2.1) is global. We start by brie y describing some motivating examples: Example 1: Training feedforward neural networks. Consider the following multiple layer feedforward 21 neural network optimization problem min W 1 2 kF h (W) Yk 2 ; whereF h is dened in a recursive manner: F k (W), k W k F k1 (W) ; for k2f2;:::;hg; with F 1 (W), 1 (W 1 X): Here h is the number of hidden layers in the network, k () denotes the activation function of layer k, and matrix W k 2 R d k d k1 is the weight of layer k with W, (W i ) h i=1 being the optimization variable. The matrix X2 R d0n is the input training data and Y2 R d h n is the target training data where n is the number of samples; see, e.g. [76]. Notice that this problem is a special case of the optimization problem in (2.1) obtained simply by setting our loss function to the ` 2 loss, and settingF =F h . A special instance of this optimization problem was studied in [123] which considers the non-linear neural network with pyramidal structure (i.e. d i d i1 8 i = 1;:::;h and d 0 n). Results in [123, Theorem 3.8] show that under some conditions, among which are the dierentiability of the loss function `() and the activation function(), every critical point W (with W i 's being full row rank) is a global minimum. Under the framework we introduce, we reproduce these results showing that any local optimum of the objective function is a global optimum. This result holds even when the dierentiability assumption on both `() and () is relaxed. Another special instance is the linear feedforward network where the mapping k () is the identity map in all layers. This yields the following optimization problem, min W 1 2 kW h W 1 X Yk 2 : (2.2) For this optimization problem, [106, 157] show that every local optimum of the objective function is globally optimal under the assumption that X and Y are both full row rank. It is in fact not hard to see that one cannot relax the full rankness assumption of Y due to the following simple counterexample: X = I W 3 = 2 4 1 0 3 5 ; W 2 = [ 0 ]; W 1 = h 1 0 i ; Y = 2 4 0 0 0 1 3 5 : In this example, the point W = ( W 1 ; W 2 ; W 3 ) is a local optimum of a 3-layer deep linear model problem (2.2) with h = 3 that is not global. Despite this counterexample, we show that if a given local optimum is non-degenerate, the full rankness of Y can be relaxed. Moreover, for degenerate local optima, we provide necessary and sucient conditions under which all local minima are global. Hence, using local openness 22 we develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks. To date all the existing works in the literature are focused on positive results, i.e., they provide assumptions under which every local optimum is globally optimal. However, in [125], we provide necessary sucient conditions on the architecture of the network. When these conditions are met, all local optimal points are globally optimal; otherwise we construct local optima that are not global. Unlike many existing results that focus on a special algorithm (e.g. gradient descent), our result depends on the landscape with no assumption on the probability distribution of the input data. Example 2: Matrix factorization and matrix completion. Matrix completion is the problem of recovering a low rank-matrix from partially observed entries. This problem requires solving a non-convex optimization problem that falls under the framework dened in (2.1). [69] shows that symmetric matrix completion has no spurious local optima. If all entries of the matrix are observed, the matrix completion problem reduces to min W1;W2 1 2 kW 2 W 1 Yk 2 ; (2.3) which is also referred to as the low rank matrix estimation problem [145]. This can also be viewed as a 2-layer linear neural network optimization problem with the input data matrix X = I. This is also a special case of (2.1) with the loss function being the ` 2 loss, and the mappingF being dened asF(W 1 ; W 2 ) = W 2 W 1 . Without any assumption on the data matrix X or the label matrix Y, we show that every critical point of (2.3) is either a global minimum or a second-order saddle point. This result can be generalized to general convex loss function `() for degenerate critical points. Our results harness the complete characterization of the local openness of matrix multiplication mapping in its range, which is one of our major contributions. The complete characterization of this mapping can be potentially used in many other optimization problems for establishing local/global equivalence. Example 3: Burer-Monteiro approach for solving semidenite programs. The seminal work of Burer and Monteiro [30, 29] studies the SDP problem min Z2R nn hC; Zi s:t: hA i ; Zi = b i ; 8 i = 1;:::; m and Z 0 (2.4) and suggests solving it through the non-convex reformulation min W2R nk hC; WW T i s:t: hA i ; WW T i = b i ; 8 i = 1;:::; m: (2.5) 23 The result in [30, Proposition 2.3] shows that every local optimum of (2.5) is a local optimum of the following optimization problem, min Z2R nn hC; Zi s:t: hA i ; Zi = b i ; 8i = 1;:::; m; and Z 0; rank(Z) k: (2.6) Moreover, ifk is chosen large enough, then the two optimization problems (2.6) and (2.4) are equivalent; see [132]. Considering the mappingF(W) = WW T , this problem becomes a special case of our original problem statement (2.1) assuming linear loss function `(). In [125], we establish the local openness of symmetric matrix multiplication mappingF(W) = WW T . This result provides a simple and intuitive proof for the relation of the local optima of problems (2.5) and (2.6). Moreover, it extends this relation to non-linear and even non-continuous loss functions. Contributions: In this chapter, we developed a rigorous theoretical framework for characterizing the land- scape of general optimization problems arising in statistical machine learning settings; and studied the im- plications of this framework on the training of deep neural networks. The theoretical framework harnesses the concept of local openness from dierential geometry to provide sucient conditions under which local optima of the objective function are global. Due to the deep structure of these models, matrix multiplication mapping (multiplication of two matrices) naturally appears in the formulation of these problems. Moreover, this mapping is widely used as a non- convex factorization approach for rank constrained problems. In the chapter, we completely characterize the local openness of this mapping in its range, i.e. providing necessary and sucient conditions under which the matrix multiplication mapping is locally open. Based on this theoretical result, we develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks and provide sucient conditions for which no spurious local optima exist under hierarchical non-linear deep neural networks. 2.3 Mathematical Framework As discussed in the previous section, we are interested in solving min w2W `(F(w)); (2.7) 24 whereF :W7!Z is a mapping and` :Z7!R is the loss function. Here we assume that the setW is closed and the mappingF is continuous. To proceed, let us dene the auxiliary optimization problem min z2Z `(z); (2.8) whereZ is the range of the mappingF. Since problem (2.8) minimizes the function `() over the range of the mappingF, the global optimal objective values for problems (2.7) and (2.8) are the same. Moreover, there is a clear relation between the global optimal points of the two optimization problems through the mappingF. However, the connection between the local optima of the two optimization problems is not clear. This connection, in particular, is important when the local optima of (2.8) are \nice" (e.g. globally optimal or close to optimal). In what follows, we establish the connection between the local optima of the optimization problems (2.7) and (2.8) under simple sucient conditions. This connection is then used to study the relation between local and global optima of (2.7) and (2.8) for various non-convex learning models. In summary, our strategy is to map the original optimization problem (2.7) to a generally simpler auxiliary problem (2.8) for which the underlying landscape has a special structure (e.g. all local optima are global). Then pass this special structure back to (2.7) by establishing the connection between the local optima of the two problems. We start by dening the following important concepts Open mapping: A mappingF :W!Z is said to be open, if for every open set U2W,F(U) is (relatively) open inZ. Locally open mapping: A mappingF() is said to be locally open at w if for every > 0, there exists > 0 such that I B F(w) F I B (w) . Here I B (w)W is an open ball with radius centered at w, and I B (F(w))Z is the ball of radius centered atF(w). A mapping is locally open everywhere if and only if it is open. A useful property of (locally) open mappings is that the composition of two (locally) open maps is (locally) open at a given point. The following simple intuitive observation, which establishes the connection between the local optima of (2.7) and (2.8), is a major building block of our analyses. Observation 2.3.1. SupposeF() is locally open at W. If W is a local minimum of problem (2.7), then z =F( W) is a local minimum of problem (2.8). 25 Figure 2.1: Sketch of the Proof of Observation 2.3.1. Proof. Let W be a local minimum of problem (2.7). Then there exists an > 0 such that `(F( W)) `(F(w)); 8w2 I B ( W). By the denition of local openness,9 > 0 such that I B ( z) F I B ( W) with z =F( W). Therefore, `( z)`(z); 8z2 I B ( z), which implies z is a local minimum of problem (2.8). The above observation can be used to map multiple local optima of the original problem (2.7) to one local optimum of the auxiliary problem (2.8); and potentially make the problem easier to analyze. This mapping is particularly interesting in neural networks since permuting the neurons and the corresponding weights in each layer does not change the objective function. Hence, by nature, the underlying landscapes of these optimization problems have multiple (disconnected) local/global optima. However, collapsing these multiple local optima to a single point could potentially simplify the problem. In other words, instead of analyzing the original landscape with multiple disconnected local optima, we analyze the simpler landscape of the resulted mapping. Let us clarify this point through the following simple examples: (a) Original Problem (b) Auxiliary Problem Figure 2.2: Two local minima w =1 and w = +1 in (a) are mapped to a single local minimum z = 1 in (b). 26 Example 2.3.1. Consider the optimization problem min w2R (w 2 1) 2 ; (2.9) and its corresponding auxiliary problem minimize z0 (z 1) 2 : (2.10) Plots of these two problems can be found in Figure 2.2a and Figure 2.2b. SinceF(w) , w 2 is an open mapping in its range, it follows from Observation 2.3.1 that every local minimum in problem (2.9) is a local minimum of problem (2.10). Thus the two local minima w =1 andw = +1 in (2.9) are mapped to a single local minimum z = 1 of problem (2.10). Moreover, since the optimization problem (2.10) is convex, the local minimum is global; and hence the original local optima w =1 and w = +1 should be both global despite non-convexity of (2.9). (a) Original Problem (b) Auxiliary Problem Figure 2.3: All the points in the setf(w 1 ;w 2 )jw 1 w 2 = 1g are local minima in (a) and are mapped to a single local minimum z = 1 in (b). Example 2.3.2. Another example is related to the widely used matrix multiplication mapping W 1 W 2 . Let ( W 1 ; W 2 ) be a local minimum of the optimization problem min W1;W2 `(W 1 W 2 ): Then, any point in the setS ,f( W 1 Q 1 ; Q 2 W 2 ) with Q 1 Q 2 = Ig is also a local minimum. If the matrix product W 1 W 2 is locally open at the point ( W 1 ; W 2 ), then all points inS are mapped to a single local minimum Z = W 1 W 2 in the corresponding auxiliary problem. A simple one dimensional example is plotted 27 in Figures 2.3a and 2.3b. Observation 2.3.1 motivates us to study the local openness of mappings appearing in various widely-used optimization problems. One example that is used in the famous Burer-Monteiro approach for semi-denite programming [30], is the symmetric matrix multiplication mapping dened as M + :R nk 7!R M+ with M + (W), WW T : HereR M+ ,f Z2R nn j Z 0; rank(Z) minfn;kgg is the range ofM + . Another mapping that is widely used in many optimization problems, such as deep neural networks (2.2) and matrix completion (2.3), is the matrix multiplication mapping dened as M :R mk R kn 7!R M with M(W 1 ; W 2 ), W 1 W 2 ; (2.11) whereR M ,f Z2R mn j rank(Z) min(m;n;k)g is the range of the mappingM. Although, the matrix multiplication mappingM(W 1 ; W 2 ) naturally appears in deep models and is widely used as a non-convex factorization for rank constrained problems, see [154, 22, 69, 145, 151], to our knowledge, the complete characterization of the local openness of this mapping has not been studied in the optimization literature. Similarly, the symmetric matrix multiplication mappingM + (W) is widely used as a non-convex factorization in semi-denite programming (SDP), see [30, 163, 131, 28], and the characterization of the openness of this mapping remains unsolved. While the classical open mapping theorem in [138] states that surjective continuous linear operators are open, this is not true for general bilinear mappings. In fact, by providing a simple counterexample of a bilinear mapping that is not open, [86] shows that the linear case cannot be generally extended to multilinear maps. Several papers, see [10, 11, 17], investigate this bilinear mapping and provide a characterization of the points where this mapping is open. The more general matrix multiplication mappingM was studied in [18] that provides necessary and sucient conditions under which the mapping is locally open in R mn . However, in our framework the (relative) local openness should be studied with respect to the range of the mappingR M which can be dierent fromR mn when k< minfm;ng. As an intuitive (and unocial) denition of local openness ofM() at ( W 1 ; W 2 ) inR M , we say the mul- tiplication mapping is locally open at ( W 1 ; W 2 ) if for any small perturbation e Z 2 R M of Z = W 1 W 2 , there exists a pair ( f W 1 ; f W 2 ), a small perturbation of ( W 1 ; W 2 ), such that e Z = f W 1 f W 2 : In the latter case, the mapping is denitely not locally open in R mn , but can still be locally open inR M . 28 For a simple example, consider W 1 = 2 4 1 2 3 5 and W 2 = h 1 1 i . In this example there does not exist f W 1 , f W 2 perturbations of W 1 and W 2 respectively such that f W 1 f W 2 = e Z when e Z is a full rank perturbation of Z = W 1 W 2 . However, for any rank 1 perturbation e Z, we can nd a perturbed pair ( f W 1 ; f W 2 ) such that e Z = f W 1 f W 2 . Motivated by Observation 2.3.1, we next study the local openness/openness of the mappings M(;) andM + (). We later use these results to analyze the behavior of local optima of deep neural networks. 2.4 Local Openness of the Symmetric and Non-Symmetric Matrix Multiplication Mappings When W 1 2R mk and W 2 2R kn withk minfm;ng, the range of the mappingM(W 1 ; W 2 ) = W 1 W 2 is the entire space R mn . In this case, which we refer to as the full rank case, [18, Theorem 2.5] provides a complete characterization of the pairs (W 1 ; W 2 ) for which the mapping is locally open. However, when k< minfm;ng, which we refer to as the rank-decient case, the characterization of the set of points for which the mapping is locally open remains unresolved. We settled this question in Theorem 2.4.2 by providing a complete characterization of points (W 1 ; W 2 ) for which the mappingM is locally open whenk< minfm;ng. Moreover, we show in Theorem 2.4.3 that the symmetric matrix multiplicationM + is open in its rangeR M+ . The proofs of these theorems can be found in Appendices A.4 and A.5, respectively. We start by restating the main result in [18]: Proposition 2.4.1. [18, Theorem 2.5 Rephrased] LetM(W 1 ; W 2 ) = W 1 W 2 denote the matrix multiplica- tion mapping with W 1 2R mk and W 2 2R kn . Assumek minfm;ng. Then the the following statements are equivalent: 1. M(;) is locally open at (W 1 ; W 2 ). 2. 8 > > > > < > > > > : 9 f W 1 2R mk such that f W 1 W 2 = 0 and W 1 + f W 1 is full row rank: or 9 f W 2 2R kn such that W 1 f W 2 = 0 and W 2 + f W 2 is full column rank. 3. dim N ( W 1 )\C( W 2 ) km or n (rank( W 2 ) dim N ( W 1 )\C( W 2 ) k rank( W 1 ). The above proposition provides a checkable condition which completely characterizes the local openness of the mappingM at dierent points when the range of the mapping is the entire space. Now, let us state our result that characterizes the local openness of the mappingM in its range when k< minfm;ng. 29 Theorem 2.4.2. LetM(W 1 ; W 2 ) = W 1 W 2 denote the matrix multiplication mapping with W 1 2R mk and W 2 2 R kn . Assume k < minfm;ng. Then if rank(W 1 )6= rank(W 2 ),M(;) is not locally open at (W 1 ; W 2 ). Else, if rank(W 1 ) = rank(W 2 ), then the following statements are equivalent: i) 9 f W 1 2R mk such that f W 1 W 2 = 0 and W 1 + f W 1 is full column rank. ii) 9 f W 2 2R kn such that W 1 f W 2 = 0 and W 2 + f W 2 is full row rank. iii) dim N (W 1 )\C(W 2 ) = 0. iv) dim N (W T 2 )\C(W T 1 ) = 0. v) M(;) is locally open at (W 1 ; W 2 ) in its rangeR M . Note that the proof of Theorem 2.4.2, which can be found in Appendix A.4, is dierent than the proof of Proposition 2.4.1, as in the former we need to work with the set of low rank matrices. Besides, the conditions in Theorem 2.4.2 are dierent than the ones in Proposition 2.4.1. For example, while conditions i) and ii) are equivalent in the rank-decient case, they are not equivalent in the full-rank case. Moreover, unlike the full-rank case, the condition rank(W 1 ) = rank(W 2 ) is necessary for local openness in the former result. How much perturbation is needed? As previously mentioned, local openness can be described in terms of perturbation analysis. For example,M(;) is locally open at (W 1 ; W 2 ) if for a given > 0, there exists > 0 such that for any e Z = Z +R 2R M withkR k, there exists f W 1 , f W 2 withk f W 1 k,k f W 2 k, such that e Z = (W 1 + f W 1 )(W 2 + f W 2 ). As a perturbation bound on , we show that for any locally open pair (W 1 ; W 2 ), given an > 0, the chosen is of order , i.e., =O(). The details of our analysis can be found in the proof of Theorem 2.4.2 in Appendix A.4. Now we state our result for the mappingM + . Theorem 2.4.3. LetM + (W) = WW T be the symmetric matrix multiplication mapping. ThenM + () is open in its rangeR M+ . How much perturbation is needed? A perturbation bound for the symmetric matrix multiplication was also derived, details in Appendix A.5. We show that for any W, given an > 0, the chosen is of order , i.e., =O(). Remark 2.4.4. Since mappingM + is open inR M+ , then by Observation 2.3.1, any local minimum of the optimization problem min W2W `(WW T ) leads to a local minimum in the optimization problem min Z2Z `(Z) whereZ =fZ j Z 0; Z = WW T ; W2Wg. Consequently, if the optimization problem on Z is convex, 30 then every local minimum of the rst optimization problem is global. This provides a simple and intuitive proof for the Burrer-Monteiro result [30, Proposition 2.3]; moreover, it extends it by relaxing the continuity assumption on `(). Remark 2.4.5. It follows from Theorem 2.4.2 that when W 1 is full column rank and W 2 is full row rank, the mappingM(;) is locally open at (W 1 ; W 2 ). This result was observed in other works; see, e.g., [151, Proposition 4.2]. In the next sections, we use our local openness result to characterize the cases where the local optima of various training optimization problems of the form (2.7) are globally optimal. 2.5 Non-linear Deep Neural Network with a Pyramidal Structure Consider the non-linear deep neural network optimization problem with a pyramidal structure min W ` F h (W) with F i (W), i W i F i1 (W) ; for i2f2;:::;hg; (2.12) andF 1 (W) , 1 (W 1 X) where i () : R7! R is a continuous and strictly monotone activation function applied component-wise to the entries of each layer, i.e., i (A) = [ i (A jk )] j;k . Here W = W i h i=1 where W i 2R didi1 is the weight matrix of layeri, and X2R d0n is the input training data. In this section, we consider the pyramidal network structure with d 0 >n andd i d i1 for 1ih; see [123] for more details on these types of networks. First notice that when X is full column rank and the functions i 's are all continuous and strictly monotone, the image set of the mappingF h is convex and hence every local optimum of the auxiliary optimization problem (2.8) is global. We now show that when W i 's are all full row rank and the functions i 's are all strictly monotone, the mappingF h is locally open at W. Lemma 2.5.1. Assume the functions i () : R7! R are all continuous and strictly monotone. Then the mappingF h dened in (2.12) is locally open at the point W = (W 1 ;:::; W h ) if W i 's are all full row rank. Before proving this result, we would like to remark that many of the popular activation functions such as logit, tangent hyperbolic, and leaky ReLu are strictly monotone and satisfy the assumptions of this lemma. Proof. We prove the result by means of induction. Using the local openness property of linear maps and the strict monotonicity of 1 (), the composition property of open maps directly implies thatF 1 is open. Now assumeF k1 W i k1 i=1 is locally open at W i k1 i=1 , then using Proposition 2.4.1 for a full rank W k , the mapping W k F k1 W i k1 i=1 is locally open at W k ; (W i ) k1 i=1 . Finally, the composition property of open maps and strict monotonicity of k () imply thatF k (W i ) k i=1 is locally open at W i k i=1 . 31 Theorem 2.5.2. If W is a local optimum of problem (2.12) with W i 's being full row rank, then it is also a global optimum. Proof. Lemma 2.5.1 in conjunction with Observation 2.3.1 imply that if W is a local optimum of problem (2.12) with W i 's being full row rank, then Z =F h ( W) is a local optimum of the corresponding auxiliary problem minimize Z2Z `(Z) whereZ is a convex set. Consequently, Z is a global optimum of problem (2.12) when the loss function `() is convex. [123] show that every critical point W of problem (2.12) with W i 's being full row rank is a global optimum when both () and `() are dierentiable. Our result relaxes the dierentiability assumption on both the activation and loss functions. However, we can only show all local optima are global. A popular activation function that is strictly monotonic and not dierentiable is the Leaky ReLU, for which our result follows. It is also worth mentioning that [123] allow wide intermediate layers in parts of their result. Extending this result to non-dierentiable activation functions remains unsolved. 2.6 Two-Layer Linear Neural Network Consider the two layer linear neural network optimization problem min W 1 2 kW 2 W 1 X Yk 2 ; (2.13) where W 2 2R d2d1 , and W 1 2R d1d0 are weight matrices, X2R d0n is the input data, and Y2R d2n is the target training data. Using our transformation, the corresponding auxiliary optimization problem can be written as min Z 1 2 jjZX Yjj 2 s:t: rank(Z) minfd 2 ; d 1 ; d 0 g : (2.14) [97, Theorem 2.3] shows that when XX T and YX T are full rank, d 2 d 0 , and when YX T (XX T ) 1 XY T has d 2 distinct eigenvalues, every local optimum is global and all saddle points are second order saddles. While the local/global equivalence result holds for deeper networks, the property that all saddles are second order does not hold in that case. Another result by [157, Theorem 2.2] shows that when XX T , YX T , and YX T (XX T ) 1 XY T are full rank, every local optimum of a linear deep network is global. However, in their proof, the full rankness assumption of YX T was not used in showing the result for non-degenerate critical points and thus can be relaxed in that case. In this section, without any assumptions on both X and Y, we reconstruct the proof that shows the latter result for 2-layer networks using local openness, and then show 32 a similar result for the degenerate case. The result for the degenerate case holds when replacing the square loss error by a general convex loss function as we will see in Corollary 2.6.2. The proofs of the theorem and corollary stated below can be found in Appendices A.1 and A.2, respectively. Theorem 2.6.1. Every local minimum of problem (2.13) is global. Moreover, every degenerate saddle point of problem (2.13) is a second order saddle. Proof. The proof of the theorem is relegated to Appendix A.1. Corollary 2.6.2. Let the square loss error in (2.13) be replaced by a general convex loss function `(). Then every degenerate critical point is either a global minimum or a second order saddle. Proof. The proof of the corollary is relegated to Appendix A.2. [12] and [145] show the same result when both X and Y are full row rank. Theorem 2.6.1 generalizes their results by relaxing the assumptions on both X and Y. 2.7 Multi-Layer Linear Neural Network Consider the training problem of multi-layer deep linear neural networks: min W 1 2 kW h W 1 X Yk 2 : (2.15) Here W = W i h i=1 , W i 2 R didi1 are the weight matrices, X 2 R d0n is the input training data, and Y2 R d h n is the target training data. Based on our general framework, the corresponding auxiliary optimization problem is given by min Z2R d h n 1 2 jjZX Yjj 2 s:t: rank(Z) d p , min 0ih d i : (2.16) Paper [106] showed that when X and Y are full row rank, every local minimum of (2.15) is global. By providing counterexamples, we show that this result does not hold when relaxing the full rankness assumption on Y. However, for certain network architectures we show that regardless of whether Y is full rank, every local minimum of (2.15) is global. Before proceeding to the proof we dene the mapping M i;j (W i ;:::; W j ) :fW i ;:::; W j g!R Mi;j for i>j; whereR Mi;j ,fZ = W i W j 2R didj1 j rank(Z) min j1li d l g. We now show that under a set of necessary conditions, every local minimum of problem (2.15) is global. 33 Lemma 2.7.1. Every non-degenerate local minimum of (2.15) is a global minimum. Proof. The proof of the Lemma is relegated to Appendix A.3. As previously mentioned, due to a simple counterexample, we cannot in general relax the full rankness assumption on Y. We now determine problems structures for which every degenerate local minimum is global, i.e. (due to Lemma 4) problem structures for which every local minimum is global. Theorem 2.7.2. If there exist p 1 and p 2 , 1p 1 <p 2 h 1 with d h >d p2 and d 0 >d p1 , we can nd a rank decient Y such that problem (2.15) has a local minimum that is not global. Otherwise, given any X and Y, every local minimum of problem (2.15) is a global minimum. Proof. The proof of the lemma is relegated to Appendix A.6. Various counterexamples are presented in the proof to show the necessity of the assumption in Theorem 2.7.2. Remark 2.7.3. Following the same steps of the proof of Theorem 2.7.2, we get the same result when replacing the square error loss by a general convex and dierentiable function `(). Moreover, if the range of the mappingM h is the entire space, i.e. min 0ih d i = minfd h ;d 0 g, the auxiliary problem (2.16) is unconstrained and convex. Then, as we show in corollary 2.7.4, every non-degenerate critical point is global, and every degenerate critical point is either a saddle point or a global minimum; which generalizes [157, Theorem 2.1]. Corollary 2.7.4. Consider problem (2.15) with general convex and dierentiable loss function `(). When min i d i = min(d h ;d 0 ), every non-degenerate critical point is global, and every degenerate critical point is either a saddle point or a global minimum. Proof. Proof of the corollary is relegated to Appendix A.7. 2.8 Future Work In this chapter, we provided a unied theoretical framework that harness the concept of local openness for analyzing the landscape of optimization problems arising from learning deep non-convex models. Our strategy is to map the original optimization problem to another domain. In that domain, we show that the landscape of the resulting optimization problem acquires some special structure (all local optima are global). Then by connecting the local optima of the two problems using local openness, we reveal a hidden special structure in the original problem. 34 Our proposed theoretical framework for studying the landscape of non-convex optimization problems, re- vealed a host of new intriguing open questions. Although the equivalence between the sets of local and global optimal points establishes signicant understanding of the underlying loss surface, computing these points can be NP-Hard in general. In the future, we plan to study similar equivalence relations between com- putable points (for example rst and second-order critical points) and global optimal points; thus providing conditions under which local search methods are able to compute the global optima, despite non-convexity. Leveraging on our result, [164] shows a similar relation for low-rank matrix factorization. In particular, they show that every critical point of the low-rank matrix factorization problem is global. Extending these results to other statistical learning optimization problems is among my future goals. Another interesting future direction relies on observations we made in our constructed proofs A.6. In par- ticular, our proof develops gradient-free descent directions for optimizing the parameters of a deep neural network. Utilizing these directions to develop ecient and scalable gradient-free algorithms that can e- ciently train deep models is a topic I plan to explore. 35 Chapter 3 Algorithms for Constrained Non-Convex Optimization Designing ecient algorithms for non-convex optimization has been an active area of research in recent years, see [6, 35, 36, 37, 38, 40, 41, 48, 49, 110, 121]. For a general non-convex problem, even nding a local optimum of the objective function is NP-Hard in the worst-case scenario [117]. Given this hardness result, recent focus has shifted to computing (approximate) rst and second-order stationary points of the objective function. The latter set of points provides stronger guarantees as it constitutes a smaller subset of critical points that includes local and global optima. For problems with nice geometrical properties, the set of second order stationary points could even be the same as the set of global optima; see [13, 14, 27, 125, 147, 148, 149] for examples of such objective functions. Almost all existing results in the eld study the problem in the absence of constraints. In this chapter, we consider the problem of nding an approximate second-order stationary point for constrained non-convex op- timization problems. We rst show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge with positive probability to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing that checking ( g ; H )-second order stationarity is NP-hard even in the presence of linear constraints. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done eciently. For such instances, we propose a dynamic second order Frank-Wolfe algorithm which converges to ( g ; H )-second order stationary points inO maxf 2 g ; 3 H g iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained quadratic sub-problem can be solved eciently. However, the complexity rate attained by this algorithm is not order optimal [34]. In [126], we propose a trust region algorithm that achieves optimal iteration complexity. To our knowledge this is the rst order optimal algorithm that solves linearly constrained non-convex optimization problems while considering the complexity of solving the sub-problem. 36 3.1 First and Second Order Stationarity To understand the denitions of rst and second order stationarity, let us rst start by considering the unconstrained optimization problem min x2R n f(x); (3.1) wheref :R n 7!R is a twice continuously dierentiable function. We say a point x is a rst order stationary point (FOSP) of (3.1) ifrf( x) = 0. Similarly, a point x is said to be a second-order stationary point (SOSP) of (3.1) ifrf( x) = 0 andr 2 f( x) 0. In practice, most of the algorithms used for nding stationary points are iterative. Therefore, we dene the concept of approximate rst and second order stationarity. We say a point x is an g -rst-order stationary point if krf( x)k 2 g : (3.2) Moreover, we say a point x is an ( g ; H )-second-order stationary point if krf( x)k 2 g andr 2 f( x) H I: (3.3) We now extend these denitions to the constrained optimization problem min x2P f(x); (3.4) wherePR n is a closed convex set. As dened in [21], we say x2P is a FOSP of (3.4) if hrf( x); x xi 0 8 x2P: (3.5) Similarly, we say a point x is a SOSP of the optimization problem (3.4) if x2P is a rst order stationary point and 0 d T r 2 f( x)d; 8 d s:t:hd;rf( x)i = 0 and x + d2P: (3.6) Notice that whenP =R n , the denitions above obviously correspond to the denitions in the unconstrained case. In both cases, we say x is a strict saddle point if it is a FOSP that is not a SOSP. Moreover, we say 3.4 satises the strict saddle property if every SOSP is also a local minimizer. Motivated by (3.5) and (3.6), given a feasible point x, we dene the following rst and second order station- 37 arity measures X (x), min s hrf(x); si s:t: x + s2P;ksk 2 1: (3.7) and (x), min d d T r 2 f(x)d s:t: x + d2P;kdk 2 1 hrf(x); di 0: (3.8) Notice that since x is feasible,X (x) 0 and (x) 0. Moreover, these optimality measures can be linked to the standard denitions in [21] by the following Lemma. Lemma 3.1.1. If x2P then X ( x) = 0 if and only if x is a rst order stationary point. X ( x) = ( x) = 0 if and only if x is a second order stationary point. Using this lemma, we dene the approximate rst and second order stationarity. Denition 3.1.2. Approximate Stationary Point: For problem (3.4), A point x2P is said to be an g -rst order stationary point ifX ( x) g . A point x2P is said to be an ( g ; H )-second order stationary point ifX ( x) g and ( x) H . In the unconstrained scenario, these denitions correspond to the standard denitions (3.3) and (3.2). Remark 3.1.3. Notice that our denition of ( g ; H )-second order stationarity is dierent than the denition in [115]. In particular, there are two major dierences: 1) The denition used for approximate rst and second order stationarity in [115] does not include the normalization constraintsksk 1 andkdk 1 in (3.7) and (3.8). 2) The second order optimality measure in [115] is dened using equality constrainthrf(x); di = 0 in (3.8) instead of the inequality constrainthrf(x); di 0. To understand the necessity of using normalization, consider the optimization problem minx 2 and the point x = with being (arbitrary) small. Clearly, x is close to optimal, while the optimality measure (3.7) does not re ect this approximate optimality if we do not include the normalization constraint in (3.7). 38 To understand the importance of using inequality constrainthrf(x); di 0 instead of equality constraint in (3.8), consider the scalar optimization problem min x 1 2 x 2 s:t: 0x 10: Let us look at the point x => 0. Using second order information, one can say that x is not a reasonable point to terminate your algorithm at. This is because the Hessian provides a descent direction with large amount of improvement in the second order approximation of the objective value. This fact is also re ected in the value of ( x) = 1. However, if we had used equality constrainthrf(x); di = 0 in the denition of () in (3.8), then the value of () would have been zero. Remark 3.1.4. There are other denitions of second order stationarity in the literature. For example, the works [23, 24] use a scaled version of the Hessian in dierent directions to dene second order stationarity for box constraints. Recently, [128] carefully revised it to account for the coordinates which are very far from the boundary. Another related denition of second order stationary, which leads to a practical perturbed gradient descent algorithm, is provided in [107] for general linearly constrained optimization problems. In the unconstrained scenario, it is known that gradient descent with random initialization converges to second order stationary points with probability one [102]. Moreover, there exist various ecient algorithms for nding an ( g ; H )-second order stationary point of the objective function [121, 49, 47, 48, 36, 37]. In what follows, we study whether these results can be directly extended to the constraint scenario by answering the following questions: Q1: Does projected gradient descent with random initialization converge to second order sta- tionary points with probability one? Q2: Does there exist a polynomial-time algorithm for nding an ( g ; H )-second order stationary point of the general constrained optimization problem (3.4)? 39 3.2 Projected Gradient Descent with Random Initialization May Converge to Strict Saddle Points with Positive Probability It is known that gradient descent with xed step size can converge to an -rst order stationary point in O( 2 ) iterations for unconstrained smooth optimization problems [119]. Moreover, with random initial- ization, this method escapes strict saddle points for general smooth unconstrained optimization problems with probability one [102]. In the general constrained optimization problem (3.4), projected gradient de- scent algorithm is a natural replacement for gradient descent. The iterates of the projected gradient descent algorithm are obtained by x k+1 P F (x k k rf(x k )); where k is the step-size, k is the iteration number, andP F is the projection operator onto the feasible set F. A natural question about projected gradient descent is whether it has the same behavior as the gradient descent algorithm. More specically, can projected gradient descent escape strict saddle points? To answer this question, we provide an example where projected gradient descent fails to converge to second order stationary points even in the presence of a single linear constraint. Consider the following optimization problem: min x;y2R f(x;y),xye x 2 y 2 + 1 2 y 2 s.t. x +y 0: (3.9) The landscape of the functionf and its corresponding negative gradient mapping are plotted in Figures 3.1a and 3.1b. Notice that the function f() has the following rst and second order derivatives: rf(x;y) = 0 @ (1 2x 2 )ye x 2 y 2 (1 2y 2 )xe x 2 y 2 +y 1 A r 2 f(x;y) = 0 @ 2xy(3 2x 2 )e x 2 y 2 (1 2x 2 )(1 2y 2 )e x 2 y 2 (1 2x 2 )(1 2y 2 )e x 2 y 2 2xy(3 2y 2 )e x 2 y 2 + 1 1 A : 40 (a) Landscape of f() (b) Negative Gradient Flow for f() Figure 3.1: The landscape and negative gradient mapping of f. The red box in 3.1b shows a non-zero measure set that converges to the origin when projected gradient descent is used. First of all, it is not hard to check thatrf(0; 0) = 0,r 2 f(0; 0) = 0 @ 0 1 1 1 1 A , and for the feasible direction v = (1;1), we have v T r 2 f(0; 0)v =1. Hence, the point (0; 0) is a saddle point that is not second order stationary. Therefore, the origin is a strict saddle point. However, as one can see in Figure 3.1b, projected gradient descent algorithm may converge to the origin if initialized around the lower right corner of the gure. This observation is true for various step-size selection rules. To formalize this observation, in what follows, we show that projected gradient descent converges to the strict saddle point (0; 0) if initialized inside the red box in Figure 3.1b. First, we show that if the sequence generated by projected gradient descent method intersects a subset of the boundary of the constraint in (3.9), then the algorithm will eventually converge to the origin. Lemma 3.2.1. If for anyk2N + , the iterate (x k ;y k ) of the sequence generated by projected gradient descent method with constant step-size k = with 0< < 2=3 applied to (3.9) satises x k 0; y k =x k ; (3.10) thenf(x k ;y k )g converges to the origin. Proof. Proof of this lemma is relegated to Appendix B.1. It remains to show that there exists a non-zero measure region so that if we initialize the projected gradient descent algorithm in this region, the iterates converge to a point on the boundary satisfying the conditions in Lemma 3.2.1. 41 Theorem 3.2.2. For any given constant step-size k = with 0< < 2 3 , there exists > 0 so that if we initialize in the set I B ,f(x;y)j 0:5x 0:5; 0:5y0:5g; then the projected gradient descent method with xed step-size converges to the origin when applied to (3.9). Proof. Proof of this Theorem is relegated to Appendix B.2. This result shows that there is a positive probability that projected gradient descent with random ini- tialization converges to a strict saddle point of the objective. Based on our example, we conjecture that even perturbed/stochastic projected gradient descent algorithm cannot help in escaping strict saddle points. Therefore, a natural question to ask is whether there exists a polynomial-time algorithm for nding an (approximate) second order stationary point. This question is the focus of the next section. 3.3 Finding or Checking (Approximate) Second Order Stationar- ity is NP-Hard Even in the Presence of Linear Constraints The classical result of [117] shows that checking whether a point d is a second-order stationary point of a non-convex quadratic optimization problem is in general NP-hard. This results was shown by considering the quadratic co-positivity problem min d2R n 1 2 d T Qd s:t: d 0: (3.11) In particular, [117, Lemma 2] shows that, by adding a ball constraintkdk 2 1, the optimal objective value of (3.11) is either 0 or2 n . Thus, checking whether d = 0 is an ( g ; H )-second order stationary point requires choosing H 2 n . In that case [117, Theorem 2] shows that the problem is NP-Hard. Combining these two results, we conclude that checking whether d = 0 is an ( g ; H )-second order stationary point is NP-hard in (n; log (1= H )). In this section, we show that even a less ambitious goal is NP-hard. More precisely, we show that checking ( g ; H )-second order stationarity is NP-hard in (n; 1= H ). Theorem 3.3.1. For the co-positivity problem min d2R n 1 2 d T Qd s:t: d 0; (3.12) there is no algorithm which can check whether d = 0 is an ( g ; H )-second order stationary point in polynomial time in (n; 1 H ), unless P = NP. 42 Proof. The proof of the Theorem is relegated to Appendix B.3. Remark 3.3.2. The result in [117] shows that checking for ( g ; H )-second order stationarity is NP-hard in (n; log(1= H )). In other words, there is no algorithm which can check whether d is an ( g ; H )-second order stationarity point in polynomial time in (n; log(1=" H )), unless P=NP. The result of Theorem 3.3.1 is stronger as it shows that checking for ( g ; H )-second order stationarity is NP-hard in (n; 1= H ). This negative result shows that we should not expect to have a polynomial-time iterative descent algorithm which can converge to second order stationary points of general convex constrained optimization problems. If such an algorithm exists, one can run that algorithm from the initial point d 0 = 0 and see if it can nd a point with negative objective value. This observation shows that in order to have a reasonable descent algorithm (with polynomial per-iteration complexity), we must put the general convex constrained case behind; and develop algorithms for special type of constraints. This transition is the focus of the next section. 3.4 Easy Instances of Finding Second Order Stationarity in Con- strained Optimization: A Second Order Frank-Wolfe Algo- rithm As discussed in previous sections, although designing polynomial time algorithms for nding second order stationary points is easy when the optimization problem is unconstrained, the same problem becomes very hard in the general convex constrained case. In particular, even for checking second order stationarity, one needs to (approximately) solve a quadratic constrained optimization problem (3.8), which is NP-hard as shown in Section 3.3. However, for some special constraint setsF, the quadratic constrained optimization problem (3.8) can be solved eciently. For example, whenF is formed by a xed number of linear constraints, [87] presents a backtracking approach which can nd the solution of (3.8) in polynomial-time. More precisely, by doing an exhaustive backtracking search on the set of constraints, one can nd the solution of the problem min d d T r 2 f(x)d s:t: x + d2F;kdk 2 1 hrf(x); di 0: (3.13) in polynomial-time whenF =fx j a T i x b i ; for i = 1;:::;mg assuming that m is small and one can aord a search which is exponentially large in m. Assuming that (3.13) can be solved eciently for a givenF, a natural question to ask is as follows: 43 Assume that the constraint setF is such that the quadratic optimization problem (3.13) can be solved eciently. For such a constraint setF, can we nd an ( g ; H )-second order stationary point of the general smooth optimization problem (3.4) eciently? In this section, we answer this question armatively by proposing a polynomial time algorithm for nding an ( g ; H ) second order stationary point of problem (3.4) assuming that a quadratic optimization problem of the form (3.13) can be solved eciently at each iteration. The proposed algorithm can be viewed as a simple second order generalization of the Frank-Wolfe algorithm proposed in [100]. In particular, in addition to the rst order Frank-Wolfe direction computed by solving (3.7) at x k , we also compute a second-order descent direction by solving (3.8) at each iteration. Then we dynamically choose the direction that potentially oers a more signicant predicted reduction in the objective value. This dynamic method was used in [47] to design an algorithm for unconstrained optimization problems. They show convergence to ( g ; H )-second- order stationary points with complexityO(maxf 2 g ; 3 H )g). Our proposed algorithm adapts this method to the constrained scenario while maintaining the same convergence guarantees and complexity bounds. Notations. Given a sequence of iterates fx k g computed by an algorithm for solving (3.4), we dene X k ,X (x k ) and k , (x k ), whereX () and () are dened in (3.7) and (3.8). Throughout this section, we make the following assumption. Assumption 3.4.1. The objective function f is twice continuously dierentiable and bounded below by a scalar f min onF. The constraint setF is closed and convex. We assume that functionsrf() andr 2 f() are Lipschitz continuous on the path dened by the iterates computed in algorithm 1, with Lipschitz constants L and, respectively. Furthermore, the gradient sequencefrf(x k )g is bounded such that there exists a scalar constant g max 2 R ++ such thatkrf(x k )k 2 g max for all k2 N. Moreover, we assume that the Hessian sequencefr 2 f(x k )g is bounded in norm, that is, there exists a scalar constant H max 2 R ++ such that kr 2 f(x k )k 2 H max for all k2N. 3.4.1 Algorithm Description Let x k be the iterate in our algorithm at iteration k. Given point x k , we dene the following rst order and second order descent directions b s k , arg min s hrf(x k ); si s:t: x k + s2F;ksk 2 1: (3.14) 44 and b d k , arg min d d T r 2 f(x k )d s:t: x k + d2F;kdk 2 1 hrf(x k ); di 0: (3.15) Notice that in the unconstrained scenario,b s k =rf(x k ) and b d k is the eigenvector corresponding to the leftmost eigenvalue of the Hessian matrixr 2 f(x k ), which lead to the simple directions proposed in [47] for the unconstrained scenario. The algorithm described below follows a dynamic strategy of choosing betweenb s k and b d k for allk2N. The choice is done based on which direction predicts a larger reduction in the objective. Ifb s k is always chosen, then the algorithm resembles Frank-Wolfe algorithm [100]. Hence, our algorithm can be seen as a second order extension of Frank-Wolfe algorithm. Algorithm 1 Second Order Frank-Wolfe with Fixed Step-size Require: The constants ~ L, maxfL;g max g, ~ , maxf;H max g. 1: procedure 2: Choose x 0 2F. 3: ComputeX 0 and 0 by solving (3.7) and (3.8), respectively. 4: for k = 0; 1; 2;::: do Computeb s k ;X k and b d k ; k using (3.14) and (3.15), respectively. 5: if k =X k = 0 then 6: terminate and return x k . 7: end if 8: if X 2 k 2 ~ L 2 3 k 3~ 2 then 9: set x k+1 x k + X k ~ L b s k 10: else 11: set x k+1 x k + 2 k ~ b d k 12: end if 13: end for 14: end procedure 3.4.2 Convergence Results We rst note that regardless of the direction we choose, the step size is either X k ~ L or 2 k ~ , which by Cauchy Schwartz and the denitions of ~ L and ~ are both less than or equal to 1. Thus, the iterates generated by the algorithm are always feasible. Also notice that, unlike the algorithms proposed in [100, 115], our algorithm does not require any boundedness assumption on the feasible setF. Another advantage of the proposed 45 algorithm is that it does not require the knowledge of the desired accuracy level ( g ; H ). This allows us to modify our termination rule when running the algorithm if needed. Next, we show that Algorithm 1 asymptotically converges to a second order stationary point. Theorem 3.4.2. Under Assumption 3.4.1, lim k!1 X k = lim k!1 k = 0: In other words, any limit point of the iterates is a second order stationary point. Proof. The proof of the Theorem is relegated to Appendix B.4. The next result computes the worst-case complexity required to reach an g -rst order stationary point and to reach an ( g ; H )-second order stationary point. Theorem 3.4.3. Let g , H > 0. The number of iterations required for Algorithm 1 to nd an g -rst order stationary point is at most 2 ~ L(f(x 0 )f min ) 2 g : Moreover, the number of iterations required to nd an ( g ; H )-second order stationary point is at most f(x 0 )f min min ( 2 g 2 ~ L ; 2 3 H 3~ 2 ): Proof. The proof of the Theorem is relegated to Appendix B.5. The complexity order of the proposed algorithm is the same as the algorithm proposed in [47, 115]. In particular, the complexity of nding an g -rst-order stationary point isO( 2 g ), which is not optimal under our assumptions. Second order information in conjunction with smoothness of the Hessian has been used to design algorithms with better complexity orders for reaching rst order stationarity [35, 40, 41]. For instance, the ARC algorithm as proposed in [35] guarantee convergence to g -rst-order stationarity with complexity rate e O 3=2 g . In the next section, we study the existence of an optimal rate algorithm for computing approximate second-order stationary points for constrained optimization problems by answering the following question: Can we nd an algorithm that guarantees convergence to ( g ; H )-second-order stationarity with optimal complexity rateO maxf 3=2 g ; 3 H g , while still eciently solving the sub-problem? 46 3.5 A Trust Region Algorithm for Solving Linearly-Constrained Non-convex Optimization Problems In this section, we propose a trust region algorithm, entitled LC-TRACE (Linearly Constrained TRACE), that adapts TRACE [49] to linearly-constrained non-convex problems. Moreover, we establish its conver- gence to g -rst order stationarity in at mostO( 3=2 g ), up to a logarithmic factor. This method is then implemented in the framework proposed in [115] to guarantee convergence to ( g ; H )-second-order station- arity with complexity rate e O maxf 3=2 g ; 3 H g . In addition, we identify instances for which the quadratic sub-problems can be solved eciently. For such instances, the proposed algorithm can be used to compute an approximate second-order stationary solution for linearly-constrained non-convex problems. Consider the optimization problem min x2R n f(x); s:t: Ax b; (3.16) where A2 R mn and b2 R m . In this section, we propose a trust region algorithm, entitled LC-TRACE (Linearly Constrained TRACE), that adapts TRACE [49] to the above linearly-constrained non-convex prob- lem. We establish its convergence to g -rst order stationarity with iteration complexity order e O( 3=2 g ). This method is then used to develop an algorithm to converge to ( g ; H )-second-order stationarity with the iteration complexity e O maxf 3=2 g ; 3 H g . LC-TRACE is dierent from the traditional trust region method proposed in [41] for constrained opti- mization. More specically, LC-TRACE utilizes the mechanisms used in TRACE [49] to provide a faster convergence rate compared to [41]. The improved convergence rate matches the rates achieved by adapted ARC [35] and TRACE [49], up to logarithmic factors. Since applying TRACE directly to constrained op- timization fails (as will be discussed later), we introduced modications to adapt this method to linearly constrained problems. Our modications are not the result of a \simple extension" of unconstrained to constrained scenario. Before explaining LC-TRACE, let us rst provide an overview of the classical trust region and TRACE algorithms. 3.5.1 Background on Traditional Trust Region Algorithm and TRACE In traditional trust region methods, the trial step s k at iterationk is computed by solving the standard trust region sub-problem min s2R n q k (s); s.t.ksk 2 k ; (3.17) 47 where q k (s) :R n 7!R is the second-order Taylor approximation of f around x k , i.e., q k (s),f k + g T k s + 1 2 s T H k s: Heref k =f(x k ), g k =rf(x k ), and H k =r 2 f(x k ). Based on the resulting trial step, an acceptance criteria is used to either accept or reject the step. In particular, if the ratio of actual-to-predicted reduction f k f(x k + s k ) f k q k (s k ) is greater than a prescribed constant, the step is accepted, otherwise it is rejected. The iterate x k+1 and trust region radius are updated accordingly. Traditional trust region methods use a geometric update rule for the trust region radius k , i.e., k+1 is some constant factor of k . TRACE algorithm, on the other hand, modies the acceptance criteria and this linear update rule for k to match the rate achieved by the ARC algorithm [36, 37]. In particular, the authors in [49] observed that ARC computes a positive sequence of cubic regularization coecients k 2 [; ] that satisfy f k f k+1 c 1 k ks k k 3 2 and ks k k 2 c 2 +c 3 1=2 kg k+1 k 1=2 2 ; (3.18) for some given positive constants c 1 ;c 2 ;c 3 . TRACE designed a modied acceptance criteria and a new mechanism for updating the trust region radius to satisfy the conditions provided in (3.18). Some of these ideas are discussed next. Sucient Decrease Acceptance Criteria. TRACE denes the ratio k , f k f(x k + s k ) ks k k 3 2 ; (3.19) as a measure to decide whether to accept or reject a trial step. For some prescribed2 (0; 1), a trial step s k can only be accepted if k . By noticing that a smallks k k 2 may satisfy only the rst condition in (3.18), the developers of TRACE realize that an acceptance criteria that only involves (3.19) is not sucient. To avoid such cases, TRACE denes a sequencef k g to estimate an upper bound for the ratio k =ks k k 2 used for acceptance. Heref k g is the sequence of dual variables corresponding to the constraintksk 2 k in sub-problem (3.17). In short, TRACE accepts a trial pair (s k ; k ) if it satises the following conditions: k and k =ks k k 2 k : (3.20) Trust Region Radius Update Procedure. In contrast to the linear update rule utilized in traditional 48 trust region algorithms, TRACE uses a CONTRACT subroutine that allows for sub-linear updates. In particular, this subroutine compares the radius obtained by the linear update scheme to the norm of the trial step computed using min s2R n f k + g T k s + 1 2 s T (H k +I)s; (3.21) for a carefully chosen. If the norm of this trial step falls within a desired range, then it is chosen to be the new trust region radius. This subroutine is called at iteration k if k <. TRACE is designed to solve unconstrained smooth optimization problems. A direct implementation of this algorithm fails in the constrained setting. In the next section, we describe two fundamental diculties introduced in the presence of constraints and discuss the necessary modications. 3.5.2 Dierence Between LC-TRACE and TRACE In the constrained setting, we dene the trust region sub-problem and its regularized Lagrangian form as Q k , min s2R n q k (s); s.t. 8 > < > : As b Ax k ksk 2 k (3.22) and Q k (), min s2R n f k + g T k s + 1 2 s T (H k +I)s; s.t. As b Ax k : (3.23) A major diculty introduced by the constraints is related to the optimality conditions of the sub-problem. In the unconstrained case, it is known that H k + k I 0 at every iteration [46, Corollary 7.2.2] for optimal Lagrange multiplier k . Along with the fact that > k in the CONTRACT subroutine of TRACE, we conclude that Q k () is a strongly convex quadratic optimization problem which has a unique global minimizer. Let s () be the solution of Q k (). It follows that the function s () is continuous in in the unconstrained scenario. However, in the linearly constrained scenario, the regularized sub-problem (3.23) might have multiple optimal solutions. Moreover, s () and the ratio =ks ()k 2 , which are core quantities in TRACE, might not even be continuous. To clarify this diculty, consider the following simple example Q() = min s15;s20 s 2 1 s 2 2 +(s 2 1 +s 2 2 ) s.t. s 2 3s 1 12: (3.24) 49 It is not hard to see that the optimal solution of (3.24) is given by s () = 8 > > > > < > > > > : (5; 3) if < 0 (5; 3); (4; 0) if = 0 (4; 0) Otherwise. : Thus, a small increases in may lead to a huge change in the ratio =ks ()k 2 . Therefore, the luxury of having an arbitrarily choice for the bounds and of the ratio =ksk 2 is not present in the constrained case. In LC-TRACEC, we resolved this issue by dening = C min + maxf max ; 0 g and = 2; (3.25) and altering the update rule of in the CONTRACT sub-routine. Here > 0 is the threshold used for the termination of the algorithm, C min is dened in Lemma C.1.10, max is dened in Lemma C.1.9, and is dened in Lemma C.1.8. Another major diculty in the constrained scenario is related to the standard trust region theory on the relationship between sub-problem solutions and their corresponding dual variables. In the unconstrained case, 1 > 2 , impliesks ( 1 )k 2 <ks ( 1 )k 2 , see [46, Chapter 7]. This relationship was used in [49], to show that CONTRACT subroutine reduces the radius of the trust region. However, it can be seen from example (3.24), that this relation may not hold in the constrained case. To account for this issue, we modied the CONTRACT sub-routine to guarantee a reduction in the trust region radius (See Lemma C.1.2). In summary, the dierences between LC-TRACE and TRACE are mainly in the CONTRACT sub-routine. Next, we describe the steps of the algorithm. 3.5.3 Description of LC-TRACE Our proposed algorithm LC-TRACE has two main building blocks: First-Order-LC-TRACE and Second- Order-LC-TRACE. We rst present First-Order-LC-TRACE which can converge to g -rst order stationarity in ~ O( 3=2 g ). Then, we use this algorithm in Second-Order-LC-TRACE to nd an ( g ; H )-Second Order stationarity in ~ O(maxf 3=2 g ; 3 H g) iterations. The First-Order-LC-TRACE algorithm is outlined in Algorithm 2. At each iteration x k , this iterative algorithm computes the values s k , k , and k by solving the optimization problem (3.22) and using the equation (3.19). Depending on the obtained values, it decides to either accept the trial point s k , or reject it. When rejecting the trial point, it either goes to contraction or expansion procedures. Thus, the main decisions include: Acceptance, Contraction, or Expansion. We distinguish the iterations by partitioning the 50 set of iteration numbers into what we refer to as the sets of accepted (A), contraction (C), and expansion (E) steps: A,fk2N : k and either k k ks k k 2 orks k k 2 = k g; C,fk2N : k <g; and E,fk2N :k = 2A[Cg: Hence, step k is accepted if the computed pair (s k ; k ) satises the sucient decrease criteria k , and either the norm of s k is large enough (ks k k 2 = k ) or the ratio k =ks k k 2 is smaller than an upper-bound k . We also partition the set of accepted steps into two disjoint subsets A ,fk2A :ks k k 2 = k g andA ,fk2A :k = 2A g: The sequence k is used in the algorithm as an upper bound on the norm ofks k k 2 . From steps 7, 12, and 16, we notice that this sequence is non-decreasing. We now describe the update mechanism used in a contraction step of First-Order-LC-TRACE which is the main dierence between TRACE and our proposed algorithm. When CONTRACT subroutine is called, two dierent cases may occur in Algorithm 3. The rst case is reached whenever conditions in Step 3 in the CONTRACT subroutine tests true. In that case, we carefully choose choose > k to ensure that the pair (s;) with s being the solution of Q k () satises =ksk 2 ; where and are prescribed positive constants dened in (3.25). The second case is reached whenever the conditions in Step 3 tests false. In that case, we choose 2 ( k ;C k ] with C > 1 is a constant scalar, to ensure that the pair (s;) with s being the solution of Q k () satises the following ksk 2 < max ( ; C H Lip + 2 2 ) ; where 2 (0; 1] is a constant scalars, and H Lip is dened in assumption 3.5.1. In what follows, we rst present our results about the convergence of First-Order-LC-TRACE algorithm and its iteration complexity. 51 Algorithm 2 First-Order-LC-TRACE Require: an acceptance constant 2 (0; 1). Require: update constantsf C ; E ; g with C 2 (0; 1) and ; E > 1. Require: ratio bound constants and dened in (3.25). 1: procedure First-Order-LC-TRACE 2: Choose a feasible point x 0 , a pairf 0 ; 0 g with 0< 0 0 , and 0 with 0 . 3: Compute (s 0 ; 0 ) by solving Q 0 , then compute 0 using the denition in (3.19) . 4: for k = 0; 1; 2;::: do 5: if k and either k =ks k k 2 k orks k k 2 = k then (Acceptance) 6: set x k+1 x k + s k 7: set k+1 maxf k ; E ks k k 2 g 8: set k+1 minf k+1 ; maxf k ; E ks k k 2 gg 9: set k+1 maxf k ; k =ks k k 2 g 10: else if k < then (Contraction) 11: set x k+1 x k 12: set k+1 k 13: set k+1 CONTRACT(x k ; k ; k ; s k ; k ) dened in Algorithm (3) 14: else if k , k =ks k k 2 > k , andks k k 2 < k then (Expansion) 15: set x k+1 x k 16: set k+1 k 17: set k+1 minf k+1 ; k = k g 18: set k+1 k 19: end if 20: Compute (s k+1 ; k+1 ) by solving Q k+1 , then compute k+1 using (3.19) 21: if k < then 22: set k+1 maxf k ; k+1 =ks k+1 k 2 g 23: end if 24: end for 25: end procedure 52 Algorithm 3 CONTRACT Sub-routine Require: update constant C 2 (0; 1). Require: ratio bound constants and dened in (3.25). 1: procedure CONTRACT(x k ; k ; k ; s k ; k ) 2: set k + k and set s as the solution of Q k ( ). 3: ifk sk 2 <ks k k 2 and k <ks k k 2 then 4: set +H max + X k 1=2 and set s as the solution of Q k (). 5: if =ksk 2 then 6: return k+1 ksk 2 7: else 8: set 9: return k+1 k sk 2 10: end if 11: else 12: ifk sk 2 =ks k k 2 then 13: set and set s as the solution of Q k () 14: else 15: set and set s as the solution of Q k () 16: end if 17: whileksk 2 =ks k k 2 do 18: and set s as the solution of Q k () 19: end while 20: if ksk 2 C ks k k 2 then 21: return k+1 ksk 2 22: else 23: return k+1 C ks k k 2 24: end if 25: end if 26: end procedure 53 3.5.4 Convergence of First-Order-LC-TRACE to First-order Stationarity Throughout this section, we make the following assumptions that are standard for global convergence theory of trust region methods. Assumption 3.5.1. The objective function f is twice continuously dierentiable and bounded below by a scalarf min onP. We assume that the functions g(),rf() and H(),r 2 f() are Lipschitz continuous on the path dened by the iterates computed in the Algorithm, with Lipschitz constants L andH Lip , respectively. Furthermore, we assume the gradient sequencefg k g is bounded in norm, that is, there exists a scalar constant g max > 0 such thatkg k k 2 ,krf(x k )k 2 g max for allk2N. Moreover, we assume that the Hessian sequence fH k g is bounded in norm, that is, there exist a scalar constant H max > 0, such thatkH k k 2 ,kr 2 f(x k )k 2 H max for all k2N. We next state the main results for convergence of Frist-Order-LC-TRACE. Theorem 3.5.2. Under Assumption 3.5.1, any limit point of the iterates generated by First-Order-LC- TRACE algorithm is a rst-order stationary point . Proof. The proof of the Theorem is relegated to Appendix C.1. Unfortunately, Assumption 3.5.1 is not sucient to obtain the desired rate of convergence in the presence of constraints; in particular, Assumption 3.5.1 may not ensure a model decrease of the form f k q k (s k ) =g T k s k 1 2 s T k H k s k k ks k k 2 2 ; (3.26) for some constant 2 (0; 1). To understand this, let us rst review the same result for the unconstrained scenario: it is known that H k + k I 0 at every iteration [46, Corollary 7.2.2]. Thus, by Lemma C.1.1, we get f k q k (s k ) =g T k s k 1 2 s T k H k s k 1 2 k ks k k 2 2 : (3.27) However, in contrast to the unconstrained case, there is no guarantee that the step s k satises (3.26) in the constrained scenario. More specically, in the presence of constraints, the condition is not guaranteed when the step s k provides ascent rst-order direction with negative curvature. To account for this case, we assume the following assumption holds. Assumption 3.5.3. If g T k s k 0 and s T k H k s k 0, there exists a sequence of feasible pointsfx k;i g l k i=0 with 54 0l k l, x k;0 = x k , s k;i = x k;i x k and x k;l k = x k + s k such that for i = 1;:::;l k , q k (s k;i )q k (s k;i1 ); g T k (x k;i x k;i1 ) + s T k;i H k (x k;i x k;i1 ) k s T k;i (x k;i x k;i1 ); g T k (x k;i x k;i1 ) + s T k;i1 H k (x k;i x k;i1 ) k s T k;i1 (x k;i x k;i1 ): This assumption was also used in [35] to show that the number of iterations required to reach an-rst order stationary point when adaptive ARC algorithm is used isO( 3=2 ). As mentioned in [35], this assumptions holds if x k;l k is the rst minimizer of the model q k along the piecewise linear pathP k , l k S i=1 [x k;i1 ; x i ]: Using Assumption 3.5.3, we obtain the desired model decrease (3.26) and we have the following Theorem. Theorem 3.5.4. Under Assumptions 3.5.1 and 3.5.3, for any given scalar 2 (0;1), the total number of sub-problem routines of First-Order-LC-TRACE required to reach an -rst order stationary point of (3.16) isO( 3=2 log 3 (1=)). Proof. The proof of the Theorem is relegated to Appendix C.2. In the next section, we use this rst order result to develop an algorithm for nding second order stationary points. 3.6 Second-Order-LC-TRACE Algorithm Leveraging the convergence result of First-Order-LC-TRACE, we propose algorithm 4 for converging to second order stationary points. Algorithm 4 Second-Order-LC-TRACE Require: The constants ~ L, maxfL;g max g, ~ H, maxfH Lip ;H max g, g > 0, H > 0. 1: procedure 2: Choose a feasible point x 0 . 3: ComputeX 0 and 0 by solving (3.7) and (3.8), respectively. 4: for k = 0; 1; 2;::: do 5: ifX k > g then 6: Compute x k+1 by running one iteration of First-Order-LC-TRACE starting with x k . 7: else 8: Compute b d k and k by solving (3.8). 9: set x k+1 x k + 2 k ~ H b d k . 10: end if 11: end for 12: end procedure 55 We now show that this algorithm can nd an ( g ; H )-second-order stationary point of problem (3.16). Theorem 3.6.1. Under Assumptions 3.5.1 and 3.5.3, for any given scalars g > 0 and H > 0, the total number of iterations required to reach an ( g ; H )-second-order stationary point of (3.16) when running Algorithm 4 isO log 3 ( 1 g ) max 3=2 g ; 3 H . Proof. The proof of the Theorem is relegated to Appendix C.3. 3.7 Future Work The algorithms we proposed require computing the Hessian of the objective function at each iteration, which can be computationally expensive for large-scale problems. Many recent methods avoid this computational burden by using the Hessian vector product. I plan to investigate the possibility of using the Hessian vector product to nd second order statiaonarity in linearly-constrained optimization problems. Another intriguing direction is motivated by our hardness result in Theorem 3.3.1. This result restrains us from developing algorithms that can compute an approximate second-order stationary point for linearly constrained non-convex problems in general. I am currently working on dening a concept of partial-second- order stationary set that constitute a subset of critical points that include second-order stationary points. This new concept opens a wide range of research directions. Developing ecient (stochastic) algorithms that can compute partial-second-order stationary points for large-scale constrained/unconstrained optimization problems is an important research direction. 56 Chapter 4 Algorithms for Non-convex Min-Max Problems with Applications in Robust and Fair Learning Recent applications that arise in machine learning have surged signicant interest in solving min-max saddle point games; see [140, 51, 50, 135, 71, 142, 108]. These applications require solving an optimization problem of the form min 2 max 2A f(;): (4.1) Game Perspective: One can view this optimization problem as a zero-sum game between two players where the goal of the rst player is to minimize the objective function f(;) over the set of strategies , while the other player's objective is to maximize the objective function over the set of strategiesA. In practice, gradient-based alternating descent-ascent are used to solve these problems [108, 109, 135, 52, 53]. Classical Optimization perspective: Through the lens of classical optimization, one can view (4.1) as a constrained minimization problem min 2 g() (4.2) with the objective g being a value function g(), max 2A f(;): (4.3) Wheng() is Lipschitz-smooth, rst-order methods (e.g. projected gradient descent) theoretically guarantee convergence to a rst-order stationary point of the objective in (4.2). However, in practice, even evaluating the function g requires solving a maximization optimization problem. 57 A famous classical result in optimization is Danskin's theorem [20] which provides a sucient condition under which the gradient of the value function max 2A f(;) can be directly evaluated using the gradient of the objective f(; ) evaluated at the optimal solution . This result requires the optimizer to be unique. When f(;) in non-concave, computing the optimizer may be in general NP-Hard [117, 124]. When the objective is concave in, the inner concave maximization problem may not have a unique solution. Hence, Danskin's theorem does not apply and the gradient of the value function cannot be directly evaluated using the gradient of the objective. For a simple example, consider the single variable smooth maximization problem max 01 (21) which is concave in. The value function is simply the non-smooth functionjj. In this chapter, we start by studying the problem in the non-convex concave regime. Using a regularization smoothing technique, we propose an alternative multi-step framework that nds an "{rst order Nash equi- librium/stationary with e O(" 3:5 ) gradient evaluations. At each iteration, our algorithm runs multiple steps of Nesterov accelerated projected gradient ascent [120] to estimate the solution of a regularized version of the maximization problem. This solution is then used to estimate the gradient of the value function of the maximization problem, which directly provides a descent direction. We show that our framework can also be applied to compute an "{rst order stationary point of the game when the inner maximization problem can be optimized to global optimality. In particular, we show that our proposed multi-step gradient descent ascent algorithm computes an"{rst order Nash equilibrium when the objective of one of the players satises the Polyak- Lojasiewicz (PL) condition. Under the latter condition, we show that the worst-case complexity of our algorithm is e O(" 2 ). In addition to the theoretical guarantees, we apply our framework to the recently growing eld of fair statistical learning, see details in the Section 4.4. 4.1 Two-player Min-Max Games and First-Order Nash Equilib- rium Consider the zero sum min-max game min 2 max 2A f(;); (4.4) where andA are both convex sets, andf(;) is a continuously dierentiable function. We say ( ; )2 A is a Nash equilibrium of the game if f( ;)f( ; )f(; ) 82 ; 82A: 58 For convex-concave games, these Nash equilibrium points always exist [89] and several algorithms were proposed to nd such points [74, 79]. However, in the non-convex non-concave regime, computing these points is in general NP-Hard. As a local denition, we say ( ; )2 A is a local Nash equilibrium of the game if there exists> 0 such that for any (;)2 A withk k andk k we have f( ;)f( ; )f(; ): As shown by [89, Proposition 10], local Nash equilibrium points for general non-convex non-concave games may not exist. A necessary condition for local Nash equilibrium is the second-order Nash equilibrium condition, dened in sequel. We say ( ; )2 A is a second-order Nash equilibrium of the game 4.4 if d T r 2 f( ; )d 0 8 d s.t. hr f( ; ); d i = 0; + d 2 and d T r 2 f( ; )d 0 8 d s.t. hr f( ; ); d i = 0; + d 2A By providing a simple example, we show that second-order Nash equilibrium may not exist for non-convex non-concave games with quadratic objectives. Consider the game min 11 max 22 2 + 2 + 4: Then (0; 0) is the only rst-order stationary point and is not a second-order Nash equilibrium. Motivated by the above discussion, we consider the problem of nding a rst order Nash equilibrium (FNE) which we dene in the sequel. We say ( ; )2 A is a rst order Nash equilibrium (FNE) of the game 4.4 if hr f( ; ); i 0 82 and hr f( ; ); i 0 82A; or equivalently if 0 = min s hr f( ; );si s:t: +s2 ;ksk 1 and 0 = min s hr f( ; );si s:t: +s2A;ksk 1: (4.5) Notice that these conditions are the rst order necessary optimality conditions for the objective of each player [130, 21]. Motivated by (4.5), we can formally dene the FNE for constrained game (4.4) as follows Denition 4.1.1 (FNE). A point ( ; ) is said to be a rst-order Nash equilibrium (FNE) of the 59 game (4.4), if X ( ; ) = 0 and Y( ; ) = 0; where X (;), min s hr f(;);si s:t: +s2 ;ksk 1; (4.6) and Y(;), max s hr f(;);si s:t: +s2A;ksk 1; (4.7) are rst order measures of the stationarity of the min and the max sub-problems respectively 1 . In the abscense of constraints, the above denition simplies to the conditions r f( ; ) = 0 and r f( ; ) = 0, which are the well-known unconstrained rst order optimality conditions. Moreover, the FNE condition is a necessary condition for (local) Nash equilibrium. Remark 4.1.2. According to Denition 4.1.1, a point is a FNE if Min/Max players cannot use gradient information to decrease/increase the objective function. In the presence of constraints, our denition (4.1.1) dierentiates between min-max game FNE and rst-order stationary points that minimize or maximize with respect to both variables and. More specically, computing a FNE according to denition 4.1.1 guarantees that the point does not solve the problems min 2;2A f(;) and max 2;2A f(;): In the absence of constraints, even in the context of simple optimization problems, the rst order stationary denition for unconstrained minimization and maximization are the same. SinceX (;) andY(;) are non-negative and continuous [124], we can dene an "{approximate FNE of the min-max game (4.4) as follows. Denition 4.1.3 (Approximate FNE). A point ( ; ) is said to be an "{rst-order Nash equilibrium ("{FNE) of the game (4.4) if X ( ; )" and Y( ; )": Notice that in the absence of constraints, "{FNE in Denition 4.1.3 reduces to kr f( ; )k" and kr f( ; )k": Our goal in this chapter is to nd an"{FNE of the game (4.4) using iterative methods. To proceed, we make the following standard assumptions about the smoothness of the objective function f. 1 It is worth noting that the optimality measures (4.6) and (4.7) have been previously used in literature for evaluating the solutions to non-convex constrained optimization problems [46, 40, 35, 124]. 60 Assumption 4.1.4. The functionf is continuously dierentiable in both and and there exists constants L 11 , L 22 and L 12 such that for every ; 1 ; 2 2A, and ; 1 ; 2 2 , we have kr f( 1 ;)r f( 2 ;)kL 11 k 1 2 k; kr f(; 1 )r f(; 2 )kL 22 k 1 2 k; kr f( 1 ;)r f( 2 ;)kL 12 k 1 2 k; kr f(; 1 )r f(; 2 )kL 12 k 1 2 k: In the next section, we propose an algorithm for computing an "{FNE for min-max games with one player satisfying the Polyak- Lojasiewicz (PL) condition. 4.2 Non-Convex PL-Game In this section, we consider the problem of developing an \ecient" algorithm for nding an "{FNE of (4.4). First notice that any non-convex minimization problem min g() could be trivially written as a game min max f(;), where f(;) = g(); 8. Thus, nding an "{FNE of the game is equivalent to nding an "{stationary point of the non-convex function g(). It is well-known that for general non-convex smooth problems, nding an "{stationary solution requires at least (" 2 ) gradient evaluations and this lower bound can be achieved by simple gradient descent method [34, 119]. Clearly, this lower bound is also valid for nding an "{FNE of games using rst order information. In this section, we will show that this lower bound can also be achieved (up to logarithmic factors) for a class of non-convex non-concave min-max game problems for which the objective of one of the players satises the PL-condition. To proceed, let us rst formally dene the Polyak- Lojasiewicz (PL) condition. Denition 4.2.1 (Polyak- Lojasiewicz Condition). A dierentiable function h(x) with the minimum value h = min x h(x) is said to be -Polyak- Lojasiewicz (-PL) if 1 2 krh(x)k 2 (h(x)h ); 8x: (4.8) The PL-condition has been established and utilized for analyzing many practical modern problems [95, 65, 59, 150, 31]. Moreover, it is well-known that a function can be non-convex and still satisfy the PL condition [95]. Based on the denition above, we dene a class of min-max PL-games. Denition 4.2.2 (PL-Game). We say that the min-max game (4.4) is a PL-Game if the max player is unconstrained, i.e.,A =R n , and there exists a constant > 0 such that the function h (),f(;) is 61 -PL for any xed value of 2 . A simple example of a practical PL-game is detailed next. Example 4.2.1 (Generative adversarial imitation learning (GAIL) of linear quadratic regulators). Imitation learning is a paradigm that aims to learn from an expert's demonstration of performing a task [31]. It is known that this learning process can be formulated as a min-max game [84]. In such a game the minimization is performed over all the policies and the goal is to minimize the discrepancy between the accumulated reward for expert's policy and the proposed policy. On the other hand, the maximization is done over the parameters of the reward function and aims at maximizing this discrepancy over the parameters of the reward function. This approach is also referred to as generative adversarial imitation learning (GAIL) [84]. The problem of generative adversarial imitation learning for linear quadratic regulators [31] refers to solving this problem for the specic case where the underlying dynamic and the reward function come from a linear quadratic regulator [65]. To be more specic, this problem can be formulated [31] as min K max 2 m(K;), where K represents the choice of the policy and represents the parameters of the dynamic and the reward functions. Under the discussed setting, m is strongly concave in and PL in K (see [31] for more details). Note that since m is strongly concave in and PL in K, any FNE of the game would also be a Nash equilibrium point. Also note that the notion of FNE does not depend on the ordering of the min and max. Thus, to be consistent with our notion of PL-games, we can formulate the problem as min 2 max K m(K;) (4.9) Thus, generative adversarial imitation learning of linear quadratic regulators is an example of nding a FNE for a min-max PL-game. In what follows, we present a simple iterative method for computing an "{FNE of PL games. 4.2.1 Multi-step gradient descent ascent for PL-games In this section, we propose a multi-step gradient descent ascent algorithm that nds an "{FNE point for PL-games. At each iteration, our method runs multiple projected gradient ascent steps to estimate the solution of the inner maximization problem. This solution is then used to estimate the gradient of the inner maximization value function, which directly provides a descent direction. In a nutshell, our proposed algorithm is a gradient descent-like algorithm on the inner maximization value function. To present the ideas of our multi-step algorithm, let us re-write (4.4) as min 2 g(); (4.10) 62 where g(), max 2A f(;): (4.11) A famous result for estimating the gradient of optimization value functions is Danskin's Theorem [20], which provides a sucient condition under which the gradient of the value function (4.10) can be directly evaluated using the gradient of the corresponding objective function (4.11) evaluated at the optimal solution. However, this results requires the optimizer to be unique. Under our PL assumption on f(;), the inner maximization problem (4.11) may have multiple optimal solutions. Hence, Danskin's theorem does not hold. In what follows, we show that under the PL assumption r g() =r f(; ); with 2 arg max 2A f(;); despite the non-uniqueness of the optimal solution. Lemma 4.2.3. Under Assumption 4.1.4 and PL-game assumption, r g() =r f(; ); where 2 arg max 2A f(;): Moreover, g is L-Lipschitz smooth with L =L 11 + L 2 12 2 . Proof. The proof of this Lemma is relegated to Appendix D.1.1. Motivated by this result, we propose a Multi-step Gradient Descent Ascent algorithm that solves the inner maximization problem to \approximate" the gradient of the value function g. This gradient direction is then used to descent on . More specically, the inner loop (Step 4) in Algorithm 5 solves the maximization problem (4.11) for a given xed value = t . The computed solution of this optimization problem provides an approximation for the gradient of the function g(), see Lemma D.1.5 in Appendix D.1. This gradient is then used in Step 7 to descent on . 63 Algorithm 5 Multi-step Gradient Descent Ascent 1: INPUT: K, T , 1 = 1=L 22 , 2 = 1=L, 0 2A and 0 2 2: for t = 0; ;T 1 do 3: Set 0 ( t ) = t 4: for k = 0; ;K 1 do 5: Set k+1 ( t ) = k ( t ) + 1 r f( t ; k ( t )) 6: end for 7: Set t+1 = proj t 2 r f( t ; K ( t )) 8: end for 9: Return ( t ; K ( t )) for t = 0; ;T 1. 4.2.2 Convergence analysis of Multi-Step Gradient Descent Ascent Algorithm for PL games Throughout this section, we make the following assumption. Assumption 4.2.4. The constraint set is convex and compact. Moreover, there exists a ball with radius R, denoted byB R , such that B R . We are now ready to state the main result of this section. Theorem 4.2.5. Under Assumptions 4.1.4 and 4.2.4, for any given scalar "2 (0; 1), if we choose K and T large enough such that TN T ("),O(" 2 ) and KN K ("),O(log " 1 ) ; then there exists an iteration t2f0; ;Tg such that ( t ; t+1 ) is an "{FNE of (4.4). Proof. The details of the proof are relegated to Appendix D.1.2. But we provide a sketch of the proof here. The rst step of our proof is to show that the inner loop in Algorithm 5 computes an approximate gradient of g(). In other words,r f( t ; t+1 )rg( t ). This implies that Algorithm 5 behaves similar to the simple vanilla gradient descent method applied to problem (4.10). The second step of the proof establishes convergence to a stationary point of the optimization problem (4.10). Remark 4.2.6. Theorem 4.2.5 shows that under the PL assumption, the pair ( t ; K ( t )) computed by Algorithm 5 is an "{FNE of the game (4.4). Since K ( t ) is an approximate solution of the inner max- imization problem, we get that t is concurrently an "{rst order stationary solution of the optimization problem (4.10). Corollary 4.2.7. Under Assumption 4.1.4 and Assumption 4.2.4, Algorithm 5 nds an "-FNE of the game (4.4) withO(" 2 ) gradient evaluations of the objective with respect to andO(" 2 log(" 1 )) gradient evaluations with respect to . If the two gradient oracles have the same complexity, the overall complexity 64 of the method would beO(" 2 log(" 1 )). As discussed at the beginning of this section, this rate is optimal up to logarithmic factors. Remark 4.2.8. In [140, (Theorem 4.2)], a similar result was shown for the case when f(;) is strongly concave in . Hence, Theorem 4.2.5 can be viewed as an extension of [140, (Theorem 4.2)]. Similar to [140, (Theorem 4.2)], one can easily extend the result of Theorem 4.2.5 to the stochastic setting by replacing the gradients with stochastic gradients. In the next section we consider the non-convex concave min-max saddle game. It is well-known that convex- ity/concavity does not imply the PL condition and PL condition does not imply convexity/concavity [95]. Therefore, the problems we consider in the next section are neither restriction nor extension of our results on PL games. 4.3 Non-Convex Concave Games In this section, we study the \non-convex concave" min-max games where the objective of the max player is concave while the objective of the min player is non-convex. Under this setting, we propose an algorithm that nds an "{FNE point in at most e O(" 3:5 ) gradient evaluations. Throughout this section, we make the following assumption. Assumption 4.3.1. The objective function f(;) is concave in for any xed value of . Moreover, the setA is convex and compact, and there exists a ball with radius R that contains the feasible setA. For PL games we have shown that g() is Lipschitz smooth, see Lemma 4.2.3. Unlike the PL case, one can show that in the non-convex concave setting the function g() is not necessarily dierentiable. Using an arbitrarily small regularization smoothing term, we approximate the functiong() by a dierentiable function g (), max 2A f (;); (4.12) where f (;),f(;) 2 k k 2 : (4.13) Here 2A is some given xed point and > 0 is a regularization parameter that we will specify later. Since f(;) is concave in , we obtain f (; 2 )f (; 1 )hr f (; 1 ); 2 1 i + 2 k 1 2 k 2 ; for any 1 , 2 2A. Hence, f (;) is -strongly concave. We propose an algorithm that runs at each iteration multiple steps of Nesterov accelerated projected gradient ascent to estimate the solution of (4.13). 65 This solution is then used to estimate the gradient of (4.12) which directly provides a descent direction on . Our algorithm computes an"{FNE point for non-convex concave games with e O(" 3:5 ) gradient evaluations. In a nutshell, our proposed algorithm implements the multi-step framework proposed in Section 4.2 to a regularized objective function f . Then for suciently small regularization coecient, we show that the computed point is an "-FNE. Despite the simplicity of our proposed method, we achieve a rate superior to all other known methods applied to the same problem setting. Before proceeding to our algorithm and its convergence analysis, we rst show some that g () is L-Lipschitz smooth. Lemma 4.3.2. Under Assumption 4.1.4 and Assumption 4.3.1, the function g is L-Lipschitz smooth with L =L 11 + L 2 12 . Proof. The proof is relegated to Appendix D.2.1. Based on this result and the compactness assumption, we can dene g , max 2 krg ()k; g , max 2 kr f (; ())k; and g max = maxfg ;g ; 1g; (4.14) where (), arg max 2A f (;). We are now ready to describe our proposed algorithm. 4.3.1 Algorithm Description Let t ; K ( t ) be the point computed at iterate t of the outer loop (Step 1) in Algorithm 6. Given t ; K ( t ) , we dene the stationarity measure X t , min s hr f ( t ; K ( t ));si s:t: t +s2 ;ksk 1; (4.15) and the rst order descent direction b s t , arg min s hr f ( t ; K ( t ));si s:t: t +s2 ;ksk 1: (4.16) In the unconstrained case, the descent direction isb s t =r f ( t ; K ( t )) which leads to an approximate gradient descent method. Our proposed framework incorporates two rst-order algorithms. The rst algo- rithm is run for K iterations and updates k ( t ) (Step 4) and the second algorithm is then run for one 66 iteration to update t (Step 7) using the nal iterate K ( t ). This procedure is then repeated until we nd an "-FNE point. Algorithm 6 Multi-Step Frank Wolfe Projected Gradient Framework Require: The constants e L, maxfL;L 12 ;g max g, N, p 8L 22 = 1, K, T , , , 0 2 , and 0 2A 1: for t = 0; 1; 2;:::;T do 2: Set 0 (t) = t 3: for k = 0;:::; K N 1 do 4: Set N(k+1) (t) = APGA( Nk (t); t ;;N) 5: end for 6: Set t+1 = K (t) 7: Compute t+1 using rst-order information (Frank-Wolfe or projected gradient descent). 8: end for Algorithm 7 APGA: Accelerated Projected Gradient Ascent Require: The constants 0 , t , , and N. 1: Set 1 = 0 , 1 = 1 2: for k = 1; 2;:::;dNe do 3: Set k = proj A k +r f ( t ; k ) 4: Set k+1 = 1 + p 1 + 4 2 k 2 5: k+1 = k + k 1 k+1 ( k k1 ) 6: end for 7: Return dNe In Step 7 of the algorithm, we can either use projected gradient descent with the update rule t+1 , proj t 1 L 11 +L 2 12 = r f t ; K (t) or Frank-Wolfe method with update rule t+1 = t + X t e L b s t for updating . We show convergence of the algorithm to "{FNE in Theorems 4.3.3. Theorem 4.3.3. Given a scalar "2 (0; 1). Assume that Step 7 in Algorithm 6 sets t+1 = t + X t e L b s t or t+1 = proj t 1 L r f ( t ; K (t)) : 67 Under Assumptions 4.3.1 and 4.1.4, if we apply Algorithm 6 with = 1 L 22 ; , " 4R ; TN T ("),O(" 3 ); and KN K ("),O " 1=2 log(" 1 ) ; then there exists t2f0;:::;Tg such that ( t ; t+1 ) is an "{rst-order stationary point of problem (4.4). Proof. The proof is relegated to Appendix D.2.2. But we provide a sketch of the proof here. The rst step of our proof is to show that the inner loop in Algorithm 6 computes an approximate gradient of g (). In other words,r f ( t ; t+1 )rg ( t ). This implies that Algorithm 6 behaves similar to the simple vanilla gradient descent method applied to problem (4.12). The second step of the proof establishes convergence to a stationary point of the optimization problem (4.10) when is suciently small. This theorem directly implies the following corollary. Corollary 4.3.4. Under Assumptions 4.1.4 and 4.3.1, Algorithm 6 nds an"-rst-order stationary solution of the game (4.4) withO(" 3 ) gradient evaluations of the objective with respect andO(" 3:5 log(" 1 )) gradient evaluations with respect to . If the two oracles have the same complexity, the overall complexity of the method would beO(" 3:5 log(" 1 )). A direct application for our proposed framework is when g is a nite maximization function, i.e. g(), maxff i ()g n i=1 : To see this, we re-write the latter problem as g(), max t n X i=1 t i f i () s.t. t i 0 8i = 1:::;n; n X i=1 t i = 1: Then solving the minimization problem min g() is equivalent to solving the min-max game min max t n X i=1 t i f i () s.t. t i 0 8i = 1:::;n; n X i=1 t i = 1: Clearly the inner maximization problem is concave. Using our proposed framework we solve at each iteration the regularized strongly concave problem max t n X i=1 t i f i () 2 n X i=1 t 2 i s.t. t i 0 8i = 1:::;n; n X i=1 t i = 1: (4.17) 68 Then we use the computed solution to perform a descent step with respect to . We note that the special structure of the maximization problem in (4.17) can be utilized to improve on the complexity rate we have in Corollary 4.3.4 to obtain a rate ofO(" 3 ). 4.4 Min-Max Games for Fair Statistical Learning Machine learning algorithms have been increasingly deployed in critical automated decisions that directly aect human lives. Several reported instances have demonstrated that these algorithms may suer from systematic discrimination against individuals based on their sensitive attributes (gender, race,:::). In this section, we introduce Hirschfeld-Gebelein-R enyi (HGR) correlation principle as a tool to impose several known group fairness measures (Demographic parity, Equalizedodds, Equalized opportunity). Using HGR coecient as a regularization term, we propose a min-max game formulation for fair statistical inference. Motivated by Algorithm 6, we compute a rst-order stationary point of the min-max game. Our contribu- tions are summarized below. Contributions: We introduce HGR correlation principle as a tool to impose several notions of group fairness: demo- graphic parity, equalized odds and equalized opportunity. These fairness notions require (condition) independence between the model output and the sensitive attribute. Unlike Pearson correlation and HSIC, the HGR correlation provides a necessary and sucient condition for independence of random variables. Using a regularization framework, we propose a min-max game formulation for fair statistical infer- ence. Unlike methods that use an adversarial neural network to impose fairness, we show that in special problems (like binary classication) it suces to use a simple concave quadratic function as the adversarial objective. For these cases, we use a multi-step gradient ascent descent algorithm for computing a stationary solution of the optimization problem. We apply our regularization framework to k-means clustering. For this problem we propose an algo- rithm that penalizes the dependence between the allocation of points to clusters and their corresponding sensitive attribute. We show that for suciently large regularization coecient we achieve perfect fair- ness under disparate impact doctrine. Unlike the two-phase methods proposed in [44, 146], our method does not require any pre-processing and allows for regulating the trade-o between the accuracy and fairness. 69 4.4.1 R enyi Correlation As a Measure of Dependence The most widely used notions for group fairness in machine learning are demographic parity, equalized odds, and equalized opportunities. These notions require (conditional) independence between the model output and a certain sensitivity attribute. This independence is typically imposed by adding fairness constraints or regularization terms to the training process. For instance, [94] added a regularization term that uses mutual information as a measure of dependence between the model output and the sensitive attribute. Since estimating the joint probability density function between random variables is not computationally tractable, [94] approximated this probability density function using a logistic regression model. In an at- tempt to better estimate the joint probability density function, [137] proposed an adversarial approach that estimates the joint probability density function using a parameterized neural network. Although these works start from a very well-justied objective function, they end up solving approximations of the objective func- tion due to computational barriers. Thus, no fairness guarantee can be provided even when the resulting optimization problems are solved to global optimality in the large sample size scenarios. A simpler and more tractable measure of dependence between two random variables is Pearson correlation coecient. The Pearson correlation coecient between the two random variables A and B is dened as P (A;B) = Cov(A;B) p Var(A) p Var(B) ; (4.18) where Cov(;) is the covariance of the two random variables, and Var() denotes the variance. The Pearson correlation coecient is used in [160] to decorrelate the binary sensitive attribute and the decision boundary of the classier. A major drawback of Pearson correlation as a dependence measure is that it only captures linear dependencies between random variables. It is known that if two random variables A, B are independent, then P (A;B) = 0. However, the converse is not necessarily true, i.e., P (A;B) may be zero even when the random variables A and B have signicant dependencies. This property raises concerns about the use of Pearson correlation for imposing fairness. Related to Pearson correlation, [133] proposes the Hilbert Schmidt Independence Criterion (HSIC) as a fairness measure in training. Similar to Pearson correlation, the HSIC of two random variables may be zero even if the two variables have dependencies. While universal Kernels can be used to resolve this issue, they could arrive at the expense of computational interactability. In addition, HSIC is not a normalized dependence measure [78, 77], raising concerns about its appropriateness as a dependence measure. In this thesis, we suggest to use Hirschfeld-Gebelein-R enyi (HGR) correlation [136, 83, 70] as a dependence measure between random variables to be used for fair inference. This measure of dependence is normalized, captures higher order dependence, and it is computationally tractable in certain cases, as will be seen later. 70 HGR correaltion, which is also known as maximal correlation or R enyi correlation, between two random variables A and B is dened as R (A;B) = sup f;g E[f(A)g(B)] s:t: E[f(A)] =E[g(B)] = 0; E[f 2 (A)] =E[g 2 (B)] = 1; (4.19) where the optimization is done over the set of measurable functions f() and g() satisfying the constraints. Unlike HSIC and Pearson correlation, R enyi correlation captures higher-order dependencies between random variables. Moreover, R enyi correlation coecient between two random variables is zero if and only if the random variables are independent. For other interesting properties of the HGR corelation, we refer the readers to [64]. In addition to desirable statistical properties, as we will discuss later, R can be computed \eciently" for discrete random variables. These computational and statistical benets make R enyi correla- tion a powerful tool in the context fair inference. In the next section, we will discuss how R enyi correlation can be used to achieve some popular notions of fairness. 4.4.2 A General Min-Max Framework for R enyi Fair Inference Consider a learning task over a given random variable Z. Our goal is to minimize the average inference lossL() where our loss function is parameterized with parameter . Hence, for nding parameter with smallest average loss, we need to solve the following optimization problem min E L(; Z) : Here expectation is taken over the random variable Z. Notice that the above formulation is quite general and can include regression, classication, clustering, or dimensionality reduction tasks as special cases. Now assume that, in addition to minimizing the average loss, we need to bring fairness to our learning task. Let S be the sensitive attribute and b Y (X) be the result of our inference task using parameter . Assume we are interested in reducing the dependence of the random variable b Y (X) to the sensitive attribute S. To balance the goodness-of-t and fairness in our learning task, we propose to use the following optimization problem min E L(; Z) + 2 R b Y (Z); S ; (4.20) where is a positive scalar balancing fairness and goodness-of-t in our learning task. Notice that the above 71 framework is quite general. For example, ^ Y may be the assigned labels in a classication task, the assigned cluster in a clustering task, or the output of a regressor in a regression task. Using the denition of R enyi correlation, we can rewrite this optimization problem as min sup f;g E L(; Z) + E f b Y (Z) g(S) 2 ; s:t: E f b Y (Z) =E g(S) = 0; E f 2 b Y (Z) =E g 2 (S) = 1; (4.21) where the supremum is taken over the set of measurable functions. The next natural question to ask is whether this optimization problem can be solved eciently in practice. In the next two sections, we narrow down our focus to special cases and study the practical tractability of the above min-max optimization problem. 4.4.3 R enyi Fair Classication In a typical fair classication problem, we are given input data X2 R nd each row representing a data sample, ground-truth label Y2Y ,f0; 1;:::;Cg, and sensitive attribute S2 R n . Here C is the number of classes considered in the classication. The model seeks a prediction b Y(X)2Y that provides accurate prediction for the ground-truth label Y while being fair to the sensitive attribute S. More specically, in addition to being accurate, the predictions should not be biased in favor of individuals of a certain protected group. We also denote by b P(X i = j) to be the probability of classifying data sample i in class j and let b P(X i ), b P(X i =j) C j=1 2R C . Using this notation, a predictor ^ Y(X) satises demographic parity if it is independent of the sensitive attribute S. This condition formalizes the legal doctrine of disparate impact. Motivated by the desired statistical properties of HGR correlation, we impose demographic parity fairness by adopting the HGR independence measure in an adversarial learning approach. The problem can be formulated as a two-player game: Player 1: min L() (4.22a) Player 2: R ( ^ Y (X); S), sup f;g E[f( ^ Y (X))g(S)] s:t: E[f( ^ Y (X))] =E[g(S)] = 0; E[f 2 ( ^ Y (X))] =E[g 2 (S)] = 1: (4.22b) 72 The goal of the rst player is to learn a model that best ts the pair of data (X; Y). More specically, (4.22a) computes parameters that minimize the loss (measured by the function L()) of the model. Typical examples ofL are cross-entropy and hinge loss. Given the parameters , the goal of the second player is to nd the functionsf() andg() that maximize the correlation between the prediction model output ^ Y (X) and the sensitivity attribute S. To solve the described two-player game problem, we use a regularization framework that penalizes the R enyi correlation DP-Fair classication: min L() + 2 R ( ^ Y (X); S): (4.23) The terms of the above formulation correspond to the objective functions of players 1 and 2 dened in (4.22a) and (4.22b), respectively. Here is a regularization hyper-parameter balancing fairness and accuracy of the classier. The regularization term added in (4.23) is used to penalize the dependence between the prediction model output ^ Y (X) and the sensitivity attribute S. As we discussed earlier, formulation (4.23) imposes the demographic parity group fairness notion. Another popular notion of group fairness is equalized odds. We say a predictor ^ Y (X) satises equalized odds condition if the predictor ^ Y (X) is conditionally independent of the sensitive attribute S given a true label Y, i.e. P ( ^ Y (X) = ^ yj S; Y = y) = P ( ^ Y (X) = ^ yj Y = y) for all y and ^ y. Similar to formulation (4.23), the equalized odds fairness notion can be imposed by the following Min-Max regularized problem EOD-Fair classication: min L() + X y2Y 2 R ^ Y ([X i ] i:Yi=y ); [S i ] i:Yi=y : (4.24) Here [X i ] i:Yi=y are the data samples with ground truth label y and [S i ] i:Yi=y denote their corresponding sensitivity attribute. Finally, it might be desirable to impose this constraint on a particular label ^ Y (X) = y. We say a predictor ^ Y (X) satises equalized opportunity if the predictor ^ Y (X) is conditionally independent of the sensitive attribute S given a specic true label Y = y, i.e. given y, P ( ^ Y (X) = ^ yj S; Y = y) =P ( ^ Y (X) = ^ yj Y = y) for all ^ y: The equalized opportunity fairness notion can be imposed by the following Min-Max regularized problem EOP-Fair classication: min L() + 2 R ^ Y ([X i ] i:Yi=y ); [S i ] i:Yi=y : (4.25) Note that in the case of multiple discrete sensitivity attributes S j 2S j with j2f1;:::;kg, we can include one sensitivity attribute that considers all possible combinations S2f(s i ) k i=1 js i 2S i g. For instance, when 73 we have two sensitivity attributes S 1 2f0; 1g and S 2 2f0; 1g, we can include one attribute S2f0; 1; 2; 3g corresponding to the four combinations off(0; 0); (0; 1); (1; 0); (1; 1)g. Hence, our formulations (4.23), (4.24), (4.25) can be used to impose fairness among multiple sensitivity attributes. Based on the denition of R in (4.22b), the problems dened in (4.23), (4.24), (4.25) fall under the Min-Max optimization framework. In the next section, we study the problem of nding polynomial-time algorithms for solving these Min-Max optimization problems with theoretical convergence guarantees. 4.4.4 Convergence Analysis Several algorithms have been recently proposed for solving Min-Max optimization problems [140, 127, 89]. These methods require solving the inner maximization problem to (approximate) global optimality. The optimization problem described in (4.19) maximizes the objective function over a functional space, which is in general computationally intractable. In this section, we show that in the case of discrete random variables, R can be solved \eciently" to global optimality. Motivated by the algorithms proposed in [127, 89], we use the computed global solution to develop an algorithm that solves the Min-Max optimization problem to approximate rst order optimality. We rst consider the case when S is binary. In this case, [64] showed that solving the optimization problem in (4.19) is equivalent to solving a concave quadratic maximization problem; see details in the theorem below. Theorem 4.4.1 (Theorem 2, [64]). Suppose that ^ Y (X)2f1;:::;Cg n is a discrete random vector and S2f0; 1g n is a binary random variable. Then, 2 R ( ^ Y (X); S), 1 P(S = 1)P(S = 0) ; (4.26) , min v v T Q v + d v + 1 4 ; (4.27) with Q 2R CC and b2R C are dened as Q i;j = 8 > < > : P n k=1 b P (X k =j) if i =j 0 otherwise and d j = n X k=1 (2S k 1) b P (X k =j) for every i;j = 1;:::;C. The above result implies that when S is binary, the maximal HGR correlation can be computed exactly by solving a convex unconstrained quadratic optimization problem; which has a closed form solution. Based on 74 Theorem 4.4.1, the problem dened in (4.23) can be re-written as min max v f(; v); (4.28) where f(; v),L() + 1c v T Q v + d T v + 1 4 ! : (4.29) Here c = 1 P(S = 1)P(S = 0) is a positive scalar, where P(S = 1) and P(S = 0) can be estimated from the given dataset. Under Lipschitz smoothness assumption on b P (X), we use the framework proposed in 4.3 to solve at each iteration the regularized strongly concave problem max v c v T Q v + d T v 2 kvk 2 : Then we use the computed solution to perform a gradient descent step on . It is worth mentioning that the strongly concave quadratic problem has a closed-form solution. Hence, by Corollary 4.3.4, we nd an "{FNE of the Min-Max problem inO(" 3 ). Also note that the inner maximization problem in (4.28) is a convex quadratic function composed with a linear function. Hence, for a given , using [96] and for a given , f(;) is + ()-PL. Here + () is the smallest non-zero singular value of Q . If there exists > 0 such that + () for all , then under Lipschitz smoothness assumptions on f, we can use Algorithm 5 to nd an "-FNE withO(" 2 ) iterations. In the next section, we apply our framework to impose fairness to clustering problems. 4.4.5 Fair K-means Clustering In this section we apply the proposed fair R enyi framework to the popular K-means clustering problem. In a typical K-means clustering problem, we are given input data X2R nd where each row represents a data sample. The model seeks a partition of the data samples into K disjoint subsets,P =fP 1 ;P 2 ;:::;P K g with corresponding centersC =fc 1 ; c 2 ;:::; c K g such that the following objective function is minimized min P;C K X k=1 X x2P k kx c k k 2 2 : (4.30) For fair K-means clustering, in addition to input data X, we are given sensitivity attribute S. We seek a partition of the data samples into K disjoint subsets,P =fP 1 ;P 2 ;:::;P K g with corresponding centers C =fc 1 ; c 2 ;:::; c K g such that the accuracy (4.30) is minimized while being fair to the sensitivity attribute S. More specically, in addition to being accurate, the assignment of data samples to clusters should be independent of the sensitivity attribute S. Let A2f0; 1g nK be the assignment matrix with A i;k = 1 if the 75 i th data sample is assigned to cluster k, and 0 otherwise. Using our regularization framework, we formulate the fair K-means clustering as min A;C P n i=1 P K k=1 kX i A i;k c j k 2 + 2 R (A; S) s:t: P K k A i;k = 1 8i = 1;:::;n A i;k 2f0; 1g 8i = 1;:::;n; k = 1;:::;K: (4.31) Assume that S is binary, then using Theorem 4.4.1, we have 2 R (A; S) = max v2R K 1 1 P(S = 1)P(S = 0) n X i=1 (A T i v S i ) 2 ; where A i 2f0; 1g K is the i th row of the matrix A. Hence, the problem of nding a fair K-means partition P can be formulated as the following Min-Max problem min A;C max v2R K P n i=1 P K k=1 kX i A i;k c j k 2 P n i=1 (A T i v S i ) 2 s:t: P K k A i;k = 1 8i = 1;:::;n A i;k 2f0; 1g 8i = 1;:::;n; k = 1;:::;K: (4.32) Here, we removed the constant terms and absorbed 1 P(S = 1)P(S = 0) into the hyper-parameter. Motivated by algorithm 6, we propose an Algorithm that solves the inner maximization problem at each iteration. Then uses the computed solution to update A. In problem (4.32), given A, the inner maximization problem has the closed-form solution v k (A), P n i=1 S i A i;k P n i=1 A i;k 8 k = 1;:::;K: Hence, v k () is the ratio of data points with S = 1 in clusterk. Under disparate impact doctrine, we dene the notion of balance, minfv k (A)g K k=1 to be a measure of fairness. Larger balance corresponds to a more fair solution. When all elements of v are the same, the proportion of data points with S = 1 (or S = 0) in each cluster is the same as the proportion of data points with S = 1 (or S = 0) in the whole dataset. Therefore, the smaller the variance of the elements of v computed by solving (4.32), the larger the balance measure; which represents a more fair solution. To solve the fair K-means clustering problem, we propose Algorithm 8 that adapts the popular k-means Algorithm [82] to solve the Min-Max optimization problem (4.32). More specically, in addition to updating the assignments A and centers c j , our algorithm updates v at each iteration. Moreover, when updating A our algorithm considers the fairness regularization term for computing the euclidean distance from data 76 points to the centers. Algorithm 8 Fair R enyi K-means 1: Input: X and S 2: Initialize: RandomA s:t: P K k=1 A i;k = 18 i; and A i;k 2f0; 1g: A prev = 0 3: while A prev 6= A do 4: A prev = A 5: 6: for i = 1;:::;n do . Update A 7: A i = 0 8: t = arg min k kX i c k jk 2 (v k S i ) 2 9: A it = 1 10: for k = 1;:::;K do . Update v 11: v k = P n i=1 S i A i;k P n i=1 A i;k 12: end for 13: end for 14: 15: for k = 1;:::;K do . Update c 16: c k = P n i=1 A i;k X i P n i=1 A i;k 17: end for 18: end while For S i = 1, Step 8 penalizes the clusters with larger proportion of points with S = 1, i.e. clusters with larger v k . On the other hand, for S i = 0, Step 8 penalizes the clusters with smaller proportion of points with S = 1, i.e. clusters with smaller v k . Hence the role of the regularization term in Step 8 is to try to assign points with S = 1 to clusters with small v k enforcing a more balanced output. In the next section, we evaluate our algorithm on a real dataset. 4.4.6 Numerical Experiments In this section, we evaluate our algorithm by performing experiments on the standard Bank 2 and Adult 3 real datasets. Bank dataset: This dataset contains the information of individuals contacted by a Portuguese bank insti- tution. For this dataset, we sampled 3 continuous features: Age, balance, and duration. We consider the sensitive attribute to be the marital status of the individuals. Adult dataset: This dataset contains the census information of individuals including education, gender, capital-gain, and etc. We selected 5 continuous features, and sampled 10000 data samples. We consider the 2 https://archive.ics.uci.edu/ml/datasets/Bank%20Marketing. 3 https://archive.ics.uci.edu/ml/datasets/adult. 77 sensitivity attribute to be the gender of the individuals. We implemented Algorithm 8 as described in Section 4.4.5 to solve the fair K-means clustering problem for the described datasets. We chose the number of clustersK to be 14. The results are summarized in Figure 4.1. Result: Figure 4.1 shows the results for the two dataets. Each plot depicts the minimum, maximum, average, average plus standard deviation, and average minus standard deviation of the elements of vector v. These measures are used to show the variation in the proportion of individuals with S = 1 across dierent clusters. When = 0, algorithm 8 is just the traditional K-means algorithm. Observe that as we increase , the deviation of the elements of v decreases imposing more fairness. Moreover, the red plots show that as we increase the total loss increases. This demonstrates the trade-o between accuracy and fairness. (a) Fair R enyi K-means for Adult dataset for K = 14 (b) Fair R enyi K-means for Bank dataset for K = 14 Figure 4.1: Trade-o between accuracy and fairness. 78 Bibliography [1] C. R. Act. Civil rights act of 1964, title vii, equal employment opportunities,. 1964. [2] A. Agarwal, A. Beygelzimer, M. Dud k, J. Langford, and H. Wallach. A reductions approach to fair classication. arXiv preprint arXiv:1803.02453, 2018. [3] D. Alabi, N. Immorlica, and A. Kalai. Unleashing linear optimizers for group-fair learning and opti- mization. In Conference On Learning Theory, pages 2043{2066, 2018. [4] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018. [5] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018. [6] A. Anandkumar and R. Ge. Ecient approaches for escaping higher order saddle points in non-convex optimization. In Conference on Learning Theory, pages 81{102, 2016. [7] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. ProPublica, 2016. [8] M. Anitescu. Degenerate nonlinear programming with a quadratic growth condition. SIAM Journal on Optimization, 10(4):1116{1135, 2000. [9] S. Arora, N. Cohen, N. Golowich, and W. Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018. [10] M. Balcerzak, A. Majchrzycki, and A. Wachowicz. Openness of multiplication in some function spaces. Taiwanese J. Math, 17:1115{1126, 2013. [11] M. Balcerzak, A. Wachowicz, and W. Wilczy nski. Multiplying balls in the space of continuous functions on [0; 1]. Studia Mathematica, 170:203{209, 2005. [12] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53{58, 1989. 79 [13] A. S. Bandeira, N. Boumal, and V. Voroninski. On the low-rank approach for semidenite programs arising in synchronization and community detection. In Conference on Learning Theory, pages 361{382, 2016. [14] B. Barazandeh and M. Razaviyayn. On the behavior of the expectation-maximization algorithm for mixture models. arXiv preprint arXiv:1809.08705, 2018. [15] Y. Bechavod and K. Ligett. Penalizing unfairness in binary classication. arXiv preprint arXiv:1707.00044, 2017. [16] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183{202, 2009. [17] E. Behrends. Products of n open subsets in the space of continuous functions on [0; 1]. Studia Mathe- matica, 204:73{95, 2011. [18] E. Behrends. Where is matrix multiplication locally open? Linear Algebra and its Applications, 517:167{176, 2017. [19] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. arXiv preprint arXiv:1706.02409, 2017. [20] P. Bernhard and A. Rapaport. On a theorem of danskin with an application to a theorem of von neumann-sion. Nonlinear analysis, 24(8):1163{1182, 1995. [21] D. P. Bertsekas. Nonlinear programming. Athena scientic Belmont, 1999. [22] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873{3881, 2016. [23] W. Bian, X. Chen, and Y. Ye. Complexity analysis of interior point algorithms for non-lipschitz and nonconvex minimization. Mathematical Programming, 149(1-2):301{327, 2015. [24] E. Birgin and J. Mart nez. On regularization and active-set methods with complexity for constrained optimization. SIAM Journal on Optimization, 28(2):1367{1395, 2018. [25] A. Blum and R. L. Rivest. Training a 3-node neural network is np-complete. In Advances in neural information processing systems, pages 494{501, 1989. [26] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pages 4349{4357, 2016. 80 [27] N. Boumal. Nonconvex phase synchronization. SIAM Journal on Optimization, 26(4):2355{2377, 2016. [28] N. Boumal, V. Voroninski, and A. Bandeira. The non-convex burer-monteiro approach works on smooth semidenite programs. In Advances in Neural Information Processing Systems, pages 2757{2765, 2016. [29] S. Burer and R. Monteiro. A nonlinear programming algorithm for solving semidenite programs via low-rank factorization. Mathematical Programming, 95(2):329{357, 2003. [30] S. Burer and R. D. C. Monteiro. Local minima and convergence in low-rank semidenite programming. Mathematical Programming, 103(3):427{444, 2005. [31] Q. Cai, M. Hong, Y. Chen, and Z. Wang. On the global convergence of imitation learning: A case for linear quadratic regulator. arXiv preprint arXiv:1901.03674, 2019. [32] T. Calders, F. Kamiran, and M. Pechenizkiy. Building classiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pages 13{18. IEEE, 2009. [33] F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, pages 3992{4001, 2017. [34] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for nding stationary points i. arXiv preprint arXiv:1710.11606, 2017. [35] C. Cartis, N. Gould, and P. L. Toint. An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4):1662{1695, 2012. [36] C. Cartis, N. I. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245{295, 2011. [37] C. Cartis, N. I. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical pro- gramming, 130(2):295{319, 2011. [38] C. Cartis, N. I. Gould, and P. L. Toint. On the evaluation complexity of cubic regularization methods for potentially rank-decient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM Journal on Optimization, 23(3):1553{1574, 2013. [39] C. Cartis, N. I. Gould, and P. L. Toint. On the complexity of nding rst-order critical points in constrained nonlinear optimization. Mathematical Programming, 144(1-2):93{106, 2014. 81 [40] C. Cartis, N. I. Gould, and P. L. Toint. On the evaluation complexity of constrained nonlinear least- squares and general constrained nonlinear optimization using second-order methods. SIAM Journal on Numerical Analysis, 53(2):836{851, 2015. [41] C. Cartis, N. I. Gould, and P. L. Toint. Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics, pages 1{35, 2017. [42] L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi. Classication with fairness constraints: A meta- algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 319{328. ACM, 2019. [43] A. Chambolle and T. Pock. On the ergodic convergence rates of a rst-order primal{dual algorithm. Mathematical Programming, 159(1-2):253{287, 2016. [44] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, pages 5029{5037, 2017. [45] A. Choromanska, M. Hena, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Articial Intelligence and Statistics, pages 192{204, 2015. [46] A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods, volume 1. Siam, 2000. [47] F. E. Curtis and D. P. Robinson. Exploiting negative curvature in deterministic and stochastic opti- mization. arXiv preprint arXiv:1703.00412, 2017. [48] F. E. Curtis, D. P. Robinson, and M. Samadi. An inexact regularized newton framework with a worst- case iteration complexity of o( 3=2 ) for nonconvex optimization. arXiv preprint arXiv:1708.00475, 2017. [49] F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity ofO( 3=2 ) for nonconvex optimization. Mathematical Programming, 162(1-2):1{32, 2017. [50] B. Dai, H. Dai, A. Gretton, L. Song, D. Schuurmans, and N. He. Kernel exponential family estimation via doubly dual embedding. arXiv preprint arXiv:1811.02228, 2018. [51] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1133{1142, 2018. [52] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017. 82 [53] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimiza- tion. In Advances in Neural Information Processing Systems, pages 9236{9246, 2018. [54] A. Datta, M. C. Tschantz, and A. Datta. Automated experiments on ad privacy settings. Proceedings on privacy enhancing technologies, 2015(1):92{112, 2015. [55] P. J. Dickinson and L. Gijben. On the computational complexity of membership problems for the completely positive cone and its dual. Computational optimization and applications, 57(2):403{415, 2014. [56] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Advances in Neural Information Processing Systems, pages 2791{2801, 2018. [57] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067{1077, 2017. [58] S. S. Du and J. D. Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018. [59] S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent nds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018. [60] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018. [61] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214{226. ACM, 2012. [62] C. Dwork, N. Immorlica, A. T. Kalai, and M. Leiserson. Decoupled classiers for group-fair and ecient machine learning. In Conference on Fairness, Accountability and Transparency, pages 119{133, 2018. [63] H. Edwards and A. Storkey. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897, 2015. [64] F. Farnia, M. Razaviyayn, S. Kannan, and D. Tse. Minimum hgr correlation principle: From marginals to joint distribution. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 1377{1381. IEEE, 2015. [65] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1466{1475, 2018. 83 [66] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259{268. ACM, 2015. [67] B. Fish, J. Kun, and A. D. Lelkes. A condence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 144{152. SIAM, 2016. [68] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points|online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797{842, 2015. [69] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973{2981, 2016. [70] H. Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMM-Journal of Applied Mathematics and Mechan- ics/Zeitschrift f ur Angewandte Mathematik und Mechanik, 21(6):364{379, 1941. [71] S. Ghosh, M. Squillante, and E. Wollega. Ecient stochastic gradient descent for distributionally robust learning. arXiv preprint arXiv:1805.08728, 2018. [72] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551, 2018. [73] G. Gidel, R. A. Hemmat, M. Pezeshki, G. Huang, R. Lepriol, S. Lacoste-Julien, and I. Mitliagkas. Negative momentum for improved game dynamics. arXiv preprint arXiv:1807.04740, 2018. [74] G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. arXiv preprint arXiv:1610.07797, 2016. [75] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016. [76] I. Goodfellow and A. Courville. Deep learning. Book in preparation for MIT Press, Cambridge, 2016. [77] A. Gretton, O. Bousquet, A. Smola, and B. Sch olkopf. Measuring statistical dependence with hilbert- schmidt norms. In International conference on algorithmic learning theory, pages 63{77. Springer, 2005. [78] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch olkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6(Dec):2075{2129, 2005. 84 [79] E. Y. Hamedani, A. Jalilzadeh, N. Aybat, and U. Shanbhag. Iteration complexity of randomized primal-dual methods for convex-concave saddle point problems. arXiv preprint arXiv:1806.04118, 2018. [80] M. Hardt and T. Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016. [81] M. Hardt, E. Price, N. Srebro, et al. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315{3323, 2016. [82] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100{108, 1979. [83] H. O. Hirschfeld. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 31, pages 520{524. Cambridge University Press, 1935. [84] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565{4573, 2016. [85] M. Hong, J. D. Lee, and M. Razaviyayn. Gradient primal-dual algorithm converges to second-order stationary solutions for nonconvex distributed optimization. arXiv preprint arXiv:1802.08941, 2018. [86] C. Horowitz. An elementary counterexample to the open mapping principle for bilinear maps. Pro- ceedings of the American Mathematical Society, 53(2):293{294, 1975. [87] Y. Hsia and R.-L. Sheu. Trust region subproblem with a xed number of additional linear inequality constraints has polynomial complexity. arXiv preprint arXiv:1312.1398, 2013. [88] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points eciently. arXiv preprint arXiv:1703.00887, 2017. [89] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019. [90] A. Juditsky and A. Nemirovski. Solving variational inequalities with monotone operators on domains given by linear minimization oracles. Mathematical Programming, 156(1-2):221{256, 2016. [91] F. Kamiran and T. Calders. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, pages 1{6. IEEE, 2009. [92] F. Kamiran and T. Calders. Classication with no discrimination by preferential sampling. In Proc. 19th Machine Learning Conf. Belgium and The Netherlands, pages 1{6. Citeseer, 2010. 85 [93] F. Kamiran and T. Calders. Data preprocessing techniques for classication without discrimination. Knowledge and Information Systems, 33(1):1{33, 2012. [94] T. Kamishima, S. Akaho, and J. Sakuma. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643{650. IEEE, 2011. [95] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient meth- ods under the polyak- lojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795{811. Springer, 2016. [96] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient meth- ods under the polyak- lojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795{811. Springer, 2016. [97] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586{594, 2016. [98] K. Kawaguchi and Y. Bengio. Depth with nonlinearity creates no bad local minima in resnets. arXiv preprint arXiv:1810.09038, 2018. [99] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. arXiv preprint arXiv:1711.05144, 2017. [100] S. Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016. [101] J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017. [102] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246{1257, 2016. [103] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157{8166, 2018. [104] Q. Lin, M. Liu, H. Raque, and T. Yang. Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207, 2018. [105] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015. [106] H. Lu and K. Kawaguchi. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017. 86 [107] S. Lu, M. Razaviyayn, B. Yang, K. Huang, and M. Hong. Finding second-order stationary solutions using a perturbed projected gradient descent algorithm for non-convex linearly constrained problems. in prep., 2019. [108] S. Lu, I. Tsaknakis, and M. Hong. Block alternating optimization for non-convex min-max prob- lems: algorithms and applications in signal processing and communications. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. [109] S. Lu, I. Tsaknakis, M. Hong, and Y. Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: Algorithms and applications. arXiv preprint arXiv:1902.08294, 2019. [110] S. Lu, Z. Wei, and L. Li. A trust region algorithm with adaptive cubic regularization methods for nonsmooth convex minimization. Computational Optimization and Applications, 51(2):551{573, 2012. [111] D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable repre- sentations. arXiv preprint arXiv:1802.06309, 2018. [112] A. K. Menon and R. C. Williamson. The cost of fairness in binary classication. In Conference on Fairness, Accountability and Transparency, pages 107{118, 2018. [113] P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629, 2018. [114] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pages 3478{3487, 2018. [115] A. Mokhtari, A. Ozdaglar, and A. Jadbabaie. Escaping saddle points in constrained optimization. arXiv preprint arXiv:1809.02162, 2018. [116] R. D. Monteiro and B. F. Svaiter. On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization, 20(6):2755{2787, 2010. [117] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117{129, 1987. [118] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2-3):319{344, 2007. [119] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. 87 [120] Y. Nesterov et al. Gradient methods for minimizing composite objective function, 2007. [121] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177{205, 2006. [122] Q. Nguyen. On connected sublevel sets in deep learning. arXiv preprint arXiv:1901.07417, 2019. [123] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017. [124] M. Nouiehed, J. D. Lee, and M. Razaviyayn. Convergence to second-order stationarity for constrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018. [125] M. Nouiehed and M. Razaviyayn. Learning deep models: Critical points and local openness. arXiv preprint arXiv:1803.02968, 2018. [126] M. Nouiehed and M. Razaviyayn. A trust region method for nding second-order stationarity in linearly constrained non-convex optimization. arXiv preprint arXiv:1904.06784, 2019. [127] M. Nouiehed, M. Sanjabi, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative rst order methods. arXiv preprint arXiv:1902.08297, 2019. [128] M. O'Neill and S. J. Wright. A log-barrier newton-cg method for bound constrained optimization with complexity guarantees. arXiv preprint arXiv:1904.03563, 2019. [129] S. Oymak and M. Soltanolkotabi. Towards moderate overparameterization: global convergence guar- antees for training shallow neural networks. arXiv preprint arXiv:1902.04674, 2019. [130] J. S. Pang and M. Razaviyayn. A unied distributed algorithm for non-cooperative games., 2016. [131] D. Park, A. Kyrillidis, C. Caramanis, and S. Sanghavi. Non-square matrix sensing without spurious local minima via the burer-monteiro approach. arXiv preprint arXiv:1609.03240, 2016. [132] G. Pataki. On the rank of extreme matrices in semidenite programs and the multiplicity of optimal eigenvalues. Mathematics of operations research, 23(2):339{358, 1998. [133] A. P erez-Suay, V. Laparra, G. Mateo-Garc a, J. Mu~ noz-Mar , L. G omez-Chova, and G. Camps-Valls. Fair kernel learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 339{355. Springer, 2017. [134] E. Ra and J. Sylvester. Gradient reversal against discrimination: A fair neural network learning ap- proach. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 189{198. IEEE, 2018. 88 [135] H. Raque, M. Liu, Q. Lin, and T. Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060, 2018. [136] A. R enyi. On measures of dependence. Acta mathematica hungarica, 10(3-4):441{451, 1959. [137] A. Rezaei, R. Fathony, O. Memarrast, and B. Ziebart. Fair logistic regression: An adversarial perspec- tive. arXiv preprint arXiv:1903.03910, 2019. [138] W. Rudin. Functional analysis, mcgraw-hill series in higher mathematics. 1973. [139] S. Ruggieri. Using t-closeness anonymity to control for non-discrimination. Trans. Data Privacy, 7(2):99{129, 2014. [140] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. On the convergence and robustness of training gans with regularized optimal transport. In Advances in Neural Information Processing Systems, pages 7091{7101, 2018. [141] P. Sattigeri, S. C. Homan, V. Chenthamarakshan, and K. R. Varshney. Fairness gan. arXiv preprint arXiv:1805.09910, 2018. [142] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. 2018. [143] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017. [144] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742{ 769, 2019. [145] N. Srebro and T. Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 720{727, 2003. [146] K. B. Suman, D. Chakrabarty, and M. Negahbani. Fair algorithms for clustering. arXiv preprint arXiv:1901.02393, 2019. [147] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 2379{2383. IEEE, 2016. [148] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853{884, 2017. 89 [149] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere ii: Recovery by riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885{914, 2017. [150] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131{1198, 2018. [151] R. Sun. Matrix completion via nonconvex factorization: Algorithms and theory. PhD thesis, University of Minnesota, 2015. [152] L. Sweeney. Discrimination in online ad delivery. arXiv preprint arXiv:1301.6822, 2013. [153] L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in two-layer neural network optimization landscapes. arXiv preprint arXiv:1802.06384, 2018. [154] L. Wang, X. Zhang, and Q. Gu. A unied computational and statistical framework for nonconvex low-rank matrix estimation. arXiv preprint arXiv:1610.05275, 2016. [155] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory predic- tors. arXiv preprint arXiv:1702.06081, 2017. [156] D. Xu, S. Yuan, L. Zhang, and X. Wu. Fairgan: Fairness-aware generative adversarial networks. In 2018 IEEE International Conference on Big Data (Big Data), pages 570{575. IEEE, 2018. [157] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017. [158] C. Yun, S. Sra, and A. Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487, 2018. [159] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classication without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171{1180. International World Wide Web Conferences Steering Committee, 2017. [160] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classication. arXiv preprint arXiv:1507.05259, 2015. [161] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325{333, 2013. [162] B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335{340. ACM, 2018. 90 [163] Q. Zheng and J. Laerty. Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051, 2016. [164] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin. Global optimality in distributed low-rank matrix factoriza- tion. arXiv preprint arXiv:1811.03129, 2018. [165] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018. 91 Appendix A Proofs for results in Chapter 2 A.1 Proof of the Theorem 2.6.1 Proof. We start by showing that the full rankness assumption on X in [106, Theorem 2.2] is unnecessary. Lemma A.1.1. Every local minimum of problem (2.14) is global. Proof. Let r X = rank(X) and U X X V T X with U X 2R d0d0 , X 2R d0n , and V X 2R nn be a singular value decomposition of X. Then kZX Yk 2 =kZU X h X :;1:r X 0 i YV X k 2 =kZU X X :;1:r X YV X :;1:r X k 2 + k YV X :;r X +1:n k 2 | {z } constant in problem (2.14) : Since U X X :;1:r X is full column rank, then the linear mapping ZU X X :;1:r X is open, and rank(ZU X X :;1:r X ) minfrank(Z);r X g minfd 2 ;d 1 ;d 0 ;r X g: Consequently, every local minimum of (2.14) corresponds to a local minimum in problem min Z2R d 2 r X 1 2 k Z Yk 2 s:t: rank( Z) minfd 2 ; d 1 ; d 0 ; r X g ; (A.1) where Y = YV X :;1:r X . The result follows using [106, Theorem 2.2]. We are now ready to show Theorem 2.6.1. The proof for the degenerate case is done by constructing a descent direction if the point is critical but not global. Let ( W 2 , W 1 ) be a degenerate critical point, i.e. 92 rank( W 2 W 1 ) < minfd 2 ;d 1 ;d 0 g. Then, based on the dimensions of d 0 , d 1 , and d 2 , we have one of the following cases: d 2 <d 1 then9 b6= 0 such that b2N W 2 . d 0 <d 1 then9 b6= 0 such that b2N W T 1 . d 1 d 2 ; and d 1 d 0 then either W 2 is rank decient and9 b6= 0 s:t: b2N W 2 or W 1 is rank decient and9 b6= 0 such that b2N W T 1 . So in all cases eitherN W 2 6=; orN W T 1 6=;. Also, let = W 2 W 1 X Y. If X T = 0, then by convexity of the square loss error function, the point ( W 2 ; W 1 ) is a global minimum of (2.13). Else, there exists (i;j) such that X i;: ; j;: 6= 0. We now use rst and second order optimality conditions to construct a descent direction when the current critical point is not global. First order optimality condition: By considering perturbations in the directions A2R d2d1 and B2R d1d0 for the optimization problem min t 1 2 k( W 2 +tA)( W 1 +tB)X Yk 2 ; (A.2) we obtain the rst order optimality condition: A W 1 X + W 2 BX; = 0; 8A2R d2d1 ; B2R d1d0 : Second order optimality condition: 2 ABX; +kA W 1 X + W 2 BXk 2 0 8A2R d2d1 ; B2R d1d0 : Suppose ( W 2 ; W 1 ) is a critical point and there exists b6= 0, b2N ( W 2 ). Dene B :;l , 8 > < > : b if l =i; 0 otherwise A l;: , 8 > < > : b T if l =j; 0 otherwise where is a scalar constant. Then, using the second order optimality condition, for c =kA W 1 Xk 2 , we get 2kbk 2 |{z} 6=0 X i;: ; j;: | {z } 6=0 +c 0: Since this is true for every value of , b should be zero which contradicts the assumption on the choice of b. HenceN ( W 2 ) =;. Similarly, when ( W 2 ; W 1 ) is a critical point and there exists a T 6= 0, a T 2N ( W T 1 ), we can show that ( W 2 ; W 1 ) is a second order saddle point of (2.13). Combin- ing these results, we get that every degenerate critical point that is not a global optimum is a second-order 93 saddle point. We now show the result for the non-degenerate case. Let ( W 2 , W 1 ) be a non-degenerate local minimum, i.e. rank( W 2 W 1 ) = minfd 2 ;d 1 ;d 0 g. Then it follows by Lemma A.3.1 that the matrix multiplicationM(;) is locally open at ( W 2 ; W 1 ). Then by Observation 2.3.1, Z = W 2 W 1 is a local optimum of problem (2.14) which is in fact global by Lemma A.1.1. A.2 Proof of Corollary 2.6.2 Proof. We follow the same steps used in the proof of Theorem 2.6.1 to show the result. Similar to the proof of Theorem 2.6.1, we obtain the following rst and second order optimality conditions: hA W 1 X + W 2 BX;r`( W 2 W 1 X Y)i = 0 8A2R d2d1 ; B2R d1d0 2h ABX;r`( W 2 W 1 X Y)i +h A W 1 X; W 2 BX; W 2 W 1 X) 0 8A2R d2d1 ; B2R d1d0 ; where h() is a function that has a tensor representation. But we only need to know that it is a function of A W 1 X; W 2 BX, and W 2 W 1 X. Ifr`( W 2 W 1 X Y)X T = 0, then by convexity of `(), ( W 2 ; W 1 ) is a global minimum. Otherwise, there exists (i;j) such that X i;: ; r`( W 2 W 1 X Y) j;: 6= 0. Using the same former argument in proof of Theorem 2.6.1, we choose A and B such thath(A W 1 X; W 2 BX; W 2 W 1 X) is some constant that does not depend on , and ABX;r`( W 2 W 1 X Y)i = X i;: ; r`( W 2 W 1 X Y) j;: | {z } 6=0 : Then by proper choice of we show that the point ( W 2 ; W 1 ) is a second order saddle point. A.3 Proof of Lemma 2.7.1 We start by re-stating Theorem 3:1 of [106] using our notation. Lemma A.3.1. If W is non-degenerate, thenM h;1 (W) = W h W 1 is locally open at W. Proof. We construct a proof by induction on h to show the desired result. When h = 2, we either have d 1 < minfd 2 ;d 0 g or d 1 minfd 2 ;d 0 g. In the rst case, d 1 = rank(W 2 W 1 ) rank(W 1 )d 1 ) rank(W 1 ) =d 1 ; 94 and d 1 = rank(W 2 W 1 ) rank(W 2 )d 1 ) rank(W 2 ) =d 1 : Since W 1 is full row rank and W 2 is full column rank, then by Theorem 2.4.2,M 2;1 () is locally open at (W 2 ; W 1 ). In the second case, either d 2 = rank(W 2 W 1 ) rank(W 2 )d 2 ) rank(W 2 ) =d 2 ; or d 0 = rank(W 2 W 1 ) rank(W 1 )d 0 ) rank(W 1 ) =d 0 : Thus, either W 2 is full row rank or W 1 is full column rank, then by Proposition 2.4.1,M 2;1 () is locally open at (W 2 ; W 1 ). Now assume the result holds for the product of h matricesM h;1 (W), we show it is true forM h+1;1 (W). Since d p = rank(W h ::: W 1 ) rank(W p+1 W p )d p ) rank(W p+1 W p ) =d p ; then using Proposition 2.4.1, we getM p+1;p () is locally open at (W p+1 ; W p ). So we can replace W p+1 W p by a new matrix Z p with rank d p . Then by induction hypothesis, the product mappingM h+1;1 (W) = W h+1 W p+2 Z p W p1 W 1 is locally open at W. Since the composition of locally open maps is locally open, the result follows. We now show Lemma 2.7.1. Suppose W = (W h ;:::; W 1 ) is a non-degenerate local minimum. Then it follows by Lemma A.3.1 thatM h;1 is locally open at W. Then by Observation 2.3.1, Z =M h (W h ;:::; W 1 ) is a local optimum of problem (2.16) which is in fact global by Lemma A.1.1. A.4 Proof of Theorem 2.4.2 Before proceeding to the proof of Theorem 2.4.2, we need to state and prove few lemmas: Lemma A.4.1. Let V 2 R mn be a matrix with rank(V) = r < m. Then there exist an index set B =fi 1 ;:::;i r gf1;:::;mg and a matrix A2R (mr)r such that kAk 1 = max i;j jA ij j 2 mr1 and V B c = AV B ; where V B 2R rn is a matrix with rowsfV i;: g i2B and V B c2R (mr)n is a matrix with rowsfV i;: g i2B c. Notice that in the above lemma, the bound on the norm of matrix A is independent of the dimensionn and the choice of matrix V. 95 Proof. To ease the notation, we denote the i th row of V by v i . We use induction on m to show that there exists a basisB =fi 1 ;:::;i r g and a vector a j 2 R r such that8 j 2B c , v j = P i2B a j;i v i with ja j;i j 2 mr1 8 i2B: Induction Base Case m = r + 1: Without loss of generality, assumeB =f1;:::;rg. Since the case of v r+1 = 0 trivially holds, we consider v r+1 6= 0. By the property of basis, there exists a non-zero vector a r+1 2R r such that v r+1 = P r i=1 a r+1;i v i . Leti = arg max i2B ja r+1;i j. Ifja r+1;i j 1; then the induction hypothesis is true. Otherwise, whenja r+1;i j> 1, we have v i = 1 a r+1;i | {z } ar+1;r+1 v r+1 r X i=1;i6=i a r+1;i a r+1;i | {z } ar+1;i v i = X i2B a r+1;i v i ; whereB = (B[fr + 1g)nfi g; i.e., we remove the itemi fromB and include the itemr + 1 instead. Since j a r+1;i j 1, the induction base case holds. Inductive Step: Assume the induction hypothesis is true for m > r, we show it is also true for m + 1. Without loss of generality we can assume thatB =f1;:::;rg. By induction hypothesis, v j = P r i=1 a j;i v i withja j;i j 2 mr1 ;8j =fr+1;:::;mg: Since the case of v m+1 = 0 trivially holds, we consider v m+1 6= 0. SinceB is a basis, there exists a m+1 6= 0 such that v m+1 = P r i=1 a m+1;i v i . Let i = argmax i2B ja m+1;i j. If ja m+1;i j 2 mr , the induction step is done. Otherwise, for the case ofja m+1;i j> 2 mr ; we have v i = 1 a m+1;i | {z } am+1;m+1 v m+1 r X i=1;i6=i a m+1;i a m+1;i | {z } am+1;i v i = X i2B a m+1;i v i ; whereB = (B[fm + 1g)nfi g and clearlyj a m+1;i j 1;8 i2B according to the denition of i . For all j2fr + 1;:::;mg v j = r X i=1;i6=i a j;i v i + a j;i v i = X i6=i a j;i v i + a j;i a m+1;i v m+1 X i6=i a m+1;i a j;i a m+1;i v i = r X i=1;i6=i a j;i a j;i a m+1;i a m+1;i | {z } aj;i v i + a j;i a m+1;i | {z } aj;m+1 v m+1 = X i2B a j;i v i : It remains to show thatj a j;i j 2 mr for alli2B ,j2fr + 1;:::;mg. Let us rst consideri2B nfm + 1g 96 and j2fr + 1;:::;mg: j a j;i jja j;i j + a j;i a m+1;i a m+1;i 2 mr1 + 2 mr1 a m+1;i a m+1;i 2 mr ; where the the rst inequality holds by triangular inequality, the second inequality holds by the induction hypothesis, and the last inequality holds by the denition of i . For i = m + 1,j a j;m+1 j = a j;i a m+1;i 2 mr1 a m+1;i 2 mr . This concludes the inductive step and completes our proof. Lemma A.4.2. Let W 1 2 R mk and W 2 2 R kn . Assume further that W 1 W 2 = UV T is a singular value decomposition of the matrix product W 1 W 2 with U2 R mm , V2 R nn , and 2 R mn . Then M(;) is locally open at (W 1 ; W 2 ) if and only ifM(;) is locally open at (U T W 1 ; W 2 V). Proof of this Lemma is a direct consequence of the denition of local openness. Lemma A.4.2 implies that for proving Theorem 2.4.2, without loss of generality, we can assume that the product W 1 W 2 is a diagonal matrix. We next show in Lemma A.4.4 that if k< minfm;ng and rank(W 1 ) = rank(W 2 ), then statements i;ii;iii; and iv in Theorem 2.4.2 are all equivalent. Lemma A.4.3. Let W 1 2 R mk and W 2 2 R kn . Assume further that W 1 W 2 = UV T is a singular value decomposition of the matrix product W 1 W 2 with U2 R mm , V2 R nn , and 2 R mn . Dene W 1 , U T W 1 and W 2 , W 2 V. Then the condition (A) below holds true if and only if the condition (B) is true. Similarly, condition (C) is true if and only if condition (D) is true. (A) 9 c W 1 2R mk such that c W 1 W 2 = 0 and W 1 + c W 1 is full column rank. (B) 9 f W 1 2R mk such that f W 1 W 2 = 0 and W 1 + f W 1 is full column rank. (C) 9 c W 2 2R kn such that W 1 c W 2 = 0 and W 2 + c W 2 is full row rank. (D) 9 f W 2 2R kn such that W 1 f W 2 = 0 and W 2 + f W 2 is full row rank. Proof. Setting f W 1 = U T c W 1 and f W 2 = c W 2 V leads to the desired result. Lemma A.4.2 and Lemma A.4.3 imply that for proving Theorem 2.4.2, without loss of generality, we can assume that the product W 1 W 2 is a diagonal matrix. We next show in Lemma A.4.4 that ifk< minfm;ng and rank(W 1 ) = rank(W 2 ), then statements i;ii;iii; and iv in Theorem 2.4.2 are all equivalent. Lemma A.4.4. Let W 1 2 R mk , W 2 2 R kn with rank(W 1 ) = rank(W 2 ). Assume further that k < minfm;ng. Then, the following conditions are equivalent i) 9 f W 1 2R mk such that f W 1 W 2 = 0 and W 1 + f W 1 is full column rank. 97 ii) 9 f W 2 2R kn such that W 1 f W 2 = 0 and W 2 + f W 2 is full row rank. iii) dim N (W 1 )\C(W 2 ) = 0. iv) dim N (W T 2 )\C(W T 1 ) = 0. Proof. To prove the desired result we show the equivalences ii, iii, and i, iv. Then we complete the proof by showing iii,iv. We rst show the direction \ii)iii". Consider W 1 2R mk ; W 2 2R kn with both being rank r matrices. Suppose ii holds, then C( f W 2 )N (W 1 ) which implies rank( f W 2 ) dim N (W 1 ) =kr: (A.3) Also, k = rank(W 2 + f W 2 ) rank(W 2 ) + rank( f W 2 ) =r + rank( f W 2 ): This inequality combined with (A.3) implies that rank( f W 2 ) = kr: Note that dim C( f W 2 ) = dim N (W 1 ) andC( f W 2 )N ( W 1 ), which implies thatC( f W 2 ) =N ( W 1 ): Then, since rank(W 2 + f W 2 ) = rank(W 2 ) + rank( f W 2 ); we get ; =C( f W 2 )\C( W 2 ) =N ( W 1 )\C( W 2 )) dim N (W 1 )\C( W 2 ) = 0: We now show the other direction \ii(iii". Without loss of generality, let W 2 = (W 2 0 ) kr A rnr (W 2 0 ) kr where columns of W 2 0 are linearly independent and let f W 2 = w 1 1 ;:::; w kr 1 ; 0;:::; 0 2 R kn be a rank k r matrix where w i 1 are unit basis of N ( W 1 ) which yields C( f W 2 ) = N ( W 1 ). Then since dim N (W 1 )\C(W 2 ) = 0, we get rank(W 2 + f W 2 ) =k for generic choice of . This completes the proof. Note that by setting W 1 = W T 2 and W 2 = W T 1 , the same proof can be used to show i,iv. Next, we will prove the equivalence iii,iv. Notice that dim span N (W 1 )[C(W 2 ) = dim (N (W 1 )) + dim (C(W 2 )) dim (N (W 1 )\C(W 2 )) =kr +r dim N ( W 1 )\C( W 2 ) =k dim N ( W 1 )\C( W 2 ) : 98 Thus, dim N ( W 1 )\C( W 2 ) 6= 0, dim span N ( W 1 )[C( W 2 ) <k ,9 a6= 0 such that a?C( W 2 ); and a?N ( W 1 ) ,9 a6= 0 such that a2N ( W T 2 ); and a2C( W T 1 ) , dim N ( W T 2 \C( W T 1 ) 6= 0; which completes the proof. Lemma A.4.5. Let W 1 2 R mk , W 2 2 R kn with k < minfm;ng and let r , rank(W 1 W 2 ). Assume further that W 1 W 2 = UV T is an SVD decomposition of W 1 W 2 with U2R mm , and V2R nn , and 2R mn . If 8 > > > > < > > > > : i)9 f W 1 2R mk such that f W 1 W 2 = 0 and W 1 + f W 1 is full column rank. and ii)9 f W 2 2R kn such that W 1 f W 2 = 0 and W 2 + f W 2 is full row rank. then rank(W 1 ) = rank(W 2 ); W 2 V :;r+1:n = 0; and U T W 1 r+1:n;: = 0: Proof. Suppose that ii) holds, then C( f W 2 )N ( W 1 )) rank( f W 2 ) dim N ( W 1 ) =k rank(W 1 ): (A.4) Also, k = rank(W 2 + f W 2 ) rank(W 2 ) + rank( f W 2 ): This inequality combined with (A.4) implies k rank(W 2 ) rank( f W 2 )k rank(W 1 )) rank(W 2 ) rank(W 1 ): (A.5) Similarly, condition i) implies rank(W 1 ) rank(W 2 ). Combined with (A.5), we obtain rank(W 1 ) = rank(W 2 ). Therefore, Lemma A.4.4 implies dim N (W 1 )\C( W 2 ) = 0. It follows from the SVD decom- position of W 1 W 2 , that U T W 1 W 2 V) :;r+1:n = :;r+1:n = 0, or equivalently W 1 W 2 V :;r+1:n = 0. On the other hand, sinceC W 2 V :;r+1:n C W 2 andN W 1 \C W 2 =;, we have W 2 V) :;r+1:n = 0. Similarly, we can show that U T W 1 ) r+1:n;: = 0. Proposition A.4.6. LetM(W 1 ; W 2 ) = W 1 W 2 be the matrix product mapping with W 1 2R mk , W 2 2 R kn , and k < minfm;ng. Then,M(;) is locally open in its rangeR M ,fZ2R mn : rank(Z)kg at 99 the point ( W 1 ; W 2 ) if and only if the following two conditions are satised: 8 > > > > < > > > > : i)9 f W 1 2R mk such that f W 1 W 2 = 0 and W 1 + f W 1 is full column rank. and ii)9 f W 2 2R kn such that W 1 f W 2 = 0 and W 2 + f W 2 is full row rank. Proof. First of all, according to Lemma A.4.2 and Lemma A.4.3, without loss of generality we can assume that the matrix product W 1 W 2 is of diagonal form. Let us start by rst proving the \only if" direction. Notice that the result clearly holds when rank( W 1 ) = rank( W 2 ) = k by choosing f W 1 = f W 2 = 0. Moreover, the mappingM(;) cannot be locally open if only one of the matrices W 1 or W 2 is rank decient. To see this, let us assume that W 1 is full column rank, while W 2 is rank decient. Assume further that the mappingM(;) is locally open at ( W 1 ; W 2 ), it follows from the denition of openness that the mappingM 1 (W 1 ; W 1 2 ) , W 1 W 1 2 is locally open at ( W 1 ; W 1 2 ) where W 1 2 , ( W 2 ) :;1:k only contains the rst k columns of W 2 . Since the range of the mappingM 1 at ( W 1 ; W 1 2 ) is the entire spaceR mk , Proposition 2.4.1 implies that 8 > > > > < > > > > : 9 f W 1 such that f W 1 W 1 2 = 0 and W 1 + f W 1 is full row rank. or 9 f W 1 2 such that W 1 f W 1 2 = 0 and W 1 2 + f W 1 2 is full rank. Moreover, since W 1 2R mk and m > k, it is impossible for f W 1 + W 1 to be full row rank. On the other hand, since W 1 is full column rank, W 1 f W 1 2 = 0 implies that f W 1 2 = 0; and hence W 1 2 + f W 1 2 is not full column rank. Hence none of the above two conditions can hold and consequently,M(;) cannot be open at the point ( W 1 ; W 1 2 ) in this case. Similarly, we can show that when W 1 is rank decient and W 2 is full row rank, the mappingM(;) cannot be locally open. Hence, if W 1 and W 2 are not both full rank, then they both should be rank decient. Assume that the matrices W 1 and W 2 are both rank decient andM(;) is locally open at ( W 1 ; W 2 ). It follows thatM 1 (W 1 ; W 1 2 ), W 1 W 1 2 is locally open at ( W 1 ; W 1 2 ). By Proposition 2.4.1, and since there does not exist f W 1 such that W 1 + f W 1 is full row rank, there should exist f W 1 2 such that W 1 f W 1 2 = 0 and W 1 2 + f W 1 2 is full rank. Dening f W 2 , h f W 1 2 0 i , we satisfy the desired condition ii). Similarly, by looking at the transpose of the mappingM, we can show that conditioni) is true whenM is locally open. We now prove the \if" direction. Suppose i) and ii) hold. 100 Let = W 1 W 2 = h :;1:r 0 i be a rank r matrix. Lemma A.4.5 implies that rank( W 1 ) = rank( W 2 ), and the lastnr columns of W 2 are all zero. We need to show that for any given > 0, there exists> 0, such that I B W 1 W 2 \R M M I B W 1 ; I B W 2 : Consider a perturbed matrix e 2 I B \R M , we show that e 2M I B W 1 ; I B W 2 . Without loss of generality, and by permuting the columns of e if necessary, e can be expressed as e = " :;1:r + R 1 | {z } mr R 2 |{z} m(kr) ( :;1:r + R 1 )A 1 + R 2 A 2 | {z } m(nk) # : Here A 1 2R r(nk) and A 2 2R (kr)(nk) exist since rank( e )k. Moreover,k e k implies that the perturbed matrix R , h R 1 R 2 ( :;1:r + R 1 )A 1 + R 2 A 2 i has norm less than or equal , i.e. kR k . Since rank( W 2 + f W 2 ) = k, there exist a unitary basis set fe w 1 2 ;:::;e w kr 2 g for f W 2 such that spanfe w 1 2 ;:::;e w kr 2 g\C( W 2 ) =;. Dene f W 1 2 , n2 n+1 2 6 4 kr z}|{ 0 k(kr) z }| { e w 1 2 :::;e w kr 2 3 7 5; (A.6) and let us form the matrix W 1 2 2 R kk using the rst k columns of W 2 . Since the last nr columns of the matrix W 2 are zero, f W 1 2 + W 1 2 is a full rank k k matrix and W 1 f W 1 2 = 0. Let us dene W 0 1 , h R 1 R 2 i ( f W 1 2 + W 1 2 ) 1 ; and W 0 2 , h f W 1 2 W 1 2 + f W 1 2 :;1:r A1 + W 1 2 + f W 1 2 :;r+1:k A2 i : Using this denition, we have ( W1 + W 0 1 )( W2 + W 0 2 ) = h ( W1 + W 0 1 )( W2 + W 0 2 ) :;1:k ( W1 + W 0 1 )( W2 + W 0 2 ) :;k+1:n i = h ( W1 + W 0 1 )( W 1 2 + f W 1 2 ) ( W1 + W 0 1 )( W2 + W 0 2 ) :;k+1:n i = " :;1:k + W1 f W 1 2 | {z } =0 + h R 1 R 2 i ( W 1 2 + f W 1 2 ) 1 ( W 1 2 + f W 1 2 ) 0 |{z} m(nk) # + 2 4 0 |{z} mk ( W1 + W 0 1 ) 2 4 h W 1 2 + f W 1 2 :;1:r W 1 2 + f W 1 2 :;r+1:k i 2 4 A1 A2 3 5 3 5 3 5 = W1 W2 + h R 1 R 2 (:;1:r + R 1 )A1 + R 2 A2 i = W1 W2 + R = e : (A.7) To complete the proof, it remains to show that for any > 0, we can choose small enough such that k W 0 1 k andk W 0 2 k. In other words, we will show e 2M I B W 1 ; I B W 2 . Lete r, withke rr, be the rank of e . According to Lemma A.4.1 and by possibly permuting the columns, e can be expressed 101 as e = h e 1 e 1 A i ; where e 1 2R me r is full column rank, and A has a bounded normk Akn2 ne r1 . Notice that for given W 0 1 and W 0 2 satisfying (A.7), permuting the columns of e corresponds to permuting the columns of ( W 2 + W 0 2 ). If we can show that the rstr columns are not among the permuted ones, then using the fact that W 2 has only its rst r columns non-zero, it follows that the permutation of the columns of e corresponds to the same permutation of the columns of W 0 2 . Moreover, if the rst r columns are not among the permuted ones, then without loss of generality we can express the perturbed matrix e = " :;1:r + R 1 | {z } mr R 2 |{z} m(kr) ( :;1:r + R 1 ) A 1 + R 2 A 2 | {z } m(nk) # ; and the perturbation matrix R = " R 1 |{z} mr R 2 |{z} m(kr) ( :;1:r + R 1 ) A 1 + R 2 A 2 | {z } m(nk) # ; where 2 4 A 1 A 2 3 5 = A has a bounded norm. We now show that the rst r columns of e are not among the permuted columns. Assume the contrary, then there exists at least a column :;j + R 1 :;j withjr, that is not a column of e 1 and is thus a column of e 1 A. Without loss of generality let :;j + R 1 :;j = e 1 A :;1 . It follows that j;j + R 1 j;j = ( e 1 ) j;: A :;1 : But since j;j + R 1 j;j is a non-zero perturbed singular value, and since elements of ( e 1 ) j;: are all of or- der, then by choosing suciently small, we getk Ak> 2 ne r1 , which contradicts the bound we have on A. We now obtain an upper-bound onk W 0 2 k. Since the norm of A is bounded, the norm of A 2 is also bounded by some constant K,n2 n >n2 ne r1 . Hence, kR kk( :;1:r + R 1 ) A 1 + R 2 A 2 k k( :;1:r + R 1 ) A 1 kkR 2 A 2 k k( :;1:r + R 1 ) A 1 kK min 2 k A 1 kK; where min is the minimum singular value of the full column rank matrix :;1:r which is bounded away from zero. Here, we have chosen < min =2 so thatk( :;1:r + R 1 ) A 1 k min 2 k A 1 k. Rearranging the terms, we 102 obtaink A 1 k 2(1 +K) min : Thus, for some constant C,k W 1 2 k, we obtain k W 0 2 k 2 k f W 1 2 k 2 +k W 1 2 k 2 k A 1 k 2 +k f W 1 2 k 2 k A 2 k 2 2 4n 2 2 2n + 2 C 2 2 + 2K min ! 2 + 2 K 2 4n 2 2 2n 2 4K 2 + 2 C 2 2 + 2K min ! 2 + 2 K 2 4K 2 2 =2 + 2 C 2 2 + 2K min ! 2 ; where the rst inequality holds by Chauchy Swarz and triangular inequality. Thus, for a given > 0, we can choose min 8 > > > > < > > > > : 1 + max ( k( W 1 2 + f W 1 2 ) 1 k; p 2C 2 + 2K min !); min =2 9 > > > > = > > > > ; : This choice of leads tok W 0 2 k. Moreover, k W 0 1 kkR kk(W 1 2 + f W 1 2 ) 1 kk(W 1 2 + f W 1 2 ) 1 k k(W 1 2 + f W 1 2 ) 1 k 1 +k(W 1 2 + f W 1 2 ) 1 k ; which completes the proof. We now use Proposition A.4.6, Lemma A.4.4, and Lemma A.4.5 to complete the proof of Theorem 2.4.2: Proof. First of all, ifM(;) is locally open at ( W 1 ; W 2 ), according to Proposition A.4.6, the conditions i) and ii) must hold; and hence rank( W 1 ) = rank( W 2 ) due to Lemma A.4.5. Thus,M(;) cannot be locally open if rank( W 1 )6= rank( W 2 ). On the other hand, when rank( W 1 ) = rank( W 2 ), the conditions i), ii), iii), andiv) are equivalent due to Lemma A.4.4. Moreover, these conditions imply local openness according to Proposition A.4.6. A.5 Proof of Theorem 2.4.3 Before proceeding with the proof of Theorem 2.4.3, we recall the denition of the symmetric matrix mul- tiplication mappingM + : R nk 7!R M+ with M + (W) , WW T : whereR M+ ,f Z2 R nn j Z 0; rank(Z) kg. In this section, we show thatM + is open inR M+ . Particularly, we show that given a matrix W2 R nk and a small perturbation e Z 2 R M+ of Z, WW T , there exists a small perturbation f W of W such that e Z = f W f W T . Similar to the previous proof scheme, we rst show that local openness of 103 M + () at W is equivalent to local openness ofM + () at U T W where U T U is a symmetric singular value decomposition of the product WW T . Lemma A.5.1. Consider W2 R nk and assume that WW T = UU T is a symmetric singular value decomposition of the matrix product WW T with U2R nn , and 2R nn . Then,M + () is locally open at W if and only ifM + () is locally open at U T W: The proof of this lemma is a direct consequence of local openness denition. According to Lemma A.5.1, proving local openness ofM + () at W is equivalent to proving local openness ofM + () at U T W. To ease the notation, denote U T W by W. Notice that when W2 R nn is a full rank square matrix, for any symmetric perturbation R withkR k suciently small, e = W W T + R is a full rank symmetric positive denite matrix. Then nding a perturbation W + A of W such that ( W + A )( W + A ) T = e is equivalent to solving the matrix equation A W T + WA T + A A T = R : Substituting A = P( W 1 ) T for some matrix P2R nn , we obtain the following quadratic matrix equation P + P T + P 1 P T = R ; (A.8) where = W W T . In the next Lemma, we show how to nd a solution matrix P withkPk =O() that satises (A.8); thus proving local openness ofM + () at any full rank square matrix W. Lemma A.5.2. Let 2R nn be a full rank diagonal positive denite matrix. There exists 0 > 0 such that for any positive< 0 and any symmetric matrix R2R nn withkRk 1 , there exists an upper-triangular matrix P2R nn withkPk 1 3 satisfying the equation P + P T + P 1 P T = R: Before proving this lemma, let us emphasize that the value of 0 depends on , but is independent of the choice of R. Proof. Let us start by simplifying the equation of interest. For all i = 1;:::;n, let s i = 1 ii , which is positive by the positive deniteness of . Then, 104 P + P T + P 1 P T = R, 8 > > > < > > > : 2P ii + X l s l P 2 il = R ii 8i P ij + P ji + X l s l P il P jl = R ij 8i<j , 8 > > > > < > > > > : s i P ii + 1 2 + X l6=i s i s l P 2 il = s i R ii + 1 8i P ij s j P jj + 1 + P ji s i P ii + 1 + X l6=i;j s l P il P jl = R ij 8i<j , 8 > > > > > < > > > > > : P ii = 1 s i 0 @ s s i R ii + 1 X l6=i s i s l P 2 il 1 1 A P ij s j P jj + 1 + P ji s i P ii + 1 + X l6=i;j s l P il P jl = R ij 8i<j: An upper-triangular solution P can be generated using the following pseudo code: Algorithm 9 Pseudo-code for generating matrix P 1: For all (i;j) with i>j, set P ij = 0. 2: for j =n! 1 do Pjj = 1 sj s sjRjj + 1 X l>j sjs l P 2 jl 1 (A.9) 3: for i =j 1! 1 do Pij = Rij P l>j s l P il P jl sjPjj + 1 (A.10) 4: end for 5: end for Notice that at each iteration of the algorithm corresponding to the (i;j)-th index, the corresponding equation is satised. Moreover, once an equation is satised, the variables in that equation are not going to change any more; and thus it remains satised. We proceed by showing that Algorithm (9) generates a matrix P withkPk 3 for small enough. In particular, we show that for suciently small> 0,jP ij j 2 +O( 2 ) for all ij. We prove our result by a reverse induction on j: Base step, j =n (last column of P): Using (A.9), jP nn j = 1 s n p s n R nn + 1 1 1 s n (s n jR nn j + 1 1) =jR nn j: Moreover, (A.10)) impliesjP in j = jR in j js n P nn + 1j : For suciently small ; js n P nn + 1j 1 2 : It follows that jP in j 2jR in j 2: 105 Induction hypothesis: AssumejP ij j 2 +O( 2 ) for all ij, j =n;:::;k. We show that the result holds for k 1. First of all, (A.9) implies jP (k1)(k1) j = 1 s k1 s s k1 R (k1)(k1) + 1 X l>k1 s k1 s l P 2 (k1)l 1 1 s k1 s k1 jR (k1)(k1) j + 1 + X l>k1 s k1 s l P 2 (k1)l 1 jR (k1)(k1) j +O( 2 ): Also,jP i(k1) j = R i(k1) P l>k1 s l P il P (k1)l s k1 P (k1)(k1) + 1 ; which implies jP i(k1) j jR i(k1) j +j P l>k1 s l 4 2 +O( 3 )j js k1 P (k1)(k1) + 1j : Thus, for suciently small , we havejs k1 P (k1)(k1) + 1j 1 2 : Consequently, jP i(k1) j 2jR i(k1) j +O( 2 ) 2 +O( 2 ): We now use the above lemmas to complete the proof of Theorem 2.4.3: Proof. To show the openness of the mapping, it suces to show that it is locally open everywhere. Consider an arbitrary point W2R nk , and let UU T be a singular value decomposition of the symmetric matrix product WW T . To ease the notation, denote U T W by W. By Lemma A.5.1, M + () is locally open at W if and only ifM + () is locally open at W. When WW T is rank decient, we can write = W W T = 2 4 1 0 0 0 3 5 ; where 1 2R rr is a positive denite diagonal matrix and r is the rank of WW T . It is easy to show that the last n r rows of W are all zeros, i.e., for all j > r, h W j;: ; W :;j T i = k W j;: k 2 = 0; or equivalently, W j;: = 0: To show local openness ofM + () at W, we consider a perturbation e , + R of in the rangeR M+ , and show that there exists a small perturbation W + A of W such that ( W + A )( W + A ) T = e . By possibly permuting the columns of e , the perturbed matrix which we know is symmetric positive semi-denite with rank at most k can be expressed as 106 r columns kr columns nk columns r rows kr rows nk rows 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 + R 1 R 2 h 1 + R 1 R 2 i B R T 2 R 3 h R T 2 R 3 i B B T 2 6 6 6 4 1 + R T 1 R T 2 3 7 7 7 5 B T 2 6 6 6 4 R 2 R T 3 3 7 7 7 5 B T 2 6 6 6 4 1 + R T 1 R 2 R T 2 R T 3 3 7 7 7 5 B 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : where R3 = 2 6 6 6 6 6 6 6 6 6 6 4 R3 h R T 2 R3 i B B T 2 6 6 6 4 R2 R T 3 3 7 7 7 5 B T 2 6 6 6 4 1 + R T 1 R2 R T 2 R T 3 3 7 7 7 5 B 3 7 7 7 7 7 7 7 7 7 7 5 ; and R 2 = h R 2 h 1 + R 1 R 2 i B i : Here B2R k(nk) exists since rank( e ) k. Moreover, e 0 for small enough perturbation. Therefore, the Schur complement theorem implies R 3 R T 2 ( 1 + R 1 ) 1 R 2 . Thus e 2R M+ requires R 1 to be a symmetric R rr matrix, R 2 to be an R rnr matrix, and R 3 to be a symmetric R (nr)(nr) matrix with R 3 R T 2 ( 1 + R 1 ) 1 R 2 . For every small perturbation R = 2 4 R 1 R 2 R T 2 R 3 3 5 , withkR k , we need to nd A 2R nk such that ( W + A )( W + A ) T = e or equivalently WA T + A W T + A A T = R : (A.11) Since the last nr rows of W are all zeros, we obtain WA T = 2 4 W1 0 3 5 h (A 1 ) T (A 2 ) T i = 2 4 W1(A 1 ) T W1(A 2 ) T 0 0 3 5 ; and A A T = 2 4 A 1 A 2 3 5 h (A 1 ) T (A 2 ) T i = 2 4 A 1 (A 1 ) T A 1 (A 2 ) T A 2 (A 1 ) T A 2 (A 2 ) T 3 5 : where W 1 = W 1:r;: 2 R rk is a full row rank matrix, A 1 2 R rk , and A 2 2 R (nr)k . From Equation 107 (A.11), we get the following three expressions: W 1 (A 1 ) T + A 1 W T 1 + A 1 (A 1 ) T = R 1 ; (A.12) W 1 (A 2 ) T + A 1 (A 2 ) T = R 2 ; (A.13) A 2 (A 2 ) T = R 3 : (A.14) Setting A 1 , P( W y 1 ) T , where ( W 1 ) y , W T 1 ( W 1 W T 1 ) 1 , we obtain W1(A 1 ) T + A 1 W T 1 + A 1 (A 1 ) T = W1 W T 1 1 1 P T + P 1 1 W1 W T 1 + P 1 1 W1 W T 1 1 1 P T = P T + P + P 1 1 P T : Using Lemma A.5.2, we can choose small enough so that for any perturbation matrix R withkRk<, there exists a solution P withkPk = O(). More precisely, we can generate P2 R rr that satises expression (A.12), withkPk 1 3. Also, since ( W y 1 ) T W y 1 = 1 1 , we obtaink W y 1 :;j k 2 1 min 8j r; where min is the minimum singular value for 1 . Then by denition of A 1 , we can bound its norm: kA 1 kk W y 1 kkPk p r p min 3r 2 = 3r 2:5 p min : (A.15) Note thatkA 1 k is of order which can be chosen arbitrarily small so that W 1 + A 1 is full row rank. Dene (A 2 ) T , ( W 1 + A 1 ) y R 2 + M; where MfM2R k(nr) kMk; C(M)N W 1 + A 1 g; and ( W 1 + A 1 ) y , ( W 1 + A 1 ) T [( W 1 + A 1 )( W 1 + A 1 ) T ] 1 = ( W 1 + A 1 ) T ( 1 + R 1 ) 1 with the last equality obtained using (A.12). Substituting A 2 in (A.13), we obtain ( W 1 + A 1 )(A 2 ) T = ( W 1 + A 1 )( W 1 + A 1 ) y R 2 + ( W 1 + A 1 )M = R 2 : where the last equality is valid sinceC(M)N ( W 1 + A 1 ). Substituting A 2 in (A.14), we obtain A 2 (A 2 ) T = R T 2 ( 1 + R 1 ) 1 ( 1 + R 1 )( 1 + R 1 ) 1 R 2 + M T ( W 1 + A 1 ) T ( 1 + R 1 ) 1 R 2 + R T 2 ( 1 + R 1 ) 1 ( W 1 + A 1 )M + M T M = R T 2 ( 1 + R 1 ) 1 ( 1 + R 1 )( 1 + R 1 ) 1 R 2 + M T M = R T 2 ( 1 + R 1 ) 1 R 2 + M T M; 108 where the second inequality holds sinceC(M)N ( W 1 + A 1 ). Expression (A.14) can be satised if for any symmetric R 3 R T 2 ( 1 + R 1 ) 1 R 2 , there exists M such that M T M = R 3 R T 2 ( 1 + R 1 ) 1 R 2 . Since ( W 1 + A 1 )2R rk is a full row rank matrix, then dim N ( W 1 + A 1 ) =kr. Let Q2R k(kr) be a basis forN ( W 1 + A 1 ). Then for every MfM2 R k(nr) kMk ; C(M) N W 1 + A 1 g, there exist N 2 R (kr)(nr) with M = QN, which implies M T M = N T Q T QN = N T N: Since R 3 R T 2 ( 1 + R 1 ) 1 R 2 is the schur complement of + R , then by the Guttman rank additivity for- mula, we get k rank( e ) = rank( 1 + R 1 ) + rank( R 3 R T 2 ( 1 + R 1 ) 1 R 2 ); which implies rank( R 3 R T 2 ( 1 +R 1 ) 1 R 2 )kr: Thus for any symmetric positive semi-denite matrix R 3 R T 2 ( 1 +R 1 ) 1 R 2 , there exist a matrix N2 R (kr)(nr) such that N T N = R 3 R T 2 ( 1 + R 1 ) 1 R 2 . It follows that there exist a matrix M2R k(nr) , M, QN, with M T M = N T N = R 3 R T 2 ( 1 + R 1 ) 1 R 2 : We have dened A = 2 4 A 1 A 2 3 5 such that ( W + A )( W + A ) T = e . We now obtain an upper-bound onkA k. Since ( W 1 + A 1 ) y T ( W 1 + A 1 ) y = ( 1 + R 1 ) 1 ( W 1 + A 1 )( W 1 + A 1 ) T ( 1 + R 1 ) 1 = ( 1 + R 1 ) 1 ; we obtain k ( W 1 + A 1 ) y :;j k 2 1 min 8jr: Then by the denition of A 2 , we can bound its norm as follows kA 2 kk( W 1 + A 1 ) y kk R 2 k +kMk p r p min +: (A.16) Using (A.15) and (A.16), we obtain kA k 3r 2:5 p min + p r p min + 3r 2:5 p min + p 2r p min + 3r 2:5 + p 2r + p min p min ; where the second inequality assumes min =2. Now, for a given > 0, choose min ( p min 3r 2:5 + p 2r + p min ; min =2 ) : This choice of leads tokA k, which completes the proof. 109 A.6 Proof of Theorem 2.7.2 Consider the training problem of a multi-layer deep linear neural network: minimize W 1 2 kW h W 1 X Yk 2 : (A.17) Here W = W i h i=1 , W i 2 R didi1 are the weight matrices, X 2 R d0n is the input training data, and Y2 R d h n is the target training data. Based on our general framework, the corresponding auxiliary optimization problem is given by minimum Z2R d h d 0 1 2 jjZX Yjj 2 subject to rank(Z)d p , min 0ih d i : (A.18) Lemma A.6.1. Consider a degenerate critical point W = ( W h ;:::; W 1 ) with N ( W i ) andN W T i for h 1i 2 all non-empty. If N W h is non-empty or N W T 1 is non-empty; then W is either a global minimum or a saddle point of problem (A.17). Proof. Suppose thatN W h is non-empty. Let = W h W 1 X Y. If X T = 0, by convexity of the loss function, the point W = ( W h ;:::; W 1 ) is a global minimum of (A.17). Else, there exist (i;j) such that X i;: ; j;: 6= 0. We dene the setK,fk2Nj 3kh; N ( W k )?N ( W k1 W k2 W 2 ) T g. We split the rest of the proof into two cases that correspond toK being empty and non-empty. Case a: AssumeK is non-empty. We dene k , maximum k2K k. By denition of the setK and choice of k , the null spaceN W k is orthogonal to the null-spaceN ( W k 1 W 2 ) T . This implies there exists a non-zero b2R d k 1 such that b2N W k \C W k 1 W 2 . By considering perturbation in directions A = (A h ;:::; A 1 ), A i 2R didi1 for the optimization problem minimize t g(t), 1 2 k( W h +tA h ) ( W 1 +tA 1 )X Yk 2 ; (A.19) we examine the optimality conditions for a specic direction A. Let ( A h ) l;: , 8 > < > : h p T h if l =j; 0 otherwise ( A 1 ) :;l , 8 > < > : 1 b 1 if l =i; 0 otherwise 110 A k , 8 > > > > < > > > > : b k p T k if k + 1kh 1 b k b T if k =k 0 if 2kk 1; where h and 1 are scalar constants, b 1 2R d1 such that W k 1 W 2 b 1 = b, and p k 2N ( W k1 W 2 ) T ; b k1 2N W k ; andh p k ; b k1 i6= 08 k + 1kh: (A.20) Notice that such p k and b k1 exist from the denition ofK and choice of k . For this particular choice of A = ( A h ;:::; A 1 ), we obtain W k+1 A k = 0 for k kh 1; and A k W k1 W 2 = 0 for k + 1kh: (A.21) We now show that ( A h ;:::; A 1 ) is in fact a descent direction. Before proceeding, let us dene some notation to ease the expressions of the optimality conditions. LetV be an index set that is a subset off1;:::;hg. We dene the function f( A V ; W V ) which is the matrix product attained from W h W 1 X by replacing matrices W v by matrices A v for every v2V. For instance, if h = 5 andV =f2; 3; 5g, then f( A V ; W V ) = A 5 W 4 A 3 A 2 W 1 X. We now determine index setsV, withjVj 1, that correspond to non-zerof( A V ; W V ). First note by denition of A, ifV \fk 1;:::; 2g6=;, then f( A V ; W V ) = 0. Also by (A.21), for any k v h 1, if v 2 V then either fk ;:::;hg 2 V or f( A V ; W V ) = 0. This implies that A h A k W k 1 W 1 X and A h A k W k 1 W 2 A 1 X are the only terms that can take non-zero values. Using the denition equation (A.19) we obtain g(t) = 1 2 kt hk +1 A h A k W k 1 W 1 X +t hk +2 A h A k W k 1 W 2 A 1 X + k 2 : It follows that @ r g(t) @t r t=0 = 0 for all rhk and @ hk +1 g(t) @t hk +1 t=0 =c 1 A h A k W k 1 W 1 X; ; where c 1 > 0 is a scalar. If A h A k W k 1 W 1 X; 6= 0, then by properly choosing the sign of h such that 111 A h A k W k 1 W 1 X; < 0, we get a descent direction. Otherwise, @ hk +2 g(t) @t hk +2 t=0 =c 1 A h A k W k 1 W 2 A 1 X; +h( A h A k W k 1 W 1 X): where c 1 > 0 is a scalar, and h() is a function of A h A k W k 1 W 1 X. We now evaluate the term A h A k W k 1 W 2 A 1 X; . Since ( A h ) l;: = 0 for alll6=j and ( A 1 ) :;l = 0 for alll6=i, we only need to compute the (j;i) index A h A k W k 1 W 2 A 1 (j;i) as all other indices are zero. For some constant c = p T h b h1 p T h1 b h2 p T k +1 b k b T b, we obtain c 1 A h A k W k 1 W 2 A 1 (j;i) = c 1 h 1 p T h b h1 p T h1 b h2 p T k +1 b k b T W k 1 W 2 b 1 = c 1 h 1 p T h b h1 p T h1 b h2 p T k +1 b k b T b = h 1 c; where c is non-zero by our choice of b, p k and b k1 for k + 1 k h as dened in (A.20). For a xed h 6= 0, h( A h A k W k 1 W 1 X) is a constant scalar we denote by c . Then by properly choosing 1 such that h |{z} 6=0 1 c |{z} 6=0 X i;: ; j;: | {z } 6=0 +c < 0; we get a descent direction. This completes the rst case. Case b: AssumeK is empty. We consider ( A h ) l;: , 8 > < > : h p T h if l =j; 0 otherwise ( A 1 ) :;l , 8 > < > : 1 b 1 if l =i; 0 otherwise A k , 8 > < > : b k p T k if 3kh 1 b k b T 1 if k = 2; where h and 1 are scalar constants, b 1 2N W 2 , and p k 2N ( W k1 W 2 ) T ; b k1 2N W k ; andh p k ; b k1 i6= 0;8 3kh: (A.22) For this particular choice of A = ( A h ;:::; A 1 ), we obtain W k+1 A k = 0 for 2 k h 1; and A k W k1 W 2 = 0 for 3 k h: We now determine index setsV, with V 1, that correspond to non-zero f( A V ; W V ). By (A.22), for any 2 v h 1, if v 2V then eitherf2;:::;hg2V or f( A V ; W V ) = 0. This directly imply that A h A 2 W 1 X and A h A 1 X are the only terms that can take non-zero values. Using the denition of equation (A.19) we obtain 112 g(t) = 1 2 kt h1 A h A 2 W 1 X +t h A h A 1 X + k 2 : It follows that @ r g(t) @t r t=0 = 0 for all rh 2; and @ h1 g(t) @t h1 t=0 = c 1 A h A 2 W 1 X; ; where c 1 > 0 is a scalar. If A h A 2 W 1 X; 6= 0, then by properly choosing the sign of h such that A h A 2 W 1 X; < 0, we get a descent direction. Otherwise, @ h g(t) @t h t=0 =c 1 A h A 1 X; +h( A h A 2 W 1 X); wherec 1 > 0 is a scalar, andh() is a function of A h A 2 W 1 X. We now evaluate the term A h A 1 X; . Since ( A h ) l;: = 0 for all l6= j and ( A 1 ) :;l = 0 for all l6= i, we only need to compute the (j;i) index A h A 1 (j;i) as all other indices are zero. For some constant c = p T h b h1 p T h1 b h2 p T 3 b 2 b T 1 b 1 ; we obtain c 1 A h A 1 (j;i) =c 1 h 1 p T h b h1 p T h1 b h2 p T 3 b 2 b T 1 b 1 = h 1 c; where c is non-zero by our choice of b, p k and b k1 for 3kh as dened in (A.22). For a xed h 6= 0, h( A h A 2 W 1 X) is a constant scalar we denote by c . Then by properly choosing 1 such that h |{z} 6=0 1 c |{z} 6=0 X i;: ; j;: | {z } 6=0 +c < 0; we get a descent direction. This completes the second case. Now ifN W T 1 is non-empty, we dene the set K,fkj 1kh 2; N ( W h1 W k+1 )?N ( W T k )g; and use a similar proof scheme to show the result. More specically, we split the proof into two cases that correspond toK being empty and non-empty. Case a: AssumeK is non-empty. We dene k , minimum k2K k. By denition of the setK and choice of k , the null spaceN W T k is orthogonal to the null-spaceN W h1 W k +1 . This implies there exists a non-zero p2 R d k such that p2N W T k \C ( W h1 W k +1 ) T . By considering perturbation in 113 directions A = (A h ;:::; A 1 ), A i 2R didi1 for the optimization problem minimize t g(t), 1 2 k( W h +tA h ) ( W 1 +tA 1 )X Yk 2 ; (A.23) we examine the optimality conditions for a specic direction A. Let ( A h ) l;: , 8 > < > : h p T h if l =j; 0 otherwise ( A 1 ) :;l , 8 > < > : 1 b 1 if l =i; 0 otherwise A k , 8 > > > > < > > > > : b k p T k if 2kk 1 pp T k if k =k 0 if k + 1kh 1; where h and 1 are constants and p h 2R d h1 with p T h W h1 W k +1 = p T ; p k 2N W T k1 ; b k1 2N W h1 W k ; andh p k ; b k1 i6= 0 for all 2kk . Notice that such p k and b k1 exist from the denition ofK and choice of k . For this particular choice of A = ( A h ;:::; A 1 ), we obtain A k W k1 = 0 for 2kk ; and W h1 W k+1 A k = 0 for 1kk 1: (A.24) The same argument used above can be used to show that ( A h ;:::; A 1 ) is actually a descent direction. This completes the proof of the rst case. Case b: AssumeK is empty. We consider ( A h ) l;: , 8 > < > : h p T h if l =j; 0 otherwise ( A 1 ) :;l , 8 > < > : 1 b 1 if l =i; 0 otherwise A k , 8 > < > : b k p T k if 2kh 2 p h p T k if k =h 1; where h and 1 are scalar constants, p h 2N W T h1 , and p k 2N W T k1 ; b k1 2N W h1 W k ; andh p k ; b k1 i6= 08 2kh 1: For this particular choice of A, we obtain A k W k1 = 0 for 2k h 1 and W h1 W k+1 A k = 0 for 1 k h 2: The same argument used above can be used to show that ( A h ;:::; A 1 ) is actually a descent direction. This completes the second case and thus completes 114 the proof. Following the same steps of the proof in Lemma A.6.1, we get the same result when replacing the square loss error by a general convex and dierentiable function `(). We are now ready to prove the main result restated below. Theorem 2.7.2 If there exist p 1 and p 2 , 1 p 1 < p 2 h 1 with d h > d p2 and d 0 > d p1 , we can nd a rank decient Y such that problem (2.15) has a local minimum that is not global. Otherwise, given any X and Y, every local minimum of problem (2.15) is a global minimum. Proof. Suppose there exist such a pairfp 1 ;p 2 g. Let X, I, Y (i;j) , 8 > < > : 1 if (i;j) = (d h ;d 0 ) 0 otherwise ; W k , 8 > > > > > > > > > < > > > > > > > > > : I d k 0 if d k d k1 ; 2 6 4 I d k1 0 3 7 5 if d k >d k1 ; for k 2fh;:::;p 2 + 1g[fp 1 ;:::; 1g, and W k = 0 for k 2fp 2 ;:::;p 1 + 1g. Since W h W p2+1 and W p1 W 1 are both full rank, Lemma A.3.1 implies the matrix productsM h;p2+1 andM p1;1 are locally open at ( W h ;:::; W p2+1 ) and ( W p1 ;:::; W 1 ), respectively. Moreover, using Proposition 2.4.1, Theorem 2.4.2, and the composition property of open maps, the matrix product mappingM p2;p1+1 is locally open at ( W p2 ;:::; W p1+1 ). It follows by Observation 2.3.1 that if W is a local minimum of minimize W 1 2 kW h W 1 Yk 2 : (A.25) then ( Z 3 ; Z 2 ; Z 1 ) is a local minimum of minimize Z32R d h dp 2;Z22R dp 2 dp 1;Z12R dp 1 d 0 1 2 kZ 3 Z 2 Z 1 Yk 2 : (A.26) Let Z 3 = W h W p2+1 = 2 4 I dp 2 0 3 5 ; Z 2 = 0; and Z 1 = W p1 W 1 = h I dp 1 0 i : The point ( Z 3 ; Z 2 ; Z 1 ) is obviously not global, we show using optimality conditions that the point is a local minimum. By considering 115 perturbations in the directions A = ( A 3 ; A 2 ; A 1 ) for the optimization problem minimize t g(t) , 1 2 k( Z 3 +t A 3 )( Z 2 +t A 2 )( Z 1 +t A 1 ) Yk 2 = 1 2 kt( Z 3 +t A 3 ) A 2 ( Z 1 +t A 1 ) Yk 2 : (A.27) It follows that @g(t) @t t=0 = Z 3 A 2 Z 1 ; Y = Z 3 A 2 Z 1 d h ;d0 Y d h ;d0 = 0; (A.28) where the last equality holds since the last row (d th h row) of Z 3 is zero. Also, @ 2 g(t) @t 2 t=0 = 2 Z 3 A 2 A 1 + A 3 A 2 Z 1 ; Y +k Z 3 A 2 Z 1 k 2 = 2 Z 3 A 2 A 1 d h ;d0 Y d h ;d0 2 A 3 A 2 Z 1 d h ;d0 Y d h ;d0 +k Z 3 A 2 Z 1 k 2 =k A 2 k 2 ; (A.29) where the last equality holds since the last row (d th h row) of Z 3 and the last column (d th 0 column) of Z 1 are both zeros. Then fork A 2 k6= 0, the second-order optimality condition implies that the point is a local minimum, and ifk A 2 k = 0 we get g(t) = 1 2 k Yk = 1 2 ; which implies ( Z 3 ; Z 2 ; Z 1 ) is a local optimum that is not global. Note that the same method used to construct the example above can be used to nd a local minimum that is not global whenever the rank(Y) minfd h d p2 ;d 0 d p1 g. When Y is full rank, we know from the results of [106, 157] that every local minimum is global. To have a complete characterization of problems for which every local minimum is global, it remains to either prove or disprove the statement when Y is a rank decient matrix with rank(Y) > minfd h d p2 ;d 0 d p1 g. We now provide a counterexample that disproves the statement. In particular, we construct a three layer network with input X and output Y with rank(Y)> minfd h d p2 ;d 0 d p1 g, and then nd a local minimum ( W 3 ; W 2 ; W 1 ) that is not global. Let X = I, Y, 2 6 6 6 4 1 0 1 0 4 0 1 0 1 3 7 7 7 5 ; W3, 2 6 6 6 4 1 1 1 1 1 1 3 7 7 7 5 ; W2, 2 4 1 1 1 1 3 5 ; and W1, W T 3 : Clearly, ( W 3 ; W 2 ; W 1 ) is not a global minimum. Dene , W 3 W 2 W 1 Y. Then, W T 3 = W T 1 = 0: (A.30) 116 Considering perturbations in the directions A = (A 3 ; A 2 ; A 1 ) for the optimization problem minimize t g(t; A), 1 2 k( W 3 +tA 3 )( W 3 +tA 2 )( W 1 +tA 1 ) Yk 2 ; it follows that @g(t; A) @t t=0 = W 3 W 2 A 1 + W 3 A 2 W 1 + A 3 W 2 W 1 ; = 0; where the last equality is directly implied from (A.30). Also,g (2) (0; A), @ 2 g(t; A) @t 2 t=0 = 2 A 3 A 2 W 1 +A 3 W 2 A 1 + W 3 A 2 A 1 ; + k W 3 W 2 A 1 + W 3 A 2 W 1 +A 3 W 2 W 1 k 2 = 2 A 3 W 2 A 1 ; +k W 3 W 2 A 1 + W 3 A 2 W 1 +A 3 W 2 W 1 k 2 ; which is a quadratic function of A denoted by f A , 1 2 a T H A a: Here a2R 161 is a vectorization of matrices A 3 , A 2 , and A 1 , and H A is the Hessian off A . By computing the eigenvalues of H A we get that H A 0 which directly implies g (2) (0; A) 0 8A: Moreover, let a opt be the optimal solution set of the problem minimize a f A : Then a opt =faj a2N (H a )g. We notice that for any a2 a opt , the corresponding direction A has W 3 W 2 A 1 + W 3 A 2 W 1 + A 3 W 2 W 1 = 0 and A 3 A 2 A 1 ; = 0: Then, it follows that g (3) (0; A), @ 3 g(t; A) @t 3 t=0 = 6 A 3 A 2 A 1 ; = 0; and g (4) (0; A), @ 4 g(t; A) @t 4 t=0 = 12k A 3 A 2 W 1 + A 3 W 2 A 1 + W 3 A 2 A 1 k 2 0: If A 3 A 2 W 1 + A 3 W 2 A 1 + W 3 A 2 A 1 6= 0, then using the fourth order optimality conditions ( W 3 ; W 2 ; W 1 ) is a local minimum. Otherwise, we get g (5) (0; A), @ 5 g(t; A) @t 5 t=0 = 0; and g (6) (0; A), @ 6 g(t; A) @t 6 t=0 = 360k A 3 A 2 A 1 k 2 0; which also implies that ( W 3 ; W 2 ; W 1 ) is a local minimum. We now show that if such a pairfp 2 ;p 1 g does not exist, then every local minimum of (A.17) is global. In particular, we show that for any X and Y, if W is not a global minimum, we can construct a descent direction. First notice that if for some 1ih 1, W i is full column rank, then using Proposition 2.4.1,M i+1;i () is locally open at ( W i+1 ; W i ) and W i+1 W i 2R di+1di1 . Using Observation 2.3.1, we conclude that any local minimum of problem (A.17) relates to a local minimum of the problem obtained by replacing W i+1 W i by Z i+1;i 2R di+1di1 . By a similar argument, we conclude that if W i is a full row rank for some 2ih, any local minimum of problem (A.17) relates to local minimum of the problem obtained by replacing W i W i1 by Z i;i1 2 R didi2 . Thus, if W = ( W h ;:::; W 1 ) is a local minimum of problem (A.17), the new point 117 Z = ( Z 0 h 0;:::; Z 0 1 ), where Z i 2R d 0 i d 0 i1 andh 0 h, is a local minimum of the problem attained by applying the replacements discussed above. Ifh 0 = 1, we get the desired result from Lemma A.1.1. Else, ifh 0 = 2, the auxiliary problem becomes a two layer linear network for which Theorem 2.6.1 provides the desired result. Whenh 0 > 2, examined 0 h 0;d 0 h 0 1 ;d 0 1 andd 0 0 . Ifd 0 h 0 >d 0 h 0 1 andd 0 0 >d 0 1 , then there exist 1p 1 <p 2 h 1 with d h >d p2 and d 0 >d p1 which contradicts our assumption. It follows by construction of Z i , that either d 0 h 0 d 0 h 0 1 and Z 0 h 0 is not full row rank or d 0 0 d 0 1 and Z 0 1 is not full column rank; thus at least one of the null spacesN Z 0 h 0 ,N ( Z 0 1 ) T is non empty. Moreover, Z i has non-empty right and left null spaces for 2ih 1. The result follows using Lemma A.6.1. A.7 Proof of Corollary 2.7.4 Proof. Suppose W is a degenerate critical point, then by replacing the square loss error by a general convex and dierentiable function `() in Theorem 2.7.2, we get that W is either a saddle or a global minimum. Suppose W = ( W h ;:::; W 1 ) is a non-degenerate critical point and k , min i d i = min(d h ;d 0 ), we follow the same steps of the proof of [157, Theorem 2.1] to show the desired result. First note that @`(W h W 1 X) @W 1 W= W = W T 2 W T h r`( W h W 1 X)X T ; and @`(W h W 1 X) @W h W= W =r`( W h W 1 X)X T W T 1 W T h1 ; wherer` is the gradient mapping of the function `(). If k =d h , let S = W T 2 W T h 2R d1k and T =r`( W h W 1 X)X T . It follows that k = rank( W h W 1 ) rank(S T )k) rank(S) =k: Since W is a critical point and S T is full row rank, we get 0 = @`(W h W 1 X) @W 1 W= W 2 =tr T T S T ST 2 min (S)kTk 2 : Thus, T =r`( W h W 1 X)X T = 0; which by convexity`() implies that W is a global minimum. Similarly, we can show that the case of k =d 0 results in the global optimality of W as well. 118 Appendix B Proofs for results in Chapter 3 B.1 Proof of Lemma 3.2.1 For a given k2N + , let (x k ;y k ) = (x k ;x k ) be the current iterate with x k 0. We rst show that iterate k + 1 generated by projected gradient descent satises x k+1 +y k+1 = 0: (B.1) Then we show that x k+1 0; x k x k+1 0: (B.2) Combining (B.1) and (B.2), we will complete our proof. First note that if x k = 0, then the result trivially holds. Assume that x k > 0, we dene x k+1 ,x k r x f(x k ;x k ) =x k (1 2x 2 k )x k e 2x 2 k and y k+1 ,y k r y f(x k ;x k ) =x k + (1 2x 2 k )x k e 2x 2 k +x k : Since x k+1 + y k+1 = x k > 0, the point ( x k+1 ; y k+1 ) is not feasible. By projecting ( x k+1 ; y k+1 ) to the feasible setf(x;y)jy +x 0g, we obtain x k+1 =x k (1 2x 2 k )x k e 2x 2 k 2 x k and y k+1 =x k + (1 2x 2 k )x k e 2x 2 k + 1 2 x k : (B.3) 119 Obviously (B.1) holds. We now show that x k+1 =x k 1 2 (1 2x 2 k )e 2x 2 k 0 and x k x k+1 =x k (1 2x 2 k )e 2x 2 k + 2 0: Let g(x), (1 2x 2 )e 2x 2 . This function has two global minima x =1, and one global maximum x = 0. Hence, e 2 g(x) 1; 8 x: (B.4) Using (B.4), we get x k+1 =x k 1 2 g(x k ) x k 1 2 0; where the second inequality holds since < 2=3 and x k 0. Also, x k x k+1 =x k g(x k ) + 1 2 x k 1 2 e 2 0: Combining (B.1) and (B.2), we conclude that for any kk, (x k ;y k ) belongs to the compact setf(x;y)j 0 xx k ; y =xg which guarantees convergence. Thus x, lim k!1 x k+1 = lim k!1 h x k (1 2x 2 k )x k e 2x 2 k 2 x k i = x h 1 2 g( x) i ; which implies x = 0; or g( x) = 1 2 : Since max x g(x)>e 2 , we get that x = 0 which completes the proof. B.2 Proof of Theorem 3.2.2 Consider the initial point (x 0 ;y 0 ). If we can show that y 1 =x 1 and x 1 0, then using Lemma 3.2.1, we conclude that the sequence of iteratesf(x k ;y k )g eventually converges to the origin. Thus it suces to show that there exist an > 0 such that if 0:5x 0 0:5; and 0:5y 0 0:5; (B.5) 120 then the next iterate (x 1 ;y 1 ) satises x 1 =x 0 + y 0 (1 2x 2 0 )e x 2 0 y 2 0 0; (B.6a) y 1 +x 1 =y 0 + x 0 (1 2y 2 0 )e x 2 0 y 2 0 y 0 +x 0 + y 0 (1 2x 2 0 )e x 2 0 y 2 0 0: (B.6b) The rst condition (B.6a) is satised when the step-size is small enough. To prove (B.6b) we utilize the conditions in (B.5) to obtain the following inequalities 2x 0 +y 0 0; 0:25 + (0:5) 2 x 2 0 +y 2 0 (0:5 +) 2 + 0:25; 0:51 2x 2 0 1 2(0:5) 2 ; 1 2(0:5 +) 2 1 2y 2 0 0:5; which implies that x 0 (1 2y 2 0 ) +y 0 (1 2x 2 0 ) (0:5) 1 2(0:5 +) 2 (0:5 +) 1 2(0:5) 2 =3 + 4 3 : (B.7) Then using (B.7), we get x 0 +y 0 y 0 + h x 0 (1 2y 2 0 ) +y 0 (1 2x 2 0 ) i e x 2 0 y 2 0 2 + 0:5 + 3 + 4 3 e 0:25 e (0:5+) 2 : Note that the right hand side is 0:5 +O() which is greater than or equal to zero for suciently small . This shows that condition (B.6b) holds, and the proof follows by Lemma 3.2.1. B.3 Proof for Theorem 3.3.1 Before proceeding to the result, let us dene some notations. LetG(V;E) be a graph with the set of vertices V and the set of edges E. Also letjVj be the cardinality of V and A G be the adjacency matrix of graph G. We deneC n ,fQ2R nn j x T Qx 0; 8 x 0g to be the set of co-positive matrices. We denote the all-one matrix of size n by 1 n . We say graph G has a stable set of size t if it contains a subset of t vertices, from which no two vertices in the subset are connected by an edge. 121 Lemma B.3.1. Let G = (V;E) be a graph withjVj =n. Given a scalar t with tn, dene Q = (I n + A G )(t 1 2 ) 1 n ; and = 1 2n + 1 : Then the following are equivalent: i. min d0;kdk21 d T Qd p n : ii. G contains a stable set of size t Proof. We rst show that i implies ii. By the denition of the setC n , the condition min kdk21; d0 d T Qd p n implies that Q = 2C n . Therefore, by [55, Lemma 4.1], G contains a stable set of size t. To show the converse, we use [55, lemma 4.5]. Suppose that G contains a stable set of size t. By [55, lemma 4.1], Q = 2C n . Moreover, by [55, lemma 4.5] it is far away fromC n . In other words, there exists a > 0 such that kY Qk F >; 8 Y2C n : (B.8) Let d2 arg min d0; kdk21 d T Qd. Ifk dk 2 < 1, we can clearly scale x and reduce the objective value. Hence, k dk 2 = 1. We now dene, e Q = Q + p n I n : By (B.8), we get that e Q = 2C n . If d2 arg min d0; kdk21 d T e Qd; (B.9) then we have, min d0; kdk21 d T Qd = d T Q d = d T e Q d p n k dk 2 = min d0; kdk21 d T e Qd p n p n 122 where the third equality holds by (B.9). This directly implies ii. To complete the proof, we still need to show that d2 arg min d0; kdk21 d T e Qd: Assume the contrary, then we can dene e d6= d to be the minimizer e d2 arg min d0; kdk1 d T e Qd: By the same previous argument,k e dk 2 = 1. Hence, e d T e Q e d = e d T Q e d + p n k e dk 2 2 = e d T Q e d + p n d T Q d + p n = d T Q d + p n k dk 2 2 ; where the rst and last equalities hold sincek dk 2 =k e dk 2 = 1, and the inequality holds by denition of d. Hence, e d is not an optimal solution for min d0; kdk1 d T e Qd; thus arriving to a contradiction. Theorem 3.3.1 is directly implied by Lemma B.3.1 and [55, Theorem 4.2] which shows that checking whether a graph G contains a stable set is an NP-Hard problem. B.4 Proof of Theorem 3.4.2 Proof. We rst show the following reduction bound in the objective value f(x k )f(x k+1 ) max X 2 k 2 ~ L ; 2 3 k 3~ 2 : First of all, notice that if Step 6 is reached, then clearly the reduction bound is satised. Otherwise, x k+1 is set in Step 9 or Step 11. 123 If Step 9 is reached, then using descent lemma [21, Appendix A.24], we obtain f(x k+1 )f(x k ) +hrf(x k ); x k+1 x k i + L 2 kx k+1 x k k 2 2 f(x k ) +hrf(x k ); x k+1 x k i + ~ L 2 kx k+1 x k k 2 2 =f(x k ) + X k ~ L hrf(x k );b s k i + X 2 k 2 ~ L kb s k k 2 2 f(x k ) X 2 k ~ L + X 2 k 2 ~ L =f(x k ) X 2 k 2 ~ L : (B.10) Otherwise, if Step 11 is reached, then using second-order descent lemma [21], we obtain f(x k+1 )f(x k ) +hrf(x k ); x k+1 x k i + 1 2 (x k+1 x) T r 2 f(x k )(x k+1 x k ) + 6 kx k+1 x k k 3 2 f(x k ) +hrf(x k ); x k+1 x k i + 1 2 (x k+1 x k ) T r 2 f(x k )(x k+1 x k ) + ~ 6 kx k+1 x k k 3 2 =f(x k ) + 2 k ~ hrf(x k ); b d k i + 2 2 k ~ 2 ( b d k ) T r 2 f(x k )( b d k ) + 4 3 k 3~ 2 k b d k k 3 2 f(x k ) 2 3 k ~ 2 + 4 3 k 3~ 2 =f(x k ) 2 3 k 3~ 2 ; (B.11) where the fourth inequality holds sincehrf(x k ); b d k i 0,k b d k k 1, and k =( b d k ) T r 2 f(x k ) b d k . Combin- ing (B.10) and (B.11) with Step 8, we obtain the following reduction in the objective value f(x k+1 )f(x k ) max X 2 k 2 ~ L ; 2 3 k 3~ 2 : (B.12) By summing over the iterations, we get f(x `+1 )f(x 0 ) = ` X k=0 f(x k+1 )f(x k ) ` X k=0 max X 2 k 2 ~ L ; 2 3 k 3~ 2 : (B.13) Hence, since f is bounded below by f min , we have 0 ` X k=0 max X 2 k 2 ~ L ; 2 3 k 3~ 2 f(x 0 )f min : 124 Thus, lim k!1 X k = lim k!1 k = 0: Moreover, the continuity of the functionsX () and () implies that every limit point of the iterates is a second order stationary point. B.5 Proof of Theorem 3.4.3 Proof. First notice that the sucient decrease bound (B.13) implies that ` X k=0 max X 2 k 2 ~ L ; 2 3 k 3~ 2 f(x 0 )f min ; (B.14) for every iteration `. Dene the index sets G( g ),fkjX k > g g and H( H ),fkj k > H g: According to the bound (B.14), it is easy to show that the cardinality of the above two sets is bounded by G( g ) 2 ~ L(f(x 0 )f min ) 2 g G( g )[H( H ) f(x 0 )f min min ( 2 g 2 ~ L ; 2 3 H 3~ 2 ): (B.15) 125 Appendix C Proofs for results in Section 3.5 Consider the following optimization problem minimize x2P f(x); (C.1) whereP ,fx2 R n j Ax bg is a polyhedron with nite number of linear constraints. In this section we generalize results from [49] to adapt for the linear constraints. For the sake of completeness of the manuscript, some Lemmas and proofs are restated from [49]. Recall the sub-problem Q k with trust region k , Q k , min s q k (s),f k + g T k s + 1 2 s T H k s; subject to 8 > < > : As b Ax k ksk 2 k : Let C k be the multiplier corresponding to the linear constraint As b Ax k , and k be the multiplier for the trust region constraintksk 2 k . The rst order K.K.T optimality conditions for the above problem are state below [21] g k + (H k + k I)s k + A T C k = 0; (C.2) 0 C k ? b Ax k As k 0; (C.3) 0 k ? k ks k k 2 2 0: (C.4) C.1 Proof of Theorem 3.5.2 To show convergence to rst-order stationarity, we rst provide in Lemma C.1.1 a sucient decrease condi- tion. Then, in Lemma C.1.7 we show that the number of accepted stepsjAj is innite. Combining these two 126 results with the assumption that f is lower bounded, we get the desired convergence result. In practice, the algorithm terminates whenX k is below a prescribed positive threshold > 0. Hence, we assume, without loss of generality thatX k for all k2N. Lemma C.1.1. For any k2N, the trial step s k and dual variable k satisfy f k q k (s k ) 1 2 s T k (H k + k I)s k + 1 2 k ks k k 2 2 : (C.5) In addition, for any k2N + , the trial step s k satises f k q k (s k )CX k min n k ; X k kH k k 2 ; 1 o : (C.6) Proof. By denition of q k , f k q k (s k ) =g T k s k 1 2 s T k H k s k = s T k H k s k + k ks k k 2 2 + s T k A T ( C k ) 1 2 s T k H k s k = 1 2 s T k (H k + k I)s k + 1 2 k ks k k 2 + s T k A T ( C k ) 1 2 s T k (H k + k I)s k + 1 2 k ks k k 2 ; (C.7) where the second equality follows by KKT condition (C.2), and the last inequality follows from the feasibility of x k and the complementary slackness (C.3). Also, using [46, Theorem 12.2.2], we obtain f k q k (s k )CX k min n k ; X k kH k k 2 ; 1 o : To prove the innite cardinality of the setA, we need some intermediate Lemmas. The next result shows that the trust region radius is reduced when the CONTRACT subroutine is called. Lemma C.1.2. For any k2N, if k2C, then k+1 < k . Proof. Suppose that k2C. We prove the result by considering the various cases that may occur within the CONTRACT subroutine. If Step 23 is reached, the subroutine returns k+1 = C ks k k 2 < k . Otherwise, if Step 6 is reached, the subroutine returns k+1 =ksk 2 where s solves Q k () for . Hence, k+1 =ksk 2 <ks k k 2 k ; 127 where the strict inequality follows from Step 3. Similarly, if Step 9 is reached, the subroutine returns k+1 =k sk 2 where s solves Q k ( ). Hence, k+1 =k sk 2 <ks k k 2 k : Otherwise, Step 21 is reached. In which case, the subroutine returns k+1 =ksk 2 where s solves Q k () for > k . The result follows using the while loop condition Step 17 along with the inverse relationship of andksk. We now show that for all iterationsk, the trust region region radius k is upper bounded by a non-decreasing sequencef k g. Also, if k2A[E, we show that k+1 k . Lemma C.1.3. For any k2N, there holds k k k+1 . Moreover, k+1 k for all k2A[E. Proof. The fact that k k+1 for all k2 N follows from the computations in Steps 7, 12, and 16 of Algorithm 2. It remains to show that k k for all k2N. We prove the result by means of induction. The inequality holds for k = 0 by the initialization of quantities in Step 2 of Algorithm 2. Assume the induction hypothesis holds for iteration k. By the computations in Steps 7, 8, 16, 17 and by Lemma C.1.2, the result holds for iteration k + 1. We next show that k+1 k for all k2A[E. Suppose k2A. It follows from Steps 7 and 8 that k+1 = minfmaxf k ; E ks k k 2 g; maxf k ; E ks k k 2 gg k : Here the inequality follows since k k k+1 . Now suppose k2E. By the conditions indicated in Step 14, we have k > k ks k k 2 0. It follows by (C.4) thatks k k 2 = k . We obtain k+1 = minf k+1 ; k = k g minf k ;ks k k 2 g = k ; where the inequality follows since k k k+1 . The next result, shows that we cannot have two consecutive expansion steps. Lemma C.1.4. For any k2N, if k2C[E, then k + 1 = 2E. Proof. Observe that if k+1 = 0, then conditions in Steps 14 of Algorithm 2 ensure that (k + 1) = 2E. Thus, by (C.4), we may proceed under the assumptions thatks k+1 k 2 = k+1 and k+1 > 0. Suppose that k2C, i.e. k <. It follows that Step 22 sets k+1 k+1 =ks k+1 k 2 . Therefore, if k+1 , we have (k + 1)2A. Otherwise, k+1 <; which implies that (k + 1)2C. 128 Now suppose that k2E. It follows that k > k ks k k 2 ; k+1 = minf k ; k = k g; and k+1 = k : (C.8) Combined with (C.4), we getks k k 2 = k . We now consider two dierent cases: 1. Suppose k k = k . It follows from (C.8) that k+1 = k = k >ks k k 2 = k : (C.9) Therefore, by the relationship between the trust region radius and its corresponding multiplier, we get k+1 k . Combined with (C.8) and (C.9), we obtain k+1 k = k k+1 = k+1 ks k+1 k 2 : Hence (k + 1) = 2E. 2. Suppose k < k = k . Using (C.8) ks k+1 k 2 = k+1 = k = k+1 ; where the last equality holds by Step 16. If k+1 , then (k + 1)2A A. Otherwise, k+1 <, from which it follows that (k + 1)2C. Hence, in both cases (k + 1) = 2E. Next, we show that if the dual variable for the trust region constraint k is suciently large, then the constraint is active and the sucient decrease criteria is met. Lemma C.1.5. For any k2N, if the trial step s k and dual variable k satisfy k g Lip +H max +ks k k 2 ; (C.10) thenks k k 2 = k and k . Proof. By the denition of the objective function of the model q k , there exists a point x k 2R n on the line segment [x k ; x k + s k ] such that q k (s k )f(x k + s k ) = g k g( x k ) T s k + 1 2 s T k H k s k kg k g( x k )k 2 ks k k 2 1 2 kH k k 2 ks k k 2 2 : (C.11) 129 Therefore, f k f(x k + s k ) =f k q k (s k ) +q k (s k )f(x k + s k ) 1 2 s T k H k s k + k ks k k 2 2 kg k g( x k )k 2 ks k k 2 1 2 kH k k 2 ks k k 2 2 kH k k 2 ks k k 2 2 + k ks k k 2 2 g Lip ks k k 2 2 ( k g Lip H max )ks k k 2 2 ks k k 3 2 : Here the rst inequality holds from Lemma C.1.1 and expression (C.11). The resultks k k = k follows directly from (C.10) and (C.4). We now use the previous results to show that if from some iteration onward, all the steps are contraction steps, then the sequence of trust region radii converge to zero, and the sequence of dual variables converge to innity. Lemma C.1.6. If k2C for all kk 0 , thenf k g! 0 andf k g!1. Proof. Assume, without loss of generality, that k2C for all k2 N. It follows from Lemma C.1.2 that f k g is monotonically strictly decreasing. Combined with the fact thatf k g is bounded below by zero, we have thatf k g converges. We may now observe that if Step 23 of the CONTRACT subroutine is reached innitely often, then clearly,f k g! 0. Hence, it follows by the relationship between the trust region radius and its corresponding multiplier thatf k g!1. Therefore, let us assume that Step 23 of the CONTRACT subroutine does not occur innitely often, i.e., that there exists k C 2N such that Step 6, 9, or 21 is reached for all kk C . Consider iteration k C . Steps 2, 4, 13, 15, 18 in the CONTRACT subroutine will set k+1 = minf k + k ; k g> k for all kk C + 1: Therefore, sincek2C for allkk C , we have x k = x k C (and soX k =X k C ) for allkk C , which implies that f k g!1. It follows by the relationship between the trust region radius and its corresponding multiplier thatks k k 2 = k ! 0. We now prove that the set of accepted steps is innite. Lemma C.1.7. The setA has innite cardinality. Proof. To derive a contradiction, suppose thatjAj <1. We claim that this impliesjCj =1. Indeed, if jCj <1, then there exist some k E 2 N such that k2E for all k k E , which contradicts Lemma C.1.4. Thus,jCj =1. Combining this with the result of Lemma C.1.4, we conclude that there exists somek C 2N + 130 such that k2C for all k k C . It follows from Lemma C.1.6 thatfks k k 2 gf k g! 0 andf k g!1. In combination with Lemma C.1.5, we conclude that there exists some k k C such that k , which contradicts the fact that k2C for all kk C . Having arrived at a contradiction under the supposition that jAj<1, the result follows. We now provide an upper bound for the sequencef k g and the trial stepsfs k g. Moreover, we show that the number ofA steps computed by the algorithm is nite. Lemma C.1.8. There exists a scalar constant > 0 and k A 2 N, such that k = for all k k A . Moreover, the setA has nite cardinality, and there exists a scalar constant s max > 0 such thatks k k 2 s max for all k2N. Proof. For all k2A, we have k , which implies by Step 6 of Algorithm 2 that f(x k )f(x k+1 )ks k k 3 2 : Combining this with Lemma C.1.7 and the fact that f is bounded below, it follows thatfs k g k2A ! 0. In particular, there exists k A 2N such that for all k2A with kk A , we have E ks k k 2 0 k ; (C.12) where the latter inequality follows from Lemma C.1.3. Combined with the update in Steps 7, 12 and 16 of LC-TRACE, we get k+1 = k for all kk A : This proves the rst part of the lemma. The second part also follows from (C.12) which implies that ks k k 2 < k for all k2A with k k A . Finally, the last part of the lemma follows from the rst part and the fact that Lemma C.1.3 ensuresks k k 2 k k = for all suciently large k2N. We now show that there exits a uniform upper bound on the termkg k + A T C k k 2 . Lemma C.1.9. For all k2N,kg k + A T C k k 2 G max , where G max > 0 is a constant scalar. Moreover, k maxf 0 ; max g 8k2N; where max , maxfg Lip + 2H max + ( +) + (g max ) 1=2 ; (g Lip +H max +)g. Proof. By (C.2) and Lemma C.1.3, kg k + A T C k k 2 =kH k s k + k s k k 2 (H max + k ) k (H max + k ): 131 Thus, it suces to nd a constant upper bound for k to get the desired result. Ifks k+1 k 2 < k+1 , then by (C.4), k+1 = 0. Therefore, we may proceed under the assumption that ks k+1 k 2 = k+1 . Suppose k2C, then by Lemma C.1.5, k <g Lip +H max +. If Step 3 in the CONTRACT subroutine tests true, we get k+1 k +H max + k + (X k ) 1=2 g Lip + 2H max + ( +) + (g max ) 1=2 : (C.13) Otherwise, if Step 3 tests false, we claim that k+1 (g Lip +H max +): (C.14) To show our claim, we assume the contrary, i.e. k+1 > (g Lip +H max +): Then the condition of the while loop in Step 17 of the CONTRACT subroutine tested true for some ^ s being a solution of Q k ( ^ ) for ^ g Lip +H max +. There exist ^ x on the line segment [x k ; x k + ^ s] such that q k (^ s)f(^ x + ^ s) = g k g(^ x k ) T s k + 1 2 ^ s T H k ^ sg Lip k^ sk 2 2 1 2 H max k^ sk 2 2 : (C.15) Therefore, f(^ x)f(^ x + ^ s) k^ sk 3 2 = f(^ x)q k (^ s) +q k (^ s)f(^ x + ^ s) k^ sk 3 2 kH k k 2 + 2 ^ 2g Lip H max 2k^ sk 2 ^ g Lip H max k^ sk 2 ; where the rst inequality holds by Lemma C.1.1 and (C.15). Since k = f k f(x k + s k ) ks k k 3 2 <; it follows thatk^ sk 2 6=ks k k 2 which contradicts the condition of the while loop in Step 17 which tested true for ^ s generated by solving Q k ( ^ ). 132 Combining (C.13) and (C.14), we get that for all k2C k+1 max ; (C.16) where max , maxfg Lip + 2H max + ( +) + (g max ) 1=2 ; (g Lip +H max +)g. Now, suppose that k2A[E. By Lemma C.1.3, we haveks k+1 k 2 = k+1 k ks k k 2 : Hence, by the relationship between the trust region radius and its corresponding multiplier, we obtain k+1 k : (C.17) Let k C , minfk2Njk2Cg be the rst contract step. By (C.17), k 0 for all kk C . Moreover, using (C.16) and (C.17), k max 8k>k C : Combining these results yield k maxf 0 ; max g 8k2N; which completes the proof. Notice that in the proof of Lemma C.1.9, we have shown that there exists a uniform upper bound for the dual variables k . Our next result shows that the ratio X k ksk 2 is upper bounded by C min + k , where C min is a scalar constant. Lemma C.1.10. For any k2N, it holds X k (C min + k )ks k k 2 ; (C.18) where C min ,H max +G max +g max is a scalar constant. Proof. Let k;1 be the largest singular value of H k . For all d satisfying Ad b Ax k , we have g T k d =d T (H k + k I)s T k ( C k ) T Ad d T (H k + k I)s T k ( C k ) T As k ( k;1 + k )kdk 2 ks k k 2 ( C k ) T As k ; (C.19) where the rst equality holds by (C.2), and the rst inequality holds by complementary slackness (C.3). 133 Minimizing over all such d, we obtain min AdbAx k ;kdk1 g T k d( k;1 + k )ks k k 2 k( C k ) T Ak 2 ks k k 2 ( k;1 + k )ks k k 2 (kg k + ( C k ) T Ak 2 +kg k k 2 )ks k k 2 (H max + k )ks k k 2 (G max +g max )ks k k 2 ; where the last inequality uses Lemma C.1.9. Then denition ofX k yields X k (H max + k +G max +g max )ks k k 2 = (C min + k )ks k k 2 ; where C min ,H max +G max +g max is a scalar constant. We now show that the limit inferior of stationarity measureX k is equal to zero. Lemma C.1.11. There holds lim inf k2N;k!1 X k = 0: Proof. Suppose the contrary that there exists a scalar constantX min > 0 such thatX k X min for allk2N. Then by Lemmas C.1.9 and C.1.10, for scalar s min = X min C min + maxf max ; 0 g ; we have thatks k k 2 s min > 0 for allk2N. Moreover, for allk2A we havef k f k+1 ks k k 3 2 > 0. Given the lower boundedness off and Lemma C.1.7 that insures innite cardinality of setA, we havefs k g k2A ! 0. This contradicts the existence of s min > 0. Theorem C.1.12. Under Assumption 3.5.1, it holds that lim k2N;k!1 X k = 0: (C.20) Proof. Suppose the contrary that (C.20) does not hold. Combined with Lemmas C.1.7 and C.1.11, it implies that there exist an innite sub-sequenceft i gA (indexed overi2N) such thatX ti 2 X for some X > 0 and alli2N. Additionally, Lemmas C.1.7 and C.1.11 imply that there exist an innite subsequencefl i gA such that X k X andX li < X 8 i2N; k2N; t i k<l i : (C.21) 134 We claim that for all k2N + , the trial step s k satises the following ks k k 2 min n k ; X k C min o : (C.22) The proof of this claim follows directly from Lemma C.1.10. If ks k k 2 = k , the result trivially holds. Otherwise, using KKT condition (C.4), k = 0 which proves our claim when combined with Lemma C.1.10. We now restrict our attention to indices in the innite index set K,fk2A :t i k<l i for some i2Ng: Observe from (C.21) and (C.22) that f k f k+1 ks k k 3 2 min n k ; X C min o 3 : (C.23) Sinceff k g is monotonically decreasing and bounded below, we know that f k ! f for some f2 R. When combined with (C.23), we obtain lim k2K;k!1 k = 0: (C.24) Using this fact and Lemma C.1.1, we have for all suciently large k2K that f k f k+1 =f k q k (s k ) +q k (s k )f k+1 CX k min n k ; X k kH k k 2 ; 1 o (g Lip + 1 2 H max )ks k k 2 2 C X min n k ; X H max ; 1 o (g Lip + 1 2 H max )ks k k 2 2 C X k (g Lip + 1 2 H max ) 2 k C 2 X k : Consequently, for all suciently large i2N, we have kx ti x li k 2 li1 X k2K;k=ti kx k x k+1 k 2 li1 X k2K;k=ti k li1 X k2K;k=ti 2 C X (f k f k+1 ) = 2 C X (f ti f li ): Sinceff ti f li g! 0, we getfkx ti x li k 2 g! 0, which, in turn, implies thatfX ti X li g! 0. This contradicts (C.21). 135 C.2 Proof of Theorem 3.5.4 In this section we show that the number of iterations required to reach an -rst order stationary point is O( 3=2 log 3 1 ). To that end, we start by showing the desired model decrease using Assumption 3.5.3. Lemma C.2.1. Consider the directions s, s + and points x = x k + s, x + = x k + s + . If for some 2 (0; 1] f k q k (s) k ksk 2 2 ; (C.25) q k (s + )q k (s); (C.26) g T k (s + s) + (s + ) T H k (s + s) k (s + ) T (s + s); (C.27) g T k (s + s) + s T H k (s + s) k s T (s + s); (C.28) then f k q k (s + ) 1 3 k ks + k 2 2 : Proof. Suppose that for a given constant scalar2 (0; 1),ksk 2 ks + k 2 . Then, it directly follows by (C.25) and (C.26) that f k q k (s + ) =f k q k (s) +q k (s)q k (s + ) k ksk 2 2 2 k ks + k 2 2 : (C.29) Now consider the case thatksk 2 <ks + k 2 . First note that by (C.27), 0 (g k + H k s + ) T (s + s) + k (s + ) T (s + s) = (g k + H k s) T (s + s) + (s + s) T H k (s + s) + k (s + ) T (s + s): (C.30) Also, by (C.28) (g k + H k s) T (s + s) + k s T (s + s) 0: (C.31) Adding (C.30) and (C.31), we get q k (s + )q k (s) = (g k + H k s) T (s + s) + 1 2 (s + s) T H k (s + s) 1 2 k (ks + k 2 2 ksk 2 2 ): (C.32) Sinceksk 2 <ks + k 2 , it follows that f k q k (s + )q k (s)q k (s + ) 1 2 k (ks + k 2 2 ksk 2 2 ) 1 2 k ks + k 2 2 (1 2 ); (C.33) where the rst inequality holds by (C.25), the second inequality holds by (C.32), and the last inequality 136 holds becauseksk 2 <ks + k 2 . We now choose the value of for which the lower bounds (C.29) and (C.33) are equal; i.e. 2 = 1 2 1 2 , equivalently = r 1 3 . We next show that sucient model decrease is satised when either g T k s k 0 or s T k H k s k 0. Lemma C.2.2. Suppose that g T k s k 0 or s T k H k s k 0. Then, f k q k (s k ) =g T k s k 1 2 s T k H k s k 1 2 k ks k k 2 2 : (C.34) Proof. First notice that since the origin is feasible in Q k , g T k s k + 1 2 s T k H k s k 0: Hence, s T k H k s k 0) g T k s k 0: On the other hand, if g T k s k 0, by (C.2), 2 g T k s k + 1 2 s T k H k s k + 1 2 k ks k k 2 2 = g T k s k s T k A T C k g T k s k 0; (C.35) where the rst inequality holds due to the complementary slackness condition (C.3). Lemma C.2.3. Suppose Assumption 3.5.3 holds at iteration k. Then there exist a constant > 0 indepen- dent of k such that f k q k (s k ) k ks k k 2 2 : Proof. If g T k s k 0 or s T k H k s k 0, the result follows by Lemma C.2.2. Thus, we may proceed under the assumption that g T k s k 0 and s T k H k s k 0. Using Assumption 3.5.3, we proceed with a proof by induction onl k . Ifl k = 1, the last condition in Assumption 3.5.3 implies that g T k s k 0, thus Lemma C.2.2 implies the desired result. Now assume the result holds for l k =i< l, we next show that the result holds for l k =i + 1. By the induction step and Assumption 3.5.3, we have f k q k (s k;i ) k ks k;i k 2 2 ; q k (s k;i+1 )q k (s k;i ); g k + H k s k;i+1 ; x k;i+1 x k;i k s T k;i+1 (x k;i+1 x k;i+1 ); g k + H k s k;i ; x k;i+1 x k;i k s T k;i (x k;i+1 x k;i+1 ): 137 Then using Lemma C.2.1, with x = x k;i and x + = x k;i+1 we obtain f k q k (s k;i+1 ) + k ks k;i+1 k 2 2 ; for some + 2 (0; 1) independent of k. Our next result provides a bound on the ratio k+1 =ks k+1 k 2 when k2C. Lemma C.2.4. Assume Assumption 3.5.3 holds at iteration k2C. Then, If Step 6, 9, or 21 of Algorithm 3 is reached, then k+1 ks k+1 k 2 max n ; C H Lip + 2 2 o : If Step 23 of Algorithm 3 is reached, then k+1 ks k+1 k 2 max n ; C H Lip + 2 2 o : Proof. Let k2C and consider the three possible cases. The rst two correspond to situations in which the conditions in Step 3 in the CONTRACT subroutine tests true. Suppose that Step 6 is reached. Then, k+1 =ksk 2 where (; s) is computed in Step 4. It follows that Step 22 in Algorithm 2 will then produce the primal-dual pair (s k+1 ; k+1 ) = (s;) with > 0. Since the condition in Step 5 tested true, we have k + k k k+1 ks k+1 k 2 = ksk 2 ; (C.36) where the second inequality holds sinceks k+1 k 2 = k+1 ks k k 2 k . Suppose that Step 9 is reached. Then, k+1 =k sk 2 where ( ; s) is computed in Step 2. Similar to the previous case, it follows that Step 22 in Algorithm 2 will produce the primal-dual pair (s k+1 ; k+1 ) = ( s; ) with = k + k . We rst show thatk sk 2 . Assume the contrary, then by Lemma C.1.10 and the fact thatX k+1 =X k for all k2C, X k (C min +maxf max ; 0 g)k sk 2 (C min +maxf max ; 0 g) =; which contradicts our assumption onX k . Here the last inequality uses the denition (3.25) of . Combined with Lemma C.1.3 we obtain k sk 2 ks k k 2 k : 138 Therefore, k + k k k sk 2 k + k (ks k k 2 + k ) 2 k ; (C.37) where the fourth inequality holds by the condition of Step 3 and the last inequality holds by Lemmas C.1.3, C.1.8 and the denition of in 3.25. The other case correspond to situations in which the condition in Step 3 tests false. It follows by Steps 2 and 3 that ksk 2 ; (C.38) Finally, using the argument of Lemma C.1.9, we claim that k+1 H Lip + 2 2 ks k k 2 : (C.39) To show our claim, we assume the contrary, i.e. k+1 > H Lip + 2 2 ks k k 2 : Then the condition of the while loop in Step 17 tested true for some ^ s computed by solving Q k ( ^ ) for ^ ( H Lip + 2 2 )k^ sk 2 . There exists ^ x on the line segment [x k ; x k + ^ s] such that q k (^ s)f(^ x + ^ s) = 1 2 ^ s T H k H(x k ) ^ s 1 2 H Lip k^ sk 3 2 : (C.40) Therefore, ^ ff(^ x + ^ s) k^ sk 3 2 = ^ fq k (^ s) +q k (^ s)f(^ x + ^ s) k^ sk 3 2 ^ k^ sk 2 2 0:5H Lip k^ sk 3 2 k^ sk 3 2 ks k k 2 k^ sk 2 ; where the rst inequality holds by Lemma C.2.3 and (C.40), and the last inequality holds sinceks k k 2 k^ sk 2 . However k = f k f(x k + s k ) ks k k 3 2 <. It follows thatk^ sk 2 6=ks k k 2 which contradicts the condition of the while loop in Step 17 for ^ s computed by solving Q k ( ^ ). Suppose that Step 21 is reached. Then, k+1 =ksk 2 . It follows that Step 22 in Algorithm 2 will produce the primal-dual pair (s k+1 ; k+1 ) solving Q k+1 such that s k+1 = s and k+1 =. In conjunction with 139 (C.38), (C.39), and the condition in Step 20 of the CONTRACT sub-routine, we observe that k+1 ks k+1 k 2 = ksk 2 C H Lip + 2 2 (C.41) Suppose that Step 23 is reached. Then, k+1 = C ks k k 2 . It follows that Step 22 in Algorithm 2 will produce the primal-dual pair (s k+1 ; k+1 ) = (s;). Ifksk 2 < k+1 = C ks k k 2 , then k+1 = 0. Otherwise,ksk 2 = k+1 = C ks k k 2 and k+1 > 0. Combined with (C.38) and (C.39), we obtain k+1 ks k+1 k 2 = ksk 2 C H Lip + 2 2 : The result follows since we have obtained the desired inequalities in all cases. We now provide an upper bound for the sequencef max g. Lemma C.2.5. Assume Assumption 3.5.3 holds. There exists a scalar constant max > 0 such that k max 8 k2N: Proof. First note that by Lemma C.1.8, the cardinality of the setA is nite. Hence, there exist k A 2N such thatk = 2A for allkk A . We continue by showing that k is upper bounded for allkk A . Consider the following three cases: If k2A , then by denition k k ks k k 2 , which implies by Step 9 of Algorithm 2 that k+1 = maxf k ; k =ks k k 2 g = k : If k2C, by Step 22 of Algorithm 2 and Lemma C.2.4, it follows that k+1 = max n k ; k+1 ks k+1 k 2 o max n k ;; C H Lip + 2 2 o : (C.42) If k2E, then Step 18 of Algorithm 2 implies that k+1 = k . Combining the results of these three cases, the desired result follows. We now establish an upper bound on the norm trial steps s k when k2A . Lemma C.2.6. For all k2A , the accepted step s k satises ks k k 2 (H Lip + max ) 1=2 X 1=2 k+1 : 140 Proof. For all k2A , there exists x k on the line segment [x k ; x k + s k ] such that g T k+1 d = g T k+1 d g T k d d T (H k + k I)s k d T A T C k = g(x k + s k ) g k T d d T (H k + k I)s k d T A T C k = d T (H( x k ) H k )s k k d T s k d T A T C k H Lip ks k k 2 2 kdk 2 k ks k k 2 kdk 2 d T A T C k H Lip ks k k 2 2 k ks k k 2 2 d T A T C k for all d withkdk 2 1; where the rst equation follows from (C.2) and the last inequality follows since k k ks k k 2 for allk2A . Thus min s.t. d2D k+1 g T k+1 dH Lip ks k k 2 2 k ks k k 2 2 max s.t. d2D k+1 d T A T C k ; (C.43) whereD k+1 ,fd2 R n jkdk 2 1; Ad b Ax k+1 g. Note that since k2A the updated step will be x k+1 = x k + s k . Now, letI k ,fij a T i s k = b i a T i x k g, then A I k d b I k A I k x k+1 = b I k A I k x k A I k s k = 0) ( C k ) T Ad = ( C k ) T I k A I k d 0; for all d2D k+1 . Substituting in (C.43), we obtain X k+1 H Lip ks k k 2 2 + k ks k k 2 2 ; which along with Lemma C.2.5, implies the result. We are now ready to compute a worst-case upper bound on the number of steps inA for which the rst-order criticality measureX k is larger than a prescribed > 0. Lemma C.2.7. Assume Assumption 3.5.3 holds. For a scalar 2 (0;1), the total number of elements in the index set K ,fk2N + :k 1; (k 1)2A ;X k >g is at most & f 0 f min (H Lip + max ) 3=2 ! 3=2 ' ,K () 0: (C.44) Proof. By Lemma C.2.6, we have for all k2K that f k1 f k ks k1 k 3 2 (H Lip + max ) 3=2 X 3=2 k (H Lip + max ) 3=2 3=2 : 141 In addition, we have by Theorem C.1.12 thatjK j<1. Hence, we have that f 0 f min X k2K (f k1 f k )jK j(H Lip + max ) 3=2 3=2 : Rearranging this inequality to yield an upper bound forjK j we obtain the desired result. It remains to compute a worst-case upper bound for the number of iterations inA for whichX k is larger than a prescribed > 0, and the number of contraction and expansion iterations that may occur between two acceptance steps. We compute these bounds separately in Lemmas C.2.8 and C.2.11. Lemma C.2.8. The cardinality of the setA is upper-bounded by & f 0 f min 3 0 ' ,K 0: (C.45) Proof. For all k2A , it follows by Lemma C.1.3 that f k f k+1 ks k k 3 2 = 3 k 3 0 : Hence, we have that f 0 f min 1 X k=0 (f k f k+1 ) X k2A (f k f k+1 )jA j 3 0 ; from which the desired result follows. So far, we have obtained upper-bound on the number of acceptance iterations. To obtain upper-bounds on the number of contraction and expansion iterations that may occur until the next accepted step, let us dene, for a given ^ k2A, k A ( ^ k), minfk2A :k> ^ kg; I( ^ k),fk2N + : ^ k<k<k A ( ^ k)g; Using this notation, the following result shows that the number of expansion iterations between the rst iteration and the rst accepted step, or between consecutive accepted steps, is never greater than one. More- over, when such an expansion iteration occurs, it must take place immediately. Lemma C.2.9. For any ^ k2N + , if ^ k2A; thenE\I( ^ k)f ^ k + 1g. 142 Proof. By the denition of k A ( ^ k), we have under the conditions of the lemma thatI( ^ k)\A =;, which means thatI( ^ k)C[E. It then follows from Lemma C.1.4 that thatE\I( ^ k)f ^ k + 1g, as desired. Lemma C.2.10. For any k2N + , if k2C and Step 21 in Algorithm 3 is reached, then k+1 ks k+1 k 2 k ks k k 2 : Proof. If Step 21 is reached, thenksk 2 C ks k k 2 . It follows that Step 22 in Algorithm 2 will produce the primal-dual pair (s k+1 ; k+1 ) solving Q k+1 such thatks k+1 k 2 = k+1 <ks k k 2 = k and k+1 k , i.e., k+1 ks k+1 k 2 k ks k k 2 : (C.46) Lemma C.2.11. Assume Assumptions 3.5.1 and 3.5.3 hold. For any ^ k2N + , if ^ k2A, then jC\I( ^ k)jK C ; where K C , 1 + & 2 + 1 log( ) log max log 1 (C min + maxf max ; 0 ) log(1= C ) ' : Proof. The result holds trivially ifjC\I( ^ k)j = 0. Thus, we may assumejC\I( ^ k)j 1. To proceed with our proof, we rst claim that the number of iterations k2C\I( ^ k) with Step 23 in CONTRACT sub-routine reached, we denote by K C;1 , satises K C;1 log 1 (C min + maxf max ; 0 log(1= C ) : By Lemma C.1.10, and the assumption thatX k A ( ^ k) (optimality not reached yet), (C min + maxf max ; 0 g) 1 ks k A ( ^ k)1 k 2 k A ( ^ k)1 K C;1 C ; where the last inequality holds by Lemma C.1.3, Lemma C.1.2, and the fact that each time Step 23 is reached the radius of the trust region is multiplied by C . The proof of the claim follows by rearranging the inequality to yield an upper bound for K C;1 . It remains to compute the number of iterations between Steps in k2C\I( ^ k) for which Step 23 is reached. 143 For a given ^ k2A, We dene I C ( ^ k),fk2C\I( ^ k) : Step 23 is reached in Algorithm 3g which correspond to indices inC\I( ^ k) for which step 23 is reached. Let k C;1 ( ^ k) and k C;2 ( ^ k) be any two consecutive indices inI C ( ^ k) with k C;2 ( ^ k)k C;1 ( ^ k)> 2. By Lemma C.2.4, we have k ks k k 2 ; 8k C;1 ( ^ k) + 2kk C;2 ( ^ k); which implies that step 21 of CONTRACT subroutine is reached for every k C;1 ( ^ k) + 2 < k < k C;2 ( ^ k). By Lemmas C.2.5 and C.2.10, we then get max k C;2 ( ^ k) ks k C;2 ( ^ k) k 2 k C;2 ( ^ k)k C;1 ( ^ k)2 : Hence, k C;2 ( ^ k)k C;1 ( ^ k) 1 log( ) log max + 2: Now let k C;last ( ^ k)k C;1 ( ^ k) be the last element ofI C ( ^ k). Similarly, we can show that k A ( ^ k)k C;last ( ^ k) 1 log( ) log max + 2: The desired result follows s sincejC\I( ^ k)j = 1 +K C;1 1 log( ) log max + 2 . Notice that since =O( 1 ), the number of contract stepsK C is of orderO(log 2 1 ). Due to the while loop in Step 17 of the CONTRACT sub-routine, completing a contract step may require solving more than one sub-problem. Our next result provides an upper bound on the number of subproblem routine calls required in one contract step. Lemma C.2.12. Assume Assumptions 3.5.1 and 3.5.3 hold. For a scalar 2 (0;1), the total number of sub-problems we are required to solve in a step k2C withX k , is at most K 1 C , log (C min + maxf max ; 0 g H max +g Lip + : 144 Proof. We prove the result by contradiction. Assume the contrary, k log Cmin+maxfmax;0g H max +g Lip + ks k k 2 C min + maxf max ; 0 g H max +g Lip + H max +g Lip +; where the last inequality holds by Lemma C.1.10 and the fact thatX k . Hence, by Lemma C.1.5 f(x k + s)f k ksk 3 2 > f(x k + s k )f k ks k k 3 2 ; where s is computed by solving Q k () and the strict inequality holds since k 2 C. We conclude that ks k k 2 6=ksk 2 which contradicts the condition of the while loop. This completes the proof. Notice that from the denition of in the algorithm, K 1 C =O(log 1 ). Theorem C.2.13. Under Assumptions 3.5.1 and 3.5.3, for a scalar2 (0;1), the total number of elements in the index set fk2N + :X k >g is at most K(), 1 + (K () +K )(1 +K C K 1 C ); (C.47) whereK (), K , K 1 C , andK C are dened in Lemmas C.2.7, C.2.8, C.2.11, and C.2.12 respectively. Hence, for g > 0, the number of sub-problem routines required for First-Order-LC-TRACE to nd an g -rst-order stationary point is at most K( g ) =O 3=2 g log 3 1 g . Proof. Without loss of generality, we may assume that at least one iteration is performed. Lemmas C.2.7 and C.2.8 guarantee that the total number of elements in the index setfk2Ajk 1;X k >g is at most K () +K . Also, immediately prior to each of the corresponding accepted steps, Lemmas C.2.9, C.2.11, and C.2.12 guarantee that at most 1 +K C K 1 C sub-problem routine calls are required in expansion and contraction. Accounting for the rst iteration, the desired result follows. C.3 Proof of Theorem 3.6.1 . Proof. Let us rst dene K ,fk2N + :k 1; (k 1)2A ;X k >g 145 and V,fkj Step 7 in Algorithm 4 is reached at iteration kg: To proceed with our proof, we rst show that if k2V[K , the following reduction bound on the objective value holds f k1 f k+1 min ( X 3=2 k (H Lip + max ) 3=2 ; 2 3 k 3 ~ H 2 ) (C.48) If k2V, then using second-order descent lemma, we obtain f k+1 f k +hg k ; x k+1 x k i + 1 2 (x k+1 x) T H k (x k+1 x k ) + H Lip 6 kx k+1 x k k 3 2 f k +hg k ; x k+1 x k i + 1 2 (x k+1 x k ) T H k (x k+1 x k ) + ~ H 6 kx k+1 x k k 3 2 =f k + 2 k ~ H hg k ; b d k i + 2 2 k ~ H 2 ( b d k ) T H k ( b d k ) + 4 3 k 3 ~ H 2 k b d k k 3 2 f k 2 3 k ~ H 2 + 4 3 k 3 ~ H 2 =f k 2 3 k 3 ~ H 2 : (C.49) Now if a rst-order stationary point was reached at k 1, i.e. k 12V, it directly follows from (C.49) that f k1 f k+1 f k f k+1 2 3 k 3 ~ H 2 ; 8 k2V; k 12V: (C.50) Otherwise, Step 5 of Algorithm 4 was reached at iteration k 1 and First-Order-LC-TRACE was called. The former algorithm by denition is a monotone algorithm, i.e. it generates a sequence of iterates for which the corresponding sequence of objective values is decreasing. This property combined with (C.49) implies that f k1 f k+1 f k f k+1 2 3 k 3 ~ H 2 ; 8 k2V; First-Order-LC-TRACE called at k 1: (C.51) Combining (C.50) and (C.51), we get f k1 f k+1 f k f k+1 2 3 k 3 ~ H 2 ; 8 k2V: (C.52) We next show a lower bound on the reduction of the objective value if k2K . By Lemma C.2.7, we have f k1 f k+1 f k1 f k X 3=2 k (H Lip + max ) 3=2 ; (C.53) where the rst inequality again holds by the monotonicity of First-Order-LC-TRACE and (C.49). Combin- 146 ing (C.52) and (C.53), we get f k1 f k+1 min ( X 3=2 k (H Lip + max ) 3=2 ; 2 3 k 3 ~ H 2 ) ; 8k2K [V: By summing over the iterations we get 2(f 0 f min ) X k2K[V;k1 (f k1 f k+1 )jK [Vj min ( X 3=2 k (H Lip + max ) 3=2 ; 2 3 k 3 ~ H 2 ) : Rearranging this inequality yields jK [Vj 2(f 0 f min ) max ( X 3=2 k (H Lip + max ) 3=2 ; 2 3 k 3 ~ H 2 ) (C.54) LetH( g ; H ) ,fkjX k > g ; k > H g: Using Lemmas C.2.9, C.2.11 and C.2.12, the number of sub- problem routine calls required in expansion and contraction between two acceptance steps is upper bounded by 1 +K C K 1 C . Hence, jH( g ; H )j (jK [Vj +K )(1 +K C K 1 C ); which concludes our proof. 147 Appendix D Proofs for results in Chapter 4 D.1 Proofs for results in Section 4.2 Before proceeding to the proofs of the main results, we need some intermediate lemmas and preliminary denitions. Denition D.1.1. [8] A functionh(x) is said to satisfy the Quadratic Growth (QG) condition with constant > 0 if h(x)h 2 dist(x) 2 ; 8x; where h is the minimum value of the function, and dist(x) is the distance of the point x to the optimal solution set. The following lemma shows that PL implies QG [95]. Lemma D.1.2 (Corollary of Theorem 2 in [95]). If function f is PL with constant , then f satises the quadratic growth condition with constant = 4. The next Lemma shows the stability of arg max f(;) with respect to under PL condition. Lemma D.1.3. Assume thatfh () =f(;)j g is a class of -PL functions in . Dene A() = arg max f(;) and assumeA() is closed. Then for any 1 , 2 and 1 2A( 1 ), there exists an 2 2A( 2 ) such that k 1 2 k L 12 2 k 1 2 k (D.1) Proof. Based on the Lipchitzness of the gradients, we have thatkr f( 2 ; 1 )kL 12 k 1 2 k. Then using 148 the PL condition, we know that g( 2 ) +h 2 ( 1 ) L 2 12 2 k 1 2 k 2 : (D.2) Now we use the result of Lemma D.1.2 to show that there exists 2 = arg min 2A(2) k 1 k 2 2A( 2 ) such that 2k 1 2 k 2 L 2 12 2 k 1 2 k 2 (D.3) re-arranging the terms, we get the desired result that k 1 2 k L 12 2 k 1 2 k: Finally, the following lemma would be useful in the proof of Theorem 4.2.5. Lemma D.1.4 (See Theorem 5 in [95]). Assume h(x) is -PL and L-smooth. Then, by applying gradient descent with step-size 1=L from point x 0 for K iterations we get an x K such that h(x)h 1 L K (h(x 0 )h ); (D.4) where h = min x h(x). We are now ready to prove the results in Section 4.2. D.1.1 Proof of Lemma 4.2.3 Proof. Let 2 arg max 2A f(;). By Lemma D.1.3, for any scalar and direction d, there exists ()2 arg max f( +d;) such that k () k L 12 2 kdk: Using Taylor expansion, we get g( +d)g() =f( +d; ())f(; ) =r f(; ) T d +r f(; ) T | {z } 0 ( () ) + O( 2 ): 149 Thus, by denition of the directional derivative of g, we obtain g 0 (;d) = lim !0 + g( +d)g() =r f(; ) T d: (D.5) Note that this relationship holds for any d. Thus,rg() =r f(; ) for any 2 arg max 2A f(;) = A(). Interestingly, the directional derivative does not depend on the choice of . This means that r f(; 1 ) =r f(; 2 ) for any 1 and 2 in arg max 2A f(;). We nally show that functiong is Lipschitz smooth. Let 1 2A( 1 ) and 2 = arg min 2A(2) k 1 k 2 2 A( 2 ), then krg( 1 )rg( 2 )k =kr f( 1 ; 1 )r f( 2 ; 2 )k =kr f( 1 ; 1 )r f( 2 ; 1 ) +r f( 2 ; 1 )r f( 2 ; 2 )k L 11 k 1 2 k +L 12 k 1 2 k L 11 + L 2 12 2 k 1 2 k; where the last inequality holds by Lemma D.1.3. Using Lemma 4.2.3 and Assumption 4.2.4, we can dene g , max 2 krg()k and g max , maxfg ; 1g: (D.6) The next result shows that the inner loop in Algorithm 5 computes an approximate gradient ofg(). In other words,r f( t ; t+1 )rg( t ). Lemma D.1.5. Dene = L22 1 and = 1 1 < 1 and assume g( t )f( t ; 0 ( t ))< , then for any prescribed "2 (0; 1) if we choose K large enough such that KN K ("), 1 log 1= 4 log(1=") + log(2 15 L 6 R 6 =L 2 ) ; (D.7) where L = maxfL 12 ;L 22 ;L;g max ; 1g and R = maxfR; 1g, then the error e t ,r f( t ; K ( t ))rg( t ) has a norm ke t k, L" 2 2 6 R(g max +LR) 2 and kr f( t ; K ( t ))k": (D.8) 150 Proof. First of all, Lemma D.1.4 implies that g( t )f( t ; K ( t )) K : (D.9) Thus, using the QG result of Lemma D.1.2, we know that there exists an 2A( t ) such that k K ( t ) k K=2 s 2 (D.10) Thus, ke t k =kr f( t ; K ( t ))rg()kL 12 k K ( t ) k L 12 K=2 s 2 L" 2 2 6 R(g max +LR) 2 ; (D.11) where the last inequality holds by our choice of K which yields K 2L 2 " 4 2 12 R 2 L 2 (2 L R) 4 2L 2 " 4 2 12 R 2 L 2 ( L + LR) 4 2L 2 " 4 2 12 R 2 L 2 (g max +LR) 4 : Here the second inequality holds since R 1, and the third inequality holds since g max L. To prove the second part, note that kr f( t ; K ( t ))r f( t ; ) | {z } 0 kL 22 k K ( t ) kL 22 K=2 s 2 "; (D.12) where the last inequality holds by our choice of K which yields K " 2 L 2 " 2 L 2 2 15 L 4 R 4 | {z } 1 " 2 L 2 : Here the second inequality holds since "< 1, L; R 1, and L L. The above lemma implies that Algorithm 5 behaves similar to the simple vanilla gradient descent method applied to problem (4.10). Notice that the assumption g( t )f( t ; 0 ( t )) ;8t could be justied by Lemma D.1.3. More speci- 151 cally, by Lemma D.1.3, k t+1 t k L 12 2 k t+1 t k; where t+1 , arg max f( t+1 ;) and t , arg max f( t ;). Hence, the dierence between consecutive optimal solutions computed by the inner loop of the algorithm, are upper bounded by the dierence between corresponding's. Since is a compact set, we can nd an upper bound such thatg( t )f( t ; 0 ( t )) , for all t. We are now ready to show Theorem 4.2.5 D.1.2 Proof of Theorem 4.2.5 Proof. We start by dening g =g( 0 )g ; where g , min g() is the optimal value of g. Note that by the compactness assumption of the set , we have g =g( 0 )g <1. Based on the projection property, we know that t 1 L r f( t ; t+1 ) t+1 ; t+1 0 8 2 : Therefore, by setting = t , we get r f( t ; t+1 ); t+1 t Lk t t+1 k 2 ; which implies r f t ; ( t ) ; t+1 t Lk t t+1 k 2 + r f t ; ( t ) r f t ; t+1 ; t+1 t =Lk t t+1 k 2 +he t ; t t+1 i (D.13) where ( t )2 arg max 2A f( t ;) and e t ,r f t ; t+1 r f t ; ( t ) . By Taylor expansion, we have g( t+1 )g( t ) + r f t ; ( t ) ; t+1 t + L 2 k t+1 t k 2 g( t ) L 2 k t+1 t k 2 +he t ; t t+1 i: (D.14) where the last inequality holds by (D.13). Moreover, by the projection property, we know that r f( t ; t+1 ); t+1 L t t+1 ; t+1 82 ; 152 which implies r f( t ; t+1 ); t r f( t ; t+1 ); t+1 t +L t t+1 ; t+1 (g max + 2LR +ke t k)k t+1 t k 2(g max +LR)k t+1 t k: (D.15) Here the second inequality holds by Cauchy-Schwartz, the denition of e t and our assumption that B R . Moreover, the last inequality holds by our choice of K in Lemma D.1.5 which yields ke t k =kr f( t ; K ( t ))rg()k (D.16) L 12 k K ( t ) k L 12 K=2 s 2 1 (D.17) g max : (D.18) Hence, X t 2(g max +LR)k t+1 t k; or equivalently k t+1 t k X t 2(g max +LR) : (D.19) Combined with (D.14), we get g( t+1 )g( t ) L 8 X 2 t g max +LR 2 + 2ke t kR; where the inequality holds by using Cauchy Schwartz and our assumption that is in a ball of radius R. 153 Hence, 1 T P T1 t=0 X 2 t 8 g (g max +LR) 2 LT + 16R(g max +LR) 2 L " 2 2 ; where the last inequality holds by using Lemma D.1.5 and choosing K and T : TN T , 32 g (g max +LR) 2 L" 2 ; KN K ("), 1 log 1= 4 log(1=") + log(2 15 L 6 R 6 =L 2 ) ;: Therefore, using Lemma D.1.5, there exists at least one index b t for which X b t " and kr f( b t ; b t+1 )k": (D.20) This completes the proof of the theorem. D.2 Proofs for results in in Section 4.3 D.2.1 Proof of Lemma 4.3.2 Proof. First notice that the dierentiability of the functiong () follows directly from Danskin's Theorem [20]. It remains to show that g is a Lipschitz smooth function. Let 1 , arg max 2A f ( 1 ;) and 2 , arg max 2A f ( 2 ;): Then by strong convexity off (;), we have f ( 2 ; 2 ) f ( 2 ; 1 ) +hr f ( 2 ; 1 ); 2 1 i 2 k 2 1 k 2 ; and f ( 2 ; 1 ) f ( 2 ; 2 ) +hr f ( 2 ; 2 ); 1 2 i | {z } 0; by optimality of 2 2 k 2 1 k 2 : Adding the two inequalities, we get hr f ( 2 ; 1 ); 2 1 ik 2 1 k 2 : (D.21) 154 Moreover, due to optimality of 1 , we have hr f ( 1 ; 1 ); 2 1 i 0: (D.22) Combining (D.21) and (D.22) we obtain k 2 1 k 2 hr f ( 2 ; 1 )r f ( 1 ; 1 ); 2 1 i L 12 k 1 2 kk 2 1 k; (D.23) where the last inequality holds by Cauchy-Schwartz and the Lipschtizness assumption. We nally show that g is Lipschitz smooth. krg ( 1 )rg ( 2 )k =kr f ( 1 ; 1 )r f ( 2 ; 2 )k =kr f ( 1 ; 1 )r f ( 2 ; 1 ) +r f ( 2 ; 1 )r f ( 2 ; 2 )k L 11 k 1 2 k +L 12 k 1 2 k L 11 + L 2 12 k 1 2 k; where the last inequality holds by (D.23). Our Algorithm 6 solves the inner maximization problem using accelerated projected gradient descent meth- ods. The next lemma is known for accelerated projected gradient descent when applied to strongly convex functions. Lemma D.2.1. Assume h(x) is -strongly convex and L-smooth. Then, applying accelerated projected gradient descent algorithm [16] with step-size 1=L and restart parameter N , p 8L= 1 for K iterations, we get x K such that h(x K )h(x ) 1 2 K=N (h(x 0 )h(x )); (D.24) where x , arg min x2F h(x). 155 Proof. According to [16, Theorem 4.4], we have h(x iN )h(x ) 2L (N + 1) 2 kx (i1)N x k 2 4L (N + 1) 2 h(x (i1)N )h(x ) 1 2 h(x (i1)N )h(x ) ; (D.25) where the second inequality holds by the strong convexity of h and the optimality condition of x , and the last inequality holds by or choice of N. This yields, h(x K )h(x ) ( 1 2 ) K=N h(x 0 )h(x ) ; (D.26) which completes our proof. The next result shows that the inner loop in Algorithm 6 computes an approximate gradient of g (). In other words,r f ( t ; t+1 )rg ( t ). Lemma D.2.2. Dene = L 22 1 and assume g ( t )f ( t ; 0 ( t )) < , then for any prescribed "2 (0; 1) if we choose K large enough such that KN K ("), p 8 log 2 4 log(1=") + log(2 17 L 6 R 6 =L 2 ) ; (D.27) where L, maxfL 12 ;L 22 ;L;g max ; 1g and R = maxfR; 1g, then the error e t ,r f ( t ; K ( t ))rg () has a norm ke t k, L" 2 2 6 R(g max +LR) 2 (D.28) and " 2 Y t;K , max s r f ( t ; K ( t ));s s:t: K ( t ) +s2A;ksk 1 : (D.29) Proof. Starting from Lemma D.2.1, we have that g ( t )f ( t ; K ( t )) 1 2 K p 8 : (D.30) Let ( t ), arg max 2A f ( t ;). Then by strong convexity off( t ;), we get 156 2 k K ( t ) ( t )k 2 g ( t )f ( t ; K ( t )) 1 2 K p 8 : (D.31) Combined with the Lipschitz smoothness property of the objective, we obtain ke t k =kr f ( t ; K ( t ))rg ( t )k =kr f ( t ; K ( t ))r f ( t ; ( t ))k L 12 k K ( t ) ( t )k L 12 2 K=2 p 8 r 2 L" 2 2 6 R(g max +LR) 2 (D.32) where the second inequality uses (D.31), and the third inequality uses the choice of K in (D.27) which yields 1 2 K L 2 " 4 2 13 R 2 L 2 (2 L R) 4 2 2 " 4 2 13 R 2 L 2 ( L + LR) 4 L 2 " 4 2 13 R 2 L 2 (g max +LR) 4 : Here the second inequality holds since R 1, and the third inequality holds since g max L. To prove the second argument of the lemma, we also use the Lipschitz smoothness property of the objective to get r f ( t ; K ( t ));s = r f ( t ; K ( t ))r f ( t ; ( t ));s + r f ( t ; ( t ));s kr f ( t ; K ( t ))r f ( t ; ( t ))kksk + r f ( t ; ( t ));s (L 22 +)k ( t ) K ( t ))kksk + r f ( t ; ( t ));s : 2L 22 k ( t ) K ( t ))kksk + r f ( t ; ( t ));s ; (D.33) where the second inequality holds by our Lipschitzness assumption and the last inequality holds by our 157 assumption that L 22 = 1. Moreover, min s r f ( t ; ( t ));s s:t: K ( t ) +s2A;ksk 1 = min r f ( t ; ( t )); K ( t ) s:t: 2A;k K ( t )k 1 = r f ( t ; ( t )); ( t ) K ( t ) max r f ( t ; ( t )); ( t ) s:t: 2A;k K ( t )k 1 | {z } 0 = r f ( t ; ( t )); ( t ) K ( t ) ; (D.34) where the last equality holds since ( t ) is optimal andk ( t ) K ( t )k 1: Combining (D.33) and (D.34), we get min s r f ( t ; K ( t ));s s:t: K ( t ) +s2A;ksk 1 kr f ( t ; ( t ))k + 2L 22 k K ( t ) k: (D.35) Hence, using (4.14), we get Y t;K 2L 22 +g max k K ( t ) k 3 L 2 K=2 p 8 r 2 " 2 ; (D.36) where the second inequality uses (D.31), and the last inequality holds by our choice of K in (D.27) and since "2 (0; 1). The above lemma implies thatkr f ( t ; K ( t ))rg ( t )k , L" 2 64 R 3 L 2 . We now show that our assumption g( t )f( t ; 0 ( t )) for all t in the above Lemma holds. Let t+1 , arg max 2A f ( t+1 ;) and t , arg max 2A f ( t ;): 158 Then by strong convexity off (;), we have f ( t+1 ; t+1 ) f ( t+1 ; t ) +hr f ( t+1 ; t ); t+1 t i 2 k t+1 t k 2 ; and f ( t+1 ; t ) f ( t+1 ; t+1 ) +hr f ( t+1 ; t+1 ); t t+1 i | {z } 0; by optimality of t+1 2 k t+1 t k 2 : Adding the two inequalities, we get hr f ( t+1 ; t ); t+1 t ik t+1 t k 2 : (D.37) Moreover, due to optimality of t , we have hr f ( t ; t ); t+1 t i 0: (D.38) Combining (D.37) and (D.38) we obtain k t+1 t k 2 hr f ( t+1 ; t )r f ( t ; t ); t+1 t i L 12 k t t+1 kk t+1 t k; (D.39) Thus, k t+1 t k L 12 k t+1 t k: Hence, the dierence between consecutive optimal solutions computed by the inner loop of the algorithm, are upper bounded by the dierence between corresponding 's. Since is a compact set, we can nd an upper bound such that g( t )f( t ; 0 ( t )) , for all t. We are now ready to show the main theorem that implies convergence of our proposed algorithm to an "{ rst-order stationary solution of problem (4.4). In particular, we show that usingr f ( t ; K ( t )) instead ofrg ( t ) for a small enough in the Frank-Wolfe or projected descent algorithms applied to g , nds an "{FNE. D.2.2 Proof of Theorem 4.3.3 Proof. Frank-Wolfe Steps: We now show the result when Step 7 of Algorithm 6 sets t+1 = t + X t e L b s t : 159 Using descent lemma on g and the denition of e L in Algorithm 2, we have g ( t+1 )g ( t ) + rg ( t ); t+1 t + e L 2 k t+1 t k 2 =g ( t ) + X t e L rg ( t );b s t + X 2 t 2 e L kb s t k 2 g ( t ) + X t e L rg ( t );b s t + X 2 t 2 e L =g ( t ) X t e L r f t ; K ( t ) rg ( t ) | {z } et ;b s t X 2 t 2 e L g ( t ) + X t e L ke t k X 2 t 2 e L (D.40) whereb s t andX t are dened in equations (4.15) and (4.16) of the manuscript, and the second and last in- equalities use the fact thatkb s t k 1. Summing up these inequalities for all values of t leads to 1 T T1 X t=0 X 2 t 2 e L T + 4ke t kg max 2 e L T + " 2 4 " 2 2 ; where the rst inequality holds since X t = r f t ; K ( t ) r f t ; ( t ) +r f t ; ( t ) ;b s t g max +ke t k 2g max : Here the rst inequality holds by (4.14), Cauchy-Schwartz, and the fact thatkb s t k 1. The last inequality holds since by choice of K in Lemma D.2.2 KN K ("), p 8 log 2 4 log(1=") + log(2 17 L 6 R 6 =L 2 ) ; 160 which yieldske t k 1g max and by choosing T such that TN T ("), 8 e L " 2 : Therefore, using Lemma D.2.2, there exists at least one index b t for which X b t " and Y b t;K " 2 : (D.41) Hence, Y( b t ; K ( b t )) = max s hr f( b t ; K ( b t ));si s:t: K ( b t ) + s2A;ksk 1 = max s hr f ( b t ; K ( b t ));si +( K ( b t ) ) T s s:t: K ( b t ) + s2A;ksk 1 Y b t;K +k K ( b t ) k " ; (D.42) where the rst inequality uses Cauchy Shwartz and the fact thatksk 1, and the last inequality holds due to (D.41), the choice of in the theorem and our assumption thatk K ( b t ) k 2R. Projected Gradient Descent: We start by dening g =g ( 0 )g ; where g , min g () is the optimal value of g . Note that by the compactness assumption of the set , we have g =g ( 0 )g <1. We now show the result when Step 7 of Algorithm 6 sets t+1 = proj t 1 L r f ( t ; K (t)) ; 161 Based on the projection property, we know that t 1 L r f( t ; t+1 ) t+1 ; t+1 0 8 2 : Therefore, by setting = t , we get r f( t ; t+1 ); t+1 t Lk t t+1 k 2 ; which implies r f t ; ( t ) ; t+1 t Lk t t+1 k 2 + r f t ; ( t ) r f t ; t+1 ; t+1 t =Lk t t+1 k 2 +he t ; t t+1 i (D.43) where ( t ), arg max 2A f ( t ;) and e t ,r f t ; t+1 r f t ; ( t ) . By Taylor expansion, we have g ( t+1 )g ( t ) + r f t ; ( t ) ; t+1 t + L 2 k t+1 t k 2 g ( t ) L 2 k t+1 t k 2 +he t ; t t+1 i: (D.44) Moreover, by the projection property, we know that r f( t ; t+1 ); t+1 L t t+1 ; t+1 ; which implies r f( t ; t+1 ); t r f( t ; t+1 ); t+1 t +L t t+1 ; t+1 (g max + 2LR +ke t k)k t+1 t k 2(g max +LR)k t+1 t k: (D.45) 162 Here the second inequality holds by Cauchy-Schwartz, the denition of e t and our assumption that B R . Moreover, the last inequality holds by our choice of K in Lemma D.1.5 which yields ke t k =kr f( t ; K ( t ))rg()k (D.46) L 12 k K ( t ) k L 12 K=2 s 2 1 (D.47) g max : (D.48) Hence, X t 2(g max +LR)k t+1 t k; or equivalently k t+1 t k X t 2(g max +LR) : (D.49) Combined with (D.44), we get g ( t+1 )g ( t ) L 8 X 2 t g max +LR 2 + 2ke t kR; where the inequality holds by using Cauchy Schwartz and our assumption that is in a ball of radius R. Hence, 1 T P T1 t=0 X 2 t 8 g (g max +LR) 2 LT + 16R(g max +LR) 2 L " 2 2 ; where the last inequality holds by using Lemma D.2.2 and choosing K and T : TN T ("), 32 g (g max +LR) 2 L" 2 ; and KN K ("), p 8 log 2 4 log(1=") + log(2 17 L 6 R 6 =L 2 ) ; Therefore, using Lemma D.2.2, there exists at least one index b t for which X b t " and Y b t;K " 2 : (D.50) Hence, 163 Y( b t ; K ( b t )) = max s hr f( b t ; K ( b t ));si s:t: K ( b t ) + s2A;ksk 1 = max s hr f ( b t ; K ( b t ));si +( K ( b t ) ) T s s:t: K ( b t ) + s2A;ksk 1 Y b t;K +k K ( b t ) k " ; (D.51) where the rst inequality uses Cauchy Shwartz and the fact thatksk 1, and the last inequality holds due to (D.50), the choice of in the theorem and our assumption thatk K ( b t ) k 2R. 164
Abstract (if available)
Abstract
Solving non-convex optimization problems are becoming increasingly important in various science and engineering disciplines. In this dissertation, we develop a general framework for analyzing the landscape of these non-convex objective functions, and propose algorithms for computing solutions with strong theoretical guarantees. ❧ We first start by developing a unifying theoretical framework for identifying “nice” non-convex optimization problems. In particular, motivated by the simple structure of the landscape of simple non-convex problems arising in statistics and machine learning, we develop a methodology for checking whether every local optimum of a given function is globally optimal. Our theoretical framework harnesses the concept of local openness from differential geometry to provide sufficient conditions under which local optima of the objective function are globally optimal. We use our general condition to develop a complete characterization of the local/global optima equivalence of multi-layer linear neural networks. Moreover, we provide sufficient conditions under which no spurious local optima exist for over-parameterized hierarchical non-linear deep neural networks. ❧ Although the equivalence between the sets of local and global optima establishes significant understanding of the underlying loss surface, finding a local optimum point may be NP-Hard in general. Given this hardness result, recent focus has shifted to computing first or second-order stationary points of the objective function. While almost all existing results study the problem in the absence of constraints, we consider the problem of finding an approximate second-order stationary point for constrained non-convex optimization problems. We show that, unlike the unconstrained scenario, vanilla projected gradient descent algorithm may converge to a strict saddle point even in the presence of a single linear constraint. We then provide a hardness result by showing that checking approximate second order stationarity is NP-hard even for linearly constrained problems. Despite our hardness result, we identify instances of the problem for which checking second order stationarity can be done efficiently. For such instances, we propose a dynamic second order Frank-Wolfe algorithm which converges to $(\epsilon_g, \epsilon_H)$-second order stationary points in ${\cal O}(\max{\epsilon_g^{-2}, \epsilon_H^{-3}})$ iterations. The proposed algorithm can be used in general constrained non-convex optimization as long as the constrained quadratic sub-problems can be solved efficiently. Since the iteration complexity of our Frank-Wolfe method does not match the lower bound, we restrict our focus to linearly constrained problems with constant number of constraints for which we propose an algorithm that matches the lower bound. More specifically, we propose a trust region algorithm that finds an $(\epsilon_g, \epsilon_H)$-second order stationary point for linearly constrained problems in $\tilde{\cal O}(\max{\epsilon_g^{-3/2}, \epsilon_H^{-3}})$ iterations. This iteration complexity matches the lower bound and hence it is order optimal. ❧ Finally, due to practical concerns on the reliability of non-convex models arising from critical applications, we study the problem of finding robust solutions for these non-convex problems. More specifically, we study the Min-Max optimization problem in the non-convex-concave regime. Using a simple smoothing technique, we propose an alternative multi-step gradient descent-ascent algorithm that finds an ε-first order stationary solution in ${\cal O}(\epsilon^{-3.5})$ gradient evaluations. Moreover, we extend our algorithm to the case where the objective of the inner optimization problem satisfies the Polyak-Lojasiewicz (PL) condition. Under this condition, we show that the worst-case complexity of our algorithm is $\tilde{\cal O}(ε⁻²)$. Lastly, we evaluate the performance of our algorithm on a fair statistical learning problem.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Stochastic games with expected-value constraints
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Train scheduling and routing under dynamic headway control
PDF
Robustness of gradient methods for data-driven decision making
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
Asset Metadata
Creator
Nouiehed, Maher
(author)
Core Title
Landscape analysis and algorithms for large scale non-convex optimization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
07/12/2019
Defense Date
05/23/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,fairness in machine learning,Frank-Wolfe algorithm,local openness,local/global,min-max games,OAI-PMH Harvest,saddle point optimization problems,second-order stationary,trust region algorithm
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Razaviyayn, Meisam (
committee chair
), Jain, Rahul (
committee member
), Jovanovic, Mihailo (
committee member
), Pang, Jong-Shi (
committee member
)
Creator Email
maher.nouiehed@gmail.com,nouiehed@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-183773
Unique identifier
UC11660527
Identifier
etd-NouiehedMa-7547.pdf (filename),usctheses-c89-183773 (legacy record id)
Legacy Identifier
etd-NouiehedMa-7547.pdf
Dmrecord
183773
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Nouiehed, Maher
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
fairness in machine learning
Frank-Wolfe algorithm
local openness
local/global
min-max games
saddle point optimization problems
second-order stationary
trust region algorithm