Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms and landscape analysis for generative and adversarial learning
(USC Thesis Other)
Algorithms and landscape analysis for generative and adversarial learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Algorithms and Landscape Analysis for Generative and Adversarial Learning by Babak Barazandeh A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (INDUSTRIAL AND SYSTEMS ENGINEERING) May 2021 Copyright 2021 Babak Barazandeh Everything in my life is dedicated to my parents, Mohammad and Zahra, including this thesis. ii Acknowledgements I would like to thank my family specially my parents, Mohammad and Zahra, my older brother and sisters Ebrahim, Khadijeh and Solmaz that always supported me during my studies. My beautiful nieces, Hannaneh, Yasna and Reyhaneh are the main motives for my work and I’m sure when they are grown up enough to read this thesis, will understand how hard it was for me not to be with them on their birthdays to celebrate with them. It is really hard for me to find words to express my love and thank to my uncle, Ali, who is the bar and standard for me and I always look up to him to solve any difficulties in my life the way he would solve it. I’m also very thankful to the support of my advisor Dr. Meisam Razaviyayan that made everything possible for me during my studies. He was always there for me and it was great discussing with him details of the problems that sometimes lasted for hours. I also would like to thank my committee members, Professors Jong-Shi Pang, Aiichiro Nakano, Shaddin Dughmi, Qiang Huang and Sze-chuan Suen. It was my greatest honor to learn from Professor Pang the true values and morals that a researcher has to focus on in addition to the scientific contributions. I’m also thankful to Professor George Michailidis for giving me the chance to join University of Florida as a visiting scholar. Of course, this thesis wouldn’t be complete without the support and help from all my colleagues and friends at University of Southern California, University of Florida, Virginia Tech and Sharif University of Technology. I would like to thank all of them specially Mohammadhussein Rafieisakhaei for being a great friend that always is there for me. I would like to finish this section by thanking Professor iii Scotland C. Leman from Virginia Tech that without a doubt, has played a major role in my career. iv Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract x Chapter 1: Introduction 1 Chapter 2: Algorithm and Landscape Analysis for Maximum Likelihood Es- timation in Mixture Models 12 2.1 Maximum Likelihood Estimation (MLE) for Gaussian/Laplacian Mixture Models 13 2.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 EM for the Case of K = 2 . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.3 Modified EM for the Case of K≥ 3 . . . . . . . . . . . . . . . . . . . 18 2.2 MLE for Mixed Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Review of the Benchmark Algorithms . . . . . . . . . . . . . . . . . . 25 2.2.3 Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . . . 26 2.2.4 Alternating Direction Method of Multipliers (ADMM) . . . . . . . . 28 2.2.5 ADMM for Maximum Likelihood Estimation of MLR parameters . . 29 2.2.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Algorithms for Generative Models and Adversarial Learning 36 3.1 Adversarial Learning and Solving Nonconvex-Concave Nondifferentiable Min- Max Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.2 Nonconvex-Strongly-Concave Games . . . . . . . . . . . . . . . . . . 44 3.1.3 Nonconvex-Concave Games . . . . . . . . . . . . . . . . . . . . . . . 48 3.1.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Nonconvex Stochastic Games under Minty VI Condition . . . . . . . . . . . 51 v 3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2 ADAM: from Minimization to Min-Max Problems . . . . . . . . . . . 53 3.2.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.4 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Min-Max Problems for Small Adversary Region . . . . . . . . . . . . . . . . 59 3.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Do We Always Need to Solve Min-Max Games for Training GANs? . . . . . 64 3.4.1 Training Generator via random Discriminator . . . . . . . . . . . . . 65 3.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.3 Optimal Transport Distance . . . . . . . . . . . . . . . . . . . . . . 66 3.4.4 Training in different feature domain . . . . . . . . . . . . . . . . . . . 68 3.4.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4: Conclusion and Future Works 75 4.1 Landscape Analysis for Maximum Likelihood Estimation . . . . . . . . . . . 76 4.1.1 Extensions to General Maximum Likelihood Estimation Problem . . . 76 4.1.2 Landscape Analysis and Algorithm Developments for Mixed Linear Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.1 Adaptive Min-Max Optimizer . . . . . . . . . . . . . . . . . . . . . . 78 4.2.2 Zeroth-order Methods for Min-Max Problems . . . . . . . . . . . . . 80 4.2.3 Solving Min-Max Saddle Point Games for Small Region Problems using Quadratic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.4 Solving Min-Max Saddle Point Game using Gradient Unrolling . . . . 82 4.2.5 Decentralized Algorithms for Solving Min-max Games . . . . . . . . . 83 References 85 Appendix 94 A.1 Proofs of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.1.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.1.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.2 Proofs of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 A.2.1 Discussions on Remark 1 . . . . . . . . . . . . . . . . . . . . . . . . . 100 A.2.2 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2.3 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.2.4 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.2.5 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.2.6 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.2.7 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.2.8 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 vi List of Tables 3.1 Average computational time of different algorithms for adversarial attack against the LASSO estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Model Architecture for the MNIST dataset. . . . . . . . . . . . . . . . . . . 64 3.3 Classification accuracy after adversarial attack against neural network. The proposed algorithm trains a neural network that is more robust against attacks compared to benchmark methods. . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 Conjectures/known results for 2-component Gaussian/Laplacian mixture model. These conjectures have been obtained by extensive numerical experiments. As can be seen in this table, expanding the search space could lead to computa- tional advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 vii List of Figures 1.1 A discriminative model (left) versus a generative model (right) . . . . . . . . . . 2 1.2 Generating an image explained in the text [7] (left), Generative art works [8] (right) 2 1.3 Using single Gaussian to model data (left), Mixture of Gaussians (right) . . . . . 3 1.4 Landscape of the maximum likelihood estimation problem for finding the parameters of a 3-component single-dimensional Gaussian mixture model. . . 4 1.5 Optimization landscape of the maximum likelihood estimation problem for a 2- component single-dimensionequally weighted Laplacian and Gaussian distributions in the left and right panels, respectively. In both cases, all local optima are globally optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Likelihood landscape for a 2-component single-dimension unequally weighted dis- tributions (left): Laplacian distribution (right): Gaussian distribution; Landscape has spurious local optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.7 Structure of the Generative Adversarial Networks (GANs) [6] . . . . . . . . . 6 1.8 Structure of adversarial approach for training fair machine learning algorithms; classifier aims to minimize the loss value while the adversary aims to maximize it [20, 21]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.9 Solving the min-max optimization problem min θ max α θα. The optimal solution is the point (0,0). However, none of the popular algorithms such as simultaneous gradient descent-ascent (left) or alternating gradient descent-ascent (right) converge to the optimal solution. See Example 1 in Section 3.1 for more details. . . . . . . 8 1.10 Simultaneous ADAM algorithm does not converge to the optimal point. The normalized error is always above 0.9965; See Numerical Experiment 3.2.4 for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Na¨ ıve EM fails to recover the ground truth parameter. . . . . . . . . . . . . 18 2.2 Modified EM based on regularized MLE (2.5). . . . . . . . . . . . . . . . . . 20 viii 2.3 Multi-objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Performance of Stochastic multi-objective EM. . . . . . . . . . . . . . . . . 22 2.5 An example dataset for MLR problem with 100 data points. . . . . . . . . . 25 2.6 Average Recovery error for (a) Gaussian and (b) Laplacian cases. Blue numbers represent the performance of the proposed method and numbers in parentheses is the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7 Difference of the computation time for the 1950 experimental data points generated by (computation time of EM algorithm - computation time of Algorithm 4) for (a) Gaussian and (b) Laplacian cases. EM is very slow in Laplacian case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Simultaneous gradient descent-ascent (left) and alternating gradient descent-ascent (right); none of the algorithms converge to the only Nash equilibrium point, (0,0). . . . . 40 3.2 (left): Convergence behavior of different algorithms in terms of the objective value. The objective value at iteration t is defined as g(A t ), min x kA t x−bk 2 2 +ξkxk 1 , (right): Convergence behavior of different algorithms in terms of the stationarity measuresX (A t ,x t+1 ),Y(A t ,x t+1 ) (logarithmic scale). The list of the algorithms used in the comparison is as follows: Proposed Algorithm (PA), Subgradient Descent- Ascent (SDA), and Proximal Descent-Ascent algorithm (PDA). . . . . . . . . . . 50 3.3 Left/Righty-axis: Error rate,e t / Average norm of gradient,R k . S-Adam misses the unique FNE point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Inception score for generated CIFAR-10 images using Adam 3 and OAdagrad. 59 3.5 Generating hand-written digits using MNIST dataset . . . . . . . . . . . . . 73 3.6 Generating fashion products using Fashion-MNIST dataset . . . . . . . . . 74 1 Relation between different notions of stationarity . . . . . . . . . . . . . . . 102 2 (left): γ =π, two measures have the largest deviation, (right): γ = π 2 , two measures coincide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 ix Abstract Generative models are among the most popular statistical tools that are used not only to learn the underlying distribution of data and generate more samples from it, but also to help in classifying normal data from anomalies. Despite their popularity, from optimization perspective, they result in training problems that are challenging to solve. This is mainly due to the nonconvexity and nonsmoothnes of the landscape of these problems. In this dissertation, we study the landscape and design algorithms for learning via Mixture Models and Generative Adversarial Networks (GANs) as two of the most popular generative models. Regarding the mixture models, we show that when the number of components is two and they are equally weighted, despite nonconvexity, all local optima of the maximum likelihood estimation problem are globally optimum. In addition, in the general case of mixture models, we propose a multi-objective algorithm which has a higher chance of escaping spurious local optima compared to the benchmark methods. Next, we study the training problem of GANs which requires solving min-max saddle point games. We first study a special class of deterministic nonconvex-concave min-max games and define a notion of stationarity for these problems (motivated by the necessary first-order optimality condition) and propose an algorithm that is capable of finding these stationary points. We then study stochastic version of min-max saddle point games that satisfy Minty variational inequality condition and propose an algorithm and obtain its convergence rate for finding stochastic first-order Nash equilibria. We generalize our algorithms and analysis to nonconvex- nonconcave games in which the feasible set for one of the players is small. Our analysis shows that under this assumption, solving a linear approximation of the objective function can x find a stationary point of the original problem. Finally, due to the difficulties of solving a general class of nonconvex-nonconcave games that appear from GANs problem, we revisit its formulation and propose a new generative model that does not require solving min-max games and instead uses random discriminator. Our new model results in a more stable optimization mechanisms. xi Chapter 1 Introduction Generative models are commonly used in machine learning to learn the underlying distri- bution of data and generate more data instances. In contrast to discriminative models that classify data instances by learning conditional distributions, generative models learn the joint distribution of the independent and target variables. To be more specific, consider a pair of random variables (X,Y ), with X being the independent variable and Y being the corresponding label/target variable. Generative models learn the joint distribution of the data, p(X =x,Y =y), while discriminative models capture the conditional distribution, p(Y =y|X =x) that represents the probability of assigning a label/target variable y given the input instance x. Figure 1.1 shows the difference of these two models using data of handwritten digits. In this figure, the data consists of handwritten digits 0 and 1 leading to labels y = 0 and y = 1, respectively. Left figure is a discriminative model that only aims at finding the class that data belongs to via learning the conditional distributionp(Y =y|X =x). On the other hand, the right figure is a generative model that models how data points are distributed throughout the data space via learning the joint distribution p(X =x,Y =y). Generative models have a wide range of applications such as detecting outliers in data [1], translating a text to an image or generating artworks (Figure 1.2). There are different types of generative models such as Mixture Models [2, 3], Hidden Markov Models [4], 1 Figure 1.1: A discriminative model (left) versus a generative model (right) Figure 1.2: Generating an image explained in the text [7] (left), Generative art works [8] (right) Variational Autoencoders [5], and Generative Adversarial Networks (GANs) [6]. In modern applications, training these models requires solving large-scale (and in most cases nonconvex) optimization problems. The focus of my research is to better understand the algorithms and optimization landscape for these models. In particular, this thesis will describe my results and future work on designing algorithms and analyzing landscape of optimization problems appearing in modeling via Mixture Models and Generative Adversarial Networks. In the remainder of this chapter, we will briefly discuss these two problems. Mixture Models: A general mixture model distribution is defined as P (x;w,K,θ θ θ) = K X k=1 w k f(x;θ θ θ k ), 2 Figure 1.3: Using single Gaussian to model data (left), Mixture of Gaussians (right) where K is the number of mixture components, w = (w 1 ,w 2 ,...,w K ) is the non-negative mixing weight with P K k=1 w k = 1,f(·) is a probability density function andθ θ θ = (θ θ θ 1 ,θ θ θ 2 ,...,θ θ θ K ) is the distribution’s parameter vector. Fitting finite mixture models to data is one of the most traditional approaches for learning the underlying data distribution. This is mainly due to the ability of mixture models to capture the existence of sub-population within an overall population [9] and their capability in approximating any distribution provided that the number of mixture components is sufficiently large [10–14]. Figure 1.3 shows this ability using a simple bimodal dataset. Left figure uses a single Gaussian distribution to model the data while the right figure is the result of fitting a (Gaussian) mixture model. As seen in the figure, fitting a simple Gaussian distribution fails drastically to explain the data while the mixture model can capture the existence of underlying sub-populations in the overall population. From the optimization perspective, training mixture models often requires solving challenging nonconvex optimization problems. For example, consider the problem of fitting a 3-component Gaussian mixture model to a single-dimensional dataset. Figure 1.4 shows the landscape of the optimization problem resulted from maximum likelihood estimation. This landscape has multiple spurious local optima. As a result, using first-order methods (or any other local-search optimization procedure) for solving this problem fails in recovering the ground-truth distribution. 3 Figure 1.4: Landscape of the maximum likelihood estimation problem for finding the parame- ters of a 3-component single-dimensional Gaussian mixture model. In chapter 2 of this qualification exam proposal, we study the landscape of maximum likelihood estimation (MLE) for two popular types of mixture models: Gaussian and Laplacian. To be more specific, we start by analyzing Laplacian/Gaussian mixture models with two equally weighted components. We show that in this scenario, the maximum likelihood estimation of the parameters of the model results in an optimization problem that despite the nonconvexity, does not have spurious local optima. In other words, all local optima of the MLE problem are globally optimal. Figure 1.5 is an example of such landscapes for equally weighted Laplacian (left) and Gaussian (right) distributions. As seen in this figure, all local optima are globally optimal. However, when the mixture model is unequally weighted or the number of components is more than two, estimating mixture models can result in optimization problems that have local optima which are arbitrary far in value from global value [2] ∗ . Figure 1.6 shows landscape instances for such scenarios. In this experiment, our mixture models have two unequally weighted components and this leads to an optimization problem with spurious local optima. For such scenarios, we propose an algorithm that has higher chance of escaping spurious local optima. ∗ The existence of spurious local optima of MLE of a k-component Gaussian mixture model (with k≥ 3) has also been observed in [15]. 4 Figure 1.5: Optimization landscape of the maximum likelihood estimation problem for a 2- component single-dimension equally weighted Laplacian and Gaussian distributions in the left and right panels, respectively. In both cases, all local optima are globally optimum Practically, mixture models suffer from two deficiencies. First, although mixture models can approximate any distribution with sufficiently large number of components, it is easy to create distributions that this number grows exponentially with the dimension of data [9, 16]. Secondly, fitting mixture models might result in computationally heavy optimization procedure [10]. For example, fitting a Gaussian mixture model to a 64× 64 color images requires inversion of a covariance matrix with the size (64× 64× 3) 2 × 0.5≈ 7.5× 10 7 which is computationally cumbersome [16]. Recently, Generative Adversarial Networks (GANs) [6] were proposed to overcome these shortcomings. Due to the practical success of GANs, they have drawn a lot of attention for learning the underlying distribution of data. GANs consist of two neural networks: Generator and Discriminator. Generator gets a random input such as Gaussian random vector and generates samples that are similar to the real data samples. On the other hand, Discriminator is responsible to distinguish the fake data generated by Generator from the real data. Over the training process both Generator and Discriminator improve their performance; Generator generates more realistic samples and Discriminator becomes more capable in discriminating between real data and the fake data generated by Generator. During each iteration of the training process, a batch of real data and fake data (generated by passing random vector to Generator) is fed into Discriminator. The output of Discriminator is passed through a loss function that measures dissimilarity of 5 Figure 1.6: Likelihood landscape for a 2-component single-dimension unequally weighted distri- butions (left): Laplacian distribution (right): Gaussian distribution; Landscape has spurious local optima Figure 1.7: Structure of the Generative Adversarial Networks (GANs) [6] these two data. Then, Discriminator maximizes this loss function using its own parameters (weight of its deep neural network), while, on the other hand, Generator minimizes this loss function over its own parameters. Figure 1.7 shows the details of training GANs framework. Despite the tremendous recent success of GANs, the training procedure for GANs is notoriously difficult. This difficulty is due to the fact that training GANs requires solving nonconvex min-max saddle point games between Generator and Discriminator. From game theoretic point of view, the objective of GANs is sometimes viewed as finding Nash equilibrium [17] in which no player can do better off by unilaterally changing its strategy. 6 Figure 1.8: Structure of adversarial approach for training fair machine learning algorithms; classifier aims to minimize the loss value while the adversary aims to maximize it [20, 21]. Unfortunately, finding such Nash equilibria is hard in general [18]. Moreover, such Nash equilibria might not even exist for a general nonconvex game [19]. In Chapter 3, we study min-max optimization problems defined as min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α), (1.1) where the feasible sets Θ andA are assumed to be closed and convex. Problems in this form that are also called min-max saddle point games, appear in wide range of applications. As a example, as mentioned above, training procedures of GANs requires solving min-max saddle point games. Besides GANs, recent studies have shown that machine learning algorithms might be unfair against minority groups, i.e., the result of their outcomes is dependent to given protected attributes such as gender, ethnicity and disability. To overcome this issue and make machine learning algorithms fair, recent studies develop new formulations that require solving problems in the form of (1.1) [20–22]. Figure 1.8 represents the structure of these algorithms. Values y (ˆ y) represents the true (predicted) data label while p (ˆ p) is the true (predicted) protected attribute value. These new formulations of fairness can also used in other similar problems such as fair sensor scheduling in which the goal is to design a fair scheduling for sensors that are monitoring processes over time [23]. There are also similar attempts in resource allocation for wireless communication systems that aim to maximize the minimum rate among all users [24–26]. Solving problems in the form of (1.1) is a challenging task and most of benchmark studies are limited to the special case of convex-concave problems 7 Figure 1.9: Solving the min-max optimization problem min θ max α θα. The optimal solution is the point (0,0). However, none of the popular algorithms such as simultaneous gradient descent- ascent (left) or alternating gradient descent-ascent (right) converge to the optimal solution. See Example 1 in Section 3.1 for more details. in which the objective function is convex (concave) with respect to the parameters of the minimization (maximization) problem when the other parameters are fixed. [27–34] study the problem in this regime and propose efficient algorithms for solving it. However, solving the min-max saddle point games in general is a difficult task. This difficulty mainly arises from that fact that intuitive modifications of classical optimization tools fail for solving even simple min-max problems; see Figure 1.9. In Section 3.1, we study a new class of min-max saddle point games. In this class, we assume that the objective function has nonconvex-concave structure and it is (potentially) non- differentiable. In this regime, unlike convex-concave case, the solution (global Nash equilibria) may not even exist. As a result, we start by defining a new notion of (approximate) first- order Nash equilibrium that always exists under mild assumptions. Then, we show that our proposed notion is stronger than the commonly used notion in the literature that is based on the norm of the proximal gradient. Additionally, we propose a simple iterative algorithm that converges to the proposed notion of first-order Nash equilibrium. We then, evaluate the performance of the proposed algorithm through adversarial attack on a LASSO estimator in which it outperforms benchmark methods. While our algorithm is a strong first step toward understanding such nonconvex min-max games, it requires the assumption that one of the players’ objective function is easy to optimize. Besides, the proposed algorithm 8 requires prior knowledge about step-size. Acquiring this knowledge is in general time- consuming. In addition, many applications demand for stochastic solvers for such nonconvex min-max optimization problems. Addressing these issues is the next step of this dissertation. Figure 1.10: Simultaneous ADAM algorithm does not converge to the optimal point. The normalized error is always above 0.9965; See Numerical Experiment 3.2.4 for more details. In Section 3.2, we study stochas- tic nonconvex-nonconcave min-max problems. There are wide range of studies for solving classical min- imization problems using adaptive algorithms. As an example, [35, 36] proposed an algorithms for con- vex case and [37–40] developed new algorithms for nonconvex prob- lem. However, similar to the pre- vious part, intuitive generalization of them for min-max saddle point games might fail drastically in practice; see Figure 1.10. In this section, we propose an algorithm for a class of stochastic min-max problems that satisfies Minty variational inequality condition [41, 42] which is widely used in recent min-max optimization problems because of its practicality [29, 41–45]. We also theoretically analyze the proposed algorithm and obtain its convergence rate. In addition, we deploy the proposed algorithm for training GANs in which it outperforms benchmark methods. In summary, in the first two sections of Chap- ter 3, we study special classes of nonconvex-concave and stochastic nonconvex-nonconcave min-max saddle point problems that satisfy Minty variational inequality condition. Despite the broad class of problems that these studies cover, they are not suited for general class of nonconvex-nonconcave problems. Solving general nonconvex-nonconcave problem without any further assumption is a challenging task [46, 47]. However, there are some problems in 9 which we can further assume that the inner maximization problem has some special structure in its feasible set. Problems in this regime are the topic of the next section in Chapter 3. In Section 3.3, we aim to solve a general class of nonconvex-nonconcave min-max problems under the assumption that the inner maximization problem is optimized over a small region, i.e., its radius is small (the same order as the desired optimality accuracy). This problem widely appears in the training process of neural networks that are robust against adversarial attacks [47]. It is known that the general nonconvex-nonconcave min- max problems are difficulty to solve. This study is the first work that solves this class of optimization problems under “small region” assumption and shows the computational tractability of finding first-order stationary point under this assumption. The proposed algorithm for solving this problem is based on optimizing linear Taylor approximation of the objective function instead of the original function. The fact that the inner problem is solved in a small region helps us to relate the obtained solution of the linear approximation problem to the solution of the original problem. In addition to the theoretical analysis, we evaluate the performance of the proposed algorithm in training neural networks that are robust to adversarial attacks. In Section 3.4, we revisit the training process of GANs. This problem requires solving a general nonconvex-nonconcave min-max saddle point problem without any further assumption. This generality of the problem makes the training process notoriously difficult. As a result, we propose a new formulation for GANs that is based on the idea of using random discriminator. In this approach, the discriminator needs not to be trained and in each iteration of the training process, it only requires to be randomly selected among a set of possible discriminators. In other words, in each iteration, we sample a batch of real and fake data. This data will be fed into a randomly generated discriminator. The output of discriminator will then be used to measure dissimilarity of fake and real data and generator will optimize over its own parameters to reduce this dissimilarity. In contrast to the standard GAN formulation, this 10 model is easy to solve. In addition, since there is no need for optimizing discriminator, it results in a faster and computationally more efficient training process. Finally, we discuss the future directions and open problems that are related to this dissertation in Chapter 4. 11 Chapter 2 Algorithm and Landscape Analysis for Maximum Likelihood Estimation in Mixture Models Maximum likelihood estimation (MLE) is one of the most commonly used approaches for estimating parameters of statistical models. This is mainly due to the fact that MLE is robust to inconsistency between the model and data [48]. However, from optimization perspective, solving MLE is a challenging task in general which is mainly due to its nonconvex and combinatorial structure [49]. Expectation-Maximization (EM) algorithm [3] is one of the most popular iterative algorithms to solve the resulting MLE problem. Each iteration of this algorithm consists of two main steps: In the first step (expectation step), the algorithm finds a tight upper-bound for the MLE objective function at the current estimate and in the second step (maximization step), it optimizes this bound instead of the original MLE function. The main reason for popularity of the EM algorithm is the second step in which, under some conditions, it results in an objective function that is easy to solve or might even have a closed form solution. In this chapter, we revisit the landscape of the MLE problem for mixture model which can be used to capture existence of sub-population within an overall population and also model non-linear measurements using simple linear structure. These capabilities make 12 them applicable for different engineering disciplines such as health care [50], trajectory clustering [51], medical image denoising [52] and video retrieval [53–55]. In the first section, we study Gaussian/Laplacian mixture model. We show that when the number of components is two and each component is equally weighted, MLE has no spurious local optima, i.e., all local optima are globally optimum, and EM algorithm converges to the global optimum of the problem. However, when the number of components is bigger than two, the problem might suffer from spurious local optima that can be arbitrary far from global objective value. This results in the EM algorithm to fail drastically. To overcome this issue, we propose an algorithm that has higher chance of escaping these spurious local optima. The proposed algorithm optimizes a modified version of the MLE objective function that exploits first moment information of the data. In the second section, we study MLE for mixed linear regression problem. We show that when data is contaminated by a non-Gaussian additive noise, the resulting sub-problem of the EM algorithm in its maximization step (second step) does not have a closed-form solution and is challenging to optimize it in each iteration mainly due to its non-differentiability despite its convexity. This increases the cost of applying EM for problems that have large dimension. To overcome this issue, we reformulate the MLE for mixed linear regression problem that makes it amendable for a variant of Alternating Direction Method of Multipliers (ADMM). The proposed algorithm has closed-form solutions for each of its sub-problems in every iteration. As a result, it is computationally efficient for large-scale problems. 2.1 Maximum Likelihood Estimation (MLE) for Gaussian/Laplacian Mixture Models The ability of finite mixture distributions [9] to model the presence of sub-populations within an overall population has made them popular across almost all engineering and scientific 13 disciplines [12, 13, 56, 57]. While statistical identifiability for various mixture models has been widely studied [58, 59], Gaussian mixture model (GMM) has drawn more attention due to its wide applicability [60, 61]. Started by Dasgupta [62], there have been multiple efforts for finding algorithm with polynomial sample/time complexity for estimating GMM parameters [10, 63–68]. Despite statistical guarantees, these methods are not computationally efficient enough for many large-scale problems. Moreover, these results assume that the data is generated from an exact generative model which never happens in reality. In contrast, methods based on solving maximum likelihood estimation (MLE) problem are very popular due to computational efficiency and robustness of MLE against perturbations of the generative model [48]. Although MLE-based methods are popular in practice, the theory behind their optimization algorithms (such as EM method) is little understood. Most existing algorithms with theoretical performance guarantees are not scalable to the modern applications of massive size. This is mainly due to the combinatorial and nonconvex nature of the underlying optimization problems. Recent advances in the fields of nonconvex optimization has led to a better understand- ings of the mixture model inference algorithms such as EM algrithm. For example, [69] proves that under proper initialization, EM algorithm exponentially converges to the ground truth parameters. However, no computationally efficient initialization approach is provided. [70] globally analyzes EM algorithm applied to the mixture of two equally weighted Gaussian distributions. While [71] provides global convergence guarantees for the EM algorithm, [15] studies the landscape of GMM likelihood function with more than 3 components and shows that there might be some spurious locals even for the simple case of the equally weighted GMM. In this chapter, we revisit the EM algorithm under Laplacian and Gaussian mixture models. We first show that, similar to the Gaussian case, the maximum likelihood estimation 14 objective has no spurious local optima in the symmetric Laplacian mixture model (LMM) with K = 2 components. This Laplacian mixture structure has wide range of applications in medical image denoising, video retrieval and blind source separation [53–55, 72, 73]. For the case of mixture model withK≥ 3 components, we propose a stochastic algorithm which utilizes the likelihood function as well as moment information of the mixture model distribution. Our numerical experiments show that our algorithm outperforms the Na¨ ıve EM algorithm in almost all scenarios. 2.1.1 Problem Formulation The general mixture model distribution is defined as P (x;w,K,θ θ θ) = K X k=1 w k f(x;θ θ θ k ), where K is the number of mixture components; w = (w 1 ,w 2 ,...,w K ) is the non-negative mixing weight with P K k=1 w k = 1 andθ θ θ = (θ θ θ 1 ,θ θ θ 2 ,...,θ θ θ K ) is the distribution’s parameter vector. Estimating the parameters of the mixture models (w,θ θ θ,K) is central in many applications. This estimation is typically done by solving MLE problem due to its intuitive justification and its robust behavior [48]. As an example, the result of the work [70] shows that despite the non-convex nature of the inference problem for equally weighted Gaussian mixture model, all of its locals are global. This work shows the difficulty of the problem even in a simple case. The focus of our work is on the population likelihood maximization, i.e., when the number of samples is very large. When parameters w and K are known, using the law of large numbers, MLE problem leads to the following population risk optimization problem [15, 70, 71] θ θ θ ∗ = argmax θ θ θ E log K X k=1 w k f(x;θ θ θ k ) . (2.1) 15 In this paper, we focus on the case of equally weighted mixture components, i.e., w k = 1/K,∀k [15, 70, 71, 74]. We also restrict ourselves to two widely-used Gaussian mixture models and Laplacian mixture models [53–55, 63–66, 72, 73]. It is worth mentioning that even in these restricted scenarios, the above MLE problem is nonconvex and highly challenging to solve. 2.1.2 EM for the Case of K = 2 Recently, it has been shown that the EM algorithm recovers the ground truth distributions for equally weighted Gaussian mixture model with K = 2 components [70, 71]. In these works, they showed that if the distributions are equally weighted for Gaussian mixture model, independent of the dimension of the data, all local optima are globally optimum. Here we extend this result to single dimensional Laplacian mixture models. Define the Laplacian distribution with the probability density function L(x;μ,b) = 1 2b e − |x−μ| b whereμ andb control the mean and variance of the distribution. Thus, the equally weighted Laplacian mixture model with two components has probability density function P (x;μ 1 ,μ 2 ,b) = 1 2 L(x;μ 1 ,b) + 1 2 L(x;μ 2 ,b). In the population level estimation, the overall mean of the data, i.e., μ 1 +μ 2 2 can be estimated accurately. Hence, without loss of generality, we only need to estimate the normalized difference of the two means, i.e., μ ∗ , μ 1 −μ 2 2 . Under this generic assumption, our observations are drawn from the distribution P (x;μ ∗ ,b) = 1 2 L(x;μ ∗ ,b) + 1 2 L(x;−μ ∗ ,b). Our goal is to estimate the parameter μ ∗ from observations x at the population level. Without loss of generality, and for simplicity of the presentation, we set b = 1, define 16 p μ (x),P(x;μ,1) and L(x;μ),L(x;μ,1). Thus, the t-th step of the EM algorithm for estimating the ground truth parameter μ ∗ is: λ t+1 = E x∼p μ ∗ x 0.5L(x;λ t ) p λ t (x) E x∼p ∗ μ 0.5L(x;λ t ) p λ t (x) , (2.2) whereλ t is the estimation ofμ ∗ int-th iteration; see [15, 70, 71] for the similar Gaussian case. In the rest of the paper, without loss of generality, we assume that λ 0 ,μ ∗ > 0. Further, to simplify our analysis, we define the mapping M(λ,μ), E x∼pμ h x 0.5L(x;λ) p λ (x) i E x∼pμ h 0.5L(x;λ) p λ (x) i . It is easy to verify that M(μ ∗ ,μ ∗ ) =μ ∗ , M(−μ ∗ ,−μ ∗ ) =−μ ∗ , M(0,0) = 0, and λ t+1 = M(λ t ,μ ∗ ). In other words, λ∈{μ ∗ ,−μ ∗ ,0} are the fixed points of the EM algorithm. Using symmetry, we can simplify M(·,·) as M(λ,μ) =E x∼L(x;μ) " x L(x;λ)−L(x;−λ) L(x;λ) +L(x;−λ) # . Let us first establish few lemmas on the behavior of the mapping M(·,·). Lemma 1. The derivative of the mapping M(·) with respect to λ is positive, i.e., 0< ∂ ∂λ M(λ,μ). Lemma 2. For 0<λ<η, we have ∂ ∂η M(λ,η) = 1− 2 e −η λ +e −λ e λ +e −λ > 1− 2 e −λ λ +e −λ e λ +e −λ > 0 The proofs of the above two lemmas can be found in Appendix A.1. 17 Theorem 1. Without loss of generality, assume thatλ 0 ,μ ∗ > 0. Then the EM iterate defined in (2.2) is a contraction, i.e., λ t+1 −μ ∗ λ t −μ ∗ <κ< 1,∀t, where κ = max{κ 1 ,κ 2 }, κ 1 = (μ ∗ +1)e −μ ∗ cosh(μ ∗ ) , and κ 2 = 2 λ 0 e −λ 0 +e −λ 0 e λ 0 +e −λ 0 . Theorem 1 which is proved in the Appendix A.1 shows that the EM iterates converge to the ground truth parameter which is the global optimum of the MLE. 2.1.3 Modified EM for the Case of K≥ 3 In [74] it is conjectured that the local optima of the population level MLE problem for any equally weighed GMM is globally optimal. Recently, [15] has rejected this conjecture by provid- ing a counter example with K = 3 components. Moreover, they have shown that the local op- tima could be arbitrary far from ground truth parameters and there is a positive probability for the EM algorithm with random initialization to converge to these spurious local optima. Mo- tivated by [15], we numerically study the performance of the EM algorithm in both GMMs and LMMs. Figure 2.1: Na¨ ıve EM fails to recover the ground truth parameter. Numerical Experiment 1: Fig- ure 2.1 presents the convergence plots of the EM algorithm with four different initializations. Two of these initializations converge to the global optima, while the other two fails to recover the ground truth pa- rameters and they are trapped in spurious local optima. To under- stand the performance of the EM algorithm with random initializa- tion, we ran the EM algorithm for different number of componentsK and dimensionsd. First we generate thed-dimensional mean vectorsμ μ μ k ∼N(0,5I),k = 1,...,K. These vectors are the 18 mean values of different Gaussian components. For each Gaussian component, the variance is set to 1. Thus the vectors μ μ μ 1 ,μ μ μ 2 ,...,μ μ μ K will completely characterize the distribution of the GMM. Then, 30,000 samples are randomly drawn from the generated GMM and the EM algorithm is run with 1000 different initializations, each for 3000 iterations. The table in Figure 2.1 shows the percentage of the times that the EM algorithm converges to the ground truth parameters (global optimal point) for different values of K and d. As can be seen in this table, EM fails dramatically especially for larger values of K. By examining the spurious local optima in the previous numerical experiment, we have noticed that many of these local optima fail to satisfy the first moment condition. More specifically, we know that any global optimum of the MLE problem (2.1) should recover the ground truth parameter – up to permutations [58, 59]. Hence, any global optimum ˆ μ μ μ = (ˆ μ μ μ 1 ,..., ˆ μ μ μ K ) has to satisfy the first moment condition E(x) = P K k=1 1 K ˆ μ μ μ k . Without loss of generality and by shifting all data points, we can assume thatE(x) = 0. Thus, ˆ μ μ μ must satisfy the condition K X k=1 ˆ μ μ μ k = 0. (2.3) However, according to our numerical experiments, many spurious local optima fail to satisfy (2.3). To enforce condition (2.3), one can regularize the MLE cost function with the first order moment condition and solve max μ μ μ E μ μ μ ∗ log K X k=1 1 K f(x;μ μ μ k ) − M 2 k K X k=1 μ μ μ k k 2 , (2.4) whereM> 0 is the regularization coefficient. To solve (2.4), we propose the following iterative algorithm: μ μ μ t+1 k = E μ μ μ ∗[xw t k (x)] +MKμ μ μ t k −M P K j=1 μ μ μ t j MK +E μ μ μ ∗[w t k (x)] ,∀k, (2.5) where w t k (x), f(x;μ μ μ t k ) P K j=1 f(x;μ μ μ t j ) ,∀k = 1,...,K. This algorithm is based on the successive upper- bound minimization framework [75–77]. Notice that if we set M = 0 in (2.5), we obtain the 19 Figure 2.2: Modified EM based on regularized MLE (2.5). na¨ ıve EM algorithm. The following theorem establishes the convergence of the iterates in (2.5). Theorem 2. Any limit point of the iterates generated by (2.5) is a stationary point of (2.4). The details of proof for the above theorem can be found at A.1 Numerical Experiment 2: To evaluate the performance of the algorithm defined in (2.5), we repeat the Numerical Experiment 1 with the proposed iterative method (2.5). Figure 2.2 shows the performance of the proposed iterative method (2.5). As can be seen from the table, the regularized method still fails to recover the ground truth parameters in many scenarios. More specifically, although the regularization term enforces (2.3), it changes the likelihood landscape and hence, it introduces some new spurious local optima. In our Numerical Experiment 2, we observed that many of the spurious local optima are tied to a fixed value ofM. In other words, after getting stuck in a spurious local optimum point, changing the value of M helps us escape from that local optimum. Notice that the global optimal parameterμ μ μ ∗ is the solution of (2.4) for any value ofM. To better understand this approach, consider the set of 3 different functions in Figure 2.3. Each of the functions in the figure have their own separate spurious local optima but they share the same global optimum point. Now, during the optimization process if we get stuck at a local optimum that is not globally optimum, by changing the optimization function we can escape that point. In more general cases this approach may increase the chance of converging to the global 20 Figure 2.3: Multi-objective function optimum point. We will evaluate this conjecture in numerical experiment 3. Motivated by this observation, we consider the following objective function max μ μ μ E λ∼Λ " E μ μ μ ∗ " log K X k=1 1 K f(x;μ μ μ k ) !# − λ 2 k K X k=1 μ μ μ k k 2 # , (2.6) where Λ is some continuous distribution defined over λ. The idea behind this objective is that each sampled value of λ leads to different set of spurious local optima. However, if a point b μ μ μ is a fixed point of EM algorithm for any value of λ, it must be a stationary point of the MLE function and also it should satisfy the first moment condition (2.3). Based on this objective function, we propose Algorithm 1 for estimating the ground truth parameter. Algorithm 1 Stochastic multi-objective EM 1: Input: Number of iterations: N Itr , distribution Λ, Initial estimate: μ μ μ 0 . 2: for t = 1 :N Itr do Sample λ∼ Λ, 3: for k = 1 :K do w t−1 k (x), f(x;μ μ μ t−1 k ) P K j=1 f(x;μ μ μ t−1 j ) ; 4: end for ˆ μ μ μ t = (ˆ μ μ μ t 1 ,..., ˆ μ μ μ t K ) 5: end for Numerical Experiment 3: To evaluate the performance of Algorithm 1, we repeat the data generating procedure in Numerical Experiment 1. Then we run Algorithm 1 on 21 Figure 2.4: Performance of Stochastic multi-objective EM. the generated data. Figure 2.4 shows the performance of this algorithm. As can be seen from this figure, the proposed method significantly improves the percentage of times that a random initialization converges to the ground truth parameter. For example, the proposed method converges to the global optimal parameter 70% of the times for K = 9,d = 3, while the na¨ ıve EM converges for 19% of initializations (comparing Figure 2.1 and Figure 2.4). Remark: While the results in this section are only presented for GMM model, we have observed similar results in LMM model. These results are omitted due to lack of space. 2.2 MLE for Mixed Linear Regression Mixed linear regression (MLR) can be viewed as a generalization of the simple linear regression model in which each observation has been generated via a mixture of the multiple linear models [78–81]. The ability of MLR in modeling complex non-linear measurements with simple models has made it popular in different engineering disciplines such as health care [50], time-series analysis [82] and trajectory clustering [51]. Despite wide applicability of MLR, recovering the parameters of MLR is NP-hard in general [79]. Various studies in the literature consider different simplifying assumptions for inference under MLR models. As an example, [79, 83] assume that the measurement vectors are generated from uniform Gaussian vector and the data is not contaminated by any noise. [84] studies the problem when data is contaminated by Gaussian noise and the model has only two components. Moreover, all these studies considers a well-specified scenario that the data is guaranteed to be generated from the exact model. 22 This assumption is not justifiable in practice as the ground-truth distribution does not exactly follow an MLR model with given number of components. In such scenarios, one is interested in just finding the best fit based on solving Maximum Likelihood Estimation (MLE) problem. Such maximum likelihood estimators are robust to inconsistencies between data and the model [48]. However, MLE-based approaches may result in optimization problems that are hard to solve. In the case of mixture models, this difficulty is mainly due to the nonconvex and combinatorial structure of the resulting problem [49]. One of the most commonly used approaches to solve MLE problems is the Expectation- Maximization (EM) algorithm [2, 3]. This method is special case of successive upper bound minimization method [76]. to be more specific, this algorithm is an iterative method that consists of two main steps: In the first step (expectation step), it finds a tight lower-bound for the MLE problem and in the second step, it maximizes this lower-bound (maximization step) [76, 85]. In addition to being efficient and reliable in practice, this algorithm does not require tuning any hyper-parameter. These properties make the algorithm popular among practitioners. However, when applied to mixture models, this algorithm leads to closed-form update rules for specific noise distributions. In particular, with the exception of few works [2, 86], most works use EM for solving MLE problem assume that the data is contaminated by Gaussian noise. On the other hand, for a wide range of recent applications, such as medical image denoising and video retrieval [53–55], the data is contaminated by noise that has non-Gaussian distribution, such as Laplacian. Using EM for solving the problems under Laplacian noise results in sub-problems that do not have closed-form update rule. As we discuss later in this work, when the noise has Laplacian distribution, EM requires solving a linear programming in each iteration. When the dimension of the problem is large, this increases the computational cost of applying EM in these settings. 23 To overcome the barriers of applying EM for solving MLR models with non-Gaussian noise, we first reformulate the MLE problem for MLR. This reformulation, makes the problem amenable to (a modified version of) Alternating Direction Method of Multipliers (ADMM), where each iteration of the algorithm has closed form, thus it can be used in solving large-scale problems. 2.2.1 Problem Setup MLR is a generalization of a simple linear regression model in which the data points {(y i ,x i )∈R d+1 } N i=1 is generated by a mixture of linear components, i.e., y i =hβ β β ∗ α i ,x i i + i , ∀i∈{1,···,N}, (2.7) where{β β β ∗ k ∈R d } K k=1 are the ground-truth regression parameters. i is the i th additive noise with probability density function f (·) and α i ∈{1,···,K} where P (α i =k) =p k with P K k=1 p k = 1. For simplicity of notation, we defineβ β β ∗ = [β β β ∗ 1 ,···,β β β ∗ K ]. In this work, we assume that p k = 1 K ,∀k∈{1,···,K} and{ i } N i=1 are independent and identically distributed with probability density function f (·) that has Gaussian or Laplacian distribution, i.e., f () = 1 √ 2πσ 2 e − 2 2σ 2 (in Gaussian scenario), f () = 1 2b e − || b ,b = σ √ 2 (in Laplacian scenario), where σ is the standard deviation of each distribution that is assumed to be known a priori. This limited choice of the additive noise is based on the fact that these two distributions cover wide range of applications such as medical image denoising [55, 72], video retrieval [53] and clustering trajectories [51]. Figure 2.5 shows an example dataset for a 1-dimensional MLR problem. This dataset consists of 100 data points. Each data point, y, is generated through multiplying x ∼N(0,1) by randomly selected regression parameters, β β β = [2,0.5,3], and then it is contaminated by ∼N(0,.01). Our goal is inferring β β β ∗ given{(y i ,x i )} N i=1 via 24 Figure 2.5: An example dataset for MLR problem with 100 data points. Maximum likelihood estimator (MLE), which is the commonly used in practice [83]. Given the described model, the MLE b β β β can be computed by solving: ˆ β β β = argmax β β β logP(y 1 ,...,y N |X,β β β) = argmax β β β N X i=1 logP(y i |x i ,β β β) = argmax β β β N X i=1 log K X k=1 p k f (y i −hx i ,β β β k i) (2.8) Next, we will discuss how to solve this problem. 2.2.2 Review of the Benchmark Algorithms Expectation-Maximization (EM) algorithm [2, 3, 85, 87] is a popular method for solving (2.8) due to its simplicity and no tuning requirements when the additive noise is Gaussian. However, when the additive noise follows other distributions, such as Laplacian distribution, EM requires solving sub-problems that are not easy to solve in closed-form [88]. In what follows, we first derive the steps of EM algorithm for solving (2.8) in general form and show the difficulties that arise when the additive noise has non-Gaussian distribution such as Laplacian. 25 2.2.3 Expectation-Maximization Algorithm EM algorithm is an iterative method that in each iteration finds a tight lower-bound for the objective function of the MLE problem and maximizes that lower-bound at that iteration [76, 85]. More precisely, the first step (E-step) involves updating the latent data labels and the second step (M-step) includes updating the parameters. That is, the first step updates the probability of each data point belonging to different labels given the estimated coefficients, and the second step updates the coefficients given the label of all data. Let β β β t = (β β β t 1 ,···,β β β t K ) be the estimated regressors andw t k,i be the probability thati th data belongs tok th component at iterationt. Starting from the initial pointsβ β β 0 andw 0 k,i , two major steps of the EM algorithm is as following, E-step: w t+1 k,i = f (y i −hx i ,β β β t k i) K P j=1 f (y i −hx i ,β β β t j i) ,∀k,i, M-step: β β β t+1 = argmin β β β − N X i=1 K X k=1 w t+1 k,i log f (y i −hβ β β k ,x i i) = argmin β β β − K X k=1 N X i=1 w t+1 k,i log f (y i −hβ β β k ,x i i). (2.9) The problem in (2.9) is separable with respect to β β β k ’s. Thus, we can estimate β β β t+1 k ’s in parallel by solving β β β t+1 k = argmin β β β k − N X i=1 w t+1 k,i log f (y i −hβ β β k ,x i i),∀k. (2.10) Let us discuss this optimization problem in two cases of Gaussian and Laplacian noise scenarios: 26 Additive Gaussian noise: When the additive noise has Gaussian distribution, prob- lem (2.10) is equivalent to β β β t+1 k = argmin β β β k N X i=1 w t+1 k,i (y i −hβ β β k ,x i i) 2 , ∀k. It can be easily shown that this problem has the closed-form solution of the form β β β t+1 k = ( N X i=1 w t+1 k,i x i x T i ) −1 N X i=1 w t+1 k,i y i x i , ∀k. (2.11) Additive Laplacian noise: For the Laplacian case, the problem in (2.10) is equivalent to β β β t+1 k = argmin β β β k N X i=1 w t k,i |y i −hβ β β k ,x i i|, ∀k. (2.12) Despite convexity of this problem, this optimization problem is non-smooth. Thus, one needs to use sub-gradient or other iterative methods for solving it. However, these methods suffer from slow rate of convergence and they are sensitive to tuning hyperparameters such as step-size [89]. Another potential approach for solving (2.12) is to reformulate it as a linear programming problem β β β t+1 k = argmin β β β k ,{h i } N i=1 N X i=1 w t+1 k,i h i s.t. h i ≥y i −hβ β β k ,x i i, ∀i = 1,...,n, h i ≥−(y i −hβ β β k ,x i i), ∀i = 1,...,n. However, this linear programming has to be solved in each iteration of the EM algorithm, which makes EM computationally expensive in the presence of Laplacian noise (specially in large-scale problems). The following pseudo-code summarizes the steps of the EM algorithm for both Gaussian and Laplacian cases. 27 Algorithm 2 EM Algorithm for MLR problem 1: Input: β β β 0 : Initial value for β β β, N Itr : Maximum iteration number 2: for t = 1 :N Itr do 3: ˆ w t k,i = f(y i −hx i , ˆ β β β t−1 k i) K P j=1 f(y i −hx i , ˆ β β β t−1 j i) , ∀k,i 4: if has a Gaussian distribution: 5: ˆ β β β t k = ( P N i=1 ˆ w t k,i x i x T i ) −1 P N i=1 w t k,i y i x i , ∀k 6: if has a Laplacian distribution: 7: ˆ β β β t k = argmin β β β k P N i=1 ˆ w t k,i |y i −hβ β β k ,x i i|, ∀k 8: end for Since the M-step in the above EM algorithm does not have a closed-form, next we propose another ADMM-based approach for solving this problem. In order to describe our ADMM-based algorithm, we first need to review ADMM algorithm. 2.2.4 Alternating Direction Method of Multipliers (ADMM) Alternating Direction Method of Multipliers (ADMM) [90–92] is one of the most commonly used approaches for solving the problems of the form min β β β,Z f(β β β) +g(Z) s.t. Xβ β β +MZ = N, (2.13) with variables β β β∈R d×K ,Z∈R n×K and given X∈R l×d ,M∈R l×n ,N∈R l×K and f(.) : R d×K →R,g(.) :R n×K →R. This method first defines augmented Lagrangian as L(β β β,z,λ λ λ) =f(β β β) +g(Z) +hλ λ λ,Xβ β β +MZ−Ni + ρ 2 ||Xβ β β +MZ−N|| 2 F , whereh·,·i is the Euclidean inner product and ρ> 0 is a given constant. Then, it updates the variables iteratively in the format summarized in Algorithm 3. This algorithm has been used in the literature before in both convex and nonconvex scenarios [90, 91]. 28 Algorithm 3 General ADMM algorithm 1: Input: Initial values: β β β 0 ,z 0 ,λ λ λ 0 , Dual update step: ρ, Number of iterations: N Itr 2: for t = 0 :N Itr − 1 do 3: Z t+1 = argmin Z L(β β β t ,Z,λ λ λ t ) 4: β β β t+1 = argmin β β β L(β β β,Z t+1 ,λ λ λ t ) 5: λ λ λ t+1 =λ λ λ t +ρ(Xβ β β t+1 +MZ t+1 −N) 6: end for 7: return β β β N Itr ,Z N Itr 2.2.5 ADMM for Maximum Likelihood Estimation of MLR parameters Defining z k,i =hx i ,β β β k i, let us reformulate (2.8) as min β β β − N X i=1 log K X k=1 p k f (y i −z k,i ) s.t. z k,i =hx i ,β β β k i,∀i,k. (2.14) Now, let’s define X = [x 1 ;··· ;x N ] | ,z k = [z k,1 ,···,z k,N ] | ,λ λ λ k = [λ k,1 ,···,λ k,N ] | , Z = [z 1 ,···,z k ] and λ λ λ = [λ λ λ 1 ,···,λ λ λ K ]. The problem in (2.14) is in the format of (2.13) by assuming f(β β β), 0,M =−I, N = 0. As a result, the augmented Lagrangian function will be in the form of L(β β β,Z,λ λ λ) =− N X i=1 log K X k=1 p k f (y i −z k,i ) +hλ λ λ,Xβ β β−Zi + ρ 2 ||Xβ β β−Z|| 2 F . (2.15) By using ADMM algorithm summarized in 3, we get the following iterative approach, Z t+1 = argmin Z L(β β β t ,Z,λ λ λ t ) β β β t+1 = argmin β β β L(β β β,Z t+1 ,λ λ λ t ) λ λ λ t+1 =λ λ λ t +ρ(Xβ β β t+1 −Z t+1 ). (2.16) 29 The steps in (2.16) are simplified in what follows. The update rule of Z in (2.16) can be written as Z t+1 =argmin Z L(β β β t ,Z,λ λ λ t ) =argmin Z − N X i=1 log K X k=1 p k f (y i −z k,i ) + ρ 2 ||Xβ β β t −Z +ρ −1 λ λ λ t || 2 F . Unfortunately, this sub-problem does not have a closed-form solution since it is not decom- posable across Z dimension. To over-come this issue, following the BSUM framework [76, 93] (see also [94, section 2.2.6]), we update Z t+1 by minimizing a locally tight upper bound of L(β β β t ,z,λ λ λ t ) defined as L(β β β t ,Z,λ λ λ t )≤ ˆ L(β β β t ,Z,λ λ λ t ) (2.17) =− N X i=1 K X k=1 w t+1 k,i logf (y i −z k,i ) +C +hλ λ λ t ,Xβ β β t −Zi + ρ 2 ||Xβ β β t −Z|| 2 F , where w t+1 k,i =f (y i −hx i ,β β β t k i)/ K P j=1 f (y i −hx i ,β β β t j i),∀k,i, and C = P N i=1 P K k=1 w t+1 k,i logf (y i − z t k,i )−log P K k=1 p k f (y i −z t k,i ) is a constant. Here, the upper-bound has been derived using Jensen’s inequality and concavity of logarithm function. Unlike the original function, the ˆ L(β β β t ,Z,λ λ λ t ) is separable inz i,k ∀i,k, that is ˆ L(β β β t ,Z,λ λ λ t ) = Constant+ P N i=1 P K k=1 ˆ `(z k,i ) where ˆ `(z k,i ) =w k,i logf (y i −hx i ,β β β t k i)−λ k,i z k,i + ρ 2 (x T i β β β t k −z k,i ) 2 . Here we do not explicitly show the dependence of ˆ ` onβ β β and Z for simplicity of presentation. In the following, we will analyze this optimization problem in both Gaussian and Laplacian cases. 30 Gaussian Case: When the additive noise has Gaussian distribution, the proposed upper- bound in (2.17) can be easily optimized by, ∂ ˆ `(z k,i ) ∂z k,i = 0→z t+1 k,i = w t+1 k,i +σ 2 ρx T i β β β t k −σ 2 λ t k,i w t+1 k,i +σ 2 ρ Laplacian Case: In the Laplacian noise case, we have ˆ `(z k,i ) = w k,i |y i −z k,i | b −λ k,i z k,i + ρ 2 (x T i β β β t k −z k,i ) 2 = w t+1 k,i (y i −z k,i ) b −λ k,i z k,i + ρ 2 (x T i β β β t k −z k,i ) 2 z k,i <y i −w k,i (y i −z k,i ) b −λ k,i z k,i + ρ 2 (x T i β β β t k −z k,i ) 2 z k,i >y i . The optimal solution in the first and second intervals are clearly among the three points {y i , ¯ z k,i , ˜ z k,i } where ¯ z k,i = x T i β β β t k + (λ k,i b+w k,i ) bρ , and ˜ z k,i = x T i β β β t k − (−λ k,i b+w k,i ) bρ . As a result,z k,i is updated using, z t+1 k,i = argmin z k,i ∈{y i ,¯ z k,i ,˜ z k,i } ˆ `(z k,i ). (2.18) Notice that the minimization problem in (2.18) is only over three points and can be solved efficiently. In both Laplacian and Gaussian case, the update of β β β can be done through ∇ β β β L(β β β t+1 ,Z t+1 ,λ λ λ t ) = 0→ X | λ λ λ t +ρX | Xβ β β t+1 −X | Z t+1 = 0→ β β β t+1 = (X | X) −1 X | (Z t+1 −ρ −1 λ λ λ t ) . The steps of our proposed algorithm is summarized in Algorithm 4. As seen in the algorithm, the proposed method, unlike EM algorithm, has a closed-form solution in each of its sub- problems and can be solved efficiently. 31 Algorithm 4 Proposed ADMM-based Algorithm 1: Input: λ λ λ 0 : Initial value fo λ λ λ, β β β 0 : Initial value for β β β, ρ : a positive constant 2: for t = 0 :N Itr − 1 do 3: w t+1 k,i = f(y i −hx i ,β β β t k i) K P j=1 f(y i −hx i ,β β β t j i) , ∀k,i 4: Gaussian Case: z t+1 k,i = y i w k,i +σ 2 ρx T i β β β t k +σ 2 λ t k,i w k,i +σ 2 ρ , ∀k,∀i 5: Laplacian Case: z t+1 k,i is updated using (2.18), ∀k,∀i 6: β β β t+1 = (X | X) −1 X | (Z t+1 −ρ −1 λ λ λ t ) 7: λ λ λ t+1 =λ λ λ t +ρ(Xβ β β t+1 −Z t+1 ) 8: end for 9: return β β β N Itr , 2.2.6 Numerical Experiments In this section, we evaluate the performance of the proposed method in estimating the regressor components in the MLR problem under different noise structures. In this experiment, we consider K∈{2,···,14} components and d∈{1,···,5} dimension for MLR. For each pair (K,d), we first generate theK regressors which result inN = 20000 samples, i.e.,β β β ∗ k ∼N(0,I d ) and x i ∼N(0,I d ),∀i∈{1,···,N}, k∈{1,···,K}. Then, we generate the response vector by y i =z i hβ β β ∗ ,x i i + i where z = (z 1 ,...,z k ) takes values in{e 1 ,...,e k } uniformly (where e j is the jth unit vector) and i is the additive noise. We use both the proposed Algorithm 4 and the EM algorithm for estimating the correct coefficients starting from the same initial points and the number of iterations, N Itr = 1000. The above experiment is repeated for 30 times for each pair (K,d). Additionally, the whole procedure is repeated separately for both Gaussian and Laplacian additive noise with σ = 1. To compute the error of the estimation, after both algorithms are terminated, we find the assignment between the ground-truth parameters, {β β β ∗ k } K k=1 , and the estimated parameters,{ ˆ β β β k } K k=1 , that has the minimum distance and report that distance as recovery error. Fig. 2.6 shows the result of the experiment. This figure consists of two tables. Numbers in the tables are the average of the recovery error and numbers in parentheses are the standard deviation of 30 runs for each pair (K,d). In these tables, blue numbers represent the performance of the proposed method and numbers below 32 it show the performance of the EM method for the same pair (K,d). For the Laplacian case, running a simple paired t-test with α = 0.05 on our results reveals that the hypothesis that “EM outperforms our method” would be rejected. On the other hand, the hypothesis that “our method outperforms EM” is not rejected by our paired t-test. To compare the computational cost of each method, we calculate the difference in computation time for each of the 30 repetitions in each pair (K,d) by (computation time for EM algorithm - computation time for Algorithm 4) for each instance of the experiment, resulting in the total of 1950 = 30× 13× 5 data points. Histograms in Fig. 2.7 show the statistics of calculated differences. As seen from the figure, EM method is very slow when the noise follows Laplacian distribution. In this case, the difference in computation time ranges from 1.6 minutes to 10 minutes depending on the size of the problem. This is mainly due to the fact that EM results in sub-problems that are computationally expensive to solve while the proposed algorithm enjoys closed-from solution in each of its steps for any noise scenario. 33 (a) Recovery error (Gaussian) (b) Recovery error (Laplacian) Figure 2.6: Average Recovery error for (a) Gaussian and (b) Laplacian cases. Blue numbers represent the performance of the proposed method and numbers in parentheses is the standard deviation. 34 (a) Difference in computation time (Gaussian) (b) Difference in computation time (Laplacian) Figure 2.7: Difference of the computation time for the 1950 experimental data points generated by (computation time of EM algorithm - computation time of Algorithm 4) for (a) Gaussian and (b) Laplacian cases. EM is very slow in Laplacian case. 35 Chapter 3 Algorithms for Generative Models and Adversarial Learning Generative Adversarial Networks (GANs) has drawn a lot of attentions recently. This is a mainly due to its successful performance in a wide range of applications such as generating high resolution images, creative language generation and commonsense reasoning [95]. GANs is composed of two competing neural network models called Generator and Discriminator. Generator’s task is to create fake data instances that are as indistinguishable as possible from the real data. Discriminator, on the contrary, attempts to distinguish between fake data and real data. In practice, GANs results in a training process that is time-consuming and numerically unstable which usually needs tedious fine-tuning. In this chapter, we will focus on nonconvex min-max optimization problems which are the main problems arise in training GANs. To better understand the training procedure of GANs, let us start by assuming that the real data distribution is denoted by P x with x∈R d . Define a random variable y∈R n with an arbitrary given distribution P y such as Gaussian distribution, and a parametric function G θ θ θ (y) : y7→R d that is usually modeled using a deep neural network. Additionally, consider a distance measure ρ(·,·) that gets two distributions as input and outputs a non-negative 36 number proportional to the dissimilarity between that two distributions. From optimization perspective, GANs problem can be formulated as, min θ θ θ ρ(P x ,P G θ θ θ (y) ). In other words, GANs aims at optimizing the parameterized function G θ θ θ (·) to be as close as possible to the distribution of real that under the distance metric ρ(·,·). For two distribution p and q, one of the most commonly used measure is the optimal transport distance that is defined as following, ρ(p,q) = min π∈Π(p,q) Z Y Z X π(x,y)c(x,y)dxdy, where Π(p,q) is the set of all joint distributions having marginal distributions p and q and c :R d ×R d 7→R is a given cost function. As a result, GANs problem leads to the following optimization problem, min θ θ θ min π∈Π(Px,G θ θ θ (y)) Z Y Z X π(G θ θ θ (y),x)c(G θ θ θ (y),x)dxdy. (3.1) However, since parametrizing the transport plan π is difficult, in practice the dual of (3.1) is used, that results in the following optimization problem, min θ θ θ max α α α,β β β E x∼Px φ α (G θ (x))− E y∼q ψ β β β (y), (3.2) s.t. φ α α α (G θ θ θ (x))−ψ β β β (y)≤c(G θ θ θ (x),y),∀(x,y) where for practical considerations we have assumed that the dual functions, φ and ψ, belong to the set of parametric functions with parameters α α α and β β β respectively 37 As a consequence, optimizing GANs results in solving a min-max saddle point game that is hard to solve in general. To address these challenges, we will start studying general class of non-differentiable nonconvex min-max games. Later on, we will show that GANs problem requires solving a broader class of nonconvex-nonconcave min-max games. As a result, in the next part, we will propose a new formulation for GANs problem that does not need to solve a zero-sum game. The idea of this new formulation is based on replacing adversarial training of discriminator with a computationally cheap random projection procedure. This leads to a simpler and more stable training procedure by removing the optimization step needed for learning discriminator. Besides, the resulting Generator, whose quality is of main interest in applications, can outperform models trained in a conventional way, especially in the case where data has an implicit low-dimensional structure. Besides GANs, many other problems in machine learning and signal processing require solving a min-max saddle point game. As an example, consider the recent efforts regarding “fair machine learning”. These efforts are based on the studies that have shown machine learning algorithms result in a systematic discrimination against people in different minority groups [22, 96]. Different proposed methods in the literature aim to overcome this issue using adversarial approach which requires solving a min-max saddle point game [22]. Similar approaches are used for resource allocation problem in wireless commutation systems. The goal in this systems is maximize the minimum rate among all users which also requires solving min-max games [25]. 3.1 Adversarial Learning and Solving Nonconvex -Concave Nondifferentiable Min-Max Games Min-max saddle point games appear in a wide range of applications such as training Generative Adversarial Networks [6, 97–99], fair statistical inference [22, 96, 100], and training robust 38 neural networks and systems [2, 101, 102]. In such a game, the goal is to solve the optimization problem of the form min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α), (3.3) which can be considered as a two player game where one player aims at increasing the objective, while the other tries to minimize the objective. Using game theoretic point of view, we may aim for finding Nash equilibria [17] in which no player can do better off by unilaterally changing its strategy. Unfortunately, finding/checking such Nash equilibria is hard in general [18]. Moreover, such Nash equilibria might not even exist. One of the main reasons for this difficulty rises from the fact that intuitive generalization of classical optimization tools fail in practice for even simple min-max games. For example, consider using following simultaneous gradient-descent-ascent for solving (3.3), α α α t =α α α t−1 +η∇ α α α f(α α α t−1 ,θ θ θ t−1 ) θ θ θ t =θ θ θ t−1 −η∇ θ θ θ f(α α α t−1 ,θ θ θ t−1 ). Following example shows the performance of this method for solving a simple game. Example 1. Consider the two-player zero-sum min-max game min θ max α f(θ,α) =θα. The point (0,0) is the only Nash equilibrium of this game. Let us use the simple simultaneous gradient descent-ascent to find this point. This algorithm results in the following update rule, θ t =θ t−1 −ηα t−1 α t =α t−1 +ηθ t−1 . 39 Figure 3.1: Simultaneous gradient descent-ascent (left) and alternating gradient descent- ascent (right); none of the algorithms converge to the only Nash equilibrium point, (0,0). Let R t−1 = q θ 2 t−1 +α 2 t−1 be the distance of the estimated point at iteration t− 1 from the origin. Now, by a simple algebra we can show thatθ 2 t +α 2 t = (1+η 2 )(θ 2 t−1 +α 2 t−1 ). As a result, R t = (1 +η 2 ) t 2 R 0 . This shows that the iterates will diverge from the origin which is the only Nash equilibrium point. We also implement the alternating gradient descent-ascent that its updated rule is as following, θ t =θ t−1 −ηα t−1 , α t =α t−1 +ηθ t . Following figure shows the result of running these methods starting from the initial point (0.3,0.2) with step-size 1e−2. As seen from the figure, none of the algorithms converge to the only Nash equilibrium point, (0,0). Due to these difficulties, many works focus on special cases such as convex-concave problems where f(θ θ θ,.) is concave for any given θ θ θ and f(.,α α α) is convex for any given α α α. Under this assumption, different algorithms such as optimistic mirror descent [28–31], Frank- Wolfe algorithm [32, 33] and Primal-Dual method [34] have been studied. In the general nonconvex settings, [103] considers the weakly convex-concave case and proposes a primal-dual based approach for finding approximate stationary solutions. More recently, the research 40 works [47, 104–106] examine the min-max problem in nonconvex-(strongly)-concave cases and proposed first-order algorithms for solving them. Some of the results have been accelerated in the “Moreau envelope regime” by the recent interesting work [107]. This work first starts by studying the problem in smooth strongly convex-concave and convex-concave settings, and proposes an algorithm based on the combination of Mirror-Prox [27] and Nesterov’s accelerated gradient descent [108] methods. Then the algorithm is extended to the smooth nonconvex-concave scenario. Some of the aforementioned results are extended to zeroth-order methods for solving nonconvex-concave min-max optimization problems [109, 110]. As a first step toward solving nonconvex-nonconcave min-max problems, [47] studies a class of games in which one of the players satisfies the Polyak- Lojasiewic(PL) condition and the other player has a general nonconvex structure. More recently, the work [111] studied the two sided PL min-max games and proposed a variance reduced strategy for solving these games. While almost all existing efforts focus on smooth min-max problems, in this work, we study nondifferentiable, nonconvex-strongly-concave and nonconvex-concave games and propose an algorithm for computing their first-order Nash equilibria. 3.1.1 Problem Formulation Consider the min-max zero-sum game min θ θ θ∈Θ max α α α∈A (f(θ θ θ,α α α),h(θ θ θ,α α α)−p(α α α) +q(θ θ θ)), (3.4) where we assume that the constraint sets and the objective function satisfy the following assumptions throughout the paper. Assumption 1. The sets Θ⊆R d θ andA⊆R dα are convex and compact. Moreover, there exist two separate balls with radius R that contains the feasible setsA and Θ. Assumption 2. The functionsh(θ θ θ,α α α) is continuously differentiable, p(·) andq(·) are convex and (potentially) non-differentiable, p(·) is L p -Lipschitz continuous and q(·) is continuous. 41 Assumption 3. The functionh(θ θ θ,α α α) is continuously differentiable in bothθ θ θ andα α α and there exist constants L 11 , L 22 and L 12 such that for every α α α,α α α 1 ,α α α 2 ∈A, and θ θ θ,θ θ θ 1 ,θ θ θ 2 ∈ Θ, we have k∇ θ θ θ h(θ θ θ 1 ,α α α)−∇ θ θ θ h(θ θ θ 2 ,α α α)k≤L 11 kθ θ θ 1 −θ θ θ 2 k, k∇ α α α h(θ θ θ,α α α 1 )−∇ α α α h(θ θ θ,α α α 2 )k≤L 22 kα α α 1 −α α α 2 k, k∇ α α α h(θ θ θ 1 ,α α α)−∇ α α α h(θ θ θ 2 ,α α α)k≤L 12 kθ θ θ 1 −θ θ θ 2 k, k∇ θ θ θ h(θ θ θ,α α α 1 )−∇ θ θ θ h(θ θ θ,α α α 2 )k≤L 12 kα α α 1 −α α α 2 k. To proceed, let us first define some preliminary concepts: Definition 1. (Directional Derivative) Let ψ :R n →R and ¯ x∈dom(ψ). The directional derivative of ψ at the point ¯ x along the direction d is defined as ψ 0 (¯ x;d) = lim τ↓0 ψ(¯ x +τd)−ψ(¯ x) τ . We say that ψ is directionally differentiable at ¯ x if the above limit exists for all d∈R n . It can be shown that any convex function is directionally differentiable. Definition 2. (FNE) A point (θ θ θ ∗ ,α α α ∗ )∈ Θ×A is a first-order Nash equilibrium (FNE) of Game (3.4) if f 0 θ θ θ (θ θ θ ∗ ,α α α ∗ ;θ θ θ−θ θ θ ∗ )≥ 0 ∀θ θ θ∈ Θ, f 0 α α α (θ θ θ ∗ ,α α α ∗ ;α α α−α α α ∗ )≤ 0 ∀α α α∈A; or equivalently if h∇ θ θ θ h(θ θ θ ∗ ,α α α ∗ ),θ θ θ−θ θ θ ∗ i +q(θ θ θ)−q(θ θ θ ∗ ) + M 2 ||θ θ θ−θ θ θ ∗ || 2 ≥ 0, h∇ α α α h(θ θ θ ∗ ,α α α ∗ ),α α α−α α α ∗ i−p(α α α) +p(α α α ∗ )− M 2 ||α α α−α α α ∗ || 2 ≤ 0, for all θ θ θ∈ Θ and α α α∈A; and all M > 0. 42 This definition implies that, at the first-order Nash equilibrium point, each player satisfies the first-order necessary optimality condition of its own objective when the other player’s strategy is fixed. This is also equivalent to saying we have found the solution to the corresponding variational inequality [112]. Moreover, in the unconstrained smooth case that Θ =R d θ ,A =R dα , and p≡q≡ 0, this definition reduces to the standard widely used definition∇ α α α h(θ θ θ ∗ ,α α α ∗ ) = 0 and∇ θ θ θ h(θ θ θ ∗ ,α α α ∗ ) = 0. In practice, we use iterative methods for solving such games and it is natural to evaluate the performance of the algorithms based on their efficiency in finding an approximate-FNE point. To this end, let us define the concept of approximate-FNE point: Definition 3. (Approximate-FNE) A point ( ¯ θ θ θ, ¯ α α α) is said to be an –first-order Nash equilib- rium (–FNE) of Game (3.4) if X ( ¯ θ θ θ, ¯ α α α)≤ 2 and Y( ¯ θ θ θ, ¯ α α α)≤ 2 , where X ( ¯ θ θ θ, ¯ α α α),−2L 11 min θ θ θ∈Θ h∇ θ θ θ h( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi +q(θ θ θ)−q( ¯ θ θ θ) + L 11 2 ||θ θ θ− ¯ θ θ θ|| 2 , and Y( ¯ θ θ θ, ¯ α α α), 2L 22 max α α α∈A h∇ α α α h( ¯ θ θ θ, ¯ α α α),α α α− ¯ α α αi−p(α α α) +p(¯ α α α)− L 22 2 ||α α α− ¯ α α α|| 2 . In the unconstrained and smooth scenario that Θ =R d θ ,A =R dα , and p≡q≡ 0, the above -FNE definition reduces tok∇ α α α h( ¯ θ θ θ, ¯ α α α)k≤ andk∇ θ θ θ h( ¯ θ θ θ, ¯ α α α)k≤. Remark 1. The above definition of–FNE is stronger than the-stationarity concept defined based on the proximal gradient norm in the literature (see, e.g., [113]). Details of this remark is discussed in the Appendix A.2. 43 Remark 2. (Rephrased from Proposition 4.2 in [114]) For the min-max Game (3.4), under Assumptions 1, 2 and 3, FNE always exists. Moreover, it is easy to show thatX (·,·) and Y(·,·) are continuous functions in their arguments. Hence, –FNE exists for every ≥ 0. In what follows, we consider two different scenarios for finding -FNE points. In the first scenario, we assume that h(θ θ θ,α α α) is strongly concave in α α α for every given θ θ θ and develop a first-order algorithm for finding -FNE. Then, in the second scenario, we extend our result to the case where h(θ θ θ,α α α) is concave (but not strongly concave) in α α α for every given θ θ θ. 3.1.2 Nonconvex-Strongly-Concave Games In this section, we study the zero-sum Game (3.4) in the case that the function h(θ θ θ,α α α) is σ-strongly concave in α α α for every given value of θ θ θ. To understand the idea behind the algorithm, let us define the auxiliary function g(θ θ θ), max α α α∈A h(θ θ θ,α α α)−p(α α α). A “conceptual” algorithm for solving the min-max optimization problem (3.4) is to minimize the function g(θ θ θ) +q(θ θ θ) using iterative decent procedures. First, notice that, based on the following lemma, the strong concavity assumption implies the differentiability of g(θ θ θ). Lemma 3. Let g(θ θ θ) = max α α α∈A h(θ θ θ,α α α)−p(α α α) in which the function h(θ θ θ,α α α) is σ-strongly concave inα α α for any givenθ θ θ. Then, under Assumption 3, the functiong(θ θ θ) is differentiable. Moreover, its gradient is L g -Lipschitz continuous, i.e., k∇g(θ θ θ 1 )−∇g(θ θ θ 2 )k≤L g kθ θ θ 1 −θ θ θ 2 k, where L g =L 11 + L 2 12 σ . The details of the proof for this lemma can be found at Appendix A.2. The smoothness of the function g(θ θ θ) suggests the natural multi-step proximal method in Algorithm 5 for 44 solving the min-max optimization problem (3.4). This algorithm performs two major steps in each iteration: the first major step, which is marked as “Accelerated Proximal Gradient Ascent”, runs multiple iterations of the accelerated proximal gradient ascent to estimate the solution of the inner maximization problem. In other words, this step finds a point α α α t+1 such that α α α t+1 ≈ argmax α α α∈A f(θ θ θ t ,α α α). The output of this step will then be used to compute the approximate proximal gradient of the function g(θ θ θ) in the second step. In summary, this algorithm considers the min-max problem as a classical minimization problem and aims to minimizeg(θ θ θ)+q(θ θ θ) using (proximal) gradient descent method. The gradient value for function g(θ θ θ) can be calculated by the classical Danskin’s theorem [115, 116], which is restated below: Theorem 3. (Rephrased from [115, 116]) Let V⊂R m be a compact set and J(u,ν ν ν) :R n × V7→R be differentiable with respect to u. Let ¯ J(u) = max ν ν ν∈V J(u,ν ν ν) and assume ˆ V (u) = {ν ν ν∈V|J(u,ν ν ν) = ¯ J(u)} is singleton for any given u. Then, ¯ J(u) is differentiable and ∇ u ¯ J(u) =∇ u J(u, ˆ ν ν ν) with ˆ ν ν ν∈ ˆ V (u). According to the above lemma, the proximal gradient descent update rule on g(θ θ θ) will be given by θ θ θ t+1 = argmin θ θ θ∈Θ q(θ θ θ) +h∇ θ θ θ h(θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L g 2 kθ θ θ−θ θ θ t k 2 . The two main proximal gradient update operators used in Algorithm 5 are defines as ρ α α α ( ˜ θ θ θ, ˜ α α α,γ 1 ) = argmax α α α∈A h∇ α α α h( ˜ θ θ θ, ˜ α α α),α α α− ˜ α α αi− γ 1 2 kα α α− ˜ α α αk 2 −p(α α α) and ρ θ θ θ ( ˜ θ θ θ, ˜ α α α,γ 2 ) = argmin θ θ θ∈Θ h∇ θ θ θ h( ˜ θ θ θ, ˜ α α α),θ θ θ− ˜ θ θ θi + γ 2 2 kθ θ θ− ˜ θ θ θk 2 +q(θ θ θ). 45 The following theorem establishes the rate of convergence of Algorithm 5 to -FNE. A more detailed statement of the theorem (which includes the constants of the theorem) is presented in Appendix A.2. Theorem 4. Consider the min-max zero sum game min θ θ θ∈Θ max α α α∈A (f(θ θ θ,α α α) =h(θ θ θ,α α α)−p(α α α) +q(θ θ θ)), where the function h(θ θ θ,α α α) is σ−strongly concave. Let D =g(θ θ θ 0 )+q(θ θ θ 0 )− min θ θ θ∈Θ (g(θ θ θ) +q(θ θ θ)) where g(θ θ θ) = max α α α∈A h(θ θ θ,α α α)−p(α α α), and L g =L 11 + L 2 12 σ be the Lipschitz constant of the gradient of g. In Algorithm 5, if we set η 1 = 1 L 22 ,η 2 = 1 Lg , N = q 8L 22 /σ− 1 and choose K and T large enough such that T≥N T (), 4L g D 2 and K≥N K (), 2 √ 8κ C + 2log 1 + 1 2 log 2Δ σ ! , where C = max 2log2 + log (L g L 12 R),logL 22 + log (2L 22 R +g ∞ +L p +R) and κ = L 22 σ , then there exists an iteration t∈{0,···,T} such that (θ θ θ t ,α α α t+1 ) is an –FNE of (3.4). Above theorem guarantees that the proposed algorithm can find -FNE of the min-max problem. The order complexity of the proposed method can be obtained by the following corollary. Corollary 1. Based on Theorem 4, to find an -FNE of Game (3.4), Algorithm 5 requires O( −2 log( −1 )) gradients evaluations of the objective function. 46 Algorithm 5 Multi-step Accelerated Proximal Gradient Descent-Ascent 1: Input: K, T , N, η 1 , η 2 , α α α 0 ∈A and θ θ θ 0 ∈ Θ. 2: for t = 0,···,T− 1 do 3: for k = 0,···,bK/Nc do 4: Set β 1 = 1 and x 0 =α α α t 5: if k = 0 then 6: y 1 = x 0 7: else 8: y 1 = x N 9: end if 10: for j = 1,2,...,N do 11: Set x j =ρ α α α (θ θ θ t ,y j ,η 1 ) 12: Set β j+1 = 1 + q 1 + 4β 2 j 2 13: y j+1 = x j + β j − 1 β j+1 (x j −x j−1 ) 14: end for 15: end for 16: α α α t+1 = x N 17: θ θ θ t+1 =ρ θ θ θ (θ θ θ t ,α α α t+1 ,η 2 ) 18: end for Accelerated Proximal Gradient Ascent [108, 117] 47 3.1.3 Nonconvex-Concave Games In this section, we consider the min-max problem (3.4) under the assumption that h(θ θ θ,α α α) is concave (but not strongly concave) in α α α for any given value of θ θ θ. In this case, the direct extension of Algorithm 5 will not work since the function g(θ θ θ) might be non-differentiable. To overcome this issue, we start by making the function f(θ θ θ,α α α) strongly concave by adding a “negligible” regularization. More specifically, we define f λ (θ θ θ,α α α) =f(θ θ θ,α α α)− λ 2 kα α α− ˆ α α αk 2 , (3.5) for some ˆ α α α∈A. We then apply Algorithm 5 to the modified nonconvex-strongly-concave game min θ θ θ∈Θ max α α α∈A f λ (θ θ θ,α α α). (3.6) It can be shown that by choosing λ = 2 √ 2R , when we apply Algorithm 5 to the modified Game (3.6), we obtain an -FNE of the original problem (3.4). More specifically, with a proper choice of parameters, the following theorem establishes that the proposed method converges to -FNE point of the original problem. Theorem 5. Consider the min-max zero sum game min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α) =h(θ θ θ,α α α)−p(α α α) +q(θ θ θ) ! , where the function h(θ θ θ,α α α) is concave. Define f λ (θ θ θ,α α α) =f(θ θ θ,α α α)− λ 2 kα α α− ˆ α α αk 2 and g λ (θ θ θ) = max α α α∈A h(θ θ θ,α α α)− λ 2 kα α α− ˆ α α αk 2 −p(α α α) for some ˆ α α α∈A . Let D =g λ (θ θ θ 0 )+q(θ θ θ 0 )−min θ θ θ∈Θ (g λ (θ θ θ) +q(θ θ θ)) and L g λ =L 11 + L 2 12 λ be the Lipschitz constant of the gradient of g λ . In Algorithm 5 if we set η 1 = 1 L 22 +λ , η 2 = 1 Lg λ , N = q 8(L 22 +λ) λ − 1, λ = min{L 22 , 2 √ 2R } and choose K and T large enough such that, T≥N T (), 8L g λ D 2 , 48 and K≥N K (), 2 √ 8κ C + 2log 2 + 1 2 log 2Δ λ ! , where C = max 2log2 + log (L g λ L 12 R),log(L 22 +λ) + log 2(L 22 +λ)R +g λ ∞ +L p +R , κ = L 22 +λ λ and g λ ∞ = max α α α∈A k∇ α α α h(θ θ θ t ,α α α)k +λR, there exists t∈{0,...,T} such that (θ θ θ t ,α α α t+1 ) is an -FNE of the original problem (3.4). The details of the proof for above theorem can be found at Appendix A.2. Corollary 2. Based on Theorem 5, Algorithm 5 requiresO( −3.5 log( −1 )) gradient evalua- tions in order to find a -FNE of Game (3.4). 3.1.4 Numerical Experiments In this section, we evaluate the performance of the proposed algorithm for the problem of attacking the LASSO estimator. In other words, our goal is to find a small perturbation of the observation matrix that worsens the performance of the LASSO estimator in the training set. This attack problem can be formulated as max A∈B( ˆ A,Δ) min x kAx−bk 2 2 +ξkxk 1 , (3.7) whereB( ˆ A,Δ) ={A|||A− ˆ A|| 2 F ≤ Δ} and the matrix A∈R m×n . We set m = 100, n = 500, ξ = 1 and Δ = 10 −1 . In our experiments, first we generate a “ground-truth” vector x ∗ with sparsity level s = 25 in which the location of the non-zero elements are chosen randomly and their values are sampled from a standard Gaussian distribution. Then, we generate the elements of matrix A using standard Gaussian distribution. Finally, we set b = Ax ∗ +e, where e∼N(0,0.001I). We compare the performance of the proposed algorithm with the popular subgradient descent-ascent and proximal gradient descent-ascent algorithms. In the subgradient descent- ascent algorithm, at each iteration, we take one step of sub-gradient ascent step with respect 49 Figure 3.2: (left): Convergence behavior of different algorithms in terms of the objective value. The objective value at iteration t is defined as g(A t ), min x kA t x−bk 2 2 +ξkxk 1 , (right): Convergence behavior of different algorithms in terms of the stationarity measuresX (A t ,x t+1 ),Y(A t ,x t+1 ) (logarithmic scale). The list of the algorithms used in the comparison is as follows: Proposed Algorithm (PA), Subgradient Descent-Ascent (SDA), and Proximal Descent-Ascent algorithm (PDA). Algorithm PA SDA PDA Average time (seconds) 0.0268 3.5016 0.5603 Standard deviation (seconds) 0.0538 7.0137 1.1339 Table 3.1: Average computational time of different algorithms for adversarial attack against the LASSO estimator to x followed by one steps of sub-gradient ascent in A. Similarly, each iteration of the proximal gradient descent-ascent algorithm consists of one step of proximal gradient descent with respect to x and one step of proximal gradient descent with respect to A. To have a fair comparison, all of the studied algorithms have been initialized at the same random points in Figure 3.2. The above figure might not be a fair comparison since each step of the proposed algorithm is computationally more expensive than the two benchmark methods. To have a better comparison, we evaluate the performance of the algorithms in terms of the required time for convergence. Table 3.1 summarizes the average time required for different algorithms for finding a point ( ¯ A, ¯ x) satisfyingX ( ¯ A, ¯ x)≤ 0.1 andY( ¯ A, ¯ x)≤ 0.1. The average is taken over 100 different experiments. As can be seen in the table, the proposed method in average converges an order of magnitude faster than the other two algorithms. 50 3.2 Nonconvex Stochastic Games under Minty VI Condition Stochastic first-order methods are of core practical importance for solving numerous opti- mization problems including training deep learning networks (DLN). Standard stochastic gradient descent (SGD) has become a widely used technique for the latter task. However, its convergence crucially depends on the tuning and update of the learning rate over iterations in order to control the variance of the gradient in the stochastic search directions, especially for nonconvex functions [118]. To alleviate these issues, several improved variants of SGD that automatically update the search directions and learning rates using a metric constructed from the history of iterates have been proposed, including adaptive methods [119–122] and adaptive momentum methods [35, 36]. In particular, Adam belonging to the second category enjoys the dual advantages of variance adaption and momentum direction [123, 124] and hence represents a popular algorithm to train DLNs. There is a large body of literature on the theoretical and empirical benefits of adaptive momentum optimization algorithms for convex [35, 36], smooth nonconvex [37–39], and non-smooth nonconvex settings [40]. [43] gives an analysis of an optimistic adaptive method that uses Adagrad [121, 122] for nonconvex min-max optimization. However, Adagrad- type methods are suited for sparse convex settings and their performance deteriorates in (dense) nonconvex optimization problems [38]. These empirical findings necessitate the use of adaptive momentum methods that incorporate knowledge of past iterations. It is important to notice that all these methods are designed for classical minimization problems. However, training DLNs such as Generative Adversarial Networks (GANs) require solving a general class of min-max optimization problems [6, 125]. 51 The goal of this section is to generalize adaptive momentum methods to solve a general class of nonconvex-nonconcave min-max problems. It develops an adaptive algorithm for solving min-max saddle point games and theoretically analyzes its convergence rate. The performance of the developed algorithm is assessed on training GANs. In the previous section, we analyzed and studied deterministic min-max games. However, many practical problems such as training GANs [6] and defense against adversarial attacks to neural networks [126] requires solving stochastic min-max problems. In this section, we study such stochastic problems in more details. Let us first start by defining the problem. 3.2.1 Problem Formulation Consider the stochastic min-max saddle point problem min θ θ θ max α α α F (θ θ θ,α α α) =E ξ ξ ξ∼D [f(θ θ θ,α α α;ξ ξ ξ)], (3.8) where θ θ θ∈R p 1 , α α α∈R p 2 , ξ ξ ξ is a random variable drawn from an unknown distributionD, and F (θ θ θ,α α α) is a nonconvex-nonconcave function, i.e., it is nonconvex in θ θ θ for any given α α α and is nonconcave in α α α for any given θ θ θ. Next, we introduce necessary notation and definitions. Throughout, y := (θ θ θ,α α α)∈ R p 1 ×R p 2 and denote the objective function of Game (3.8) and its random realization by F(y) and f(y;ξ ξ ξ), respectively. Furthermore, we define∇F(y) = [∇ θ θ θ F(θ θ θ,α α α),−∇ α α α F(θ θ θ,α α α)] and∇f(y;ξ ξ ξ) = [∇ θ θ θ f(θ θ θ,α α α;ξ ξ ξ),−∇ α α α f(θ θ θ,α α α;ξ)] to represent the corresponding gradient and stochastic gradient of the objective function, respectively. Definition 1 (Nash Equilibrium). A point (θ θ θ ∗ ,α α α ∗ )∈R p 1 ×R p 2 is a Nash equilibrium of Game (3.8) if F (θ θ θ ∗ ,α α α)≤F (θ θ θ ∗ ,α α α ∗ )≤F (θ θ θ,α α α ∗ ), ∀(θ θ θ,α α α)∈R p 1 ×R p 2 . 52 This definition implies that θ θ θ ∗ is a global minimum of F(·,α α α ∗ ) and α α α ∗ is a global maximum of F(θ θ θ ∗ ,·). In the convex-concave regime with F(θ θ θ,α α α) being convex in θ θ θ for any given α α α and concave in α α α for any given θ θ θ, the Nash equilibrium always exists [127] and there are several algorithms for identifying it [34, 128]. However, computing a Nash equilibrium point is NP-hard in general [127, 129], and it may not even exist [19]. As a result, since we are considering the general nonconvex-nonconcave regime, we settle in computing a first-order Nash equilibrium point [47, 130] defined next. Definition 2 (First-Order Nash Equilibrium (FNE)). A point y ∗ ∈R p 1 ×R p 2 is a first-order Nash equilibrium point of Game (3.8), if∇F (y ∗ ) = 0. Note that at a FNE point, each player satisfies the first-order optimality condition of its own objective function when the strategy of the other player is fixed [114, 131]. In practice, iterative algorithms are used for computing a FNE for a stochastic problem. As a result, the performance of different iterative algorithms are evaluated based on the following approximate stochastic FNE definition. Definition 3 (-Stochastic First-Order Nash Equilibrium (SFNE)). A random variable y ∗ is an approximate SFNE (-SFNE) point of Game (3.8) if E h k∇F (y ∗ )k 2 i ≤ 2 , where the expectation is taken over the distribution of the random variable y ∗ . The randomness of variable y ∗ in Definition 3 comes from the used of iterative algorithms that ave access to stochastic gradients of the objective function (see, e.g., Algorithm Adam 3 below). The objective of this work is to find an -SFNE point for Game (3.8) using an iterative method based on adaptive momentum. 3.2.2 ADAM: from Minimization to Min-Max Problems The proposed ADAptive Momentum Min-Max (Adam 3 ) algorithm comes with convergence guarantees for solving a general class of nonconvex-nonconcave saddle point games defined 53 in (3.8). It is obtained by integrating AMSGrad [36], a modified version of Adam [35], with a stochastic extra-gradient method [45]. Algorithm 6 ADAptive Momentum Min-Max (Adam 3 ) 1: Input:{β 1,k } N k=1 ,β 2 ,β 3 ∈ [0,1), m∈N, and η∈R + 2: Initialize: z 0 = x 0 = m 0 = v 0 = d 0 = 0 0 0 d . 3: for k = 1 :N do 4: w t+1 k,i = f(y i −hx i ,β β β t k i) K P j=1 f(y i −hx i ,β β β t j i) , ∀k,i 5: z k = x k−1 −ηd k−1 ; 6: Draw ξ ξ ξ k = (ξ ξ ξ 1 k ,···,ξ ξ ξ m k ) fromD, and set b g k = 1 m P m i=1 ∇f(z k ;ξ ξ ξ i k ); 7: m k =β 1,k m k−1 + (1−β 1,k ) b g k ; 8: v k =β 2 v k−1 + (1−β 2 ) b g k b g k ; 9: ˜ v k =β 3 ˜ v k−1 + (1−β 3 )max(˜ v k−1 ,v k ); 10: d k = ˜ v − 1 2 k m k ; 11: x k = x k−1 −η d k ; 12: end for : Element-wise vector multiplication As seen in Algorithm 6, Adam 3 generates two sequences x k and z k , where x k is an ancillary sequence and the stochastic gradient is only computed over the sequence of z k ’s using a mini-batch of sizem, i.e., b g k = 1/m P m i=1 ∇f(z k ;ξ ξ ξ i k ). Using a mini-batch for estimating the gradient is a commonly used approach and more details are available in [113] and references therein. After estimating the gradient, the algorithm calculates the momentum direction, m k , as an exponential moving average of the past gradients. Then, m k is adaptively scaled by the square root of the exponential moving average of squared past gradients ˜ v k . The following remarks about Adam 3 are in order: (1) The square and the maximum operators are applied element-wise. In some applications, to prevent division by zero, we may add a small positive constant to v k [39]. Further, a mini-batch of size m is used in each iteration to estimate the gradient’s value. (2) Adam 3 computes adaptive learning rates from estimates of the second moments of the gradients, similar to [39]. In particular, it uses a larger learning rate compared to AMSGrad and yet incorporates the intuition of slowly decaying the effect of previous gradients on the 54 learning rate. The decay parameter β 3 is an important component of Adam 3 , that enables establishing its convergence properties similar to AMSGrad (β 3 = 0), while maintaining the efficiency of Adam. 3.2.3 Convergence Analysis In this section, we provide non-asymptotic convergence rates for both Adam 3 algorithm. To do so, we make the following assumptions. Assumption 4 (Lipschitz Gradient). The function F is L-smooth, i.e., k∇F (x)−∇F (y)k≤Lkx−yk,∀ x,y∈R d . Assumption 5 (Unbiased Gradients). For any ξ ξ ξ∈D, we have E ξ ξ ξ∼D [∇f(x,ξ ξ ξ)] =∇F (x). Assumption 6 (Bounded Variance). The function F has σ-bounded (local) variance, i.e., E ξ ξ ξ∼D h k[∇f(x,ξ ξ ξ)] j − [∇F (x)] j k 2 i =σ 2 j for all x∈R d , j∈ [d], such that σ 2 = d X j=1 σ 2 j The three Assumptions 4–6 are fairly standard in nonconvex optimization literature [118]. Assumption 7 (Bounded Gradients). The function f(x,ξ ξ ξ) has a G ∞ -bounded gradient, i.e., for any x∈R d and ξ ξ ξ∈D, we havek∇f(x,ξ ξ ξ)k ∞ ≤G ∞ . The above assumption is slightly stronger than the assumptionEk∇f(x,ξ ξ ξ)k≤G 2 , used in the analysis of SGD . Indeed, sincek·k ∞ ≤k·k≤ √ dk·k ∞ , the latter assumption implies the former withG ∞ =G 2 . However,G ∞ will be tighter thanG 2 by a factor of √ d when each coordinate of∇f(x,ξ ξ ξ) almost equals to each other. The assumptionk∇f(x,ξ ξ ξ)k ∞ ≤G ∞ is 55 crucial for the convergence analysis of adaptive subgradient methods in the nonconvex setting and it has been widely considered in the literature [38–40, 43, 132]. We also use the following assumption that is commonly used for analyzing stochastic nonconvex optimization algorithms [45] Assumption 8 (Minty VI condition). There exits x ∗ ∈R d such that for any x∈R d , hx−x ∗ ,∇F (x)i≥ 0. Assumption 8 is commonly used for solving non-monotone variational inequalities [43, 45, 133, 134]. In addition, this assumption holds in some practical nonconvex minimization problem such as learning neural networks [135]. Assumption 9. We assume thatkx ∗ −x i,k k≤D for all iterate k generated by Algorithm 6. Based on the above assumptions, we have the following main theorem. Theorem 6. Let Assumptions 4–9 hold. Then we have, C 0 N X k=1 Ek∇F (z k )k 2 ≤ C 1 N + C 2 σ 2 m , where C 1 ,C 2 ,C 3 are constants defined in (41). This theorem establishes the theoretical convergence of ADAM for smooth min-max problems. For the formal statement of this theorem, see Appendix A.2. 3.2.4 Numerical Studies (I) A Synthetic Data Experiment: SimultaneousAdam (S-Adam) is one of most commonly used approaches for solving min-max problems that are formulated using deep neural networks such as training GANs [6]. In this 56 method, the minimization and the maximization parameters are updated simultaneously using the Adam algorithm [35]. However, this method fails drastically in solving simple min-max problems. To better understand this issue, consider solving the following simple stochastic min-max problem f(θ,α) = c(θ−α) + (θ 2 −α 2 ) +kθα, w.p 1 3 , (θ−α) + (θ 2 −α 2 ) +kθα, w.p 2 3 , (3.9) where c> 1 and k≥ 0. Some calculations lead to F (θ,α) = c + 2 3 (θ−α) + (θ 2 −α 2 ) +kθα. This problem has the following unique FNE (θ ∗ ,α ∗ ) =− c + 2 3k 2 + 12 (2−k,2 +k). Since∇ 2 θ F (θ,α) = 2I 0 and∇ 2 α F (θ,α) =−2I≺ 0, this function is strongly-convex-strongly- concave and many available algorithms [106, 107] can compute its FNE due to its special structure. In this case study we show that despite the simplicity of the problem, S-Adam is unable to recover the only FNE point of this function. We also compare the performance of S-Adam with our proposed algorithm. To do the comparison, we definee k = kz k −z∗k kz∗k such that z k = (θ k ,α k ) and z ∗ = (θ ∗ ,α ∗ ) andR k = 1 k P k i=1 k∇F (z k )k 2 to measure the performance of different methods. We set the parameters at c = 1010,k = 0.01,N = 10 7 ,η = 10 −2 ,β 1 = 0,β 2 = 1/(1 +c 2 ) and β 3 = 0.1. All other parameters are initialized at zero. Figure 3.3 shows the result of the experiment. The left axis depicts the error rate, e k , and the right one the average norm of the gradient,R k . Adam 3 converges to the only FNE point, while S- Adam is unable to locate it. This shows that S-Adam is unreliable even for a simple strongly-convex-strongly-concave problem. 57 Figure 3.3: Left/Righty-axis: Error rate,e t / Average norm of gradient,R k . S-Adam misses the unique FNE point. (II) Training GANs with Adam 3 : Algorithm 6 is used to train GANs on the publicly available CIFAR-10 data set, containing 60000 color images of size 32× 32 in 10 different classes (see https://www.cs.toronto.edu/ ˜ kriz/cifar.html). Models and tasks: The generator’s network consists of the input layer, 2 hidden layers and the output layer. Each of the input and hidden layers consist of a transposed convolution layer followed by batch normalization and a ReLU activation function. The output layer is a transposed convolution layer with a hyperbolic tangent activation function. The network for the discriminator also has the input layer, 2 hidden layers and the output layer. Both the input and hidden layers are convolutional layers followed by instance normalization and a Leaky ReLU activation function with slope 0.2. The output layer consists only of a convolutional layer. The scripts containing the detail design of the networks, otgether with the implementation of Adam 3 and its competitor Optimistic AdaGrad (OAdagrad) [43] in PyTorch, will be available at https://github.com/babakbarazandeh. The parameters are set to η = 0.5× 10 −3 , β 1 = 0.5, β 2 = 0.9 and β 3 = 0.5, respectively and the batch size to 64. Finally, the experiment runs for a total of 40,000 iterations. Figure 3.4 depicts the inception score of the generated images, a metric that evaluates their 58 Figure 3.4: Inception score for generated CIFAR-10 images using Adam 3 and OAdagrad. quality [136]. It can be seen that Adam 3 exhibits better performance than OAdagrad at all iteration stages. 3.3 Min-Max Problems for Small Adversary Region In the previous sections, we studied new classes of min-max saddle point problems that are beyond well-studied convex-concave games. To be more specific, we studied deterministic problems that are in the form of nonconvex-(strongly)-concave or stochastic min-max games in which the objective function satisfies the Minty variational inequality. Despite a wide range of problems that fall under these classes, they are far away from the general nonconvex- nonconcave games. Solving these problems without any further assumption seems to be out of reach and a difficult task [46]. However, there are recent applications that although require optimizing nonconvex-nonconcave objective functions, but specific structure of constraints in their feasible set makes them a tractable problem. In particular, recent efforts to develop machine learning models, specifically deep neural networks, that are robust against adversarial attacks are among these applications. On the one hand, in these problems the adversary aims to add a small perturbation to data in order to increase the training error. On the other hand, training process minimizes the overall error of the training error [24]. 59 It is important to notice that the adversary adds a small perturbation to data in order for attacked data to be very close to the original ones and not to be easily detectable [24]. As a result, this training process against adversarial attacks can be modeled as solving a min-max saddle point game in which the feasible set in the inner maximization problem is optimized over a small region. In the upcoming sections, we formulate this problem rigorously and propose an algorithm with theoretical convergence guarantees for solving min-max games that are in this regime. 3.3.1 Problem Formulation Consider the following min-max saddle point game min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α), (3.10) where we assume that the feasible set and the objective function satisfy the following assumptions. Assumption 10. The sets Θ⊆R d θ andA⊆R dα are convex and compact. Moreover, there exist balls with radius D θ and D α that contain the feasible setA and Θ, respectively. Assumption 11. The function f(θ θ θ,α α α) is continuously differentiable in both θ θ θ and α α α and there exist constants L 11 , L 22 and L 12 such that for every α α α,α α α 1 ,α α α 2 ∈A, and θ θ θ,θ θ θ 1 ,θ θ θ 2 ∈ Θ, we have k∇ θ θ θ f(θ θ θ 1 ,α α α)−∇ θ θ θ f(θ θ θ 2 ,α α α)k≤L 11 kθ θ θ 1 −θ θ θ 2 k, k∇ α α α f(θ θ θ,α α α 1 )−∇ α α α f(θ θ θ,α α α 2 )k≤L 22 kα α α 1 −α α α 2 k, k∇ α α α f(θ θ θ 1 ,α α α)−∇ α α α f(θ θ θ 2 ,α α α)k≤L 12 kθ θ θ 1 −θ θ θ 2 k, k∇ θ θ θ f(θ θ θ,α α α 1 )−∇ θ θ θ f(θ θ θ,α α α 2 )k≤L 12 kα α α 1 −α α α 2 k. Before proceeding, we generalize definitions that have been introduced in Section 3.1.1 to cover the problems in this section. 60 Definition 4. (Approximate-FNE) A point ( ¯ θ θ θ, ¯ α α α) is said to be an ( θ , α )–first-order Nash equilibrium of Game (3.10) if ˜ X ( ¯ θ θ θ,∇ θ θ θ f( ¯ θ θ θ, ¯ α α α),L 11 )≤ 2 θ and ˜ Y(¯ α α α,−∇ α α α f( ¯ θ θ θ, ¯ α α α),L 22 )≤ 2 α , where ˜ X ( ¯ θ θ θ,∇ θ θ θ f( ¯ θ θ θ, ¯ α α α),L 11 ),−2L 11 min θ θ θ∈Θ h∇ θ θ θ f( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi + L 11 2 ||θ θ θ− ¯ θ θ θ|| 2 , and ˜ Y(¯ α α α,−∇ α α α f( ¯ θ θ θ, ¯ α α α),L 22 ), 2L 22 max α α α∈A h∇ α α α f( ¯ θ θ θ, ¯ α α α),α α α− ¯ α α αi− L 22 2 ||α α α− ¯ α α α|| 2 . 3.3.2 Proposed Algorithm The main idea of our algorithm is based on solving a linear approximation of the objective function and then, relating the solution of the approximation function to the solution of the original problem using the property of the feasible set in the inner maximization problem. To be more specific, consider α α α 0 ∈A to be a chosen center point and define f lin (θ θ θ,α α α) =f(θ θ θ,α α α 0 ) +h∇ α α α f(θ θ θ,α α α 0 ),α α α−α α α 0 i (3.11) to be the linear approximation of the function with respect to the maximization parameter around the point α α α 0 . Algorithm 7 summarizes the steps of the proposed approach. In the next section, we first show that under the assumption that the feasible set has small radius, any solution of (3.11) is a solution to (3.10). Then, we show that Algorithm 7 finds the ( θ , α )-FNE of the original problem in (3.10). 3.3.3 Convergence Analysis We have the following initial observation. 61 Lemma 4. Assume f(·,·) satisfies Assumptions 10–11 and D α ≤ min α 2L 22 , θ 2L 12 . (3.12) Then, any ( θ , α )-FNE for f lin is a (2 α ,2 θ )-FNE for f (in the sense of Definition 4). Above lemma which is proved in Appendix A.2 shows that if D α ∼O(min{ θ , α }), we can solve the linear approximation of the objective function and obtain the solution of the original problem in (3.10). Using the above lemma, we can obtain the following main theorem. Theorem 7. Consider the min-max zero sum game min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α). (3.13) Define f λ,lin (θ θ θ,α α α) =f lin (θ θ θ,α α α)− λ 2 kα α α−α α α 0 k 2 and g λ (θ θ θ) = max α α α∈A f λ,lin (θ θ θ,α α α). Let D =g λ (θ θ θ 0 )− min θ θ θ∈Θ (g λ (θ θ θ)), L g λ =L 11 + L 2 12 λ be the Lipschitz constant of the gradient of g λ and D α satisfies Equation (3.12). In Algorithm 7, if set η = 1 Lg λ ,λ = min{L 22 , α 2Dα } =O(1) and choose T large enough such that T≥N T ( θ ), 16L g λ D 2 θ =O( −2 θ −1 α ), then there exists t∈{0,...,T} such that (θ θ θ t ,α α α t+1 ) is an (2 θ ,2 α )-FNE of the problem (3.13). The proofs for the above lemma and theorem can be found in Appendix 4. Algorithm 7 Single Loop Projected Gradient Descent-Ascent 1: Input: T , η, λ, α α α 0 ∈A and θ θ θ 0 ∈R d θ . 2: for t = 0,···,T− 1 do 3: α α α t+1 = Π ∗ A α α α 0 + 1 λ ∇ α α α f lin (θ θ θ t ,α α α) 4: θ θ θ t+1 = Π ∗ Θ θ θ θ t − 1 η ∇ θ θ θ f lin (θ θ θ t ,α α α t+1 ) 5: end for *: Π S (·) : Projection operator to the set S 62 3.3.4 Numerical Experiments Neural networks have been widely used in various applications such as image classification. Despite their wide applicability, its has been shown in multiple studies such as [137–139] that the trained networks are vulnerable to adversarial attacks and even a small perturbation in the data input can significantly change the output of a neural network. As discussed in [24, 101], one of the approaches to overcome this issue and train a neural network that is robust against adversarial attack is to reformulate the training procedure into a robust min-max optimization formulation, i.e., min θ θ θ∈Θ max α α α=[α α α 1 ,···,α α α N ] s.t. α α α i ∈A∀i f(θ θ θ,α α α) := N X i=1 `(g(x i +α α α i ;θ θ θ),y i ), (3.14) where N is the total number of data points, the pair (x i ,y i ) is the i th data sample, g(· ;θ θ θ) is the output of a neural network with parameter θ θ θ, `(·,·) is a function that measures the error between the true data output and the estimated value by the model and finally α α α = [α α α 1 ,···,α α α N ] is the perturbation that is added to samples. This formulation follows the structure of the problem (3.10) that has been studied in this section. As a result, we will use our proposed algorithm to train robust neural networks. Structure of the adversarial attack: We consider the performance of our trained networks against two benchmark attacks, Fast Gradient Sign Method (FGSM) [140] and Projected Gradient Descent (PGD) [101]. Structure of the network: We test the performance of the proposed Algorithm 7 on classification task for MNIST dataset. The network is a Convolutional Neural Network (CNN) with the architecture detailed in Table 3.2. The other tuning parameters are set as = 2. Table 3.3.4 summarizes the result of the numerical experiments. The first row is the value of the λ that is required by Algorithm 7. The numbers in the table show the accuracy of the trained model in classification task when we use the proposed linear approximation compared 63 Layer Type Shape Convolution + tanh 3× 3× 5 Max Pooling 2× 2 Convolution + tanh 3× 3× 10 Max Pooling 2× 2 Fully Connected + tanh 250 Fully Connected + tanh 100 Softmax 3 Table 3.2: Model Architecture for the MNIST dataset. to the case that the original function is used. We have reported the worst performance of the model under both FGSM and PGD attack. As seen in the table, the proposed algorithm trains a neural network that is more robust against adversarial attacks compared to the case that the original function is used. 1 λ 500 2500 5000 25000 Model accuracy 68.95%|57.63% 74.92%|59.23% 75.75%|61.29% 76.47%|63.06% Table 3.3: Classification accuracy after adversarial attack against neural network. The proposed algorithm trains a neural network that is more robust against attacks compared to benchmark methods. 3.4 Do We Always Need to Solve Min-Max Games for Training GANs? In Section 3.1, we studied the class of deterministic nonconvex-concave min-max games and in Section 3.2, we studies and analyzed a class of stochastic min-max games that satisfy the Minty VI condition. Each of these studies are for special class of saddle point games and the underlying assumptions in their algorithms may fail for training GANs with complex architectures which requires solving a general class of nonconvex-nonconcave min-max games. 64 In this section, we ask a question about the necessity of solving min-max problems in training GANs. We show that the “Generator-Discriminator” structure can be used without completely optimizing Discriminator. In particular, we show that even random Discriminator can be helpful in training Generator up to a certain accuracy. 3.4.1 Training Generator via random Discriminator Generative Adversarial Networks (GANs) [6] have been relatively successful in learning underlying distribution of data, especially in application such as image generation. GANs aims to find the mapping that matches a known distribution to the underlying distribution of the data. The way they perform this task is by projecting the inputs to a higher dimension using Neural Networks [98] and then minimizing the distance between the mapped distribution and the unknown distribution in the projected space. To find the optimal network, [6] proposed using Jensen-Shannon divergence [141] for measuring the distance between projected distribution and the data distribution. Later on, [142] generalized the idea by using the f-divergence as the measure. [143] and [144] proposed using least square and absolute deviation as the measure. The most recent works proposed using Wasserstein distance and Maximum Mean Discrepancy (MMD) as the distance measure [97, 125, 145]. Unlike Jensen-Shannon divergence, the recent measures are continuous and almost everywhere differentiable. The common thread between all these approaches is that the problem is usually formulated as a game between two agents, i.e. generator and discriminator. Generator’s role is to generate samples as close as possible to real data and discriminator is responsible for distinguishing between real data and the generated samples. The result is a nonconvex min-max game which is difficult to solve. The learning process, which should solve the resulting nonconvex min-max game, is hard to tackle, due to many factors such as using discontinuous [125] or non-smooth [98] measure. In addition to these factors, the fact that all of these models try to learn the mapping transformation adversarially makes the training 65 unstable. Adding regularization or starting from a good initial point is one approach to overcome these problems [98]. However, for most problems finding a good initial point might be as hard as solving the problem itself. Randomization has shown promising improvement in machine learning algorithms [2, 146]. As the result, to prevent over-mentioned issues, we propose learning underlying distribution of data not through adversarial player but through a random projection. This random projection not only decreases the computation time by removing the optimization steps needed for most of the discriminator’s role, but also leads to a more stable optimization problem. The proposed method has the state of the art performance for simple datasets such as MNIST and Fashion-MNIST. 3.4.2 Problem Formulation Let x∈R d be a random variable with distribution P x representing the real data; and z be a random variable representing a known distribution such as standard Gaussian. Our goal is to find a function or a neural network G(·) such that G(z) has a similar distribution to the real data distribution P x . Therefore, our objective is to solve the following optimization problem min G dist(P G(z) ,P x ), (3.15) where P G(z) is the distribution of G(z) and dist(·,·) is a distance measure between the two distributions. A natural question to ask is about what distance metric to use. The original paper of Goodfellow [6] suggests the use of Jensen–Shannon divergence. However, as mentioned in [125], this divergence is not continuous. Therefore, [98, 125] suggest to use the optimal transport distance. In what follows, we first review this distance and then discuss our methodology for solving (3.15). 3.4.3 Optimal Transport Distance Let p and q be two discrete distributions taking m different values/states. Thus the distri- butions p and q can be represented by m-dimensional vectors (p 1 ,...,p m ) and (q 1 ,...,q m ) . The optimal transport distance is defined as the minimum amount of work needs to be done 66 for transporting distribution p to q (and vice versa). Let π i,j be the amount of mass moved from state i to state j; and c ij represent the per-unit cost of this move. Then the optimal transport distance between the two distributions p and q is defined as [147]: dist(p,q) = min π≥0 m X i=1 m X j=1 c ij π ij s.t. m X j=1 π ij =p i ,∀i = 1,...,m m X i=1 π ij =q j ,∀j = 1,...,m, (3.16) where the constrains guarantee that the mappingπ is a valid transport. In practice, a popular approach is to solve the dual problem. It is not hard to see that the dual of the optimization problem (3.16) can be written as dist(p,q) = max λ,γ m X i=1 γ i p i + m X j=1 λ j q j s.t. λ j +γ i ≤c ij , ∀i,j = 1,...,m. (3.17) When c is a proper distance, this dual variable should satisfy λ =−γ [147]. In practice, since the dimension m is large and estimating p and q accurately is not possible, we parameterize the dual variable with a neural network and solve the dual optimization problem by training two neural networks simultaneously [125]. However, this approach leads to a nonconvex min-max optimization problem. Unlike special cases such as convex-concave setup [148], there is no algorithm to date in the literature which can find even an -stationary point in the general nonconvex setting; see [47] and the references therein. Therefore, training generative adversarial networks (GANs) can become notoriously difficult in practice and may require significant tuning of training parameters. A natural solution is to not parameterize the dual function and instead solve (3.16) or (3.17) directly which leads to a convex reformulation. However, as mentioned earlier, since the dimension m is large, approximating p and q is 67 statistically not possible. Moreover, the distance in the original feature domain may not reflect the actual distance between the distributions. Thus, we suggest an alternative formulation in the next section. 3.4.4 Training in different feature domain In many applications, the closeness of samples in the original feature domain does not reflect the actual similarity between the samples. For example, two images of the same object may have a large difference when the distance is computed in the pixel domain. Therefore, other mappings of the features, such as features obtained by Convolutional Neural Network (CNN) may be used to extract meaningful features from samples [149]. Let D ={D 1 ,D 2 ,...,D K } be a collection of meaningful features we are interested in. In other words, each function D∈D is a mapping from our original feature domain to the domain of interest, i.e., D k (·) :R d 7→R d 0 ,∀k = 1,...,K. Then, instead of solving (3.15), one might be interested in solving the following optimization problem min G K X k=1 w k dist(P D k (G(z)) ,P D k (x) ), (3.18) where P D k (G(z)) represents the distribution of the random variable D k (G(z)); P D k (x) is the distribution of D k (x); and w k is a weight coefficient indicating the importance of the k-th feature D k . In the general setting, we may have uncountable number of mappings D k . Thus, by defining a measure on the setD, we can generalize (3.18) to the following optimization problem min G E D " dist P D(G(z)) ,P D(x) # . (3.19) Remark 1. We use the notation D since the function D plays the role of a discriminator in the Generative Adversarial Learning (GANs) context. 68 Plugging (3.16) in the equation (3.19) leads to the optimization problem min G E D max (λ,γ)∈C m X i=1 γ i P i D(G(z)) + m X j=1 λ j P j D(x) s.t. λ j +γ i ≤c ij , ∀i,j. , (3.20) whereC ={(λ,γ)|λ i +γ j ≤c i,j ,∀i,j}. Unfortunately, in practice, we do not have access to the actual values of the distributions P D(x) and P D(G(z)) . However, we can estimate them using a batch of generated and real samples. The following simple lemma which is proved at Appendix A.2 motivates the use of a natural surrogate function. Lemma 5. Letp andq be two discrete distributions withp = (p 1 ,...,p m ) andq = (q 1 ,...,q m ). Let x∈R m and y∈R m be the corresponding one-hot encoded random variables, i.e., P (x = e i ) =p i ,∀i = 1,...,m and P(y =e i ) =q i ,∀i = 1,...,m, where e i is the i-th standard basis. Assume further that dist(p,q) is the optimal transport distance between p and q defined in (3.16). Let ˆ p n and ˆ q n be the natural unbiased estimator of p and q based on n i.i.d. samples. In other words, ˆ p n = 1 n P n `=1 x ` and ˆ q n = 1 n P n `=1 y ` , where x ` and y ` ,` = 1,...,n, are i.i.d samples obtained from distributions p and q, respectively. Then, E h dist(ˆ p n+1 , ˆ q n+1 ) i ≤E[dist(ˆ p n , ˆ q n )]. Moreover, lim n→∞ dist(ˆ p n , ˆ q n ) = dist(p,q), almostsurely. 69 The above lemma suggests a natural upper-bound for the objective function in (3.20). More precisely, instead of solving (3.20), we can solve min G E max (λ,γ)∈C m X i=1 γ i ˆ P i D(G(z)) + m X j=1 λ j ˆ P j D(x) s.t. λ j +γ i ≤c ij , ∀i,j , (3.21) where ˆ P D(G(z)) and ˆ P D(x) are the unbiased estimators of P D(G(z)) and P D(x) based on our i.i.d samples. Moreover, the expectation is taken with respect to both, the function D as well as the batch of samples which is drawn for estimating the distributions. As we will see later, in practice it is easier to use the primal form for solving the inner problem in (3.21), i.e., min G E min π≥0 m X i=1 m X j=1 c ij π ij s.t. m X j=1 π ij = ˆ P i D(G(z)) , m X i=1 π ij = ˆ P j D(x) ,∀i,j , To show the dependence of c ij to G, let us assume that our generator G is generating the output h(w,z) from the input z. Here w represents the weights of the network needed to be learned. Moreover, in practice, the value of ˆ P i D(G(z)) is estimated by taking the average over all batch of data. Hence, by duplicating variables if necessary, we can re-write the above optimization problem as min w E z,x,D min π≥0 n X i=1 n X j=1 kD(h(w,z i ))−D(x j )kπ ij s.t. π ~ 1 = 1 n , π T ~ 1 = 1 n . (3.22) Here, n is the batch size and we ignored the entries of ˆ P D(G(z)) and ˆ P D(x) that are zero. Notice that to obtain an algorithm with convergence guarantee for solving this optimization problem, one can properly regularize the inner optimization problem to obtain unbiased estimates of the gradient of the objective function [47, 98]. However, in this work, due to 70 practical considerations, we suggest to approximately solve the inner problem and use the approximate solution for solving (3.22) Solving the inner-problem approximately. In order to solve the inner problem in (3.22), we need to solve min π≥0 n X i=1 n X j=1 kD(h(w,z i ))−D(x j )kπ ij s.t. π ~ 1 = 1 n , π T ~ 1 = 1 n . (3.23) Notice that this problem is the classical optimal assignment problem which can be solved using Hungarian method [150], Auction algorithm [151], or many other methods proposed in the literature. Based on our observations, even the greedy method of assigning each column to the lowest unassigned row worked in our numerical experiments. The benefit of the greedy method is that it can be performed almost linearly in m by the use of a proper hash function. Algorithm 8 summarizes our proposed Generative Networks using Random Discriminator (GN-RD) algorithm for solving (3.22). Algorithm 8 Generative Networks using Random Discriminator (GN-RD) 1: Input: w 0 : Initialization for generator’s parameter, α : Learning rate, n : Batch size, N Itr : Maximum iteration number. 2: for t = 1 :N max do 3: Sample an i.i.d. batch of real data (x 1 ,...,x n ) 4: Sample an i.i.d. batch of noise (z 1 ,...,z n ) 5: Create a random discriminator neural network D with random weights 6: Solve (3.23) using the optimal value between real and generated sample 7: Update generator’s parameter, w t+1 =w t −α∇ w G(w t ) 8: end for Remark 2. The training approach in Algorithm 8 relies on two neural networks: the generative and the discriminator. Hence, Algorithm 8 can be viewed as a GANs training approach where we use a random discriminator at each iteration of updating the generator. 71 Remark 3. The recent works [152, 153] have similarities in terms of learning generative models through min-min formulation instead of min-max formulation. However, unlike their method, 1) our algorithm is based on mapping images via randomly generated discriminators; 2) In our analysis, we establish that this formulation leads to an upper-bound of the distance measure; 3) our algorithm is based on the use of optimal assignment, while the works [152, 153] suggests a greedy matching, which is more difficult to understand and analyze. 3.4.5 Numerical Experiments In this section, we evaluate the performance of the proposed GN-RD algorithm for learning generative networks to create samples from MNIST [154] and Fashion-MNIST [155] datasets. As mentioned previously, the proposed algorithm does not require any optimization on the discriminator network and only needs randomly generated discriminator to learn the underlying distribution of the data ∗ . Architecture of the Neural Networks: The generator’s Neural Network consists of two fully connected layer with 1024 and 6272 neurons. The output of the second fully connected layer is followed by two deconvolutional layers to generate the final 28× 28 image. The discriminator neural network has two convolution layers each followed by a max pool. The size of the both convolutional layers are 64. The last layer has been flatten to create the output. The design of both neural networks is summarized below: • Generator: [FC(100, 1024), Leaky ReLU(α = 0.2), FC(1024, 6272), Leaky ReLU(α = 0.2), DECONV(64, kernel size = 4, stride = 2), Leaky ReLU(α = 0.2), DECONV(1, kernel size = 4, stride = 2), Sigmoid]. • Discriminator: [CONV(64, filter size = 5, stride = 1), Leaky ReLU(α = 0.2), Max Pool (kernel size = 2, stride = 2), COVN(64, filter size = 5, stride = 1), Max Pool (kernel size = 2, stride = 2), Flatting]. ∗ All the experiments have been run on a machine with single GeForce GTX 1050 Ti GPU. 72 (a) WGAN (b)WGAN-GP (c) Cramer GAN (d) GN-RD (e) Inception Score (Over time in second ) Figure 3.5: Generating hand-written digits using MNIST dataset We have used originally proposed adversarial discriminator for Wasserstein GAN (WGAN) [125], Wasserstein GAN with gradient penalty (WGAN-GP)[97] † and Cram´ er GAN [156] ‡ . As mentioned in Algorithm 8, it is important to notice that unlike benchmark methods, the proposed method only optimizes the generator’s parameters. However, at each iteration, weights in the convolutional layers of the discriminator are randomly generated from normal distribution. Hyper parameters: We have used Adam with step size 0.001 and β 1 = 0.5 and β 2 = 0.9 as the optimizer for our generator. The batch size is set to 100. Figure 3.5 shows the result of the generated digits and the corresponding inception score[157] using different benchmark methods. As seen from the figure, the proposed GN-RD is able to quickly learn the underlying distribution of the data and generate promising samples. † For WGAN and WGAN-GP implementation visit https://github.com/igul222/improved_wgan_ training ‡ For Cram´ er GAN implementation visit https://github.com/jiamings/cramer-gan 73 (a) Original Data (b)GN-RD Figure 3.6: Generating fashion products using Fashion-MNIST dataset Figure 3.6 shows the result of using the proposed method for generating samples from fashion MNIST dataset. The sample is generated only after 600 iterations (∼ 10 minutes ) of the proposed method which shows that the GN-RD quickly converges and generates promising samples. 74 Chapter 4 Conclusion and Future Works In this thesis, we proposed multiple algorithms and studied landscape analysis for mixture models and GANs problem. These studies revealed the need for solving a new class of min-max optimization problems which are beyond the standard regime in which the objective function is convex-concave. Despite the fact that our studies cover wide range of applications and problems, there are some directions that need more investigations. In the following section, we summarize future directions and open questions that fall under the umbrella of this dissertation in more details. Regarding mixture models, we will extend our previous analysis for studying the resulting MLE problem when we are optimizing over all parameter space, i.e., center of distributions, covariance matrices, mixing weights and also number of components and then, we will have landscape analysis for the mixed linear regression problem. Besides, for adversarial learning, we will focus on making our proposed min-max algorithm to be adaptive, analyzing and developing algorithms to work in distributed fashions and also proposing zeroth order method with theoretical guarantees for solving a broader class of nonconvex min-max games. 75 4.1 Landscape Analysis for Maximum Likelihood Estimation In this section, we explain in more details questions that are still open regarding the landscape analysis of the MLE problem. Briefly, as future work, we will extend our previous analysis for studying the resulting MLE problem when we are optimizing over all parameter space, i.e., center of distributions, covariance matrices, mixing weights and also number of components. Additionally, we will have landscape analysis for the MLE problem resulting from mixed linear regression. 4.1.1 Extensions to General Maximum Likelihood Estimation Problem As mentioned before, the MLE is one of the most commonly used approaches for estimating the parameters of a finite mixture model. In the previous section, our focus was mainly on the scenario that we only need to estimate centers of these mixture models, i.e., all other parameters such as covariance matrices and mixing weights for each component is known. However, our initial analysis and numerical experiments show that even when some parameters of the model are perfectly known a-priori, considering them as unknown variables may lead to computational advantages. To better understand this issue, letf(x;θ θ θ;Σ Σ Σ) represent a multivariate probability density function which is centered at θ θ θ, and has the covariance matrix Σ Σ Σ. As an example, for the Gaussian distribution, f(x;θ θ θ;Σ Σ Σ) is equal to f(x;θ θ θ;Σ Σ Σ) = (2π) − k 2 det(Σ Σ Σ) 1 2 e − 1 2 (x−θ θ θ)(Σ Σ Σ) −1 (x−θ θ θ) . The resulting optimization problem over all parameter space using MLE with infinite sample size is, max θ θ θ,Σ Σ Σ,w,K L(θ θ θ,Σ Σ Σ,w,K) = E log K X k=1 w k f(x;θ θ θ k ,Σ Σ Σ k ) , where θ θ θ = (θ θ θ 1 ,···,θ θ θ K ),Σ Σ Σ = (Σ Σ Σ 1 ,···,Σ Σ Σ K ) and w = (w 1 ,···,w K ). As future work, we will 76 Optimization problem local optima = global op- tima max θ θ θ L(θ θ θ,w ∗ 1 = 1/2,w ∗ 2 = 1/2,Σ Σ Σ ∗ 1 = Σ Σ Σ ∗ 2 = I,K ∗ = 2) Yes (known [2, 70]) max θ θ θ L(θ θ θ,w ∗ 1 6=w ∗ 2 ,Σ Σ Σ ∗ 1 = Σ Σ Σ ∗ 2 = I,K ∗ = 2) No (numerical example) max θ θ θ L(θ θ θ,w ∗ 1 =w ∗ 2 ,Σ Σ Σ ∗ 1 6= Σ Σ Σ ∗ 2 = I,K ∗ = 2) No (numerical example) max θ θ θ,w L(θ θ θ,w,Σ Σ Σ ∗ =K ∗ = 2) Yes (conjecture) max θ θ θ,w,Σ Σ Σ L(θ θ θ,w,Σ Σ Σ,K ∗ = 2) Yes (conjecture) Table 4.1: Conjectures/known results for 2-component Gaussian/Laplacian mixture model. These conjectures have been obtained by extensive numerical experiments. As can be seen in this table, expanding the search space could lead to computational advantage start by analyzing this general MLE problem and will seek the conditions required for this problem to have all local optima to be globally optimum. Table 4.1 shows our initial observations for 2-component Gaussian/Laplacian mixture model that are based on our numerical experiments. 4.1.2 Landscape Analysis and Algorithm Developments for Mixed Linear Regression Problem In section 2.2, we studied an equally weighted MLR problem withK-component and additive Laplacian/Gaussian noise and show that the EM algorithm results in computationally intractable sub-programs in the case that the additive noise is Laplacian. Additionally, we proposed an algorithm that can overcome this issue by providing sub-problems that have closed-form solution. Our initial attempt will be analyzing our proposed algorithm locally. Besides, there are multiple other potential directions regarding this problem. To better understand these direction, let’s repeat the definition of the MLR problem studied in previous section. From Equation (2.8) we had ˆ β β β = argmax β β β logP(y 1 ,...,y N |X,β β β) = argmax β β β N X i=1 logP(y i |x i ,β β β) 77 = argmax β β β N X i=1 log K X k=1 p k f (y i −hx i ,β β β k i) . In the above optimization problem, we are only estimating the regressors and assume that all other information such as noise variance, number of components and weights are known a-priori. Similar to the MLE problem for the mixture models, as future work we will relax this assumption and aim at analyzing a general MLR problem. In other words, we will analyze the following problem, max β β β,σ 2 ,p,K L(β β β,σ 2 ,p,K) = N X i=1 log K X k=1 p k f σ 2(y i −hx i ,β β β k i) . Our initial conjecture is that similar to mixture model problem; ignoring the given avail- able parameters and optimizing MLE problem in the whole parameter space may lead to computational advantages. As future work, we will study this conjecture in more details. 4.2 Adversarial Learning In the following section, we discuss future direction regarding the adversarial learning problem we studied in Chapter 3. In summary, we will focus on making our proposed min-max algorithm to be adaptive, analyzing and developing algorithms to work in distributed fashions and also proposing zeroth order method with theoretical guarantees for solving a broader class of nonconvex min-max games. 4.2.1 Adaptive Min-Max Optimizer Almost all algorithms proposed for solving nonconvex min-max optimization problems require some hyper-parameters that depend on the structure of the problem. In practice, these hyper-parameters are tuned by cross-validation. Besides the computational burden of this approach, it might cause the algorithm to be slow. To better understand this issue, we will 78 consider vanilla gradient descent approach. However, we need to review some preliminary lemmas and definitions to proceed. Definition 5. (L-smooth Function) A functionf :R n 7→R is said to be L-smooth if||∇f(x)− ∇f(y)≤L||x−y||,∀x,y. Lemma 6. (Descent Lemma) For any L-smooth function f :R n 7→R we have, f(y)≤ f(x) +h∇f(x),y−x)i + L 2 ||y−x|| 2 ,∀x,y. Now, consider minimizing an L-smooth unconstrained convex function f(·) using gradient descent approach. At each iteration k, gradient descent requires an step-size η k . Now based on the descent lemma we have, f(x k+1 )≤f(x k ) +h∇f(x k ),x k+1 −x k )i + L 2 ||x k+1 −x k || 2 ≤f(x k )−||∇f(x k )|| 2 (η k − Lη 2 k 2 ). Setting η ∗ k = 1 L = argmax η k (η k − Lη 2 k 2 ) will guarantee the maximum reduction in the function value globally. However, it is possible that in some regions, the function is smoother, i.e., ∃A⊂R n that||∇f(x)−∇f(y)≤L 1 ||x−y||,∀x,y∈A such thatL 1 <L. If gradient descent is set with η k = 1 L , we will still have reduction in the function value but the reduction will be smaller compared to using η k = 1 L 1 in that region. As a result, using the fixed step-size derived through cross-validation not only has computational burden but also it ignores the local information of the function that might result in slower convergence of the algorithm. There has been a lot of efforts to over-come this issue. Armijo’s rule is an example of such efforts that in each iteration, we choose the largest η k such that it satisfies the following inequality, f(x k+1 )≤f(x k )−ση k ||∇f(x k )|| 2 . 79 for some σ> 0. Similar issue can happen when we are solving a min-max saddle point. In our proposed algorithm in chapter 3 and almost all of the algorithms in the literature, it is required to solve min-max games iteratively in which the algorithms stop after they reach to a maximum iteration number and they need information about the smoothness of the function to set the gradient descent/ascent steps. Besides the fact that most of the time estimating this number is not possible, these approaches ignore the local information of the function around the current iterate. As a future work, we will focus on development of not only general rules that will output a proper step-size in each iteration but also a check-able stopping criterion for the algorithm in solving the min-max saddle point games. The following pseudo-code shows the big picture of required modification (shown in red color) for our previously proposed algorithm. We aim at 1) replacing the loops that require maximum iteration number with while loops and 2) proposing rules that choose the step-size adaptively. Algorithm 9 Adaptive Multi-step Proximal Gradient Descent-Ascent 1: Input: Initial point α α α 0 ∈A and θ θ θ 0 ∈ Θ; the desired accuracy . 2: for t = 0,···,T− 1 do→ While loop 3: Set x 0 =α α α t 4: for k = 0,···,K− 1 do→ While loop 5: Set x k+1 = Proj A h x k + 1 L 22 ∇f(θ θ θ t ,x k ) i → Armijo/backtracking type 6: end for 7: α α α t+1 = x K 8: θ θ θ t+1 =θ θ θ t − 1 L 11 ∇f(θ t ,α t+1 )→ Armijo/backtracking type 9: end for 4.2.2 Zeroth-order Methods for Min-Max Problems Zeroth-order or derivative-free methods aim to solve the optimization problem min x f(x) only by querying the information of the function value instead of its gradient when calculating the gradient is infeasible or computationally expensive. These methods have been widely studied in the optimization field [158] and are mainly based on estimating the gradient of the function using its value. 80 Some of the recent machine learning problems such as black-box attack to deep neural networks [159] require solving min-max saddle point games. However, in these applications the gradient information is not directly available and the only accessible information is the function value. As a result, there has been a lot of recent studies for developing and analyzing zeroth order algorithms for solving min-max games. Recent works such as [109, 110] have proposed some algorithms with theoretical guarantees to solve min-max saddle point games using only queries from function values. Similar to classical zeroth order minimization algorithms, these methods are based on estimating the gradient using the zeroth-order information. However, they assume that game has nonconvex-strongly-concave structure. As a future work, we will focus on relaxing this assumption to a broader class of nonconvex-concave and nonconvex-PL game that is defined in the following. Definition 6. (Polyak- Lojasiewic(PL) condition) A differentiable function h(x) with the minimum value h ∗ = min x h(x) is said to be μ-Polyak- Lojasiewic(PL)(μ-PL) if||∇h(x)|| 2 ≥ 2μ(h(x)−h ∗ ) Based on the above definition, we introduce the following game. Definition 7. (PL-Game) We say that the min-max game min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α) is a PL-Game if the max player is unconstrained, i.e.,A =R n , and there exists a constant μ> 0 such that the function h θ θ θ (α α α),−f(θ θ θ,α α α) is μ-PL for any fixed value of θ θ θ∈ Θ. 4.2.3 Solving Min-Max Saddle Point Games for Small Region Problems using Quadratic Approximation In Section 3.3, we studied general class of non-convex-non-concave min-max games that are optimized in small region, i.e., the feasible region for one of the players is small. We showed that under this assumption, a point that is-FNE of the linear approximation of the objective 81 function is also -FNE of the original problem. To be more specific, we first showed that the solution to min θ θ θ∈Θ max α α α∈A f lin (θ θ θ,α α α) =f(θ θ θ,α α α 0 ) +h∇ α α α f(θ θ θ,α α α 0 ),α α α−α α α 0 i is also a solution to the original problem min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α) and then we proposed an algorithm that is capable of obtaining such a solution. However, this setup requires the radius of the region to be in the order of . We conjecture that optimizing the quadratic approximation instead of the linear one, will relax this requirement and enables us to make the radius larger. In other words, we aim to solve min θ θ θ∈Θ max α α α∈A f quad (θ θ θ,α α α) =f lin (θ θ θ,α α α) + 1 2 hα α α−α α α 0 ,∇ 2 α α αα α α f(θ θ θ,α α α 0 )α α α−α α α 0 i. As a future work, we will study the relation between -FNE of quadratic approximation and the original function in the small region setup and will propose an algorithm that is capable of finding -FNE of original function by solving its quadratic approximation. 4.2.4 Solving Min-Max Saddle Point Game using Gradient Unrolling Consider the problem of finding an -FNE for the following unconstrained nonconvex- nonconcave problem saddle point game, min x∈R d max y∈R d f(x,y). Due to the generality of the problem, one approach to find such a point will be doing exhaustive search. The iteration complexity of this algorithm would be exponential in d. As a future work, we will try to answer the the question that, can we develop an algorithm for 82 finding an -FNE point which has polynomial iteration complexity in d?. As the first step, we came up with the following method. Let y 0 be some given point (for example y 0 = 0). Define g 0 (θ θ θ) =f(θ θ θ,α α α 0 ) g 1 (θ θ θ) =f(θ θ θ,α α α 1 (θ θ θ)) where α α α 1 (θ θ θ) =α α α 0 +α 0 ∇ α α α f(θ θ θ,α α α 0 ) g 2 (θ θ θ) =f(θ θ θ,α α α 2 (θ θ θ)) where α α α 2 (θ θ θ) =α α α 1 (θ θ θ) +α 1 ∇ α α α f(θ θ θ,α α α 1 (θ θ θ)) . . . g k (θ θ θ) =f(θ θ θ,α α α k (θ θ θ)) where α α α k (θ θ θ) =α α α k−1 (θ θ θ) +α k ∇ α α α f(θ θ θ,α α α k−1 (θ θ θ)) The idea is to choose k large enough and try to minimize g k (θ θ θ). Minimizing g k (θ θ θ) with large enough k should give us -FNE. However, obtaining gradients of g k (θ θ θ) requires computation of higher order derivatives of f(·,·) which might not be practical. Therefore, we can rely on zero-order methods to minimize g k (θ θ θ). As future direction, we will focus on this approach to find -FNE of f(·,·) by minimizing the g k (θ θ θ) using zeroth order method. 4.2.5 Decentralized Algorithm for Solving Min-Max Games Imagine that there are n agents from a connected network trying to solve the following consensus optimization problem using vanilla gradient descent algorithm, min x f(x) = n X i=1 f i (x). Additionally, assume that each f i (x) is available only to the agent i. The most intuitive approach to solve this problem is assigning one of the agents as fusion center who does the gradient descent update and is responsible to send the information of the current point to all other agents and gather their gradient information. In other words, at iteration k with the current value of x k , the fusion center sends this information to all other agents, gathers their gradient information, i.e.,∇f i (x k ), and does the update using x k+1 = x k −α P n i=1 ∇f i (x k ). 83 This approach requires an accurate synchronization and a lot of commutations among the nodes. Decentralization is the commonly used approach to overcome this issue [160]. In this approach, each agent only communicates with its neighbour and does the update internally. To better understand this approach, consider that agenti is connected to agentj with weight 0≤w ij ≤ 1. At iteration k, agent i updates its solution x k i using x k+1 i = ( n X j=1 w ij x k j )−α∇f i (x k i ). Classical decentralized problems have been widely studied in the literature [161, 162]. However, extending their results for solving min-max saddle point games is challenging. As another future direction, we will focus on solving the following consensus min-max saddle point games in distributed fashion, min θ θ θ 1 ,···,θ θ θ N max α α α 1 ,···,α α α N N X i=1 f i (θ θ θ i ,α α α i ) (4.1) s.t. θ θ θ 1 =··· =θ θ θ N ,α α α 1 =··· =α α α N . Define z i = [θ θ θ i ,α α α i ] and the operator B i (z) = [∇ θ θ θ f i (z),−∇ α α α f i (z)]. As shown in [44], finding the first order Nash equilibria of problem (4.1) is equivalent to finding the roots for the following system find z 1 ,···,z N s.t. N X i=1 B i (z i ) = 0 (4.2) s.t. z 1 =··· = z N . This new formulation results in developing new algorithms for solving min-max games in decentralized fashion. However, in the benchmark studies, the structure of the problem is considered to be ρ-weakly convex-weakly concave. As future work, we will focus on solving (4.2) for general nonconvex-concave and nonconvex-PL game. 84 References 1. Reddy, A., Ordway-West, M., Lee, M., Dugan, M., Whitney, J., Kahana, R., et al. Using gaussian mixture models to detect outliers in seasonal univariate network traffic in 2017 IEEE Security and Privacy Workshops (SPW) (2017), 229–234. 2. Barazandeh, B. & Razaviyayn, M. On the behavior of the expectation-maximization algorithm for mixture models in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (2018), 61–65. 3. Jordan, M. I. & Jacobs, R. A. Hierarchical mixtures of experts and the EM algorithm. Neural computation 6, 181–214 (1994). 4. Blunsom, P. Hidden markov models. Lecture notes, August 15, 48 (2004). 5. Doersch, C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016). 6. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. Generative adversarial nets in Advances in neural information processing systems (2014), 2672–2680. 7. Tevet, G., Habib, G., Shwartz, V. & Berant, J. Evaluating text gans as language models. arXiv preprint arXiv:1810.12686 (2018). 8. Jones, K. & Bonafilia, D. Gangogh: creating art with GANs. Towards Data Science (2017). 9. Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A 185, 71–110 (1894). 10. Moitra, A. & Valiant, G. Settling the polynomial learnability of mixtures of gaussians in 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (2010), 93–102. 11. Nguyen, H. D. & McLachlan, G. On approximations via convolution-defined mixture models. Communications in Statistics-Theory and Methods 48, 3945–3955 (2019). 12. McLachlan, G. J. & Peel, D. Finite mixture models (John Wiley & Sons, 2004). 13. Titterington, D. M., Smith, A. F. & Makov, U. E. Statistical analysis of finite mixture distributions (Wiley, 1985). 14. Rossi, P. Bayesian non-and semi-parametric methods and applications (Princeton University Press, 2014). 15. Jin, C., Zhang, Y., Balakrishnan, S., Wainwright, M. J. & Jordan, M. I. Local max- ima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences in Advances in Neural Information Processing Systems (2016), 4116–4124. 16. Richardson, E. & Weiss, Y. On gans and gmms in Advances in Neural Information Processing Systems (2018), 5847–5858. 85 17. Nash, J. Equilibrium points in n-person games in Proceedings of the national academy of sciences 36 (USA, 1950), 48–49. 18. Murty, K. & Kabadi, S. Some NP-complete problems in quadratic and nonlinear programming in Mathematical programming 39 (Springer, 1987), 117–129. 19. Farnia, F. & Ozdaglar, A. GANs May Have No Nash Equilibria. arXiv preprint arXiv:2002.09124 (2020). 20. Louppe, G., Kagan, M. & Cranmer, K. Learning to pivot with adversarial networks. Advances in neural information processing systems 30, 981–990 (2017). 21. Adel, T., Valera, I., Ghahramani, Z. & Weller, A. One-network adversarial fairness in Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 2412–2420. 22. Madras, D., Creager, E., Pitassi, T. & Zemel, R. Learning Adversarially Fair and Transferable Representations in International Conference on Machine Learning (2018), 3384–3393. 23. Wu, S., Ren, X., Hong, Y. & Shi, L. Max-min fair sensor scheduling: game-theoretic perspective and algorithmic solution. IEEE Transactions on Automatic Control (2020). 24. Razaviyayn, M., Huang, T., Lu, S., Nouiehed, M., Sanjabi, M. & Hong, M. Nonconvex min-max optimization: Applications, challenges, and recent theoretical advances. IEEE Signal Processing Magazine 37, 55–66 (2020). 25. Liu, Y.-F., Dai, Y.-H. & Luo, Z.-Q. Max-min fairness linear transceiver design for a multi-user MIMO interference channel. IEEE Transactions on Signal Processing 61, 2413–2423 (2013). 26. Xie, H., Xu, J. & Liu, Y.-F. Max-Min Fairness in IRS-Aided Multi-Cell MISO Systems with Joint Transmit and Reflective Beamforming. arXiv preprint arXiv:2003.00906 (2020). 27. Juditsky, A., Nemirovski, A. & Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm in Stochastic Systems 1 (2011), 17–58. 28. Rakhlin, S. & Sridharan, K. Optimization, learning, and games with predictable se- quences in Advances in Neural Information Processing Systems (2013), 3066–3074. 29. Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C., Chandrasekhar, V. & Piliouras, G. Optimistic Mirror Descent in Saddle-Point Problems: Going the Extra (Gradient) Mile in ICLR’19-International Conference on Learning Representations (2019). 30. Daskalakis, C. & Panageas, I. Last-iterate convergence: Zero-sum games and constrained min-max optimization. Innovations in Theoretical Computer Science (2019). 31. Mokhtari, A., Ozdaglar, A. & Pattathil, S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach in International Conference on Artificial Intelligence and Statistics (2020), 1497–1507. 32. Gidel, G., Jebara, T. & Lacoste-Julien, S. Frank-Wolfe algorithms for saddle point problems in Artificial Intelligence and Statistics (2017), 362–371. 33. Abernethy, J. & Wang, J. On Frank-Wolfe and equilibrium computation in Advances in Neural Information Processing Systems (2017), 6584–6593. 34. Hamedani, E. Y., Jalilzadeh, A., Aybat, N. & Shanbhag, U. Iteration complexity of randomized primal-dual methods for convex-concave saddle point problems. arXiv preprint arXiv:1806.04118 (2018). 35. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). 86 36. Reddi, S. J., Kale, S. & Kumar, S. On the convergence of adam and beyond in International Conference on Learning Representations (2018). 37. Chen, X., Liu, S., Sun, R. & Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv:1808.02941 (2018). 38. Zaheer, M., Reddi, S., Sachan, D., Kale, S. & Kumar, S. Adaptive methods for nonconvex optimization in Advances in Neural Information Processing Systems (2018), 9815–9825. 39. Nazari, P., Tarzanagh, D. A. & Michailidis, G. Dadam: A consensus-based distributed adaptive gradient method for online optimization. arXiv preprint arXiv:1901.09109 (2019). 40. Nazari, P., Tarzanagh, D. A. & Michailidis, G. Adaptive First-and Zeroth-order Methods for Weakly Convex Stochastic Optimization Problems. arXiv preprint arXiv:2005.09261 (2020). 41. Dang, C. D. & Lan, G. On the convergence properties of non-euclidean extragradient methods for variational inequalities with generalized monotone operators. Computa- tional Optimization and applications 60, 277–310 (2015). 42. Lin, Q., Liu, M., Rafique, H. & Yang, T. Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207 (2018). 43. Liu, M., Mroueh, Y., Ross, J., Zhang, W., Cui, X., Das, P., et al. Towards better under- standing of adaptive gradient algorithms in Generative Adversarial Nets in International Conference on Learning Representations (2019). 44. Liu, W., Mokhtari, A., Ozdaglar, A., Pattathil, S., Shen, Z. & Zheng, N. A De- centralized Proximal Point-type Method for Saddle Point Problems. arXiv preprint arXiv:1910.14380 (2019). 45. Iusem, A. N., Jofr´ e, A., Oliveira, R. I. & Thompson, P. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization 27, 686–724 (2017). 46. Daskalakis, C., Skoulakis, S. & Zampetakis, M. The complexity of constrained min-max optimization. arXiv preprint arXiv:2009.09623 (2020). 47. Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. & Razaviyayn, M. Solving a class of non-convex min-max games using iterative first order methods in Advances in Neural Information Processing Systems (2019), 14905–14916. 48. Donoho, D. L. & Liu, R. C. The” automatic” robustness of minimum distance func- tionals. The Annals of Statistics, 552–586 (1988). 49. Mei, S., Bai, Y., Montanari, A., et al. The landscape of empirical risk for nonconvex losses. The Annals of Statistics 46, 2747–2774 (2018). 50. Deb, P. & Holmes, A. M. Estimates of use and costs of behavioural health care: a comparison of standard and finite mixture models. Health economics 9, 475–489 (2000). 51. Gaffney, S. & Smyth, P. Trajectory clustering with mixtures of regression models in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (1999), 63–72. 52. Cong-Hua, X., Jin-Yi, C. & Wen-Bin, X. Medical image denoising by generalised Gaussian mixture modelling with edge information. IET Image Processing 8, 464–476 (2014). 87 53. Amin, T., Zeytinoglu, M. & Guan, L. Application of Laplacian mixture model to image and video retrieval. IEEE Transactions on Multimedia 9, 1416–1429 (2007). 54. Rabbani, H., Nezafat, R. & Gazor, S. Wavelet-domain medical image denoising using bivariate laplacian mixture model. IEEE transactions on biomedical engineering 56, 2826–2837 (2009). 55. Klein, B., Lev, G., Sadeh, G. & Wolf, L. Fisher vectors derived from hybrid gaussian- laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014). 56. Melnykov, V. & Maitra, R. Finite mixture models and model-based clustering. Statistics Surveys 4, 80–116 (2010). 57. Zhang, H. & Huang, Y. Finite mixture models and their applications: A review. Austin Biometrics and Biostatistics 2, 1–6 (2015). 58. Teicher, H. Identifiability of finite mixtures. The annals of Mathematical statistics, 1265–1269 (1963). 59. Allman, E. S., Matias, C. & Rhodes, J. A. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 3099–3132 (2009). 60. Day, N. E. Estimating the components of a mixture of normal distributions. Biometrika 56, 463–474 (1969). 61. Wolfe, J. H. Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 5, 329–350 (1970). 62. Dasgupta, S. Learning mixtures of Gaussians in Foundations of computer science, 1999. 40th annual symposium on (1999), 634–644. 63. Vempala, S. & Wang, G. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences 68, 841–860 (2004). 64. Arora, S. & Kannan, R. Learning mixtures of separated nonspherical Gaussians. The Annals of Applied Probability 15, 69–92 (2005). 65. Chaudhuri, K. & Rao, S. Learning Mixtures of Product Distributions Using Correlations and Independence. in COLT 4 (2008), 9–20. 66. Dasgupta, S. & Schulman, L. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research 8, 203–226 (2007). 67. Hsu, D. & Kakade, S. M. Learning mixtures of spherical Gaussians: moment methods and spectral decompositions in Proceedings of the 4th conference on Innovations in Theoretical Computer Science (2013), 11–20. 68. Belkin, M. & Sinha, K. Polynomial learning of distribution families in Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on (2010), 103–112. 69. Balakrishnan, S., Wainwright, M. J. & Yu, B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics 45, 77–120 (2017). 70. Xu, J., Hsu, D. J. & Maleki, A. Global analysis of expectation maximization for mixtures of two Gaussians in Advances in Neural Information Processing Systems (2016), 2676– 2684. 71. Daskalakis, C., Tzamos, C. & Zampetakis, M. Ten steps of EM suffice for mixtures of two Gaussians. arXiv preprint arXiv:1609.00368 (2016). 88 72. Bhowmick, D., Davison, A., Goldstein, D. R. & Ruffieux, Y. A Laplace mixture model for identification of differential expression in microarray experiments. Biostatistics 7, 630–641 (2006). 73. Mitianoudis, N. & Stathaki, T. Overcomplete source separation using Laplacian mixture models. IEEE Signal Processing Letters 12, 277–280 (2005). 74. Srebro, N. Are there local maxima in the infinite-sample likelihood of Gaussian mixture estimation? Lecture Notes in Computer Science 4539, 628 (2007). 75. Razaviyayn, M., Hong, M., Luo, Z.-Q. & Pang, J.-S. Parallel successive convex ap- proximation for nonsmooth nonconvex optimization in Advances in Neural Information Processing Systems (2014), 1440–1448. 76. Razaviyayn, M., Hong, M. & Luo, Z.-Q. A unified convergence analysis of block succes- sive minimization methods for nonsmooth optimization. SIAM Journal on Optimization 23, 1126–1153 (2013). 77. Hong, M., Razaviyayn, M., Luo, Z.-Q. & Pang, J.-S. A unified algorithmic framework for block-structured optimization involving big data: With applications in machine learning and signal processing. IEEE Signal Processing Magazine 33, 57–77 (2016). 78. Chaganty, A. T. & Liang, P. Spectral experts for estimating mixtures of linear regressions in International Conference on Machine Learning (2013), 1040–1048. 79. Yi, X., Caramanis, C. & Sanghavi, S. Alternating minimization for mixed linear regression in International Conference on Machine Learning (2014), 613–621. 80. Shen, Y. & Sanghavi, S. Iterative Least Trimmed Squares for Mixed Linear Regression. arXiv preprint arXiv:1902.03653 (2019). 81. Wang, T. & Paschalidis, I. C. Convergence of Parameter Estimates for Regularized Mixed Linear Regression Models. arXiv preprint arXiv:1903.09235 (2019). 82. Carvalho, A. X. & Tanner, M. A. Mixtures-of-experts of autoregressive time series: asymptotic normality and model specification. IEEE Transactions on Neural Networks 16, 39–56 (2005). 83. Zhong, K., Jain, P. & Dhillon, I. S. Mixed linear regression with multiple components in Advances in neural information processing systems (2016), 2190–2198. 84. Kwon, J., Qian, W., Caramanis, C., Chen, Y. & Davis, D. Global convergence of the em algorithm for mixtures of two component linear regression in Conference on Learning Theory (2019), 2055–2110. 85. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1–22 (1977). 86. Qian, W., Zhang, Y. & Chen, Y. Global Convergence of Least Squares EM for Demixing Two Log-Concave Densities in Advances in Neural Information Processing Systems (2019), 4794–4802. 87. Moon, T. K. The expectation-maximization algorithm. IEEE Signal processing maga- zine 13, 47–60 (1996). 88. Kozick, R., Blum, R. & Sadler, B. Signal processing in non-Gaussian noise using mixture distributions and the EM algorithm in Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No. 97CB36136) 1 (1997), 438–442. 89 89. Nemirovsky, A. & Yudin, D. B. Problem complexity and method efficiency in optimiza- tion (1983). 90. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3, 1–122 (2011). 91. Hong, M., Luo, Z.-Q. & Razaviyayn, M. Convergence analysis of alternating direc- tion method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization 26, 337–364 (2016). 92. Hong, M. & Luo, Z.-Q. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming 162, 165–199 (2017). 93. Hong, M., Chang, T.-H., Wang, X., Razaviyayn, M., Ma, S. & Luo, Z.-Q. A Block Successive Upper-Bound Minimization Method of Multipliers for Linearly Constrained Convex Optimization. Mathematics of Operations Research (2020). 94. Razaviyayn, M. Successive convex approximation: Analysis and applications (2014). 95. Saeed, A., Ili´ c, S. & Zangerle, E. Creative GANs for generating poems, lyrics, and metaphors. arXiv preprint arXiv:1909.09534 (2019). 96. Xu, D., Yuan, S., Zhang, L. & Wu, X. Fairgan: Fairness-aware generative adversarial networks in 2018 IEEE International Conference on Big Data (Big Data) (2018), 570–575. 97. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved training of Wasserstein Gans in Advances in neural information processing systems (2017), 5767–5777. 98. Sanjabi, M., Ba, J., Razaviyayn, M. & Lee, J. On the convergence and robustness of training GANs with regularized optimal transport in Advances in Neural Information Processing Systems (2018), 7091–7101. 99. Barazandeh, B., Razaviyayn, M. & Sanjabi, M. Training generative networks using random discriminators in 2019 IEEE Data Science Workshop, DSW 2019 (2019), 327–332. 100. Baharlouei, S., Nouiehed, M., Beirami, A. & Razaviyayn, M. R´ enyi Fair Inference in International Conference on Learning Representation (2020). 101. Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks in International Conference on Learning Representations (2018). 102. Berger, J. Statistical decision theory and Bayesian analysis in Springer Science & Business Media (2013). 103. Rafique, H., Liu, M., Lin, Q. & Yang, T. Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060 (2018). 104. Lu, S., Tsaknakis, I. & Hong, M. Block alternating optimization for non-convex min- max problems: algorithms and applications in signal processing and communications in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), 4754–4758. 105. Lu, S., Tsaknakis, I., Hong, M. & Chen, Y. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transac- tions on Signal Processing (2020). 90 106. Ostrovskii, D., Lowy, A. & Razaviyayn, M. Efficient Search of First-Order Nash Equilib- ria in Nonconvex-Concave Smooth Min-Max Problems. arXiv preprint arXiv:2002.07919 (2020). 107. Thekumparampil, K., Jain, P., Netrapalli, P. & Oh, S. Efficient Algorithms for Smooth Minimax Optimization in Advances in Neural Information Processing Systems (2019), 12659–12670. 108. Nesterov, Y. Introductory lectures on convex programming volume i: Basic course in Lecture notes 3 (1998), 5. 109. Liu, S., Lu, S., Chen, X., Feng, Y., Xu, K., Al-Dujaili, A., et al. Min-max optimization without gradients: Convergence and applications to adversarial ML. arXiv preprint arXiv:1909.13806 (2019). 110. Wang, Z., Balasubramanian, K., Ma, S. & Razaviyayn, M. Zeroth-order algorithms for nonconvex minimax problems with improved complexities. arXiv:2001.07819 (2020). 111. Yang, J., Kiyavash, N. & He, N. Global Convergence and Variance-Reduced Opti- mization for a Class of Nonconvex-Nonconcave Minimax Problems. arXiv preprint arXiv:2002.09621 (2020). 112. Harker, P. & Pang, J.-S. Finite-dimensional variational inequality and nonlinear com- plementarity problems: a survey of theory, algorithms and applications in Mathematical programming 48 (1990), 161–220. 113. Lin, T., Jin, C. & Jordan, M. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331 (2019). 114. Pang, J.-S. & Razaviyayn, M. A unified distributed algorithm for non-cooperative games. 2016. 115. Bernhard, P. & Rapaport, A. On a theorem of Danskin with an application to a theorem of Von Neumann-Sion in Nonlinear analysis 24 (1995), 1163–1182. 116. Danskin, J. The Theory of Max-Min. Economtrics and Operations Research 5 in Springer Verlag (1967). 117. Beck, A. & Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems in SIAM journal on imaging sciences 2 (2009), 183–202. 118. Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. Siam Review 60, 223–311 (2018). 119. Jacobs, R. A. Increased rates of convergence through learning rate adaptation. Neural networks 1, 295–307 (1988). 120. Becker, S., Le Cun, Y., et al. Improving the convergence of back-propagation learning with second order methods in (1988). 121. Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (2011). 122. McMahan, H. B. & Streeter, M. Adaptive bound optimization for online convex optimization. arXiv preprint arXiv:1002.4908 (2010). 123. Nesterov, Y. E. A method for solving the convex programming problem with convergence rate O (1/kˆ 2) in Dokl. akad. nauk Sssr 269 (1983), 543–547. 124. Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4, 1–17 (1964). 125. Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein gan. arXiv:1701.07875 (2017). 91 126. Yao, Z., Gholami, A., Xu, P., Keutzer, K. & Mahoney, M. W. Trust region based adversarial attack on neural networks in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), 11350–11359. 127. Jin, C., Netrapalli, P. & Jordan, M. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618 (2019). 128. Gidel, G., Jebara, T. & Lacoste-Julien, S. Frank-Wolfe algorithms for saddle point problems in Artificial Intelligence and Statistics (2017), 362–371. 129. Daskalakis, C., Ilyas, A., Syrgkanis, V. & Zeng, H. Training GANs with optimism. arXiv preprint arXiv:1711.00141 (2017). 130. Barazandeh, B. & Razaviyayn, M. Solving Non-Convex Non-Differentiable Min-Max Games Using Proximal Gradient Method in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), 3162–3166. 131. Pang, J.-S. & Scutari, G. Nonconvex games with side constraints. SIAM Journal on Optimization 21, 1491–1522 (2011). 132. Zhou, D., Tang, Y., Yang, Z., Cao, Y. & Gu, Q. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671 (2018). 133. Facchinei, F. & Pang, J.-S. Finite-dimensional variational inequalities and complemen- tarity problems (Springer Science & Business Media, 2007). 134. Scutari, G., Palomar, D. P., Facchinei, F. & Pang, J.-S. Convex optimization, game theory, and variational inequality theory. IEEE Signal Processing Magazine 27, 35–49 (2010). 135. Kleinberg, R., Li, Y. & Yuan, Y. An alternative view: When does SGD escape local minima? arXiv preprint arXiv:1802.06175 (2018). 136. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. Improved techniques for training gans in Advances in neural information processing systems (2016), 2234–2242. 137. Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014). 138. Carlini, N. & Wagner, D. Towards evaluating the robustness of neural networks in 2017 ieee symposium on security and privacy (sp) (2017), 39–57. 139. Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images in Proceedings of the IEEE conference on computer vision and pattern recognition (2015), 427–436. 140. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., et al. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013). 141. Lin, J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 145–151 (1991). 142. Nowozin, S., Cseke, B. & Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization in Advances in neural information processing systems (2016), 271–279. 143. Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z. & Paul Smolley, S. Least squares generative adversarial networks in Proceedings of the IEEE International Conference on Computer Vision (2017), 2794–2802. 144. Zhao, J., Mathieu, M. & LeCun, Y. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016). 92 145. Bi´ nkowski, M., Sutherland, D. J., Arbel, M. & Gretton, A. Demystifying MMD Gans. arXiv preprint arXiv:1801.01401 (2018). 146. Sun, Y., Gilbert, A. & Tewari, A. Random ReLU Features: Universality, Approximation, and Composition. arXiv preprint arXiv:1810.04374 (2018). 147. Villani, C. Optimal Transport–Old and New, volume 338 of A Series of Comprehensive Studies in Mathematics 2009. 148. Juditsky, A. & Nemirovski, A. Solving variational inequalities with monotone operators on domains given by linear minimization oracles. Mathematical Programming 156, 221–256 (2016). 149. O’Shea, K. & Nash, R. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015). 150. Kuhn, H. W. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 83–97 (1955). 151. Bertsekas, D. P. The auction algorithm: A distributed relaxation method for the assignment problem. Annals of operations research 14, 105–123 (1988). 152. Li, K. & Malik, J. On the Implicit Assumptions of GANs. arXiv preprint arXiv:1811.12402 (2018). 153. K.Li and J.Malik. Implicit maximum likelihood estimation. arXiv:1809.09087 (2018). 154. LeCun, Y., Cortes, C. & Burges, C. MNIST handwritten digit database, 1998. URL http://www. research. att. com/˜ yann/ocr/mnist (1998). 155. Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017). 156. Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., et al. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017). 157. Barratt, S. & Sharma, R. A note on the inception score. arXiv:1801.01973 (2018). 158. Spall, J. C. Introduction to stochastic search and optimization: estimation, simulation, and control (John Wiley & Sons, 2005). 159. Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J. & Hsieh, C.-J. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (2017), 15–26. 160. Yuan, K., Ling, Q. & Yin, W. On the convergence of decentralized gradient descent. SIAM Journal on Optimization 26, 1835–1854 (2016). 161. Shi, W., Ling, Q., Wu, G. & Yin, W. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25, 944–966 (2015). 162. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W. & Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent in Advances in Neural Information Processing Systems (2017), 5330–5340. 163. Karimi, H., Nutini, J. & Schmidt, M. Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2016), 795–811. 93 164. Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning 8, 231–357 (2015). 165. Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming 140, 125–161 (2013). 166. Tran, P. T. et al. On the convergence proof of amsgrad and a new version. IEEE Access 7, 61706–61716 (2019). 167. Shapiro, A., Dentcheva, D. & Ruszczy´ nski, A. Lectures on stochastic programming: modeling and theory (SIAM, 2009). 94 Appendix A.1 Proofs of Chapter 2 Lemma 1. The derivative of the mapping M(·) with respect to λ is positive, i.e., 0< ∂ ∂λ M(λ,μ). Proof: First notice that ∂ ∂λ M(λ,μ) is equal to E x∼L(x;μ) " 2x (sign(x−λ) + sign(x +λ))(e −|x−λ|−|x+λ| )) (e −|x−λ| +e −|x+λ| ) 2 # . We prove the lemma for the following two different cases separately: Case 1) μ<λ: ∂M ∂λ = 2 (e λ +e −λ ) 2 e −μ Z −λ −∞ xe x dx +e μ Z ∞ λ xe −x dx = 2 (λ + 1)e −λ (e −μ +e μ ) (e −λ +e λ ) 2 = (λ + 1)e −λ (cosh(μ)) cosh(λ) 2 > 0. 95 Case 2) μ>λ ∂M ∂λ = 2e −μ R −λ −∞ xe x dx + R μ λ xe x dx +e 2μ R ∞ μ xe −x dx (e λ +e −λ ) 2 = 2 (e λ +e −λ ) 2 e −μ (λ + 1)e −λ − (λ− 1)e λ + 2μ ≥ 2 (e λ +e −λ ) 2 e −μ (μ + 1)e −μ − (μ− 1)e μ + 2μ > 0. Lemma 1. For 0<λ<η, we have ∂ ∂η M(λ,η) = 1− 2 e −η λ +e −λ e λ +e −λ > 1− 2 e −λ λ +e −λ e λ +e −λ > 0 Proof: When η>λ, it is not hard to show that M(λ,η) = 1 2 e −η tanh(λ)(λ + 1)e −λ + (λ− 1)e λ + (λ + 1)e −λ − (λ− 1)e λ tanh(λ) + tanh(λ)η. Hence, ∂M ∂η =−e −η ( λ + 1 e λ +e −λ + λ− 1 e λ +e −λ ) + tanh(λ) = −2e −η λ +e λ − 2e −λ +e −λ e λ +e −λ > 1− 2 λe −λ +e −λ e λ +e −λ > 0, where the last two inequalities are due to the facts that λ<η and λe −λ +e −λ e λ +e −λ < 1/2. A.1.1 Proof of Theorem 1 Theorem 1. Without loss of generality, assume thatλ 0 ,μ ∗ > 0. Then the EM iterate defined in (2.2) is a contraction, i.e., λ t+1 −μ ∗ λ t −μ ∗ <κ< 1,∀t, where κ = max{κ 1 ,κ 2 }, κ 1 = (μ ∗ +1)e −μ ∗ cosh(μ ∗ ) , and κ 2 = 2 λ 0 e −λ 0 +e −λ 0 e λ 0 +e −λ 0 . Proof: First of all, according to the Mean Value Theorem,∃ξ between λ t and μ ∗ such that: λ t+1 −μ ∗ λ t −μ ∗ = M(λ t ,μ ∗ )−M(μ ∗ ,μ ∗ ) λ t −μ ∗ = ∂ ∂λ M(λ,μ ∗ ) λ=ξ > 0, 96 where the inequality is due to lemma 1. Thus, λ t does not change sign during the algorithm. Consider two different regions: μ ∗ >λ and μ ∗ <λ. When μ ∗ <λ, case 1 in Lemma 1 implies that ∂M ∂λ λ=ξ = (ξ + 1)e −ξ (cosh(μ ∗ )) cosh(ξ) 2 ≤ (μ ∗ + 1)e −μ ∗ cosh(μ ∗ ) =κ 1 < 1. The last two inequalities are due to the fact thatμ ∗ <ξ<λ, and the fact thatf(ξ) = (ξ+1)e −ξ cosh(ξ) 2 is a positive and decreasing function inR + with f(0) = 1. On the other hand, when μ ∗ >λ, the Mean Value Theorem implies that λ t+1 −μ ∗ λ t −μ ∗ = λ t+1 −λ t λ t −μ ∗ + 1 = M(λ t ,μ ∗ )−M(λ t ,λ t ) λ t −μ ∗ + 1 = 1− ∂ ∂μ ∗ M(λ t ,μ ∗ ) μ ∗ =η ≤ 2 λ t e −λ t +e −λ t e λ t +e −λ t ≤ 2 λ 0 e −λ 0 +e −λ 0 e λ 0 +e −λ 0 =κ 2 < 1, where the last two inequalities are due to lemma 2 and the facts that 1) λ t does not change sign and 2) the function f(λ) = 2 λe −λ +e −λ e λ +e −λ is positive and decreasing in R + with f(0) = 1. Hence, λ t+1 −μ ∗ λ t −μ ∗ <κ 2 < 1. Combining the above two cases will complete the proof. A.1.2 Proof of Theorem 2 In order to proceed with the proof of Theorem 2, we need to have a quick review of the Successive Upper-bound Minimization (SUM) approach [76]. Consider solving the optimization problem min X f(x) by the following iterative procedure described in Algorithm 10. In the above algorithm, the functionu(·,·) is an approximation function forf(·) and is usually Algorithm 10 Successive Upper-bound Minimization (SUM) 1: Input: x 0 : Initial point inX , N Itr : Maximum iteration number, u(·,·) :X7→R: An approximation function 2: for t = 1 :N max do 3: x r = argmin x∈X u(x,x r−1 ) 4: end for 97 easier to optimize than the original function. From [76], we have the following result for any limit point generated by Algorithm 10. Theorem 8. (Rephrased from [76]) Assume f(x) :X7→R and u(x,y) :X×X7→R are continuously differentiable functions satisfying the following conditions, u(y,y) =f(y) u(x,y)≥f(x). Then every limit point of the iterates generated by the Algorithm 10 is a stationary point of the problem min X f(x). Now, we can proceed with the proof of Theorem 2. Theorem 2. Consider the following optimization problem, max μ μ μ E μ μ μ ∗ log K X k=1 1 K f(x;μ μ μ k ) − M 2 k K X k=1 μ μ μ k k 2 . (3) Any limit point of the following iterative procedure is a stationary point of the above opti- mization problem. w t k (x), f(x;μ μ μ t k ) P K j=1 f(x;μ μ μ t j ) ,∀k (4) μ μ μ t+1 k = E μ μ μ ∗[xw t k (x)] +MKμ μ μ t k −M P K j=1 μ μ μ t j MK +E μ μ μ ∗[w t k (x)] ,∀k. (5) Proof: Let g(μ μ μ) =E μ μ μ ∗ h log P k 1 K f(x;μ μ μ k ) i − M 2 k P k μ μ μ k k 2 be the objective function of the (3) and define b g(μ μ μ,μ μ μ t ),E μ μ μ ∗ X k w t k (x)log f(x;μ μ μ k ) f(x;μ μ μ t k ) !! − M 2 K X k kμ μ μ k −μ μ μ t k k 2 2 98 −Mh X k (μ μ μ k −μ μ μ t k ), X k μ μ μ t k i +g(μ μ μ t ). We have E μ μ μ ∗ X k w t k (x)log f(x;μ μ μ k ) f(x;μ μ μ t k ) ! +E μ μ μ ∗ log X k 1 K f(x;μ μ μ t k ) ≤E μ μ μ ∗ log X k w t k (x) f(x;μ μ μ k ) f(x;μ μ μ t k ) +E μ μ μ ∗ log X k 1 K f(x;μ μ μ t k ) =E μ μ μ ∗ log X k w t k (x) f(x;μ μ μ k ) f(x;μ μ μ t k ) X k 1 K f(x;μ μ μ k ) =E μ μ μ ∗ log X k f(x;μ μ μ t k ) P K j=1 f(x;μ μ μ t j ) f(x;μ μ μ k ) f(x;μ μ μ t k ) X k 1 K f(x;μ μ μ t k ) =E μ μ μ ∗ log X k 1 K f(x;μ μ μ k ) , (6) where the first inequality is obtained using Jensen’s inequality. Moreover, M 2 X k μ μ μ k 2 = M 2 X k (μ μ μ k −μ μ μ t k ) + X k μ μ μ t k 2 = M 2 X k (μ μ μ k −μ μ μ t k ) 2 + M 2 X k μ μ μ t k 2 +M * X k (μ μ μ k −μ μ μ t k ), X k μ μ μ t k + ≤ MK 2 X k (μ μ μ k −μ μ μ t k ) 2 + M 2 X k μ μ μ t k 2 +M * X k (μ μ μ k −μ μ μ t k ), X k μ μ μ t k + , (7) where the last inequality is obtained from Cauchy–Schwarz inequality. Now, by combining (6) and (7) we have g(μ μ μ)≥b g(μ μ μ,μ μ μ t ). Additionally, it is easy to verify that g(μ μ μ t ) =b g(μ μ μ t ,μ μ μ t ). Thus the assumptions of Theorem 8 are satisfied. Furthermore, notice that the iterate (4) is obtained based on the update rule μ μ μ t+1 = argmin μ μ μ −b g(μ μ μ,μ μ μ t ). Therefore, Theorem 8 implies that every limit point of the algorithm is an stationary point. 99 A.2 Proofs of Chapter 3 A.2.1 Discussions on Remark 1 Consider the optimization problem min z∈Z F (z), (8) in which the setZ is bounded and convex; and F (·) :R n 7→R is `-smooth, i.e., k∇F (z 1 )−∇F (z 2 )k≤`kz 1 −z 2 k. One of the commonly used definitions of -stationary point for the optimization problem (8) is as follows. Definition 8. (-stationary point of the first type) A point ¯ z is said to be an -stationary point of the first type if P Z ¯ z− 1 ` ∇F (¯ z) − ¯ z ≤ ` , (9) whereP Z (·) represents the projection operator to the feasible setZ. Another notion of stationarity, which is used in this paper (as well as other works includ- ing [163]), is defined as follows. Definition 9. (-stationary point of the second type) A point ¯ z is said to be an -stationary point of the second type for the optimization problem (8) if D(¯ z)≤ 2 , (10) 100 where D(¯ z),−2`min z∈Z h h∇F (¯ z),z− ¯ zi + ` 2 kz− ¯ zk 2 i . The following theorem shows that the stationarity definition in (10) is strictly stronger than the stationarity definition in (9). Theorem 9. The -stationary concept of the second type is stronger than the -stationary concept of the first type. In particular, if a point ¯ z satisfies (10), then it must also satisfy (9). Moreover, there exist an optimization problem with a given feasible point ¯ z such that ¯ z is -stationary point of the first type, but it is not 0 -stationary point of the second type for any 0 < √ 2 + 2 . Proof: We first show that (10) implies (9), i.e., if D(¯ z)≤ 2 thenkP Z (¯ z + (1/`)∇F (¯ z))k≤ /`. From definition of D(¯ z), we have D(¯ z),− 2`min z∈Z h∇F (¯ z),z− ¯ zi + ` 2 ||z− ¯ z|| 2 =−` 2 min z∈Z 2 ` h∇F (¯ z),z− ¯ zi +||z− ¯ z|| 2 =−` 2 min z∈Z kz− ¯ z + 1 ` ∇F (¯ z)k 2 − 1 ` 2 k∇F (¯ z)k 2 =−` 2 kP Z (¯ z− 1 ` ∇F (¯ z))− (¯ z− 1 ` ∇F (¯ z))k 2 +k∇F (¯ z)k 2 . Defining ˆ z = ¯ z− 1 ` ∇F (¯ z), we get D(¯ z) =−` 2 kP Z (ˆ z)− ˆ zk 2 +k∇F (¯ z)k 2 . (11) On the other hand, as shown in Figure 1, the direct application of cosine equality implies that k 1 ` ∇F (¯ z)k 2 =kP Z (ˆ z)− ¯ zk 2 +kP Z (ˆ z)− ˆ zk 2 − 2(kP Z (ˆ z)− ¯ zk)(kP Z (ˆ z)− ˆ zk)cosγ, (12) 101 Figure 1: Relation between different notions of stationarity Figure 2: (left): γ =π, two measures have the largest deviation, (right): γ = π 2 , two measures coincide where γ is the angle between the two vectors ¯ z−P Z (ˆ z) and ˆ z−P Z (ˆ z). Moreover, from [164, Lemma 3.1] we know that cosγ≤ 0. As a result, ` 2 kP Z (ˆ z)− ¯ zk 2 ≤−` 2 kP Z (ˆ z)− ˆ zk 2 +k∇F (¯ z)k 2 =D(¯ z), where the last equality is due to (11). Furthermore, since ¯ z is an -stationery point, i.e., D(¯ z)≤ 2 , we conclude thatkP Z (ˆ z)− ¯ zk≤/`. In other words, ¯ z is an -stationary point of the first type. Next we show that the stationarity concept in (10) is strictly stronger than the stationarity concept in (9). To understand this, let us take an additional look at Figure 1 and equation (12) used in the proof above. Clearly, the two stationarity measures could coincide when cosγ = 0. Moreover, the two notions have the largest gap when cosγ =−1. Figure 2 shows both of these scenarios. According to Figure 2, in order to create an example with 102 largest gap between the two stationarity notions, we need to construct an example with the smallest possible value of cosγ. In particular, consider the optimization problem min z 1 2 z 2 s.t. z≥ 1. It is easy to check that the point ¯ z = 1+ is an-stationary point of the first type, while it is not an 0 -stationary point of the second type for any 0 < √ 2 + 2 . Next, we re-state the lemmas used in the main body of the paper and present detailed proof of them. A.2.2 Proof of Lemma 3 Lemma 3. Letg(θ θ θ) = max α α α∈A h(θ θ θ,α α α)−p(α α α) in which the functionh(θ θ θ,α α α) isσ-strongly concave inα α α for any givenθ θ θ. Then, under Assumption 3, the functiong(θ θ θ) is differentiable. Moreover, its gradient is L g -Lipschitz continuous, i.e., k∇g(θ θ θ 1 )−∇g(θ θ θ 2 )k≤L g kθ θ θ 1 −θ θ θ 2 k, where L g =L 11 + L 2 12 σ . Proof: The differentiability of the function g(θ θ θ) is obvious from Danskin’s Theorem 3. In order to find the gradient’s Lipschitz constant, define l(θ θ θ,α α α) =−h(θ θ θ,α α α) +p(α α α). Let α α α ∗ 1 = argmin α α α∈A l(θ θ θ 1 ,α α α) and α α α ∗ 2 = argmin α α α∈A l(θ θ θ 2 ,α α α). Due to σ–strong convexity of l(θ θ θ,α α α) in α α α for any given θ θ θ, we have l(θ θ θ 2 ,α α α ∗ 2 ) ≥l(θ θ θ 2 ,α α α ∗ 1 ) +l 0 (θ θ θ 2 ,α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 ) + σ 2 kα α α ∗ 2 −α α α ∗ 1 k 2 , 103 and l(θ θ θ 2 ,α α α ∗ 1 ) ≥l(θ θ θ 2 ,α α α ∗ 2 ) +l 0 (θ θ θ 2 ,α α α ∗ 2 ;α α α ∗ 1 −α α α ∗ 2 ) + σ 2 kα α α ∗ 2 −α α α ∗ 1 k 2 . Furthermore, due to optimality of α α α ∗ 2 , l 0 (θ θ θ 2 ,α α α ∗ 2 ;α α α ∗ 1 −α α α ∗ 2 )≥ 0. As a result, by adding the above two inequalities we have −l 0 (θ θ θ 2 ,α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 )≥σkα α α ∗ 2 −α α α ∗ 1 k 2 . (13) On the other hand, from optimality of α α α ∗ 1 , we have l 0 (θ θ θ 1 ,α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 )≥ 0. (14) Now, by adding (13) and (14) we get σkα α α ∗ 2 −α α α ∗ 1 k 2 ≤−l 0 (θ θ θ 2 ,α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 ) +l 0 (θ θ θ 1 ,α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 ) =h∇ α α α h(θ θ θ 2 ,α α α ∗ 1 ),α α α ∗ 2 −α α α ∗ 1 i−p 0 (α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 )−h∇ α α α h(θ θ θ 1 ,α α α ∗ 1 ),α α α ∗ 2 −α α α ∗ 1 i +p 0 (α α α ∗ 1 ;α α α ∗ 2 −α α α ∗ 1 ) =h∇ α α α h(θ θ θ 2 ,α α α ∗ 1 )−∇ α α α h(θ θ θ 1 ,α α α ∗ 1 ),α α α ∗ 2 −α α α ∗ 1 i ≤L 12 kθ θ θ 1 −θ θ θ 2 kkα α α ∗ 2 −α α α ∗ 1 k, where the last inequality holds by Cauchy-Schwartz and the Lipschitzness from Assumption 3. As a result, we get kα α α ∗ 2 −α α α ∗ 1 k≤ L 12 σ kθ θ θ 1 −θ θ θ 2 k. (15) 104 Now, Theorem 3 implies that k∇ θ θ θ g(θ θ θ 1 )−∇ θ θ θ g(θ θ θ 2 )k =k∇ θ θ θ h(θ θ θ 1 ,α α α ∗ 1 )−∇ θ θ θ h(θ θ θ 2 ,α α α ∗ 2 )k =k∇ θ θ θ h(θ θ θ 1 ,α α α ∗ 1 )−∇ θ θ θ h(θ θ θ 2 ,α α α ∗ 1 ) +∇ θ θ θ h(θ θ θ 2 ,α α α ∗ 1 )−∇ θ θ θ h(θ θ θ 2 ,α α α ∗ 2 )k ≤L 11 kθ θ θ 1 −θ θ θ 2 k +L 12 kα α α ∗ 1 −α α α ∗ 2 k ≤ L 11 + L 2 12 σ kθ θ θ 1 −θ θ θ 2 k, where the last inequality is due to (15). Lemma 7. (Rephrased from [117, 165]) Assume F(x) =m(x) +n(x), where m(x) is σ- strongly convex and L-smooth, n(x) is convex and possibly non-smooth (and possibly extended real-valued). Then, by applying accelerated proximal gradient descent algorithm with restart parameter N, q 8L/σ− 1 for K iterations, with K being a constant multiple of N, we get F (x K )−F (x ∗ )≤ 1 2 K/N (F (x 0 )−F (x ∗ )), (16) where x K is the iterate obtained at iteration K and x ∗ , argmin x F (x). Lemma 8. Let α α α t+1 to be the output of the accelerated proximal gradient descent in Al- gorithm 5 at iteration t. Assume κ = L 22 σ ≥ 1, and g(θ θ θ t )− (h(θ θ θ t ,α α α 0 (θ θ θ t ))−p(α α α 0 (θ θ θ t )))< Δ. Then for any prescribed ∈ (0,1), choose K large enough such that K≥ 2 √ 8κ logL 22 + log (2L 22 R +g max +L p +R) + 2log 1 + 1 2 log 2Δ σ !! , where g max = max α α α∈A k∇ α α α h(θ θ θ t ,α α α)k. Then the error e t ,∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇g(θ θ θ t ) has a norm ke t k≤δ, L 12 L 22 (2L 22 R +g max +L p +R) 2 . 105 and 2 ≥Y(θ θ θ t ,α α α t+1 ),L 22 max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− L 22 2 ||α α α−α α α t+1 || 2 Proof: From Lemma 7 we have, g(θ θ θ t )− h(θ θ θ t ,α α α t+1 )−p(α α α t+1 ) ≤ 1 2 K √ 8κ Δ. (17) Let α α α ∗ (θ θ θ t ), argmax α α α∈A h(θ θ θ t ,α α α)−p(α α α). By combining (17) and strong concavity of h(θ θ θ t ,α α α)− p(α α α) in α α α, we get σ 2 kα α α t+1 −α α α ∗ (θ θ θ t )k 2 ≤g(θ θ θ t )− h(θ θ θ t ,α α α t+1 )−p(α α α t+1 ) ≤ 1 2 K √ 8κ Δ. Combining this inequality with Assumption 3 implies that ke t k =k∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇g(θ θ θ t )k =k∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇ θ θ θ f(θ θ θ t ,α α α t+1 )k ≤L 12 kα α α t+1 −α α α ∗ (θ θ θ t )k≤ L 12 2 K/2 √ 8κ s 2Δ σ ≤ L 12 L 22 (2L 22 R +g max +L p +R) 2 , where the last inequality comes from our choice of K. Next, let us prove the second part of the lemma. First notice that by some algebraic manipulations, we can write 1 2L 22 Y(θ θ θ t ,α α α t+1 ) =max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− L 22 2 ||α α α−α α α t+1 || 2 =max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− L 22 2 ||α α α−α α α ∗ (θ θ θ t ) +α α α ∗ (θ θ θ t )−α α α t+1 || 2 =max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − L 22 2 kα α α−α α α ∗ (θ θ θ t )k 2 − L 22 2 kα α α ∗ (θ θ θ t )−α α α t+1 k 2 −L 22 hα α α−α α α ∗ (θ θ θ t ),α α α ∗ (θ θ θ t )−α α α t+1 i 106 =max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− L 22 2 kα α α−α α α ∗ (θ θ θ t )k 2 − L 22 2 kα α α ∗ (θ θ θ t )−α α α t+1 k 2 −L 22 hα α α−α α α ∗ (θ θ θ t ),α α α ∗ (θ θ θ t )−α α α t+1 i−p(α α α ∗ (θ θ θ t )) +p(α α α ∗ (θ θ θ t )) . Thus, we obtain 1 2L 22 Y(θ θ θ t ,α α α t+1 ) ≤max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − L 22 2 kα α α−α α α ∗ (θ θ θ t )k 2 −L 22 hα α α−α α α ∗ (θ θ θ t ),α α α ∗ (θ θ θ t )−α α α t+1 i−p(α α α ∗ (θ θ θ t )) +p(α α α ∗ (θ θ θ t )) =max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α−α α α t+1 i +h∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − L 22 2 kα α α−α α α ∗ (θ θ θ t )k 2 −L 22 hα α α−α α α ∗ (θ θ θ t ),α α α ∗ (θ θ θ t )−α α α t+1 i−p(α α α ∗ (θ θ θ t )) +p(α α α ∗ (θ θ θ t )) ≤max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α−α α α t+1 i +h∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α ∗ (θ θ θ t )−α α α t+1 i +p(α α α t+1 )−p(α α α ∗ (θ θ θ t ))−L 22 hα α α−α α α ∗ (θ θ θ t ),α α α ∗ (θ θ θ t )−α α α t+1 i + max α α α∈A h∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α−α α α ∗ (θ θ θ t )i−p(α α α) +p(α α α ∗ (θ θ θ t ))− L 22 2 kα α α−α α α ∗ (θ θ θ t )k 2 | {z } =0 . Thus, 1 2L 22 Y(θ θ θ t ,α α α t+1 ) ≤max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α−α α α t+1 i +h∇ α α α f(θ θ θ t ,α α α ∗ (θ θ θ t )),α α α ∗ (θ θ θ t )−α α α t+1 i +p(α α α t+1 )−p(α α α ∗ (θ θ θ t ))−L 22 hα α α−α α α ∗ (θ θ θ t ),α α α ∗ (θ θ θ t )−α α α t+1 i ≤L 22 kα α α t+1 −α α α ∗ (θ θ θ t )kR +g max kα α α t+1 −α α α ∗ (θ θ θ t )k +L p kα α α t+1 −α α α ∗ (θ θ θ t )k +RL 22 kα α α t+1 −α α α ∗ (θ θ θ t )k ≤(L 22 R +g max +L p +RL 22 )kα α α t+1 −α α α ∗ (θ θ θ t )k. 107 As a result Y(θ θ θ t ,α α α t+1 )≤L 22 (2L 22 R +g max +L p )kα α α t+1 −α α α ∗ (θ θ θ t )k ≤L 22 (2L 22 R +g max +L p ) 1 2 K/2 √ 8κ s 2Δ σ ≤ 2 , where the last inequality follows from the choice of K. A.2.3 Proof of Theorem 4 Theorem 4. Consider the min-max zero sum game min θ θ θ∈Θ max α α α∈A (f(θ θ θ,α α α) =h(θ θ θ,α α α)−p(α α α) +q(θ θ θ)), where the function h(θ θ θ,α α α) is σ−strongly concave. Let D =g(θ θ θ 0 )+q(θ θ θ 0 )− min θ θ θ∈Θ (g(θ θ θ) +q(θ θ θ)) where g(θ θ θ) = max α α α∈A h(θ θ θ,α α α)−p(α α α), and L g =L 11 + L 2 12 σ be the Lipschitz constant of the gradient of g. In Algorithm 5, if we set η 1 = 1 L 22 ,η 2 = 1 Lg , N = q 8L 22 /σ− 1 and choose K and T large enough such that T≥N T (), 4L g D 2 and K≥N K (), 2 √ 8κ C + 2log 1 + 1 2 log 2Δ σ ! , where C = max 2log2 + log (L g L 12 R),logL 22 + log (2L 22 R +g max +L p +R) and κ = L 22 σ , then there exists an iterationt∈{0,···,T} such that (θ θ θ t ,α α α t+1 ) is an–FNE of (3.4). Proof: First, by descent lemma we have g(θ θ θ t+1 ) +q(θ θ θ t+1 ) ≤g(θ θ θ t ) +h∇ θ θ θ g(θ θ θ t ),θ θ θ t+1 −θ θ θ t i + L g 2 kθ θ θ t+1 −θ θ θ t k 2 +q(θ θ θ t+1 ) 108 =g(θ θ θ t ) +h∇ θ θ θ h(θ θ θ t ,α α α ∗ (θ θ θ t )),θ θ θ t+1 −θ θ θ t i + L g 2 kθ θ θ t+1 −θ θ θ t k 2 +q(θ θ θ t+1 ) =g(θ θ θ t ) +h∇ θ θ θ h(θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i + L g 2 kθ θ θ t+1 −θ θ θ t k 2 +q(θ θ θ t+1 ) −h∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇ θ θ θ f(θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i =g(θ θ θ t ) +q(θ θ θ t ) + min θ θ θ∈Θ h∇ θ θ θ h(θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L g 2 kθ θ θ−θ θ θ t k 2 +q(θ θ θ)−q(θ θ θ t ) −h∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇ θ θ θ f(θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i, where the last equality follows the definition of θ θ θ t+1 . Thus we get, g(θ θ θ t+1 ) +q(θ θ θ t+1 ) ≤g(θ θ θ t ) +q(θ θ θ t ) + min θ θ θ∈Θ h∇ θ θ θ h(θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L g 2 kθ θ θ−θ θ θ t k 2 +q(θ θ θ)−q(θ θ θ t ) −h∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇ θ θ θ f(θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i =g(θ θ θ t ) +q(θ θ θ t ) + 1 2L g 2L g min θ θ θ∈Θ h∇ θ θ θ h(θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L g 2 kθ θ θ−θ θ θ t k 2 +q(θ θ θ)−q(θ θ θ t ) −h∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇ θ θ θ f(θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i 1 ≤g(θ θ θ t ) +q(θ θ θ t ) + 1 2L g 2L 11 min θ θ θ∈Θ h∇ θ θ θ h(θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L 11 2 kθ θ θ−θ θ θ t k 2 +q(θ θ θ)−q(θ θ θ t ) −h∇ θ θ θ h(θ θ θ t ,α α α t+1 )−∇ θ θ θ f(θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i ≤g(θ θ θ t ) +q(θ θ θ t )− 1 2L g X (θ θ θ t ,α α α t+1 ) +L 12 kα α α K (θ θ θ t )−α α α ∗ (θ θ θ t )kR, (18) where 1 is due to [163, Lemma 1]. Now if we choose K 1 ≥ 2 √ 8κ 2log2 + log (L g L 12 R) + 2log 1 + 1 2 log 2Δ σ ! , we have L 12 Rkα α α K (θ θ θ t )−α α α ∗ (θ θ θ t )k≤ 2 4L g , 109 due to Lemma 7. Combining this inequality with (18) and summing up both sides of the inequality (18), we obtain T−1 X t=0 1 2L g X (θ θ θ t ,α α α t+1 )− 2 4L g ! ≤g(θ θ θ 0 ) +q(θ θ θ 0 )− (g(θ θ θ T ) +q(θ θ θ T ))≤D. As a result, by picking T≥ 4LgD 2 , at least for one of the iterates t∈{1,···,T} we have X (θ θ θ t ,α α α k (θ θ θ t ))≤ 2 . On the other hand, for that point t from Lemma 8, if we choose K 2 ≥ 2 √ 8κ logL 22 + log (2L 22 R +g max +L p +R) + 2log 1 + 1 2 log 2Δ σ ! we haveY(θ θ θ t ,α α α t+1 )≤ 2 . Finally setting K = max{K 1 ,K 2 } will result inY(θ θ θ t ,α α α t+1 )≤ 2 andX (θ θ θ t ,α α α t+1 )≤ 2 . This completes the proof. A.2.4 Proof of Theorem 5 Theorem 5. Consider the min-max zero sum game min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α) =h(θ θ θ,α α α)−p(α α α) +q(θ θ θ) ! , where the functionh(θ θ θ,α α α) is concave. Definef λ (θ θ θ,α α α) =f(θ θ θ,α α α)− λ 2 kα α α− ˆ α α αk 2 andg λ (θ θ θ) = max α α α∈A h(θ θ θ,α α α)− λ 2 kα α α− ˆ α α αk 2 −p(α α α) for some ˆ α α α∈A . Let D =g λ (θ θ θ 0 )+q(θ θ θ 0 )−min θ θ θ∈Θ (g λ (θ θ θ) +q(θ θ θ)) and L g λ =L 11 + L 2 12 λ be the Lipschitz constant of the gradient of g λ . In Algorithm 5 if we set η 1 = 1 L 22 +λ , η 2 = 1 Lg λ , N = q 8(L 22 +λ) λ − 1, λ = min{L 22 , 2 √ 2R } and choose K and T large enough such that, T≥N T (), 8L g λ D 2 , 110 and K≥N K (), 2 √ 8κ C + 2log 2 + 1 2 log 2Δ λ ! , where C = max 2log2 + log (L g λ L 12 R),log(L 22 +λ) + log 2(L 22 +λ)R +g λ max +L p +R , κ = L 22 +λ λ and g λ max = max α α α∈A k∇ α α α h(θ θ θ t ,α α α)k +λR, there exists t∈{0,...,T} such that (θ θ θ t ,α α α t+1 ) is an -FNE of the original problem (3.4). Proof: We only need to show that when the regularized function converges to -FNE, by proper choice ofλ, the converged point is also an-FNE of the original game. It is important to notice that in the regularized function the smooth term is h λ (θ θ θ,α α α) =h(θ θ θ,α α α)− λ 2 kα α α− ˆ α α αk 2 . As a result, from Assumption 3 we have k∇ α α α h λ (θ θ θ,α α α 1 )−∇ α α α h λ (θ θ θ,α α α 2 )k =k∇ α α α h(θ θ θ,α α α 1 )−∇ α α α h(θ θ θ,α α α 2 )−λ(α α α 1 −α α α 2 )k≤ (L 22 +λ)kα α α 1 −α α α 2 k, where the last inequality is obtained by combing triangular inequality and Lipshitz smoothness of the function h(.,.). Additionally,∇ θ θ θ h λ ( ¯ θ θ θ, ¯ α α α) =∇ θ θ θ h( ¯ θ θ θ, ¯ α α α). Now, based on Definition 3, a point ( ¯ θ θ θ, ¯ α α α) is said to be–FNE of the regularized function ifX λ ( ¯ θ θ θ, ¯ α α α)≤ 2 andY λ ( ¯ θ θ θ, ¯ α α α)≤ 2 where X λ ( ¯ θ θ θ, ¯ α α α),−2L 11 min θ θ θ∈Θ h∇ θ θ θ h( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi +q(θ θ θ)−q( ¯ θ θ θ) + L 11 2 ||θ θ θ− ¯ θ θ θ|| 2 , and Y λ ( ¯ θ θ θ, ¯ α α α), 2(L 22 +λ)max α α α∈A h∇ α α α h( ¯ θ θ θ, ¯ α α α)−λ(¯ α α α− ˆ α α α),α α α− ¯ α α αi−p(α α α) +p(¯ α α α)− (L 22 +λ) 2 ||α α α− ¯ α α α|| 2 . 111 For simplicity, letX 0 (·,·) andY 0 (·,·) represent the above definitions for the original function. In the following we show that by proper choice of λ the proposed algorithm will result in a point thatX 0 (·,·)≤ 2 andY 0 (·,·)≤ 2 . To show this, we first bound theY 0 (·,·) byY λ (·,·): Y 0 (θ θ θ t ,α α α t+1 ) =2L 22 max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− L 22 2 ||α α α−α α α t+1 || 2 1 ≤ 2(2L 22 +λ)max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− (2L 22 +λ) 2 ||α α α−α α α t+1 || 2 =2(2L 22 +λ)max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−λ(α α α t+1 − ˆ α α α) +λ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − 2L 22 +λ 2 ||α α α−α α α t+1 || 2 , where 1 is based on [163, Lemma 1]. Hence, Y 0 (θ θ θ t ,α α α t+1 ) ≤2 2L 22 +λ L 22 +λ (L 22 +λ)max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−λ(α α α t+1 − ˆ α α α) +λ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 )− 2L 22 +λ 2 ||α α α−α α α t+1 || 2 ≤4(L 22 +λ)max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−λ(α α α t+1 − ˆ α α α) +λ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − 2L 22 +λ 2 ||α α α−α α α t+1 || 2 =4(L 22 +λ)max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−λ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − L 22 +λ 2 ||α α α−α α α t+1 || 2 − L 22 2 ||α α α−α α α t+1 || 2 +hλ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i ≤4(L 22 +λ)max α α α∈A h∇ α α α f(θ θ θ t ,α α α t+1 )−λ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i−p(α α α) +p(α α α t+1 ) − L 22 +λ 2 ||α α α−α α α t+1 || 2 + 4(L 22 +λ)max α α α∈A − L 22 2 ||α α α−α α α t+1 || 2 +hλ(α α α t+1 − ˆ α α α),α α α−α α α t+1 i ≤2Y λ (θ θ θ t ,α α α t+1 ) + 2 L 22 +λ L 22 λ 2 R 2 , 112 where 1 is based on [163, Lemma 1] and the last inequality follows the definition and optimizing the quadratic term. As a result, by choosing λ≤ min{L 22 , 2 √ 2R },O() we have, Y 0 (θ θ θ t ,α α α t+1 )≤ 2Y λ (θ θ θ t ,α α α t+1 ) + 2 L 22 +λ L 22 λ 2 R 2 ≤ 2 2 + 2 2 = 2 , where the last inequality comes from the fact that by running Algorithm 5 with the given inputs, the regularized function has resulted in a 2 –FNE point. Now, sinceX (θ θ θ t ,α α α t+1 ) is same for both original and regularized function, by pickingT≥N T (), 4Lg λ D 2 = 4D 2 L 11 + L 2 12 λ , O( −3 ) , we concludeX 0 (θ θ θ t ,α α α t+1 )≤ 2 . This completes the proof. In order to prove Theorem 6, we need some auxiliary lemmas that we present them before proceeding to the main proof. Lemma 9. Let A and B be H×M matrices with entries in R andk·k ∞ be the maximum norm that returns the maximum absolute value among all elements in a given matrix. Then, 1.k M P i=1 a i k 2 ≤M M P i=1 ka i k 2 ; 2.kA◦Bk F ≤kAk ∞ kBk 1,1 ; 3.kA◦Bk F ≤kAk ∞ kBk F . 4.kA◦Bk 2 ≤kAk 2 kBk 2 ; 5.kA◦Bk F ≤kAk F kBk F ; 6. 1 M (A◦B)1 1 1 M 2 ≤ 1 M kAk 2 ∞ kBk 2 F ; 7. 1 M (A◦B)1 1 1 M ≤ 1 M kAk ∞ M P i=1 kb i k 2 ; Proof: 1. Let y = M P i=1 a i . The proof follows from an application of Jensen’s inequality on the convex function φ(y) =kyk 2 . 2. From the definition of Hadamard product, we have kA◦Bk F = v u u u t H X i=1 M X j=1 (a i,j b i,j ) 2 ≤kAk ∞ v u u u t H X i=1 M X j=1 b 2 i,j ≤kAk ∞ H X i=1 M X j=1 |b i,j | =kAk ∞ kBk 1,1 . 113 3. It follows from the above inequality that kA◦Bk F ≤kAk ∞ v u u u t H X i=1 M X j=1 b 2 i,j ≤kAk ∞ kBk F . 4. Note thatkA⊗Bk 2 =kAk 2 kBk 2 . This, together with the fact that A◦B is a principle submatrix of A⊗B implies kA◦Bk 2 ≤kA⊗Bk 2 =kAk 2 kBk 2 . 5. Form the definition of Hadamard product, we have kA◦Bk 2 F = H X i=1 M X j=1 (a i,j b i,j ) 2 ≤ H X i=1 M X j=1 a 2 i,j H X i=1 M X j=1 b 2 i,j =kAk 2 F kBk 2 F , where the inequality follows from Cauchy-Schwarz. 6. Observe that 1 M (A◦B)1 1 1 M 2 = 1 M M X i=1 a i ◦b i 2 ≤ 1 M M X i=1 ka i ◦b i k 2 ≤ 1 M kAk 2 ∞ M X i=1 kb i k 2 , where the first and the second inequalities follow from 1 and 3, respectively. 7. Similar to 6, we obtain 1 M (A◦B)1 1 1 M ≤ 1 M M X i=1 ka i ◦b i k≤ 1 M kAk ∞ M X i=1 kb i k. The following lemma provide upper bounds on the norm of momentum vectors m i,k and ˜ v i,k defined in Algorithm 6. 114 Lemma 10. [[166], Lemma 4.2] For each k∈{1···N}, ifk b g k k ∞ ≤G ∞ , then we have km k k ∞ ≤G ∞ andk˜ v k k ∞ ≤G 2 ∞ . Lemma 11. Assume γ :=β 1,1 /β 2 ≤ 1 and let ˜ v − 1 2 r,k , and m r,k represent the values of the r th coordinate of vectors ˜ v − 1 2 k andm k , respectively. Then, for eachk∈{1···N}, andr∈{1,···,d}, we have |˜ v − 1 2 r,k m r,k−1 |≤ 1 √ u c , where u c := (1−β 3 )(1−β 1,1 )(1−β 2 )(1−γ). Proof: From the update rule of Algorithm 6, we have ˜ v r,k =β 3 ˜ v r,k−1 + (1−β 3 )max(˜ v r,k−1 ,v r,k ), which implies that ˜ v r,k ≥ (1−β 3 )v r,k . It can be easily seen from the update rule of m k and v i,k in Algorithm 6 that m r,k = k X s=1 k Y l=s+1 β 1,l (1−β 1,s )ˆ g r,s , and v r,k = (1−β 2 ) k X s=1 β k−s 2 ˆ g 2 r,s . Thus, |v − 1 2 r,k m r,k−1 | 2 ≤|v − 1 2 r,k−1 m r,k−1 | 2 ≤ k−1 P s=1 k−1 Q l=s+1 β 1,l ! (1−β 1,s )ˆ g r,s ! 2 (1−β 2 ) k−1 P s=1 β k−s−1 2 ˆ g 2 r,s ≤ k−1 P s=1 k−1 Q l=s+1 β 1,l ! ˆ g r,s ! 2 (1−β 2 ) k−1 P s=1 β k−s−1 2 ˆ g 2 r,s , (19) 115 where the first inequality follows since v − 1 2 r,k ≤v − 1 2 r,k−1 for allr∈ [d] and the last inequality uses our assumption that β 1,s ≤ 1 for all s≥ 1. Now, let π s = k−1 Q l=s+1 β 1,l . Since β 1,l is decreasing, we get π s ≤β k−s−1 1,1 . This, together with ( P i a i b i ) 2 ≤ ( P i a 2 i )( P i b 2 i ) implies that k−1 P s=1 π s ˆ g r,s ! 2 (1−β 2 ) k−1 P s=1 β k−s−1 2 ˆ g 2 r,s ≤ ( k−1 P s=1 π s )( k−1 P s=1 π s ˆ g 2 r,s ) (1−β 2 ) k−1 P s=1 β k−s−1 2 ˆ g 2 r,s ≤ 1 1−β 2 ( k−1 X s=1 π s ) k−1 X s=1 π s ˆ g 2 r,s β k−s−1 2 ˆ g 2 r,s ≤ 1 1−β 2 ( k−1 X s=1 π s ) k−1 X s=1 π s β k−s−1 2 ≤ 1 1−β 2 1 1−β 1,1 1 1−γ . where the last inequality follows from our assumption γ = β 1,1 β 2 ≤ 1. Finally, substituting the above inequality into (19) yields the desired result. Lemma 12. Let G 2 0 ≤k˜ v 0 k ∞ ≤G 2 ∞ . Then, for p> 0 : 1. N P k=1 k˜ v p k − ˜ v p k−1 k 1 ≤dG 2p ∞ ; 2. N P k=1 k˜ v p k − ˜ v p k−1 k 2 1 ≤dG 4p ∞ , and for q< 0 : 3. N P k=1 k˜ v q k − ˜ v q k−1 k 1 ≤dG 2q 0 ; 4. N P k=1 k˜ v q k − ˜ v q k−1 k 2 1 ≤dG 4q 0 , where the vector powers are considered to be element-wise. 116 Proof: 1. From the definition we have, N X k=1 ˜ v p k − ˜ v p k−1 1 = N X k=1 d X r=1 (˜ v p r,k − ˜ v p r,k−1 ) = d X r=1 N X k=1 (˜ v p r,k − ˜ v p r,k−1 ) ≤k˜ v p 0 k 1 =dG 2p ∞ , where the first equality is due to the fact that each element of ˜ v p k ,p> 0, is increasing in k and the last inequality uses the telescoping sum. 2. Similarly we have, N X k=1 ˜ v p k − ˜ v p k−1 2 1 ≤ N X k=1 d X r=1 (˜ v p r,k − ˜ v p r,k−1 )(˜ v p r,k ) ≤ N X k=1 d X r=1 (˜ v p r,k − ˜ v p r,k−1 )G 2p ∞ =dG 4p ∞ . 3. By using the definition we get, N X k=1 ˜ v q k − ˜ v q k−1 1 = N X k=1 d X r=1 (−˜ v q r,k + ˜ v q r,k−1 ) = d X r=1 N X k=1 (−˜ v q r,k + ˜ v q r,k−1 ) ≤k˜ v q 0 k 1 =dG 2q 0 . Here, the first equality is due to the fact that each element of ˜ v q k ,q< 0, is decreasing in k. 4. With similar approach as in 3 we get, N X k=1 ˜ v q k − ˜ v q k−1 2 1 ≤ N X k=1 d X r=1 (−˜ v q r,k + ˜ v q r,k−1 )(˜ v q r,k−1 ) 117 ≤ N X k=1 d X r=1 (−˜ v q r,k + ˜ v q r,k−1 )G 2q 0 =dG 4q 0 . Before proceeding, we introduce the following lemma that will be used in the proof of Theorem 6. Lemma 13. Let Assumptions 9 and 8 Hold. Then for ˜ v k ,k∈ [N] defined in Algorithm 6 we have, N X k=1 ˜ v 1 4 k−1 ◦ (x k−1 −x ∗ ) 2 − ˜ v 1 4 k−1 ◦ (x k −x ∗ ) 2 ! ≤ 3D 2 dG ∞ . Proof: We have N X k=1 ˜ v 1 4 k−1 ◦ (x k−1 −x ∗ ) 2 − ˜ v 1 4 k−1 ◦ (x k −x ∗ ) 2 ! = ˜ v 1 4 0 ◦ (x 0 −x ∗ ) 2 + − ˜ v 1 4 0 ◦ (x 1 −x ∗ ) 2 + ˜ v 1 4 1 ◦ (x 1 −x ∗ ) 2 ! + − ˜ v 1 4 1 ◦ (x 2 −x ∗ ) 2 + ˜ v 1 4 2 ◦ (x 2 −x ∗ ) 2 ! . . . + − ˜ v 1 4 N−2 ◦ (x N−1 −x ∗ ) 2 + ˜ v 1 4 N−1 ◦ (x N−1 −x ∗ ) 2 ! − ˜ v 1 4 N−1 ◦ (x N −x ∗ ) 2 . (20) For the arbitrary s th pairs in (20), we have − ˜ v 1 4 s−1 ◦ (x s −x ∗ ) 2 + ˜ v 1 4 s ◦ (x s −x ∗ ) 2 = − (˜ v 1 4 s−1 − ˜ v 1 4 s + ˜ v 1 4 s )◦ (x s −x ∗ ) + ˜ v 1 4 s ◦ (x s −x ∗ ) · ˜ v 1 4 s−1 ◦ (x s −x ∗ ) + ˜ v 1 4 s ◦ (x s −x ∗ ) 118 ≤ (˜ v 1 4 s−1 − ˜ v 1 4 s )◦ (x s −x ∗ ) 2 q G ∞ D ≤ 2D 2 q G ∞ k˜ v 1 4 s−1 − ˜ v 1 4 s k 1 . (21) where the first inequality follows fromkak−kbk≤ka−bk, Assumption 9 and Lemma 10. As a result, N X k=1 R 2,0,k ≤ 2D 2 dG ∞ + ˜ v 1 4 0 ◦ (x 0 −x ∗ ) 2 − ˜ v 1 4 N−1 ◦ (x N −x ∗ ) 2 ≤ 3D 2 dG ∞ , where the first inequality follows from Lemma 12 and last inequality uses the same lemma, Assumption 9 and the fact that d≥ 1. A.2.5 Proof of Theorem 6 Theorem 6. Let Assumptions 4–9 hold, and L, G ∞ , σ j , ν j be defined therein. Then we have, C 0 N X k=1 Ek∇F (z k )k 2 ≤ C 1 N + C 2 σ 2 m , where C 1 ,C 2 ,C 3 are constants defined in (41). Proof: We divide the proof into four steps. In Step 1, we show that the gradient norm is bounded by the norm of search direction and auxiliary variables x k and z k . Then in Steps 2 and 3, we give upper bounds for these terms. Finally, in Step 4, we provide the convergence analysis. Step 1 In this step, we show that under Assumption 7, we have 1 N N X k=1 kg k k 2 ≤ 3 Nη 2 (1−β 1,1 ) 2 G −2 ∞ N X k=1 η 2 R 1,k +R 2,k , (22) 119 where R 1,k := −d k + (1−β 1,k )˜ v − 1 2 k ◦g k 2 R 2,k :=kz k −x k k 2 +kz k −x k−1 k 2 . (23) It follows from the update rule of x k in Algorithm 6 that η(1−β 1,k ) ˜ v − 1 2 k ◦g k = z k −x k + (x k−1 −ηd k ) −z k +η(1−β 1,k ) ˜ v − 1 2 k ◦g k . Now, using Lemma 91, we get η 2 (1−β 1,k ) 2 ˜ v − 1 2 k ◦g k 2 ≤ 3η 2 −d k + (1−β 1,k )˜ v − 1 2 k ◦g k 2 + 3 kz k −x k k 2 +kz k −x k−1 k 2 . (24) From Lemma 10, we havek˜ v − 1 2 k k ∞ ≥G −1 ∞ which implies that ˜ v − 1 2 k ◦g k 2 ≥G −2 ∞ kg k k 2 . Now, it follows from the above inequality and (23) that η 2 (1−β 1,1 ) 2 G −2 ∞ kg k k 2 ≤η 2 (1−β 1,k ) 2 ˜ v − 1 2 k ◦g k 2 ≤ 3η 2 R 1,k + 3R 2,k , which gives (22). Step 2 In this step, we aim to develop an upper bound for R 1,k defined in (23). More specifically, we show that 1 N N X k=1 R 1,k = 2dβ 2 1,1 Nu c (1−κ 2 ) + 2 NG 2 0 N X k=1 k k k 2 , (25) 120 where u c is defined in Lemma 11 and k = b g k −g k . From the definition of d k in Algorithm 6, we have d k =β 1,k ˜ v − 1 2 k ◦m k−1 + (1−β 1,k )˜ v − 1 2 k ◦ b g k . (26) Hence, −d k + (1−β 1,k )˜ v − 1 2 k ◦g k =−β 1,k ˜ v − 1 2 k ◦m k−1 + (1−β 1,k )˜ v − 1 2 k ◦ (g k − b g k ), which implies that R 1,k = −β 1,k ˜ v − 1 2 k ◦m k−1 + (1−β 1,k )˜ v − 1 2 k ◦ (g k − (g k + k )) 2 ≤ 2β 2 1,k ˜ v − 1 2 k ◦m k−1 2 + 2(1−β 1,k ) 2 ˜ v − 1 2 k ◦ k 2 . (27) Here, the equality is obtained since k = b g k − g k , and the inequality follows from Lemma 91. For the first term on the R.H.S. of (27), it follows from Lemma 11 that ˜ v − 1 2 k ◦m k−1 2 ≤ d u c . (28a) Further, for the second term on the R.H.S. of (27), we have ˜ v − 1 2 k ◦ k 2 ≤k˜ v − 1 2 k k 2 ∞ k k k 2 ≤k˜ v − 1 2 0 k 2 ∞ k k k 2 ≤ 1 G 2 0 k k k 2 , (28b) where the inequality uses Lemma 96, the fact that each element of ˜ v − 1 2 k is decreasing in k and our assumption thatk˜ v − 1 2 0 k ∞ ≤ 1/G 0 . 121 Substituting (28a)–(28b) into (27), we obtain R 1,k ≤ 2dβ 2 1,k u c + 2 G 2 0 k k k 2 . (29) Summing overk = 1,···,N and using the fact that P N k=1 β 2 1,k ≤β 2 1,1 /(1−κ 2 ), we obtain the desired result. Step 3 In this step, we provide an upper bound for R 2,k defined in (23). In particular, we show that for η≤ q G 3 0 /(56L 2 G ∞ ), the following holds 1 N N X k=1 R 2,k ≤ 6D 2 dG ∞ NG 0 + 4ηD NG 0 β 1,1 G ∞ 1−κ s d u c + G 2 ∞ d G 0 ! + 56η 2 dβ 2 1,1 G ∞ Nu c (1−κ 2 )G 0 + 28η 2 dG 3 ∞ NG 3 0 + 28η 2 G ∞ NG 3 0 N X k=1 (β 1,k −β 1,k−1 ) 2 kg k k 2 + 56η 2 G ∞ G 3 0 N N X k=1 k k k 2 . (30) Let ˜ v 1 4 k−1 := ˆ v 1 4 1,k−1 , ˆ v 1 4 2,k−1 ,···, ˆ v 1 4 d,k−1 | . The update rule of x k in Algorithm 6 implies that ˜ v 1 4 k−1 ◦ (x k −x) 2 = ˜ v 1 4 k−1 ◦ (x k−1 −ηd k −x) 2 = ˜ v 1 4 k−1 ◦ (x k−1 −x)−η˜ v 1 4 k−1 ◦d k 2 − ˜ v 1 4 k−1 ◦ (x k−1 −x k )−η˜ v 1 4 k−1 ◦d k 2 = ˜ v 1 4 k−1 ◦ (x k−1 −x) 2 − ˜ v 1 4 k−1 ◦ (x k−1 −x k ) 2 − 2 ˜ v 1 4 k−1 ◦ (x k−1 −x),η˜ v 1 4 k−1 ◦d k + 2 ˜ v 1 4 k−1 ◦ (x k−1 −x k ),η˜ v 1 4 k−1 ◦d k − 2 ˜ v 1 4 k−1 ◦z k ,η˜ v 1 4 k−1 ◦d k + 2 ˜ v 1 4 k−1 ◦z k ,η˜ v 1 4 k−1 ◦d k = ˜ v 1 4 k−1 ◦ (x k−1 −x) 2 − ˜ v 1 4 k−1 ◦ (x k−1 −z k +z k −x k ) 2 − 2 ˜ v 1 4 k−1 ◦ (z k −x),η˜ v 1 4 k−1 ◦d k + 2 ˜ v 1 4 k−1 ◦ (z k −x k ),η˜ v 1 4 k−1 ◦d k 122 = ˜ v 1 4 k−1 ◦ (x k−1 −x) 2 − ˜ v 1 4 k−1 ◦ (x k−1 −z k ) 2 − ˜ v 1 4 k−1 ◦ (z k −x k ) 2 + 2 ˜ v 1 4 k−1 ◦ (x−z k ),η˜ v 1 4 k−1 ◦d k + 2 ˜ v 1 4 k−1 ◦ (x k −z k ), ˜ v 1 4 k−1 ◦ (x k−1 −z k ) + 2 ˜ v 1 4 k−1 ◦ (x k −z k ),−η˜ v 1 4 k−1 ◦d k , where the second equality follows since x k−1 −x k −ηd k = 0. Now, substituting x = x ∗ into the above equality and rearranging the terms, we get ˜ v 1 4 k−1 ◦ (z k −x k ) 2 + ˜ v 1 4 k−1 ◦ (x k−1 −z k ) 2 = ˜ v 1 4 k−1 ◦ (x k−1 −x ∗ ) 2 − ˜ v 1 4 k−1 ◦ (x k −x ∗ ) 2 | {z } R 2,0,k + 2η ˜ v 1 4 k−1 ◦ (x ∗ −z k ), ˜ v 1 4 k−1 ◦d k | {z } R 2,1,k + 2η ˜ v 1 4 k−1 ◦ (x k −z k ), ˜ v 1 4 k−1 ◦ (d k−1 −d k ) | {z } R 2,2,k . (31) Since by our assumptionk˜ v 1 2 0 k ∞ ≥G 0 , we have G 0 kz k −x k k 2 ≤ ˜ v 1 4 k−1 ◦ (z k −x k ) 2 , and G 0 kx k−1 −z k k 2 ≤ ˜ v 1 4 k−1 ◦ (x k−1 −z k ) 2 . We substitute the above lower bounds into (31) to get kz k −x k k 2 +kx k−1 −z k k 2 ≤ R 2,0,k G 0 + 2η G 0 R 2,1,k +R 2,2,k . (32) Next, we provide upper bounds for the terms R 2,1,k and R 2,2,k . 123 Bounding R 2,1,k . It follows from the update rule of d k in (26) that d k = d k − (1−β 1,k )˜ v − 1 2 k−1 ◦ b g k + (1−β 1,k )˜ v − 1 2 k−1 ◦ b g k =β 1,k ˜ v − 1 2 k ◦m k−1 + (1−β 1,k )(˜ v − 1 2 k − ˜ v − 1 2 k−1 )◦ b g k + (1−β 1,k )˜ v − 1 2 k−1 ◦g k + (1−β 1,k )˜ v − 1 2 k−1 ◦ ( b g k −g k ). (33) To find an upper bound forR 2,1,k , we first multiply each term in (33) by ˜ v 1 4 k−1 and then provide an upper bound for its inner product with ˜ v 1 4 k−1 ◦(x ∗ −z k ). From Lemmas 11, 10 and Assumption 9, we get ˜ v 1 4 k−1 ◦ (x ∗ −z k ), ˜ v 1 4 k−1 ◦ ˜ v − 1 2 k ◦m k−1 ≤DG ∞ ˜ v − 1 2 k ◦m k−1 ≤DG ∞ q du −1 c . (34a) Further, ˜ v 1 4 k−1 ◦ (x ∗ −z k ), ˜ v 1 4 k−1 ◦ (˜ v − 1 2 k − ˜ v − 1 2 k−1 )◦ b g k ≤DG ∞ k(˜ v − 1 2 k − ˜ v − 1 2 k−1 )◦ b g k k ≤DG ∞ k b g k k ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 1 ≤DG 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 1 , (34b) where the second inequality is obtained from Lemma 92 and the last inequality is due to Assumption 7. From Assumption 8, we have ˜ v 1 4 k−1 ◦ (x ∗ −z k ), ˜ v − 1 4 k−1 ◦g k =hx ∗ −z k ,g k i≤ 0. (34c) Further, ˜ v 1 4 k−1 ◦ (x ∗ −z k ), ˜ v − 1 4 k−1 ◦ ( b g k −g k ) =hx ∗ −z k , b g k −g k i =: Θ k . (34d) 124 Now, using (34d)–(34b), we obtain R 2,1,k ≤β 1,k DG ∞ q du −1 c +DG 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 1 + Θ k . (35) Bounding R 2,2,k . From the update rule of d k in (26), we get d k −d k−1 =β 1,k ˜ v − 1 2 k ◦m k−1 + (1−β 1,k )˜ v − 1 2 k ◦ b g k −β 1,k−1 ˜ v − 1 2 k−1 ◦m k−2 − (1−β 1,k−1 )˜ v − 1 2 k−1 ◦ b g k−1 =β 1,k ˜ v − 1 2 k ◦m k−1 −β 1,k−1 ˜ v − 1 2 k−1 ◦m k−2 + (1−β 1,k )(˜ v − 1 2 k − ˜ v − 1 2 k−1 + ˜ v − 1 2 k−1 )◦ b g k − (1−β 1,k−1 )˜ v − 1 2 k−1 ◦ b g k−1 =β 1,k ˜ v − 1 2 k ◦m k−1 −β 1,k−1 ˜ v − 1 2 k−1 ◦m k−2 + (1−β 1,k )(˜ v − 1 2 k − ˜ v − 1 2 k−1 )◦ b g k + (1−β 1,k )˜ v − 1 2 k−1 ◦ (g k + k )− (1−β 1,k−1 )˜ v − 1 2 k−1 ◦ (g k−1 + k−1 ) =β 1,k ˜ v − 1 2 k ◦m k−1 −β 1,k−1 ˜ v − 1 2 k−1 ◦m k−2 + (1−β 1,k )(˜ v − 1 2 k − ˜ v − 1 2 k−1 )◦ b g k + (1−β 1,k )˜ v − 1 2 k−1 ◦ k − (1−β 1,k−1 )˜ v − 1 2 k−1 ◦ k−1 + (1−β 1,k−1 )˜ v − 1 2 k−1 ◦ (g k −g k−1 ) + (β 1,k−1 −β 1,k )˜ v − 1 2 k−1 ◦g k . (36) Next, we focus on providing upper bounds for R 2,2,k =ηk˜ v 1 4 k−1 ◦ (d k −d k−1 )k 2 ≤ηG ∞ kd k −d k−1 k 2 . Observe that β 1,k ˜ v − 1 2 k ◦m k−1 2 + −β 1,k−1 ˜ v − 1 2 k−1 ◦m k−2 2 ≤ 2max β 1,k ˜ v − 1 2 k ◦m k−1 2 , −β 1,k−1 ˜ v − 1 2 k−1 ◦m k−2 2 ! ≤ 2dβ 2 1,k−1 u c , (37a) 125 where the inequality follows from Lemma 11. Using Lemma 92, we get (1−β 1,k )(˜ v − 1 2 k − ˜ v − 1 2 k−1 )◦ b g k 2 ≤k b g k k 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 2 1 ≤G 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 2 1 , (37b) where the last inequality uses Assumption 7. Similarly, (1−β 1,k−1 )˜ v − 1 2 k−1 ◦ (g k−1 −g k ) 2 ≤k˜ v − 1 2 k−1 k 2 ∞ kg k−1 −g k k 2 ≤ L 2 G 2 0 kz k−1 −z k k 2 , (37c) k(β 1,k −β 1,k−1 )˜ v − 1 2 k−1 ◦g k k 2 ≤ (β 1,k −β 1,k−1 ) 2 k˜ v − 1 2 k−1 k 2 ∞ kg k k 2 ≤ (β 1,k −β 1,k−1 ) 2 G 2 0 kg k k 2 . (37d) By taking the norm of (36), using 91 and (37a)–(37d), we get R 2,2,k G ∞ ≤ηkd k −d k−1 k 2 ≤ 14ηdβ 2 1,k−1 u −1 c + 7ηG 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 2 1 + 7ηL 2 G 2 0 kz k−1 −z k k 2 + 7η G 2 0 (β 1,k −β 1,k−1 ) 2 kg k k 2 + 7η G 2 0 k k k 2 +k k−1 k 2 . (38) By substituting (35) and (38) into (32), we obtain kz k −x k k 2 +kx k−1 −z k k 2 ≤ R 2,0,k G 0 + 2η G 0 β 1,k DG ∞ q du −1 c +DG 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 1 + Θ k + 14η 2 G ∞ G 0 2dβ 2 1,k−1 u −1 c +G 2 ∞ k˜ v − 1 2 k − ˜ v − 1 2 k−1 k 2 1 + 14η 2 G ∞ G 3 0 L 2 kz k −z k−1 k 2 + (β 1,k −β 1,k−1 ) 2 kg k k 2 + 14η 2 G ∞ G 3 0 k k k 2 +k k−1 k 2 . 126 Now, summing the above inequality over k = 1,···,N, we obtain 1− 28η 2 L 2 G ∞ G 3 0 ! N X k=1 kx k−1 −z k k 2 + 1− 28η 2 L 2 G ∞ G 3 0 ! N X k=1 kz k −x k k 2 ≤ 3D 2 dG ∞ G 0 + 2η G 0 Dβ 1,1 G ∞ 1−κ s d u c + DG 2 ∞ d G 0 + N X k=1 Θ k ! + 28η 2 dβ 2 1,1 G ∞ u c (1−κ 2 )G 0 + 14η 2 dG 3 ∞ G 3 0 + 14η 2 G ∞ G 3 0 N X k=1 (β 1,k −β 1,k−1 ) 2 kg k k 2 + 28η 2 G ∞ G 3 0 N X k=1 k k k 2 =: R.H.S.. (39) Here, we used Lemma 12 and the fact that N X k=1 kz k −z k−1 k 2 ≤ 2 N X k=1 kz k −x k−1 k 2 + 2 N X k=1 kx k−1 −z k−1 k 2 = 2 N X k=1 kz k −x k−1 k 2 + 2 N X k=1 kx k −z k k 2 , (40) where the inequality follows from Lemma 91 and the equality uses our assumption x 0 = z 0 = 0. Now, by our choice of step size η in the beginning of Step 3, we have 1− (28η 2 L 2 G ∞ )/G 3 0 ≥ 1/2. Thus, (23) together with (39) implies that 1 N N X k=1 R 2,k = N X k=1 kx k−1 −z k k 2 +kz k −x k k 2 ≤ 2 N R.H.S., which gives (30). Step 4 (Convergence Analysis) In this step, we combine all results in previous steps to establish an error bound for N −1 P N k=1 Ekg k k 2 . To do so, by substituting (30) and (25) into (22) and simplifying the terms, we obtain η 2 (1−β 1,1 ) 2 G −2 ∞ 1 N N X k=1 Ekg k k 2 ≤ 6η 2 dβ 2 1,1 Nu c (1−κ 2 ) + 6η 2 σ 2 mG 2 0 + 18D 2 dG ∞ NG 0 + 12ηD NG 0 β 1,1 G ∞ 1−κ s d u c + G 2 ∞ d G 0 ! 127 + 168η 2 dβ 2 1,1 G ∞ Nu c (1−κ 2 )G 0 + 84η 2 dG 3 ∞ NG 3 0 + 84(1−κ)β 2 1,1 η 2 G ∞ NG 3 0 κ 2 (1 +κ) Ekg k k 2 + 168η 2 G ∞ σ 2 G 3 0 m . Here, we used the fact that E N X k=1 Θ k = 0, and E N X k=1 k k k 2 = σ 2 m , by Assumptions 5 and 6, respectively. Rearranging the above inequality yields C 0 N N X k=1 Ekg k k 2 ≤ C 1 N + C 2 σ 2 m , where C 0 :=η 2 (1−β 1,1 ) 2 G −2 ∞ − 84(1−κ)β 2 1,1 η 2 G ∞ G 3 0 κ 2 (1 +κ) , C 1 := 6η 2 dβ 2 1,1 u c (1−κ 2 ) + 18D 2 dG ∞ G 0 + 168η 2 dβ 2 1,1 G ∞ u c (1−κ 2 )G 0 + 84η 2 dG 3 ∞ G 3 0 + 12ηD G 0 β 1,1 G ∞ 1−κ s d u c + G 2 ∞ d G 0 ! , C 2 := 6η 2 G 2 0 + 168η 2 G ∞ G 3 0 . (41) Next, dividing both sides by η 2 (1−β 1,1 ) 2 G −2 ∞ , we obtain B 0 := 1− G 3 ∞ G 3 0 84(1−κ)β 2 1,1 (1 +κ)κ 2 (1−β 1,1 ) 2 , B n := C n η 2 (1−β 1,1 ) 2 G −2 ∞ , n = 1,2. Now define C = (1+κ)κ 2 G 3 0 168(1−κ)G 3 ∞ . Picking β 1,1 ≤ √ C √ C+1 completes the proof. 128 A.2.6 Proof of Lemma 4 Lemma 4. Assume f(·,·) satisfies Assumptions 10–11 and D α ≤ min α 2L 22 , θ 2L 12 . (42) Then, any ( θ , α )-FNE for f lin is a (2 α ,2 θ )-FNE for f (in the sense of Definition 4). Proof: Let ( ¯ θ θ θ, ¯ α α α) be ( θ , α )-FNE for f lin . By Definition 4, this is equivalent to requiring that 1 2L 11 ˜ X ( ¯ θ θ θ,∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),L 11 )≤ 2 θ 2L 11 and 1 2L 22 ˜ Y(¯ α α α,−∇ α α α f lin ( ¯ θ θ θ, ¯ α α α),L 22 )≤ 2 α 2L 22 . Now observe that 1 2L 22 ˜ Y(¯ α α α,−∇ α α α f( ¯ θ θ θ, ¯ α α α),L 22 ) = max α α α∈A h∇ α α α f( ¯ θ θ θ, ¯ α α α),α α α− ¯ α α αi− L 22 2 kα α α− ¯ α α αk 2 ≤ 1 2L 22 ˜ Y(¯ α α α,−∇ α α α f lin ( ¯ θ θ θ, ¯ α α α),L 22 ) + max α α α∈A n h∇ α α α f( ¯ θ θ θ, ¯ α α α)−∇ α α α f lin ( ¯ θ θ θ, ¯ α α α),α α α− ¯ α α αi o = 1 2L 22 ˜ Y(¯ α α α,−∇ α α α f lin ( ¯ θ θ θ, ¯ α α α),L 22 ) + max α α α∈A n h∇ α α α f( ¯ θ θ θ, ¯ α α α)−∇ α α α f lin ( ¯ θ θ θ,α α α 0 ),α α α− ¯ α α αi o ≤ 1 2L 22 ˜ Y(¯ α α α,−∇ α α α f lin ( ¯ θ θ θ, ¯ α α α),L 22 ) +L 22 k¯ α α α−α α α 0 kmax α α α∈A kα α α− ¯ α α αk ≤ 1 2L 22 ˜ Y(¯ α α α,−∇ α α α f lin ( ¯ θ θ θ, ¯ α α α),L 22 ) +L 22 D 2 α ≤ 2 α L 22 , 129 where we plugged in the expression forf lin , then used Assumption 11, and finally used the first bound in (3.12) and the assumption about ( ¯ θ θ θ, ¯ α α α). On the other hand, using Assumption 11 we arrive at ˜ X ( ¯ θ θ θ,∇ θ θ θ f( ¯ θ θ θ, ¯ α α α),L 11 ) =−2L 11 min θ θ θ∈Θ h∇ θ θ θ f( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi + L 11 2 kθ θ θ− ¯ θ θ θk 2 ≤−4L 11 min θ θ θ∈Θ n h∇ θ θ θ f( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi +L 11 kθ θ θ− ¯ θ θ θk 2 o =−4L 11 min θ θ θ∈Θ ( h∇ θ θ θ f( ¯ θ θ θ, ¯ α α α)−∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α) +∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi + L 11 2 kθ θ θ− ¯ θ θ θk 2 + L 11 2 kθ θ θ− ¯ θ θ θk 2 ) ≤ 2 ˜ X ( ¯ θ θ θ,∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),L 11 ) − 4L 11 min θ θ θ∈Θ h∇ θ θ θ f( ¯ θ θ θ, ¯ α α α)−∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),θ θ θ− ¯ θ θ θi + L 11 2 kθ θ θ− ¯ θ θ θk 2 ≤ 2 ˜ X (¯ α α α,∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),L 11 ) + 2k∇ θ θ θ f( ¯ θ θ θ, ¯ α α α)−∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α)k 2 = 2 ˜ X (¯ α α α,∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),L 11 ) + 2k∇ θ θ θ f( ¯ θ θ θ,α α α 0 ) +∇ 2 θ θ θα α α f( ¯ θ θ θ,α α α 0 )(¯ α α α−α α α 0 )−∇ θ θ θ f( ¯ θ θ θ, ¯ α α α)k 2 ≤ 2 ˜ X (¯ α α α,∇ θ θ θ f lin ( ¯ θ θ θ, ¯ α α α),L 11 ) + 8L 2 12 D 2 α ≤ 2 2 θ + 8L 2 12 D 2 α ≤ 4 2 θ , where the firs inequality is based on[ [163], Lemma 1] and the third inequality is obtained from optimizing the unconstrained quadratic term. This completes the proof. A.2.7 Proof of Theorem 7 Theorem 7. Consider the min-max zero sum game min θ θ θ∈Θ max α α α∈A f(θ θ θ,α α α). (43) 130 Define f λ,lin (θ θ θ,α α α) =f lin (θ θ θ,α α α)− λ 2 kα α α−α α α 0 k 2 and g λ (θ θ θ) = max α α α∈A f λ,lin (θ θ θ,α α α). Let D =g λ (θ θ θ 0 )− min θ θ θ∈Θ (g λ (θ θ θ)), L g λ =L 11 + L 2 12 λ be the Lipschitz constant of the gradient of g λ and D α satisfies Equation (42). In Algorithm 7, if set η = 1 Lg λ ,λ = min{L 22 , α 2Dα } =O(1) and choose T large enough such that T≥N T ( θ ), 16L g λ D 2 θ =O( −2 θ −1 α ), then there exists t∈{0,...,T} such that (θ θ θ t ,α α α t+1 ) is an (2 θ ,2 α )-FNE of the problem (43). Proof: First, by descent lemma we have g λ (θ θ θ t+1 )≤g λ (θ θ θ t ) +h∇ θ θ θ g λ (θ θ θ t ),θ θ θ t+1 −θ θ θ t i + L g λ 2 kθ θ θ t+1 −θ θ θ t k 2 =g λ (θ θ θ t ) +h∇ θ θ θ f λ,lin (θ θ θ t ,α α α t+1 ),θ θ θ t+1 −θ θ θ t i + L g λ 2 kθ θ θ t+1 −θ θ θ t k 2 1 = g λ (x t ) + min θ θ θ∈Θ hf λ,lin (θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L g λ 2 kθ θ θ−θ θ θ t k 2 =g λ (θ θ θ t ) + 1 2L g λ 2L g λ min θ θ θ∈Θ hf λ,lin (θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L g λ 2 kθ θ θ−θ θ θ t k 2 2 ≤g λ (θ θ θ t ) + 1 2L g λ 2L 11 min θ θ θ∈X hf λ,lin (θ θ θ t ,α α α t+1 ),θ θ θ−θ θ θ t i + L 11 2 kθ θ θ−θ θ θ t k 2 =g λ (θ θ θ t )− 1 2L g λ ˜ X (θ θ θ t ,∇ θ θ θ f λ,lin (θ θ θ t ,α α α t+1 ),L 11 ) (44) where 1 follows the definition of θ θ θ t+1 and 2 is due to [163, Lemma 1]. Summing up both sides of the inequality (44), we obtain T−1 X t=0 1 2L g λ ˜ X (θ θ θ t ,∇ θ θ θ f λ,lin (θ θ θ t ,α α α t+1 ),L 11 ) ! ≤g λ (α α α 0 )−g λ (α α α T )≤D. As a result, by picking T≥ 2Lg λ D 2 θ , at least for one of the iterates t∈{1,···,T} we have ˜ X (θ θ θ t ,∇ θ θ θ f lin (θ θ θ t ,α α α t+1 ),L 11 )≤ 2 θ , (45) which is based on the fact that the regularized linear function has the same gradient as the non-regularized linear function with respect to the minimization parameter. 131 Besides, from the definition we have ˜ Y(α α α t+1 ,−∇ α α α f lin (θ θ θ t ,α α α t+1 ),L 22 ) = 2L 22 max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i− L 22 2 ||α α α−α α α t+1 || 2 1 ≤ 2(2L 22 +λ)max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 ),α α α−α α α t+1 i− (2L 22 +λ) 2 ||α α α−α α α t+1 || 2 = 2(2L 22 +λ)max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 )−λ(α α α t+1 −α α α 0 ) +λ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i − 2L 22 +λ 2 ||α α α−α α α t+1 || 2 = 2 2L 22 +λ L 22 +λ (L 22 +λ)max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 )−λ(α α α t+1 −α α α 0 ) +λ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i − 2L 22 +λ 2 ||α α α−α α α t+1 || 2 ≤ 4(L 22 +λ)max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 )−λ(α α α t+1 −α α α 0 ) +λ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i − 2L 22 +λ 2 ||α α α−α α α t+1 || 2 = 4(L 22 +λ)max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 )−λ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i− L 22 +λ 2 ||α α α−α α α t+1 || 2 − L 22 2 ||α α α−α α α t+1 || 2 +λ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i ≤ 4(L 22 +λ)max α α α∈A h∇ α α α f lin (θ θ θ t ,α α α t+1 )−λ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i− L 22 +λ 2 ||α α α−α α α t+1 || 2 + 4(L 22 +λ)max α α α∈A − L 22 2 ||α α α−α α α t+1 || 2 +hλ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i 2 ≤ 4(L 22 +λ)max α α α∈A − L 22 2 ||α α α−α α α t+1 || 2 +hλ(α α α t+1 −α α α 0 ),α α α−α α α t+1 i ≤ 2 L 22 +λ L 22 λ 2 D 2 α , where 1 is based on [163, Lemma 1] and the 2 results from the fact that step 3 in Algorithm 7 solves the regularized linear function to the optimality. Besides, the last term is obtained from optimizing the quadratic term. As a result, by choosing λ≤ min{L 22 , α 2Dα } we have, ˜ Y(α α α t+1 ,−∇ α α α f lin (θ θ θ t ,α α α t+1 ),L 22 )≤ 2 α . (46) 132 Finally, combining (46), (45) with Lemma 4 gives us the desired result. A.2.8 Proof of Lemma 5 Lemma 5. Letp andq be two discrete distributions withp = (p 1 ,...,p m ) andq = (q 1 ,...,q m ). Let x∈R m and y∈R m be the corresponding one-hot encoded random variables, i.e., P (x = e i ) =p i ,∀i = 1,...,m and P(y =e i ) =q i ,∀i = 1,...,m, where e i is the i-th standard basis. Assume further that dist(p,q) is the optimal transport distance between p and q defined in (3.16). Let ˆ p n and ˆ q n be the natural unbiased estimator of p and q based on n i.i.d. samples. In other words, ˆ p n = 1 n P n `=1 x ` and ˆ q n = 1 n P n `=1 y ` , where x ` and y ` ,` = 1,...,n, are i.i.d samples obtained from distributions p and q, respectively. Then, E h dist(ˆ p n+1 , ˆ q n+1 ) i ≤E[dist(ˆ p n , ˆ q n )]. Moreover, lim n→∞ dist(ˆ p n , ˆ q n ) = dist(p,q), almostsurely. 133 Proof: The proof is similar to the standard proof in sample average approximation method; see [167, Proposition 5.6]. Notice that, E h dist(ˆ p n+1 , ˆ q n+1 ) i =E max (λ,γ)∈C m X i=1 γ i ˆ p n+1 i + m X j=1 λ j ˆ q n+1 j =E " max (λ,γ)∈C hˆ p n+1 ,γi +hˆ q n+1 ,λi # = 1 n + 1 E max (λ,γ)∈C hγ, X ` x ` i +hλ, X ` y ` i = 1 n(n + 1) E max (λ,γ)∈C n+1 X t=1 hγ, X `6=t x ` i + n+1 X t=1 hλ, X `6=t y ` i ≤ 1 n(n + 1) E n+1 X t=1 max (λ,γ)∈C hγ, X `6=t x ` i +hλ, X `6=t y ` i = 1 (n + 1) n+1 X t=1 E 1 n max (λ,γ)∈C hγ, X `6=t x ` i +hλ, X `6=t y ` i =E[dist(ˆ p n , ˆ q n )]. The proof of the almost sure convergence follows directly from the facts that lim n→∞ ˆ p n = p, lim n→∞ ˆ q n =q, and the continuity of the distance function. 134
Abstract (if available)
Abstract
Generative models are among the most popular statistical tools that are used not only to learn the underlying distribution of data and generate more samples from it, but also to help in classifying normal data from anomalies. Despite their popularity, from optimization perspective, they result in training problems that are challenging to solve. This is mainly due to the nonconvexity and nonsmoothnes of the landscape of these problems. In this dissertation, we study the landscape and design algorithms for learning via Mixture Models and Generative Adversarial Networks (GANs) as two of the most popular generative models. Regarding the mixture models, we show that when the number of components is two and they are equally weighted, despite nonconvexity, all local optima of the maximum likelihood estimation problem are globally optimum. In addition, in the general case of mixture models, we propose a multi-objective algorithm which has a higher chance of escaping spurious local optima compared to the benchmark methods. Next, we study the training problem of GANs which requires solving min-max saddle point games. We first study a special class of deterministic nonconvex-concave min-max games and define a notion of stationarity for these problems (motivated by the necessary first-order optimality condition) and propose an algorithm that is capable of finding these stationary points. We then study stochastic version of min-max saddle point games that satisfy Minty variational inequality condition and propose an algorithm and obtain its convergence rate for finding stochastic first-order Nash equilibria. We generalize our algorithms and analysis to nonconvex-nonconcave games in which the feasible set for one of the players is small. Our analysis shows that under this assumption, solving a linear approximation of the objective function can find a stationary point of the original problem. Finally, due to the difficulties of solving a general class of nonconvex-nonconcave games that appear from GANs problem, we revisit its formulation and propose a new generative model that does not require solving min-max games and instead uses random discriminator. Our new model results in a more stable optimization mechanisms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
Stochastic games with expected-value constraints
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Train scheduling and routing under dynamic headway control
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Machine learning in interacting multi-agent systems
PDF
Robustness of gradient methods for data-driven decision making
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Models and algorithms for the freight movement problem in drayage operations
PDF
Traffic assignment models for a ridesharing transportation market
PDF
Thwarting adversaries with unpredictability: massive-scale game-theoretic algorithms for real-world security deployments
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
Asset Metadata
Creator
Barazandeh, Babak
(author)
Core Title
Algorithms and landscape analysis for generative and adversarial learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2021-05
Publication Date
09/09/2021
Defense Date
02/25/2021
Publisher
Unerivsity of Southern California. Libraries
(digital)
Tag
generative adversarial networks,min-max saddle point games,non-convex optimization,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nakano, Aiichiro (
committee member
), Pang, Jong-Shi (
committee member
), Razaviyayn, Meisam (
dissertation committee chair
)
Creator Email
babakbarazandeh@gmail.com,barazand@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-426153
Unique identifier
UC11666735
Identifier
etd-Barazandeh-9312.pdf (filename),usctheses-c89-426153 (legacy record id)
Legacy Identifier
etd-Barazandeh-9312
Dmrecord
426153
Document Type
Dissertation
Format
theses (aat)
Rights
Barazandeh, Babak
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
generative adversarial networks
min-max saddle point games
non-convex optimization