Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust estimation of high dimensional parameters
(USC Thesis Other)
Robust estimation of high dimensional parameters
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST ESTIMATION OF HIGH DIMENSIONAL PARAMETERS by Lang Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) December 2021 Copyright 2021 Lang Wang Acknowledgements First, I would like to thank my advisor Stanislav Minsker. Our discussions were often a source of inspiration for me. Thank you very much for helping me grow. When I was facing research challenges and making slow progress, my advisor was always willing to discuss any problem I was working on, and consistently helped with advice, which was instrumental to my success. I am grateful that over the past four years, my advisor helped me to find interesting topics to work on, and provided both theoretical and mental support for my work. Next, I owe special thanks to Larry Goldstein and Wenguang Sun for graciously agreeing to sit on my committee and to read my dissertation. Their insightful feedback has substantially helped me to improve my work. Last but not least, I want to thank my parents, Zaiguo Wang and Ailing Zhang, for their love and support throughout my academic life. ii Table of Contents Acknowledgements ii List of Figures v Abstract vii Chapter 1: Introduction 1 Chapter 2: Robust estimation of covariance matrices 3 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Sub-Gaussian distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Problem formulation and main results . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Application to heavy-tailed data . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Bound in the operator norm . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Bound in the Frobenius norm . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 Numerical algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Rank-one update of the spectral decomposition . . . . . . . . . . . . . . . 24 2.5.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3: Pivotal estimation in sparse linear regression under adversarial contamina- tion 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Problem formulation and main results . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 The case of Gaussian design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 iii References 53 Appendices 57 A Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.1 Proofs ommited in Section 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . 58 A.1.1 Technical tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.1.2 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . 61 A.1.3 Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . 79 A.1.4 Proof of Theorem 2.3.3 . . . . . . . . . . . . . . . . . . . . . . 81 A.2 Proof of Lemma 2.4.1 and Theorem 2.4.1 . . . . . . . . . . . . . . . . . . 82 A.2.1 Proof of Lemma 2.4.1 . . . . . . . . . . . . . . . . . . . . . . . 84 A.2.2 Proof of Theorem 2.4.1 . . . . . . . . . . . . . . . . . . . . . . 86 A.3 Proofs omitted in Section 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . 92 A.3.1 Proof of Lemma 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . 92 A.3.2 Proof of Theorem 2.4.2 . . . . . . . . . . . . . . . . . . . . . . 93 A.4 Proofs ommitted in Section 2.5 . . . . . . . . . . . . . . . . . . . . . . . . 102 A.4.1 Convergence analysis of the proximal gradient method (Theorem 2.5.1) . . . . . . . . . . . . . . . . . . . . . . . . . . 102 A.4.2 Numerical method of updating eigenvalues (solving equation (2.5.4)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.5 Auxiliary technical results . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.5.1 Detailed derivation of the claim of Remark 2.3.1 . . . . . . . . . 108 A.5.2 Proof of Lemma A.7 . . . . . . . . . . . . . . . . . . . . . . . . 110 B Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.1 Proofs ommited in Section 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . 113 B.1.1 Proof of Theorem 3.2.1 . . . . . . . . . . . . . . . . . . . . . . 113 B.1.2 Proof of Proposition 3.2.1 . . . . . . . . . . . . . . . . . . . . . 125 iv List of Figures 2.1 Distribution of the relative error of S in the Frobenius norm . . . . . . . . . . . . 29 2.2 Distribution of the relative error of e S s in the Frobenius norm . . . . . . . . . . . . 30 2.3 Distribution of the relative error of S in the operator norm . . . . . . . . . . . . . 31 2.4 Distribution of the relative error of e S s in the operator norm . . . . . . . . . . . . . 31 2.5 Distribution of the relative error of S in the Frobenius norm . . . . . . . . . . . . 31 2.6 Distribution of the relative error of e S s in the Frobenius norm . . . . . . . . . . . . 32 2.7 Distribution of the relative error of S in the operator norm . . . . . . . . . . . . . 33 2.8 Distribution of the relative error of e S s in the operator norm . . . . . . . . . . . . . 33 2.9 Distribution of the relative error of S in the Frobenius norm . . . . . . . . . . . . 33 2.10 Distribution of the relative error of e S s in the Frobenius norm . . . . . . . . . . . . 34 2.11 Distribution of the relative error of S in the operator norm . . . . . . . . . . . . . 35 2.12 Distribution of the relative error of e S s in the operator norm . . . . . . . . . . . . . 35 2.13 Distribution of the relative error of S in the Frobenius norm . . . . . . . . . . . . 35 2.14 Distribution of the relative error of e S s in the Frobenius norm . . . . . . . . . . . . 36 2.15 Distribution of the relative error of S in the operator norm . . . . . . . . . . . . . 37 2.16 Distribution of the relative error of e S s in the operator norm . . . . . . . . . . . . . 37 v List of Algorithms 1 Stochastic proximal gradient descent (SPGD) . . . . . . . . . . . . . . . . . . . . 24 2 Rank-one update of the spectral decomposition of B+rbb T . . . . . . . . . . . . 28 vi Abstract This dissertation investigates the learning scenarios where a high-dimensional parameter needs to be estimated from grossly corrupted data. The motivation for the topics we study comes from the real-world applications, where the data are often obtained from expensive experiments resulting in possible contamination and have to be analyzed robustly and efficiently. For example, in the sensor network model (Haupt et al. [25]), one employs a huge network of sensors to collect independent measurements of a signal and send them to the central hub for estimation analysis. However, a small portion of the sensors might possibly fail to send the measurements correctly, or even transmit irrelevant measurements. Therefore, it is important that the estimation is robust to those irregular measurements. In the first part of the dissertation, we consider a problem of covariance estimation under ad- versarial contamination. Specifically, we are given a set of independent observations from some distribution with unknown mean and covariance, and we assume that a small fraction of them are adversarially corrupted, namely, they are missing or replaced by arbitrary values. Our goal is to construct an estimator that reliably recovers the covariance in this framework. We will show that the proposed estimator admits a theoretically optimal error bound in some regimes and can be computed efficiently. In the second part of the dissertation, we consider a problem of robustly estimating the regres- sion coefficients in a linear model under adversarial contamination. Specifically, we are given n pairs of predictor-response values, which are assumed to satisfy a linear model. Moreover, we assume that a small fraction of the response values are adversarially corrupted, and we are aiming to construct an estimator that reliably recovers the regression coefficients. We will show that the vii proposed estimator admits nearly minimax optimal error bound under appropriate conditions and is computationally efficient. Moreover, our method is pivotal in the sense that it does not require any prior knowledge of the standard deviation of the noise term. viii Chapter 1 Introduction Robust estimation, broadly speaking, is an arsenal of techniques which are resistant to outliers and small departures from the model assumptions. As attested by some early references such as Tukey [53], Huber [27], the study of robust estimation has a long history. During the past two decades, an increasing amount of practical applications created a high demand for the tools to recover high- dimensional parameters of interest from grossly corrupted measurements. As a concrete example, consider the face recognition problem in Wright et al. [59]. In this problem, we are aiming to classify a new test sample y, which is assumed to be represented as y= Xb, with X being the matrix of training faces, andb being the coefficient vector to be estimated. In practice, b is a sparse vector lying in a high dimensional space, and the testing face sample y is possibly occluded by unwanted objects such as hats, sun glasses, face masks, etc, which can be viewed as corruption. Wright et al. [59] proposed a mathematical model y= Xb+ e+x and applied` 1 -minimization to deal with such a problem. The term e in the model represents the occasional corruption described before, whose magnitude can be arbitrarily large. 1 This dissertation deals with two robust estimation problems in the high-dimensional setting. The first problem is related to the estimation of covariance matrices under various types of con- tamination, which has many applications including sparse graphical model estimation, e.g. Mein- shausen and B¨ uhlmann [35]. The second problem can be viewed as generalization of the classical Lasso estimator, where we are aiming to construct a robust, pivotal estimator for the regression coefficients. We suggest algorithms for solving these problems that often admit nearly optimal performance bounds as well as computational efficiency. Since the required background of the two problems is not closely related, we are going to separately provide detailed background for the topics of interest in each chapter. 2 Chapter 2 Robust estimation of covariance matrices 2.1 Introduction In this chapter, we consider the problem of estimating covariance matrices frome-corrupted sam- ples: we are given n independent observations from some unknown distributionD overR d with mean m and covariance matrixS, where an e-fraction of them are adversarially corrupted. More specifically, we assume that our observations Y 1 ;:::;Y n satisfy the equation Y j = Z j +V j ; j= 1;:::;n; where Z j ’s come from the target distributionD and V j ’s are arbitrary (possibly random) vectors such that only ane-fraction of them are non-zeros. Our goal is to construct a robust estimator for the covariance matrixS in this framework. This type of problems have been extensively studied in robust statistics, for example, see Huber [27, 28], Maronna et al. [34]. However, although many estimators coming out of these papers admit theoretically optimal error bounds, they are hard to compute in general when the dimension is high because the running time is exponential in the dimension (Bernholt [5]). Recent work by Lai et al. [32], Diakonikolas et al. [18] introduced the first robust estimators for the covariance matrixS that are computationally efficient in the high-dimensional case, i.e. the running time is only polynomial in the dimension, assuming that the distributionD is Gaussian or 3 an affine transformation of a product distribution with a bounded 4 th moment. Since the publication of these initial papers, a growing body of subsequent works has appeared. For instance, Cheng et al. [13] developed fast algorithms that nearly match the best-known running time to compute the empirical covariance matrix, assuming that the distribution of Z is Gaussian with zero mean. Chen et al. [12] developed efficient algorithms under significantly weaker conditions on the unknown distributionD, i.e. D does not have to be an affine transformation of a product distribution. However, these algorithms can only achieve a theoretically suboptimal error bound in the Frobenius norm. The present chapter continues this line of research. We design a double-penalized estimator for the covariance matrixS, which will be shown to admit theoretically optimal error bounds when the “effective rank” ofS (to be defined later) is small, and can be efficiently calculated using traditional numerical methods. The rest of this chapter is organized as follows. Section 2.2 explains the main notations and background material. Section 2.3 introduces the main results. Section 2.4 displays applications of the main results to heavy-tailed data. Section 2.5 presents numerical experiments. Finally, the proofs of the main results are contained in Appendix A. 2.2 Preliminaries In this section, we introduce the main notations and recall some useful facts that we rely on in the subsequent exposition. Given two real numbers a;b2R, we define a_ b := maxfa;bg, a^ b := minfa;bg. Also, given x2R, we will denotebxc := maxfn2Z : n xg to be the largest integer less than or equal to x. We will separately introduce important results of matrix algebra and sub- Gaussian distributions in the following two subsections. 4 2.2.1 Matrix algebra Assume that A2R d 1 d 2 is a d 1 d 2 matrix with real-valued entries. Let A T denote the transpose of A. A square matrix A2R dd is called an orthogonal matrix if AA T = A T A= I d , where I d is the identity matrix inR dd . The following theorem is well known: Theorem 2.2.1 (Singular Value Decomposition). Let A2R d 1 d 2 be a d 1 d 2 real matrix, then the singular value decomposition of A always exists, and is of the form A= USV T ; where U is a d 1 d 1 orthogonal matrix, whose columns are the eigenvectors of AA T . V is a d 2 d 2 orthogonal matrix, whose columns are the eigenvectors of A T A. S is a d 1 d 2 matrix withS 1;1 =s 1 (A),. . . ,S r;r =s r (A),S k;k = 0 for all k> r andS i; j = 08i6= j, where r= rank(A), ands 1 (A)s 2 (A)s r (A) 0 are the square roots of positive eigenvalues of A T A. They are called the singular values of A. Next, given a square matrix A2R dd , we define the trace of A to be the sum of elements on the main diagonal, namely, tr(A) := d å i=1 A i;i ; where A i;i represents the element on the i th row and i th column of A. We introduce three types of matrix norms and the Frobenius inner product as follows: Definition 2.2.1 (Matrix norms). Given A2R d 1 d 2 , we define the following three types of matrix norms. (i) Operator norm: kAk :=s 1 (A)= q l max (A T A); 5 wherel max (A T A) stands for the largest eigenvalue of A T A. (ii) Frobenius norm: kAk F := v u u t rank(A) å i=1 s 2 i (A)= q tr(A T A): (iii) Nuclear norm: kAk 1 := rank(A) å i=1 s i (A)= tr( p A T A); where p A T A is a nonnegative definite matrix such that( p A T A) 2 = A T A. Definition 2.2.2. Given A;B2R d 1 d 2 , we define the Frobenius inner product as hA;Bi :=hA;Bi F = tr(A T B)= tr(AB T ): It is clear thatkAk 2 F =hA;Ai. We will now introduce matrix functions. Denote S d (R) := n A2R dd : A T = A o to be the set of all symmetric matrices, and recall the spectral theorem on S d (R): Theorem 2.2.2 (Spectral theorem for real symmetric matrices). Any matrix A2S d (R) has d or- thonormal eigenvectors with corresponding real eigenvalues. Consequently, there exists an or- thogonal matrix U and a diagonal matrixL such that A= ULU T . The decomposition is called the spectral decomposition of A. The diagonal elements ofL in the spectral decomposition, denoted asl 1 ;:::;l d , are the eigen- values of A, all of which are real numbers. We say that A2 S d (R) is nonnegative definite (or 6 positive definite) ifl j 0 (orl j > 0),8 j= 1;:::;d. Moreover, we have the following relationship between the spectral decomposition and the singular value decomposition: Proposition 2.2.1. Let S2 S d (R) be a symmetric matrix with spectral decomposition S=å d j=1 l j u j u T j , where the eigenvalues satisfyjl 1 jjl d j 0. Then s j =jl j j; j= 1;:::;d are the singular values of S, and by setting v j = sign(l j )u j ; j= 1;:::;d, we get the singular value decomposition S=å d j=1 s j u j v T j . In particular, if S is a nonnegative definite matrix, then the spectral decomposi- tion of S coincides with the singular value decomposition of S. Next, we define functions on S d (R) as follows: Definition 2.2.3. Given a real-valued function f defined on an intervalTR and a real symmetric matrix A2 S d (R) with the spectral decomposition A= ULU T such thatl j (A)2T; j= 1;:::;d, define f(A) as f(A)= U f(L)U T , where f(L)= f 0 B B B B @ 0 B B B B @ l 1 . . . l d 1 C C C C A 1 C C C C A = 0 B B B B @ f(l 1 ) . . . f(l d ) 1 C C C C A : Finally, we define the effective rank of a matrix A2 S d (R)nf0g as rk(A) := tr(A) kAk : Note that 1 rk(A) rank(A) is always true, and it is possible that rk(A) rank(A) for approxi- mately low-rank matrices A. For instance, consider A2 S d (R) with eigenvaluesl 1 = 1;l 2 == l d = 1=d, whence we have rk(A)= 2d2 d d= rank(A). 7 2.2.2 Sub-Gaussian distributions Given a random variable X on a probability space(W;A;P), and a convex nondecreasing function y :R + 7!R + withy(0)= 0, we define they-norm of X as (Vershynin [57, Section 2.7.1]) kXk y := inf C> 0 :E y jXj C 1 : In particular, in what follows we will mainly consider y 1 (u) := e u 1;u 0 and y 2 (u) := e u 2 1;u 0, which correspond to the sub-exponential norm and sub-Gaussian norm respectively. We will say that a random variable X is sub-Gaussian (or sub-exponential) ifkXk y 2 <¥ (orkXk y 1 < ¥). Also, letkXk L 2 := E jXj 2 1=2 be the L 2 norm of a random variable X. The sub-Gaussian (or sub-exponential) random vector is defined as follows: Definition 2.2.4. A random vector Z inR d with mean m =E[Z] is called L-sub-Gaussian if for every v2R d , there exists an absolute constant L> 0 such that khZm;vik y 2 LkhZm;vik L 2 : (2.2.1) Moreover, Z is called L-sub-exponential ify 2 -norm in (2.2.1) is replaced byy 1 -norm. It is clear that if Z is L-sub-Gaussian, then (Z) is also L-sub-Gaussian. We introduce some important results for sub-Gaussian distributions. Proposition 2.2.2. (Vershynin [57, pp.24]) A mean zero random variable Z is L-sub-Gaussian if and only if there exists an absolute constant K(L)> 0 depending only on L such that P(jZj t) 2expft 2 =K(L) 2 g; for all t 0: 8 Theorem 2.2.3. (Vershynin [57, Theorem 2.6.3]) Let Z 1 ;:::;Z n be i.i.d L-sub-Gaussian random variables with mean zero, and a=(a 1 ;:::;a n )2R n . Then for any t 0, there exists a constant K(L)> 0 depending only on L such that P n å i=1 a i Z i t ! 2exp ( t 2 K(L) 2 kak 2 2 ) ; wherekak 2 2 = a 2 1 ++ a 2 n . Corollary 2.2.1. Let Z 1 ;:::;Z n be i.i.d L-sub-Gaussian random variables with common mean E[Z 1 ]=m and sub-Gaussian normkZ 1 k y 2 = K. Let a2 n (a 1 ;:::;a n )2R d :kak 2 1 o . Then (i) Y :=å n i=1 a i (Z i m) is still L-sub-Gaussian. (ii) There exists an absolute constant c> 0, such thatkYk y 2 cK. Proof. This corollary immediately follows by a combination of Theorem 2.2.3 and Proposition 2.2.2. 2.3 Problem formulation and main results 2.3.1 Problem formulation Let Z 1 ;:::;Z n 2R d be i.i.d. copies of an L-sub-Gaussian random vector Z such thatEZ=m and E(Zm)(Zm) T =S. Assume that we observe a sequence Y j = Z j +V j ; j= 1;:::;n; (2.3.1) where V j ’s are arbitrary (possibly random) vectors such that only a small portion of them are not equal to zero. Namely, we assume that there exists a set of indices Jf1;:::;ng such thatjJj n 9 and V j = 0 for j = 2 J. In what follows, the sample points with j2 J will be called outliers and e := jJj n will denote the proportion of such points. In this case, Y j Y T j = Z j Z T j +V j V T j +V j Z T j + Z j V T j | {z } := p nU j := X j + p nU j ; where rank(U j ) 2 and the p n factor is added for technical convenience. Our main goal is to construct an estimator for the covariance matrixS in the presence of outliers V j . In practice, we usually do not know the true meanm of Z. To address this problem, we first recall the definition of U-statistics. Definition 2.3.1 (Hoeffding [26]). Let Y 1 ;:::;Y n (n 2) be a sequence of random variables taking values in a measurable space (S;B). Assume that H :S m 7!S d (R) (2 m n) is anS m - measurable permutation-symmetric kernel, meaning that H(y 1 ;:::;y m )= H(y p 1 ;:::;y p m ) for any (y 1 ;:::;y m )2S m and any permutationp. The U-statistic with kernel H is defined as U n := (n m)! n! å (i 1 ;:::;i m )2I m n H(Y i 1 ;:::;Y i m ); where I m n := (i 1 ;:::;i m ) : 1 i j n;i j 6= i k if j6= k . A particular example of U-statistics is the sample covariance matrix defined as e S s := 1 n 1 n å j=1 (Y j ¯ Y)(Y j ¯ Y) T ; (2.3.2) where ¯ Y := 1 n å n j=1 Y j . Indeed, it is easy to verify that e S s = 1 n(n 1) å (i; j)2I 2 n (Y i Y j )(Y i Y j ) T 2 ; (2.3.3) hence the sample covariance matrix is a U-statistic with kernel H(x;y) := (x y)(x y) T 2 for any x;y2R d : (2.3.4) 10 Note thatE h Y i Y j p 2 i = 0 andE (Y i Y j ) p 2 (Y i Y j ) p 2 T =S for all (i; j)2 I 2 n . Namely, by expressing the sample covariance matrix as a U-statistic in (2.3.3), the explicit estimation of the unknown mean m can be avoided. Therefore, we consider the following settings: e Y i; j := Y i Y j p 2 ; e Z i; j := Z i Z j p 2 ; e V i; j := V i V j p 2 ; 8(i; j)2 I 2 n : (2.3.5) Then e Y i; j e Y T i; j = e Z i; j e Z T i; j + e V i; j e V T i; j + e V i; j e Z T i; j + e Z i; j e V T i; j | {z } := p n(n1) e U i; j := e X i; j + p n(n 1) e U i; j ; where the n(n 1)=jI 2 n j factor equals the total number of e Y i; j ’s, and is added for technical conve- nience. It is easy to check that the following claims hold: (i) e Y i; j = e Z i; j + e V i; j , withE h e Z i; j i = 0 andE h e Z i; j e Z T i; j i =S, for any(i; j)2 I 2 n . Moreover, Corollary 2.2.1 shows that e Z i; j ;(i; j)2 I 2 n has sub-Gaussian distribution. (ii) e Z i; j ’s are identically distributed, but not independent. (iii) Denote e J= n (i; j)2 I 2 n : e V i; j 6= 0 o to be the set of indices such that e V i; j = 0,8(i; j)= 2 e J. Then j e Jj represents the number of outliers in n e Y i; j :(i; j)2 I 2 n o , and we have that j e Jj= 2jJj(njJj)+jJj(jJj 1)=jJj(2njJj 1): (2.3.6) (iv) Rank( e U i; j ) 2. This follows from the fact that for any vector v2R d , e U i; j v2 span n e V i; j ; e Z i; j o . In what follows, we will use the notation U U U I 2 n :=(U 1;2 ;:::;U n;n1 ) to represent the n(n 1)-dimensional sequence with subscripts taking from I 2 n . Similarly, the notation(S;U U U I 2 n ) will represent the(n 2 n+ 1)-dimensional sequence(S;U 1;2 ;:::;U n;n1 ). 11 Now we are ready to define our estimator. Givenl 1 ;l 2 > 0, set ( b S l ; b U U U I 2 n )= argmin S;U 1;2 ;:::;U n;n1 " 1 n(n 1) å i6= j e Y i; j e Y T i; j S p n(n 1)U i; j 2 F +l 1 kSk 1 +l 2å i6= j U i; j 1 # ; (2.3.7) where the minimization is over S;U i; j 2 S d (R),8(i; j)2 I 2 n . Remark 2.3.1. The double penalized least-squares estimator in (2.3.7) is in fact a penalized Hu- ber’s estimator (Donoho and Montanari [19, Section 6]). To see this, we express (2.3.7) as ( b S l ; b U U U I 2 n )= argmin S min U I 2 n " 1 n(n 1) å i6= j e Y i; j e Y T i; j S p n(n 1)U i; j 2 F +l 1 kSk 1 +l 2å i6= j U i; j 1 # ; (2.3.8) and observe that the minimization with respect to U I 2 n in (2.3.8) can be done explicitly. It yields that b S l = argmin S ( 2 n(n 1) tr " å i6= j r p n(n1)l 2 2 ( e Y i; j e Y T i; j S) # +l 1 kSk 1 ) ; (2.3.9) where r l (u) := 8 > < > : u 2 2 ; jujl ljuj l 2 2 ; juj>l 8u2R;l2R + is the Huber’s loss function. Details of the derivation are presented in Appendix A.5.1. 2.3.2 Main results In this subsection we state the main results that provide an error bound for the estimator in (2.3.7). We will compare performance of our estimator to that of the sample covariance matrix e S s defined in (2.3.2). When there are no outliers, it is well-known that e S s is a consistent estimator of S with expected error at mostO(d= p n) in the Frobenius norm, namely, e S s S F Cd= p n with 12 probability 99% for some absolute constant C> 0 (see for example, Cai et al. [9]). However, in the presence of outliers, the error for e S s can be large (see Section 2.5 for some examples). On the other hand, recall that e X i; j = e Z i; j e Z T i; j , and our estimator in (2.3.7) admits the following bound. Theorem 2.3.1. Let d > 0 be an absolute constant. Assume that n 2 andjJj c 1 (d)n, where c 1 (d) is a constant depending only ond. Then on the event E = ( l 1 140kSk p n(n 1) p rk(S)+ 4 1 n(n 1) å (i; j)2I 2 n e X i; j S ; l 2 140kSk n(n 1) p rk(S)+ 4 p n(n 1) max (i; j)2I 2 n e X i; j S ) ; the following inequality holds: b S l S 2 F inf S:rank(S) c 2 n 2 l 2 2 l 2 1 ( (1+d)kSSk 2 F + c(d) l 2 1 rank(S)+l 2 2 jJj 2 ) ; where c 2 is an absolute constant and c(d) is a constant depending only ond. The proof of Theorem 2.3.1 is presented in Appendix A.1.2. Remark 2.3.2. The bound in Theorem 2.3.1 contains two terms: (i) The first term,(1+d)kSSk 2 F +c(d)l 2 1 rank(S), does not depend on the number of outliers. When there are no outliers, i.e.jJj= 0, the bound will only contain this part, and in such a scenario Lounici [33] proved that the theoretically optimal bound is b S l S 2 F inf S kS Sk 2 F +CkSk 2 (rk(S)+t) n rank(S) with probability at least 1 e t . By making the smallest choice ofl 1 as specified in (2.3.10), one sees that the first term of our bound coincides with the theoretically optimal bound. (ii) The second term, c(d)l 2 2 jJj 2 , controls the worst possible effect due to the presence of outliers. When more conditions on the outliers are imposed (for example, independence), this bound 13 can be improved. Moreover, Diakonikolas et al. [17] proved that when Z is Gaussian with zero mean, there exists an estimator b S achieving theoretically optimal bound b SS F O(e)kSk, which is independent of the dimension d. In our case, by making the small- est choice of l 2 as specified in (2.3.11), we can show that the error bound scales like C(log(n)+ rk(S))ekSk, where C> 0 is an absolute constant. The additional factor(log(n)+ rk(S)) shows that our bound is sub-optimal in general. However, if rk(S) is small, our bound is essentially optimal up to a logarithmic factor. Note that in Theorem 2.3.1 the regularization parameters l 1 ;l 2 should be chosen sufficiently large such that the eventE happens with high probability. Under the assumption that Z j ; j = 1;:::;n are independent, identically distributed L-sub-Gaussian vectors, we can prove the follow- ing result which gives an explicit lower bound on the choice ofl 1 . Theorem 2.3.2. Assume that Z is L-sub-Gaussian with mean m and covariance matrix S. Let Z 1 ;:::;Z n be independent copies of Z, and define e Z i; j := (Z i Z j) p 2 ;(i; j)2 I 2 n . Then e Z i; j ;(i; j)2 I 2 n are mean zero L-sub-Gaussian random vectors with the same covariance matrixS. Moreover, for any t 1, there exists c(L)> 0 depending only on L such that 1 n(n 1) å i6= j e Z i; j e Z T i; j S c(L)kSk r rk(S)+t n + rk(S)+t n ! with probability at least 1 2e t . Theorem 2.3.2 along with the definition of eventE indicates that it suffices to choosel 1 satis- fying l 1 c(L)kSk r rk(S)+t n ; (2.3.10) given that n rk(S)+t. The next theorem provides a lower bound for the choice ofl 2 : 14 Theorem 2.3.3. Assume that Z is L-sub-Gaussian with mean zero and Z 1 ;:::;Z n are samples of Z (not necessarily independent), then there exists c(L)> 0 depending only on L, such that for any t 1, max j=1;:::;n kZ j Z T j Sk c(L)kSk(rk(S)+ log(n)+t) with probability at least 1 e t . Note that Theorem 2.3.3 does not require independence of samples, so it can be applied to the mean zero, L-sub-Gaussian vectors e Z i; j ;(i; j)2 I 2 n to deduce that max i6= j e Z i; j e Z T i; j S c(L)kSk[rk(S)+ log(n(n 1))+t] with probability at least 1e t . Combining this bound with the definition of eventE , we conclude that it suffices to choosel 2 satisfying l 2 c(L)kSk (rk(S)+ log(n)+t) n : (2.3.11) By choosing the smallest possiblel 1 ;l 2 as indicated in (2.3.10)(2.3.11), we deduce the following corollary: Corollary 2.3.1. Let d > 0 be an absolute constant. Assume that n rk(S)+ log(n) andjJj c 1 (d)n, where c 1 (d) is a constant depending only ond. Then we have that b S l S 2 F inf S:rank(S)c 0 2 n rk(S)+log(n) ( (1+d)kSSk 2 F + c(L;d)kSk 2 h rk(S)+ log(n) n rank(S)+ (rk(S)+ log(n)) 2 n 2 jJj 2 i ) (2.3.12) with probability at least 1 3 n , where c 0 2 is an absolute constant and c(L;d) is a constant depending only on L andd. 15 Note that the last term in (2.3.12) can be equivalently written in terms of e, the proportion of outliers, as c(L;d)kSk 2 (rk(S)+ log(n)) 2 e 2 : (2.3.13) 2.4 Application to heavy-tailed data In this section, we consider the application as well as possible improvements of the previously discussed results to heavy-tailed data. Let Y2R d be a random vector with meanE[Y]=m, covari- ance matrixS=E (Ym)(Ym) T , and such thatE h kYmk 4 2 i <¥. Assume that Y 1 ;:::;Y n are i.i.d copies of Y , and as before our goal is to estimateS. Sincem is unknown and the estimation of m is non-trivial for the heavy tailed-data, we consider the setting e Y i; j =(Y i Y j )= p 2 and denote, for brevity, H i; j := e Y i; j e Y T i; j : We have previously shown thatE h e Y i; j i = 0 andE H i; j =S, so the mean estimation is no longer needed for e Y i; j . Givenl 1 ;l 2 > 0, we propose the following estimator forS: b S l = argmin S ( 1 n(n 1) tr " å i6= j r p n(n1)l 2 2 ( e Y i; j e Y T i; j S) # + l 1 2 kSk 1 ) ; (2.4.1) which is the minimizer of the penalized Huber’s loss function: L(S)= 1 n(n 1) tr " å i6= j r p n(n1)l 2 2 ( e Y i; j e Y T i; j S) # + l 1 2 kSk 1 : (2.4.2) Recall that in Section 2.3.1 we showed that the estimator b S l in (2.4.1) is equivalent to the double- penalized least-squares estimator in (2.3.7). The key idea to derive an error bound for b S l is mo- tivated by Prasad et al. [45], which suggests that it is possible to decompose any heavy-tailed 16 distribution as a mixture of a “well-behaved” and a contamination components. The decomposi- tion bridges the gap between the heavy-tailed model and the outlier model (2.3.1), allowing us to follow an argument similar to that in Section 2.3. To be precise, we consider the decomposition e Y i; j = e Y i; j 1 n e Y i; j 2 R o | {z } := e Z i; j + e Y i; j 1 n e Y i; j 2 > R o | {z } := e V i; j ; (2.4.3) where R> 0 is the truncation level that will be specified later. In the following two subsections, we will separately show that the estimator b S l in (2.4.1) is close toS in both the operator norm and the Frobenius norm. 2.4.1 Bound in the operator norm In this subsection we show that b S l is close toS in the operator norm with high probability. We will be interested in the effective rank of the matrixE (H 1;2 S) 2 , and denote it as r H := rk(E (H 1;2 S) 2 )= tr(E (H 1;2 S) 2 ) E[(H 1;2 S) 2 ] : Minsker and Wei [39, Lemma 4.1] suggest that under the bounded kurtosis assumption (to be specified later, see (2.4.4)), we can upper bound r H by the effective rank of S, namely, r H C rk(S) with some absolute constant C. We first present a lemma which shows that if the tuning parameterl 1 is too large, the estimator b S l will be a zero matrix with high probability. Lemma 2.4.1. Assume that t 0,s E (H 1;2 S) 2 1 2 and n max ( 64a 2 t r H ; 4b 2 t 2 kSk 2 s 2 ) ; where a,b are sufficiently large constants. Then for anyl 1 > s 4 p n t ; we have that argmin S L(S)= 0 with probability at least 1 e t . 17 Lemma 2.4.1 immediately implies that for the choice ofl 1 > s 4 p n t , b S l S =kSk with high probability, which is bounded by the largest singular value ofS. The following theorem provides a bound for the choice ofl 1 s 4 p n t . Theorem 2.4.1. Assume that t 1 is such that r H t n c 3 for some sufficiently small constant c 3 ,s E (H 1;2 S) 2 1 2 , and n max n 64a 2 t r H ; 4b 2 t 2 kSk 2 s 2 o for some sufficiently large constants a, b. Then forl 1 s 4 p n t andl 2 s q 1 (n1)t , we have that b S l S 20 39 l 1 + 80 39 s r t n + 40 39 l 2 t with probability at least 1 8 3 r H + 1 e t . The proofs of Lemma 2.4.1 and Theorem 2.4.1 are presented in Appendix A.2. Remark 2.4.1. The bound in Theorem 2.4.1 is close to that in Minsker and Wei [39, Corrolary 4.1], with an additional term 20 39 l 1 . This term comes from the penalization factor l 1 2 kSk 1 , showing that by “shrinking” our estimator to a low rank matrix, we introduce a bias term bounded by a multiple of the tuning parameterl 1 . Remark 2.4.2. According to Minsker and Wei [39, Lemma 4.1], the “matrix variance” parameter s 2 appearing in the statement of Theorem 2.4.1 can be bounded bykSktr(S)= rk(S)kSk 2 under the bounded kurtosis assumption. More precisely, if we assume that the kurtoses of the linear forms hY;vi are uniformly bounded by K, meaning that sup v:kvk 2 =1 E[hYE[Y];vi] 4 h E[hYE[Y];vi] 2 i 2 K (2.4.4) 18 for any v2R d . Then we have that E (H 1;2 S) 2 K rk(S)kSk 2 ands can be chosen as C p K rk(S)kSk with some absolute constant C. Moreover, in this case the assumptions on n and t in Lemma 2.4.1 and Theorem 2.4.1 can be reduced to a single assumption that r H t n c 0 3 for some sufficiently small constant c 0 3 . We will formally state condition (2.4.4) in the next subsection and derive additional results based on it. 2.4.2 Bound in the Frobenius norm In this subsection we show that b S l is close to the covariance matrix of Y in the Frobenius norm with high probability, under a slightly stronger assumption on the fourth moment of Y . Definition 2.4.1. A random vector Y2R d is said to satisfy an L 4 L 2 norm equivalence with constant K (also referred to as the bounded kurtosis assumption), if there exists a constant K 1 such that E h hYm;vi 4 i 1 4 K E h hYm;vi 2 i 1 2 (2.4.5) for any v2R d , wherem =E[Y]. As previously discussed in Remark 2.4.2, condition (2.4.5) allows us to connect the matrix variance parameters 2 with rk(S Y ), the effective rank of the covariance matrixS Y . We will assume that Y satisfies (2.4.5) with a constant K throughout this subsection. Recall the decomposition e Y i; j = e Y i; j 1 n e Y i; j 2 R o | {z } := e Z i; j + e Y i; j 1 n e Y i; j 2 > R o | {z } := e V i; j ; (2.4.6) where R> 0 is the truncation level that will be specified later. DenoteS Y :=E h e Y 1;2 e Y T 1;2 i , S Z := E h e Z 1;2 e Z T 1;2 i and recall that our goal is to estimate S Y . Note that e Z i; j 2 R almost surely, so 19 equation (2.4.6) represents e Y i; j as a sum of a bounded vector e Z i; j and a “contamination” component e V i; j , which is similar to (2.3.1). On the other hand, we note that the truncation level R should be chosen to be neither too large (to get a better truncated distribtuion) nor too small (to reduce the bias introduced by the truncation). Mendelson and Zhivotovskiy [37] suggest that a reasonable choice is given as follows: R= tr(S Y )kS Y kn log rk(S Y ) + log(n) ! 1 4 : (2.4.7) Denote e J= n (i; j)2 I 2 n : e Y i; j 2 > R o to be the set of indices corresponding to the nonzero outliers (i.e. e V i; j 6= 0), ande := j e Jj n(n1) to be the proportion of outliers. Under the above setup, we can derive the following lemma which provides an upper bound one with high probability: Lemma 2.4.2. Assume that Y satisfies the L 4 L 2 norm equivalence with constant K, and R is chosen as in (2.4.7) . Then e c(K) rk(S Y ) log rk(S Y ) + log(n) n (2.4.8) with probability at least 1 1 n , where c(K) is a constant only depending on K. The proof of Lemma 2.4.2 is presented in Appendix A.3.1. It is worth noting that the proportion of “outliers” (in a sense of the definition above) in the heavy-tailed model can be pretty small when the sample size n is large. Consequently, we can derive the following bound: Theorem 2.4.2. Given A 1, assume that Y2R d is a random vector with meanE[Y]=m, covari- ance matrixS Y =E (Ym)(Ym) T , and satisfying an L 4 L 2 norm equivalence with constant K. Let Y 1 ;:::;Y n be i.i.d samples of Y , and let e Z i; j be defined as in (2.4.6). Assume that r H log(n) n c 3 for some sufficiently small constant c 3 , n c 4 (K)rk(S Y )log(n rk(S Y )) for some constant c 4 (K) depending only on K, and rank(S Y ) c 2 (K) n log(n) log nrk(S Y ) for some constant c 2 (K) depending 20 only on K. Then for l 1 = c(K)kS Y k r rk(S Y ) h log nrk(S Y )) i n and l 2 = c(K)kS Y k q rk(S Y )log(n) An , we have that b S l S Y 2 F c(K)kS Y k 2 " rk(S Y )log n rk(S Y ) n rank(S Y )+ A rk(S Y ) 2 log(n) 3 n # with probability at least 1 ( 8 3 r H +1) n A 4 n , where c(K) is a constant depending only on K. The proof of Theorem 2.4.2 is presented in Appendix A.3.2. Remark 2.4.3. Let us compare the result in Theorem 2.4.2 to the bound of Corollary 2.3.1. (i) The first part of the bound, c(K)kS Y k 2 rk(S Y )log n rk(S Y ) n rank(S Y ); has the same order as in Corollary 2.3.1 (up to a logarithmic factor), under the assumption thatS Y has low rank. This part of the bound is theoretically optimal according to Remark 2.3.2. (ii) The second part of the bound, c(K)kS Y k 2 rk(S Y ) 2 log(n) 3 n ; (2.4.9) controls the error introduced by the outliers. It is much smaller than the corresponding quantity in Corollary 2.3.1 when e, the proportion of the outliers, is only assumed to be a constant. The improvement is mainly due to the special structure of the heavy-tailed data, namely, the “outlier” term e V i; j is nonzero if and only if the “well-behaved” term e Z i; j equals zero. Without this special structure, one can only apply Theorem 2.3.1 directly and derive a sub-optimal bound of order c(K)kS Y k 2 rk(S Y ) 3 log(n) 3 n : 21 2.5 Numerical experiments In this section we present the results of our numerical experiments. Recall that our loss function is e L(S;U U U I 2 n )= 1 n(n 1) å i6= j e Y i; j e Y T i; j S p n(n 1)U i; j 2 F +l 1 kSk 1 +l 2å i6= j U i; j 1 : (2.5.1) We are aiming to find( b S l ; b U U U I 2 n ), the minimizer of (2.5.1), numerically. Since we are only interested in b S l , equation (2.3.9) suggests that it suffices to minimize the following function: L(S) := 1 n(n 1) tr å i6= j r p n(n1)l 2 2 ( e Y i; j e Y T i; j S)+ l 1 2 kSk 1 ; (2.5.2) where r l (u) := 8 > < > : u 2 2 ; jujl ljuj l 2 2 ; juj>l 8u2R;l2R + is the Huber’s loss function. We will introduce the proximal gradient descent method in the next subsection, which helps us find the minimizer of (2.5.2) numerically. 2.5.1 Numerical algorithm In this section, we will state our algorithm for minimizing L(S). We start with an introduction to the proximal gradient method (see for example, Combettes and Wajs [14]). Suppose that we want to minimize the function f(x)= g(x)+ h(x), where g is convex, differentiable h is convex, not necessarily differentiable We define the proximal mapping and the proximal gradient descent method as follows: 22 Definition 2.5.1. The proximal mapping of a convex function h at the point x is defined as: prox h (x)= argmin u h(u)+ 1 2 ku xk 2 2 : Definition 2.5.2 (Proximal gradient descent(PGD) method). The proximal gradient descent method for solving the problem argmin x f(x)= argmin x g(x)+ h(x) starts from an initial point x (0) , and updates as: x (k) = prox a k h x (k1) a k Ñg(x (k1) ) ; wherea k > 0 is the step size. We have the following convergence result: Theorem 2.5.1. Assume thatÑg is Lipschitz continuous with constant L> 0: kÑg(x)Ñg(y)k Lkx yk and the optimal value f is finite and achieved at the point x . Then the proximal gradient algo- rithm with constant step sizea k =a L will yield anO( 1 k ) convergence rate, i.e. f(x (k) ) f C k ; 8k2f1;2;:::g: Theorem 2.5.1 is well known (see for example, Beck [2, Chapter 10]), but a detailed proof in our case is given in Appendix A.4.1 for the convenience of the reader. Moreover, when g(x)= 1 n å n i=1 g i (x), where g 1 ;:::;g n are convex functions andÑg 1 ;:::;Ñg n are Lipschitz continuous with a common constant L> 0, the update step of PGD will require the evaluation of n gradients, which is expensive when n is large. A natural improvement is to consider the stochastic proximal gradient descent method (SPGD), where at each iteration k= 1;2;:::, we pick an index i k randomly from f1;2;:::;ng, and take the following update: x (k) = prox a k h x (k1) a k Ñg i k (x (k1) ) : 23 The advantage of SPGD over PGD is that the computational cost of SPGD per iteration is 1=n that of the PGD. On the other hand, since the random sampling in SPGD introduces additional variance, we need to choose a diminishing step size a k =O( 1 k ). As a result, the SPGD only converges at a sub-linear rate (see Nitanda [44]). To this end, we will consider the “mini-batch” PGD, which has been previously explored and widely used in large-scale learning problems (see, e.g., Shalev- Shwartz et al. [47], Gimpel et al. [22], Dekel et al. [16], Khirirat et al. [29]). This method picks a small batch of indices rather than one at each iteration to calculate the gradient, and in such a way we are able balance the computational cost of PGD and the additional variance of SPGD. The algorithm is summarized in Algorithm 1. Algorithm 1 Stochastic proximal gradient descent (SPGD) Input: number of iterations T , step sizeh t , batch size b, tuning parametersl 1 andl 2 , initial estimation S 0 , sample size n, dimension d. 1: for t= 1;2;:::;T do 2: (1) Randomly pick i t ; j t 2f1;2;:::;ng one by one without replacement. 3: (2) Compute G t =Ñg i; j (S t )=r 0 p n(n1)l 2 2 ( e Y i; j e Y T i; j S t ). 4: (3) If b> 1, then repeat (1)(2) for b times and save the average gradient in G t . 5: (4) (gradient update) T t+1 = S t G t : 6: (5) (proximal update) S t+1 = argmin S n 1 2 S T t+1 2 F + l 1 2 kSk 1 o =g l 1 2 (T t+1 ); whereg l (u)= sign(u)(jujl) + . 7: end for Output: S T+1 2.5.2 Rank-one update of the spectral decomposition Note that at each iteration of our algorithm, we need to compute the spectral decomposition of ( e Y i; j e Y T i; j S t ), which is computationally expensive. However, since e Y i; j e Y T i; j is a rank-one matrix and 24 S t was already saved in the spectral decomposition form after previous iteration, the problem of computing the spectral decomposition of ( e Y i; j e Y T i; j S t ) can be viewed as a rank-one update of the spectral decomposition, which has been extensively studied (see for example, Bunch et al. [8], and Stange [48]). In this subsection we will show how to use this idea to improve our algorithm. Consider e B= B+rbb T , where the spectral decomposition B= QDQ T is known, r2R and b2R d . Our target is to compute the spectral decomposition of e B. Note that e B= B+rbb T = Q(D+rzz T )Q T ; (2.5.3) where b= Qz, so it suffices to compute the spectral decomposition of D+rzz T . We denote z= (z 1 ;:::;z d ) T , and without loss of generality, we can assume thatkzk 2 = 1. The following theorem is fundamental for our algorithm. Theorem 2.5.2. (Bunch et al. [8, Theorem 1]) Let C= D+rzz T , where D is diagonal,kzk 2 = 1. Let d 1 d 2 ::: d d be the eigenvalues of D, and let e d 1 e d 2 ::: e d d be the eigenvalues of C. Then e d i = d i +rm i , 1 i d whereå n i=1 m i = 1 and 0m i 1. Moreover, d 1 e d 1 d 2 e d 2 ::: d d e d d ifr> 0 and e d 1 d 1 e d 2 d 2 ::: e d d d d ifr< 0. Finally, if d i ’s are distinct and all the elements of z are nonzero, then the eigenvalues of C strictly separate those of D. There are several cases where we can deflate the problem (i.e. reduce the size of the problem): (1) Ifz i = 0 for some i, then e d i = d i and the corresponding eigenvector remains unchanged. This is because(D+rzz T )e i = d i e i asz i = 0. (2) Ifjz i j= 1 for some i, then e d i = d i +r and the corresponding eigenvector remains unchanged. Moreover, in this casez j = 0 for all j6= i, so e d j = d j and their eigenvectors are the same, so the problem is done. (3) If d i has a multiplicity r 2, we can reduce the size of the problem via the following steps: (i) Let Q 1 =[q i 1 ;:::;q i r ]2R dr , wherefq i 1 ;:::;q i r g are the eigenvectors corresponding to d i . Also, set z 1 = Q T 1 z, i.e. z 1 contains rows corresponding to d i . 25 (ii) Construct an Householder transformation H2R rr such that Hz 1 =kz 1 k 2 e 1 , and de- fine ¯ Q 1 = Q 1 H T . (iii) Replace q i 1 ;:::;q i r by the columns of ¯ Q 1 , and z 1 by ¯ Q T 1 z 1 =kz 1 k 2 e 1 . This introduces (r 1) more zero entries to z and possibly an entry with absolute value equals one. An application of (1)(2) gives us(r 1) (or r) more eigen-pairs of D+rzz T . After the deflation step, it remains to work with a k k problem (k d), in which the eigenvalues d i are distinct andz i 6= 0 for all i. We will compute the eigenvalues and eigenvectors separately. First, Golub [24] showed that the eigenvalues of C= D+rzz T are the zeros ofw(l), where w(l)= 1+r k å j=1 z 2 j d j l : Alternatively, since e d 1 <:::< e d k and e d i = d i +rm i , for each i= 1;:::;k we can compute m i by solvingw i (m i )= 0, where w i (m)= 1+ k å j=1 z 2 j d j m : (2.5.4) andd j = d j d i r . Bunch et al. [8] proved that we can solvew i (m)= 0 with a numerical method that converges quadratically. Details of the numerical method are presented in Appendix A.4.2. Second, after computing the eigenvalues e d 1 ;:::; e d k , we can calculate the corresponding eigen- vectors of C= D+rzz T by solving Ce q i = e d i e q i , i= 1;:::;k. Theorem 5 in Bunch et al. [8] shows thate q i can be computed via e q i = D 1 i z D 1 i z 2 ; (2.5.5) where D i := D e d i I. Finally, once we obtained the spectral decomposition of D+rzz T = ¯ Q e D ¯ Q T , we can easily get the decomposition of B+rbb T =(Q ¯ Q) e D(Q ¯ Q) T . Note that computing k eigen- vectors via (2.5.5) costsO(k 3 ) and the matrix multiplication Q ¯ Q in the last step costsO(d 3 ), so the overall complexity of the algorithm is stillO(d 3 ). This can be further improved by exploiting 26 the special structure of ¯ Q, which is given by the product (see for example, Stange [48] and Gandhi and Rajgor [21]): ¯ Q= 2 6 6 6 6 4 z 1 . . . z d 3 7 7 7 7 5 | {z } :=A 2 6 6 6 6 4 1 d 1 e d 1 1 d 1 e d d . . . . . . 1 d d e d 1 1 d d e d d 3 7 7 7 7 5 | {z } :=C 2 6 6 6 6 4 k ¯ c 1 k 2 . . . k ¯ c d k 2 3 7 7 7 7 5 1 (2.5.6) where ¯ C := AC=[ ¯ c 1 ;:::; ¯ c d ], ¯ c i represents the i th column of ¯ C, andk ¯ c i k 2 is the Euclidean norm of ¯ c i . Using (2.5.6), we can evaluate the matrix multiplication Q ¯ Q through the following steps: (1) Compute QA := U =[u 1 ;:::;u d ], where u i =z i q i and q i is the i th column of Q. This step is straightforward and requiresO(d 2 ) computational time. (2) Let u i be the i th row of U. Define e U = UC= 2 6 6 6 6 4 u 1 C . . . u d C 3 7 7 7 7 5 ; which requires to evaluate the product of a vector u i and a Cauchy matrix C d times. The prob- lem of multiplying a Cauchy matrix with a vector is called Trummer’s problem, and Gandhi and Rajgor [21] provide an algorithm which efficiently computes such matrix-vector product inO(d log 2 d) time. Consequently, the complexity of this step isO(d 2 log 2 d). (3) Compute the matrix product e U 2 6 6 6 6 4 k ¯ c 1 k 2 . . . k ¯ c d k 2 3 7 7 7 7 5 1 : This step is again straightforward and can be done inO(d 2 ) time. 27 The overall complexity for the computation of Q ¯ Q is now reduced toO(d 2 log 2 d), which is much smaller thanO(d 3 ) when d is large. We summarized our rank-one update algorithm in Algorithm 2. Algorithm 2 Rank-one update of the spectral decomposition of B+rbb T Input: orthogonal matrix Q and vector d such that B= Qdiag(d)Q T , constantr, vector b 1: Set Qb= z. Ifkzk 2 6= 1, then further setr =rkzk 2 2 and z= z=kzk 2 . 2: Handle deflation cases, and record indices that have not done as a vector d sub . 3: Compute eigenvalues of the d sub d sub sub-problem by solving (2.5.4) numerically. 4: Compute eigenvectors of the d sub d sub sub-problem with (2.5.5). 5: Combine the resulting eigenvalues in e d and eigenvectors in ¯ Q. 6: Compute e Q= Q ¯ Q. Output: orthogonal matrix e Q and vector e d. 2.5.3 Numerical results In this section we present some numerical results with different parameter settings. First, note that if we start with S 0 = 0 dd , we can easily compute the gradient in the first step of proximal gradient descent via G=r 0 p Nl 2 2 ( e Y i; j e Y T i; j )= r 0 p Nl 2 2 ( e Y i; j 2 2 ) e Y i; j 2 2 e Y i; j e Y T i; j : (2.5.7) Here, no explicit spectral decomposition was required. Minsker and Wei [39, Remark 4.1] provide details supporting the claim that the full gradient update at the first step helps to improve the initial guess of the estimator. Therefore, we will start with S 0 = 0 dd , run one step of PGD with the full data set, and use the output as the initial estimate of the solution. Now consider the following parameter settings: d = 100, n= 100,jJj= 3, m =(0;:::;0) T , S= diag(10;1; 1 100 ;:::; 1 100 ). The samples are generated as follows: generate n= 100 indepen- dent samples Z j from the Gaussian distributionN (m;S), and then replacejJj of them (randomly chosen) with Z j +V j , where V j ; j2 J are “outliers” to be specified later. The final results after 28 replacement, denoted as Y j ’s, are the samples we observe and that will be used as inputs for the SPGD algorithm. Next, we calculate e Y i; j = (Y i Y j ) p 2 ;i6= j and perform our algorithm with K= 500 steps and the diminishing step sizea k = 1=k. The initial value S 0 is determined by a one-step full gradient update (2.5.7). To analyze the performance of estimators, we define relative error of the estimator S in the Frobenius norm := kSSk F kSk and relative error of the estimator S in the operator norm := kSSk kSk ; where S is an arbitrary estimator. We will compare the performance of the estimator S produced by our algorithm with the performance of the sample covariance matrix e S s introduced in (2.3.2). Here are some results corresponding to different types of outliers: (1) Constant outliers. Consider the outliers V j =(100;:::;100) T ; j2 J. We performed 200 repetitions of the ex- periment with l 1 = 3, l 2 = 1, and recorded S , e S s for each run. Histograms illustrating the distributions of relative errors are shown in Figure 2.1 and 2.2. Figure 2.1: Distribution of the relative error of S in the Frobenius norm 29 Figure 2.2: Distribution of the relative error of e S s in the Frobenius norm The histograms show that e S s always produces a relative error in the Frobenius norm around 29.3, while S always produces a relative error in the Frobenius norm around 0.3. The average and maximum (over 200 repetitions) relative errors of S were 0:3250 and 0:5867 respectively, with the standard deviation of 0:0995. The corresponding values for e S s were 29:2512, 29:5641 and 0:1125. It is clear that in the considered scenario, estimator S performed noticeably better than the sample covariance e S s . In the meanwhile, the following histograms (Figure 2.3 and 2.4) show that S produces smaller relative errors in the operator norm as well. The average and maximum relative errors of S in the operator norm were 0:2734 and 0:5699 respectively, with the standard deviation of 0:0998. The corresponding values for e S s were 29:3981, 29:7125 and 0:1131. 30 Figure 2.3: Distribution of the relative error of S in the operator norm Figure 2.4: Distribution of the relative error of e S s in the operator norm (2) Spherical Gaussian outliers. Consider the case that the outliers V j are drawn independently from a spherical Gaussian distri- butionN (m V ;S V ), wherem V =(0;:::;0) T ,S V = diag(100;:::;100). In this case, the outliers affect Z j uniformly in all directions. We performed 200 repetitions of the experiment with l 1 = 3, l 2 = 1, and recorded S , e S s for each run. Histograms illustrating the distributions of relative errors are shown in Figure 2.5 and 2.6: Figure 2.5: Distribution of the relative error of S in the Frobenius norm 31 Figure 2.6: Distribution of the relative error of e S s in the Frobenius norm The histograms show that e S s always produces a relative error in the Frobenius norm around 17, while S always produces a relative error in the Frobenius norm around 0.3. The average and maximum (over 200 repetitions) relative errors of S were 0:2865 and 0:6048 respectively, with the standard deviation of 0:1004. The corresponding values for e S s were 17:4683, 21:1723 and 1:4154. It is clear that in the considered scenario, estimator S performed noticeably better than the sample covariance e S s . In the meanwhile, the following histograms (Figure 2.7 and 2.8) show that S produces smaller relative errors in the operator norm as well. The average and maximum relative errors of S in the operator norm were 0:2690 and 0:6009 respectively, with the standard deviation of 0:1060. The corresponding values for e S s were 11:9925, 15:2436 and 1:1811. 32 Figure 2.7: Distribution of the relative error of S in the operator norm Figure 2.8: Distribution of the relative error of e S s in the operator norm (3) Outliers that “erase” some observations. Consider the case that the outliers are given as V j =bZ j for j2 J;b2R. In this case, the outliers erase (when b =1), amplify (when b > 0) or negatively amplify (when b <1) some sample points Z j . We performed 200 repetitions of the experiment withl 1 =l 2 = 0:4, b =50 and recorded S , e S s for each run. Histograms illustrating the distributions of relative errors are shown in Figure 2.9 and 2.10: Figure 2.9: Distribution of the relative error of S in the Frobenius norm 33 Figure 2.10: Distribution of the relative error of e S s in the Frobenius norm The histograms show that S always produces a relative error in the Frobenius norm around 0.13, while e S s produces a relative error in the Frobenius norm around 50. Note that unlike pre- vious examples, the performance of e S s is unstable in the current settings, with relative errors raising to 400 occasionally. The average and maximum (over 200 repetitions) relative errors of S were 0:1719 and 0:3692 respectively, with the standard deviation of 0:0680. The corre- sponding values for e S s were 80:1270, 366:5612 and 61:6144. It is clear that in the considered scenario, estimator S performed noticeably better than the sample covariance e S s . In the meanwhile, the following histograms (Figure 2.11 and 2.12) show that S produces smaller and more stable relative errors in the operator norm as well. The average and maxi- mum relative errors of S in the operator norm were 0:1661 and 0:3689 respectively, with the standard deviation of 0:0721. The corresponding values for e S s were 79:6248, 368:3944 and 62:3943. 34 Figure 2.11: Distribution of the relative error of S in the operator norm Figure 2.12: Distribution of the relative error of e S s in the operator norm (4) Outliers in a particular direction. Finally we consider the case that the outliers are all orthogonal (or parallel) to the subspace spanned by the first M principal components of Z, where Z is an n d matrix with Z T j on each row. We performed 200 repetitions of the experiment withl 1 = 3,l 2 = 1, M= 1 (orthogonal case) and recorded S , e S s for each run. Histograms illustrating the distributions of relative errors are shown in Figure 2.13 and 2.14: Figure 2.13: Distribution of the relative error of S in the Frobenius norm 35 Figure 2.14: Distribution of the relative error of e S s in the Frobenius norm The histograms show that e S s mainly produces a relative error in the Frobenius norm around 8, while S always produces a relative error in the Frobenius norm around 0.3. The average and maximum (over 200 repetitions) relative errors of S were 0:3083 and 0:5956 respectively, with the standard deviation of 0:0938. The corresponding values for e S s were 10:4196, 34:7976 and 6:0676. Note that the smallest error produced by e S was 0:2930, which is comparable to the error produced by S . However, the histograms show that the small error produced by e S s only occurs occasionally, while S was producing small errors consistently. Therefore, in the considered scenario, we can still conclude that estimator S performed better than the sample covariance e S s . In the meanwhile, the following histograms (Figure 2.15 and 2.16) show that S produces smaller relative errors in the operator norm as well. The average and maximum relative errors of S in the operator norm were 0:2648 and 0:5209 respectively, with the standard deviation of 0:0998. The corresponding values for e S s were 10:4703, 34:9707 and 6:0993. 36 Figure 2.15: Distribution of the relative error of S in the operator norm Figure 2.16: Distribution of the relative error of e S s in the operator norm 2.6 Conclusion and discussion In this chapter we proposed a double penalized least squares estimator in (2.3.7) aiming to robustly estimate the covariance matrix S in the presence of outliers. We showed in Theorem 2.3.1 and Corollary 2.3.1 that our estimator admits theoretically optimal error bound if the effective rank ofS is small. Moreover, we analyzed the performance of our estimator for the heavy-tailed data and concluded that the error bound can be improved in this scenario. Numerical experiments were presented, which demonstrated that our estimator performed well in different settings, and was computationally efficient. There are some questions that were not answered in this work and they are listed here as potential future research directions. First, the choices ofl 1 ,l 2 in our method relies on the variance parameters, which is usually not known in practice. Moreover, since we are mainly considering the high-dimensional setting, the estimation of s is non-trivial, making it complicated to choose the optimal tuning parameters. In our numerical experiments, we used cross-validation method to train the best possible values of tuning parameters. However, it is always meaningful to construct 37 a pivotal version of the estimator which does not rely on prior knowledge ofs. A possible idea to proceed is suggested in Chapter 3, which consists in replacing the term 1 n(n 1) å i6= j e Y i; j e Y T i; j S p n(n 1)U i; j 2 F in (2.3.7) by a square-root s 1 n(n 1) å i6= j e Y i; j e Y T i; j S p n(n 1)U i; j 2 F ; and performing similar analysis on this square-root type estimator. Moreover, since the modified problem is still convex, it retains the computational efficiency. It is worth noting that this idea is indeed a modification of the square-root Lasso introduced by Belloni et al. [4]. Second, in Section 2.4 we showed that the error bound for our estimator can be improved in the heavy-tailed setting. The improvement is mainly due the the special structure of the heavy-tailed data, without which one can only derive an error bound as in Theorem 2.3.1. Are we able to find other examples which allow us to obtain improved error bounds? This question was not answered in this work but consitutes an interesting topic for future research. 38 Chapter 3 Pivotal estimation in sparse linear regression under adversarial contamination 3.1 Introduction One of the most important problems in statistics is the problem of linear regression, in which the goal is to estimate the regression coefficientsb in the model y i = X T i b +x i ; i= 1;:::;n; (3.1.1) where X 1 ;y 1 ;:::; X n ;y n 2R p R are pairs of predictor-response variables, b 2R p is the unkown vector of regression coefficients to be estimated, and x 1 ;:::;x n 2R are independent and identically distributed (i.i.d.) noise variables. A standard method of solving such a problem is the least squares algorithm, which suggests the minimizer over b of the squared loss function å n i=1 y i X T i b 2 as an estimator forb . In many recent statistical applications, the case where the number of variables p is larger than the number of observations n became prominent. In such circumstances, it is clear that the tradi- tional least squares method fails to recoverb consistently as the Gram matrix G n := 1 n å n i=1 X i X T i is singular. Accordingly, there have been various lines of work in modern statistics aiming to resolve this problem by imposing different types of conditions on the model, such as Tibshirani 39 [51], Candes and Tao [10], Yuan and Lin [60], Meinshausen and Yu [36], Bickel et al. [7], Zhao et al. [62], Zhao and Yu [61]. The most popular model among them is the least absolute shrinkage and selection operator (the Lasso, see Tibshirani [51]), which assumes thatb is sparse, meaning that most entries of b are zeros. In this case, the squared loss can be replaced by its penalized version 1 2n n å i=1 y i X T i b 2 +lkbk 1 ; wherel> 0 is a regularization parameter, and the minimizer overb of this penalized loss function is called the Lasso estimator forb . The performance of the Lasso estimator has been extensively studied, and the following error bound that holds under certain assumptions on the design matrix X :=[X 1 ;:::;X n ] T is well known (Bickel et al. [7]): b bb 2 Cs r slog(p) n with high probability, where s represents the number of nonzero entries ofb . On the other hand, there are some limitations of the Lasso estimator. First, the optimal choice of l for the classical Lasso estimator depends on the standard deviation s of the noise, which is usually unknown. Unfortunately, when the dimension p is large, in particular, p n, estimation ofs is non-trivial. Belloni et al. [4] proposed a modification of the Lasso, called the square-root Lasso, which achieves the same error rate under appropriate assumptions on the design and the noise, and is pivotal in the sense that it does not require prior knowledge ofs. Since the publication of this paper, a growing body of subsequent works has appeared. For instance, Molstad [40] extended the square-root Lasso framework to multivariate response linear regression setting and proposed fast algorithms to compute the estimator. Tian et al. [50] applied square-root Lasso to perform inference for coefficients after variable selection and to provide estimates for the model-specific variance, which was shown to be significantly more applicable in situations where the noise level is unkown. Second, if we can only observe an adversarially contaminated version of the response values y i , namely, y obs i = y i +q i for i2f1;:::;ng, where a small fraction of q i can take any (possibly 40 random) values, the Lasso estimator is not robust in this scenario. Nguyen and Tran [43] suggest a remedy for this problem, which considers the modified model y i = X T i b + p nq i +x i ; i= 1;:::;n; (3.1.2) and the corresponding double penalized estimator ( b b; b q)= argmin b2R p ;q2R n 1 2n n å j=1 (y j X T j b p nq j ) 2 +l s kbk 1 +l o kqk 1 : The paper shows that when the design matrix X has i.i.d. Gaussian rows, b bb 2 achieves the bound b bb 2 Cs r slog(p) n + r o n ! with high probability, where o is the number of non-zero entries of the contamination vectorq = (q 1 ;:::;q n ) T . Dalalyan and Thompson [15], Sasai and Fujisawa [46] improved the error bound to b bb 2 Cs r slog(p) n + o n ! ; (3.1.3) under the same assumptions on the design matrix X, which is minimax optimal as proven in Chen et al. [11]. Note that optimall in these works still depends ons. The present chapter continues this line of research. We consider the contaminated model (3.1.2) and propose a square-root type double penalized estimator. We will show that our estima- tor achieves the minimax-optimal error rate (3.1.3) under appropriate assumptions on the design matrix X, and requires no prior knowledge of the unknown parameters. The rest of this chapter is organized as follows. Section 3.2.1 describes the model formulation and defines the proposed estimator. Section 3.2.2 introduces main notations and key definitions. Section 3.2.3 presents the main results. Section 3.3 provides a concrete example of the design matrix X which satisfies all required conditions in order to achieve the optimal error rate. Finally, 41 section 3.4 summarizes our work and discusses potential directions for the future research. The proofs of the main results are contained in Appendix B. 3.2 Problem formulation and main results 3.2.1 Problem formulation Assume that the observed pairs of predictor-response values X 1 ;y 1 ;:::; X n ;y n 2R p R are independent and satisfy the model y i = X T i b + p nq i +x i ; i= 1;:::;n; where b 2R p is the vector of true regression coefficients, with a support S :=f j :b j 6= 0g; q :=(q 1 ;:::;q n ) T 2R n is an arbitrary contamination vector (also referred to as “vector of outliers”), with a support O :=f j :q j 6= 0g; x i ’s are i.i.d. Gaussian variables withE[x i ]= 0 and Var(x i )=s 2 ; x i is independent of X j for any i; j= 1;:::;n. Using the matrix-vector notation, the model can be stated as Y = Xb + p nq +x; (3.2.1) where X =[X 1 ;:::;X n ] T is the n p design matrix, Y =(y 1 ;:::;y n ) T is the response vector and x =(x 1 ;:::;x n ) T is the noise vector. 42 Our target is to estimate b in the presence of outliers. We are mainly interested in high dimensional setting where p> n. We will assume thatb is s-sparse, namely,kb k 0 := Card(S)= s for some small s p. Define Q(b;q) := 1 2n n å j=1 (y j X T j b p nq j ) 2 = 1 2n Y Xb p nq 2 2 and the loss function L(b;q)= Q(b;q) 1 2 +l s kbk 1 +l o kqk 1 : (3.2.2) We propose the estimator ( b b; b q)= argmin b2R p ;q2R n L(b;q): (3.2.3) The estimator in (3.2.3) can be seen as a generalization of the square-root Lasso estimator (see Belloni et al. [4]), where the positive regularization parametersl s andl o encourage sparsity of the reconstructed vector b b and b q. The square root of the quadratic term Q(b;q) is mainly designed to remove the dependence on the unknown s, while retaining the convexity of the loss function L(b;q). Note that estimator (3.2.3) is equivalent to the minimizer overb2R p ,q2R n ,s > 0 of the loss function e L(b;q;s)= Q(b;q) s +s+l s kbk 1 +l o kqk 1 ; (3.2.4) provided that the minimum is attained at a positive value of s (Van de Geer [55, Chapter 3]). Indeed, the term Q(b;q) 1 2 in (3.2.2) appears when one performs minimization of e L(b;q;s) with respect to s > 0 first. In the following subsections, we will show that the estimator in (3.2.3) achieves the optimal error bound under suitable design conditions. 3.2.2 Notation In this subsection we summarize the notations employed throughout this chapter. We reserve S and O for the support ofb ,q , respectively, and denote s=jSj := Card(S), o=jOj := Card(O). Given a vector v2R p , we denote the` 1 - and` 2 - norm of v askvk 1 =å p i=1 jv i j andkvk 2 = q å p i=1 jv i j 2 43 respectively. Given two vectors u2R p and v2R n , we denote[u;v]2R p R n to be the(p+ n)- dimensional vector created by vertical concatenation of u and v. Given c 0 1 and Tf1;:::; pg, we will be interested in the coneC T (c 0 ) defined as C T (c 0 )= u2R p :ku T Ck 1 c 0 ku T k 1 ; (3.2.5) where u T is the vector inR p that has the same coordinates as u on T and zero coordinates on the complement T C of T . This is a cone of vectors whose “dominating coordinates” are concentrated on a set indexed by T . Similarly, the corresponding cone in the augmented space R p R n is defined as C T;Q (c 0 ;g)= n (u;v)2R p R n :gku T Ck 1 + v Q C 1 c 0 (gku T k 1 +kv Q k 1 ) o (3.2.6) provided that Tf1;:::; pg, Qf1;:::;ng, c 0 1 and g > 0. Note that the vectors in the set C T (c 0 ) orC T;Q (c 0 ;g) are nearly “sparse” in the sense that their values are concentrated on a subset of coordinates indexed by T . Finally, we denote a_ b := maxfa;bg and a^ b := minfa;bg. 3.2.3 Main results In this subsection, we present the main results which provide an error bound for the estimator b b defined in (3.2.3). First, it is easily seen that the following conditions on the design matrix X are necessary in order to achieve an optimal error bound: The design matrix X needs to satisfy some type of invertibility condition. Indeed, even in the simplest case when o= 0, i.e. there is no contamination, Belloni et al. [4] suggest that X has to satisfy a “restricted invertibility” condition (such as the restricted isometry, restricted eigenvalue or compatibility condition, see, e.g. Van de Geer [54], Van De Geer and B¨ uhlmann [56]) in order to guarantee the optimality of the estimator in (3.2.3). 44 The columns of the design matrix X need to be “almost uncorrelated” with the columns of the identity matrix I n . This will be called the incoherence condition throughout this chapter. To see this, consider a particular case where n= p and X = p nI n . In such a case, model (3.2.1) becomes Y = p n(b +q )+x; and the only identifiable vector will be b +q , which makes it impossible to consistently estimateb only. The following two definitions introduced by Dalalyan and Thompson [15] formalize the control on invertibility and incoherence of the design matrix X, which will be employed in the subsequent analysis. Definition 3.2.1 (Definition 1(iii) in Dalalyan and Thompson [15]). We say that the design matrix X satisfies the augmented transfer principle ATP(c 1 ;c 2 ;c 3 ) with some positive numbers c 1 ;c 2 and c 3 , if for all[u;v]2R p+n , 1 p n Xu+ p nv 2 c 1 k[u;v]k 2 c 2 kuk 1 c 3 kvk 1 : Definition 3.2.2 (Definition 1(ii) in Dalalyan and Thompson [15]). We say that the design matrix X satisfies an incoherence property IP(b 1 ;b 2 ;b 3 ) with respect to the identity matrix I n , if for all [u;v]2R p+n , 1 p n jv T Xuj b 1 kvk 2 kuk 2 + b 2 kuk 1 kvk 2 + b 3 kuk 2 kvk 1 : Remark 3.2.1. (i) Definition 3.2.1 controls the invertibility of the augmented matrix 1 p n [X; p nI n ] on certain subsets of interest. It is clear that any matrix X satisfies ATP(c 1 ;c 2 ;c 3 ) with suf- ficiently small value of c 1 and large values of c 2 and c 3 . However, only matrices satisfying ATP(c 1 ;c 2 ;c 3 ) with reasonably small values of c 2 and c 3 are interesting in our case, as they are invertible on a certain subset of vectors. On the other hand, Lemma B.2 shows that 45 ATP(c 1 ;c 2 ;c 3 ) induces the restricted eigenvalue condition for 1 p n [X; p nI n ], which is a well- know condition for the success of the Lasso and square-root Lasso estimators. Finally, it is worth noticing that Definition 3.2.1 is valid for all[u;v]2R p+n , which allows us to deduce the following property for X only: 1 p n kXuk 2 c 1 kuk 2 c 2 kuk 1 for all u2R p . (ii) Definition 3.2.2 controls the incoherence of the design matrix X. Here the term “incoher- ence” means that the columns of 1 p n X do not align with the columns of I n , and one should distinguish it from the traditional incoherence concept (see for example, Foucart and Rauhut [20, Chapter 5]). It is clear that any matrix X satisfies IP(b 1 ;b 2 ;b 3 ) with sufficiently large constants, but again only matrices satisfying IP(b 1 ;b 2 ;b 3 ) with reasonably small constants are interesting for our analysis. Moreover, we note that if X satisfies IP(b 1 ;b 2 ;b 3 ), then it also satisfies IP(0;b 2 ;b 1 + b 3 ). This directly follows from the fact thatkvk 2 kvk 1 for any v2R n . Now we are ready to present the following theorem which provides the bound for b bb 2 under “favorable” conditions. Theorem 3.2.1. Let c 1 ;c 2 ;c 3 ;b 2 ;b 3 andl s ;l o be positive constants such that l 2 s s+l 2 o o 1 2 min ( c 1 d 8 p 2 ; c 1 8 c 2 l s _ c 3 l o ; (1d)c 2 1 l s 52 p 2b 2 ) for some absolute constantd 4 43 . Assume that the design matrix X2R np satisfies ATP(c 1 ;c 2 ;c 3 ) and IP(0;b 2 ;b 3 ) as defined in Definition 3.2.1 and 3.2.2. Then on the event E = n l s kxk 2 r 2 n X T x ¥ ;l o kxk 2 p 2kxk ¥ o ; 46 the following bound holds: b bb 2 " 128l o (1d) c 2 gc 1 _ b 3 c 2 1 (g 2 s+ o) c 2 1 + 45l s p s 13 p 2c 2 1 # kxk 2 p n ; (3.2.7) whereg := l s l o . The proof of Theorem 3.2.1 is presented in Appendix B.1.1. Remark 3.2.2. The bound in (3.2.7) mainly depends on two quantities: (i) The order of the constants c 1 ;c 2 ;c 3 ;b 2 and b 3 , which are related to the design matrix X only. (ii) The distribution of the noisex , which will affect both the kxk 2 p n term in (3.2.7) and the smallest choices ofl s andl o (the smaller, the better). In the next proposition, we will show that when x has Gaussian distribution, the term kxk 2 p n has a constant order, while the smallest choices ofl s andl o are both in the order of n 1=2 . Consequently, the bound in (3.2.7) will have an order ofO p s n + o n up to logarithmic factors if we assume ap- propriate orders on the constants c 1 ;c 2 ;c 3 ;b 2 and b 3 , which achieves theoretically optimal bound as specified in (3.1.3) up to logarithmic factors. Proposition 3.2.1. Assume thatxN (0;s 2 I n ). Let X2R np be a matrix with r := max j=1;:::;p 1 p n X j 2 ; (3.2.8) where X j represents the j th column of X. Then for L 1 ;L 2 > 0, with probability at least 1 e L 1 e L 2 (1+ e 2 )e n=24 , we have the following results (Giraud [23, p.112]): (i) 1 p n X T x ¥ p 2sr p log(p)+ L 1 ; (ii)kxk ¥ p 2s p log(n)+ L 2 ; (iii) s kxk 2 p n 1 1 p 2 s: 47 The proof of (i)(iii) in Proposition 3.2.1 can be found in Giraud [23, p.112], and the proof of (ii) is presented in Appendix B.1.2. Remark 3.2.3. A natural choice of L 1 ;L 2 would be L 1 = Alog(p) and L 2 = Blog(n) for some absolute constants A> 0, B> 0, respectively. With this choice, we see that the smallest choices of l s andl o are l s = 4r r Alog(p) n (3.2.9) and l o = 4 r Blog(n) n (3.2.10) respectively. It is worth noticing that both of them are independent of the unknown standard deviations. By choosing the smallest possiblel s ,l s as indicated in (3.2.9)(3.2.10), and assuming that the constants in ATP and IP are of appropriate order, we conclude that the estimator b b defined in (3.2.3) admits the following error bound in high probability. Corollary 3.2.1. Assume thatxN (0;s 2 I n ) and the design matrix X2R np satisfies ATP(c 1 ;c 2 ;c 3 ) and IP(0;b 2 ;b 3 ) with c 1 of order 1 and c 2 ;c 3 ;b 2 ;b 3 of order 1= p n. Denoter := max j=1;:::;p 1 p n X j 2 and let l s ;l o be chosen as in (3.2.9) and (3.2.10) with constants A> 0, B> 0 respectively. Fur- thermore, assume that the sample size n is large enough such that s+ o n d 0 for some sufficiently small constant d 0 1 . Then with probability at least 1 1 p A 1 n B (1+e 2 )e n=24 , we have the following bound: b bb 2 Cs (s+ o) n + r s n p log(pn); (3.2.11) 1 The explicit expression is d 0 = 1 4 p r 2 Alog(p)+Blog(n) min c 1 d 8 p 2 ; c 1 r p ABlog(pn) 2(c 2 _c 3 ) p n ; (1d)c 2 1 r p Alog(p) 13 p 2b 2 p n ,d 4=43. 48 where C> 0 is an absolute constant depending only onr, A and B. Remark 3.2.4. The order of the bound (3.2.11) is theoretically optimal (up to logarithmic fac- tors) as proven in Dalalyan and Thompson [15], with the choice of l s and l o independent of the unknown variance parameter s. The optimal bound is achieved under the assumption that the design matrix X satisfies ATP(c 1 ;c 2 ;c 3 ) and IP(0;b 2 ;b 3 ) with appropriate constants c j and b j . In general, the estimation error depends on the constants c j , j= 1;2;3 and b j , j= 2;3 as follows: b bb 2 Cs p log(pn)min ( c 2 c 3 1 _ b 3 c 4 1 (s+ o) p n + 1 c 2 1 r s n ; r s+ o n ) ; which can be in the order ofO q s+o n or even worse. In the next section, we will present a specific type of the design matrix X which satisfies the required conditions with high probability and optimal optimal values of the constants c j and b j , hence yielding the optimal error bound. 3.3 The case of Gaussian design As discussed in Remark 3.2.4, the conditions on the design matrix X need to be verified in order to achieve the error bound in (3.2.11). As a concrete example, we consider the design matrix X whose rows are independently drawn from a Gaussian distribution with mean 0 and covariance matrix I p . The following proposition shows that such a design matrix satisfies the conditions in Corollary 3.2.1 with high probability: Proposition 3.3.1. (Dalalyan and Thompson [15, Theorem 2]) Assume that the rows of the de- sign matrix X are i.i.d. Gaussian vectors with mean 0 and covariance matrix I p . Then for any d2(0;1=7), with probability at least 1 2d, X satisfies AT P(c 1 ;c 2 ;c 3 ) and IP(b 1 ;b 2 ;b 3 ) with constants c 1 = 3 4 17:5+ 9:6 p 2log(2=d) n ; c 2 = 3:6 r 2log(p) n ; c 3 = 2:4 r 2log(n) n 49 and b 1 = 4:8 p 2+ p 2log(81=d) p n ; b 2 = 1:2 r 2log p n ; b 3 = 1:2 r 2logn n : The following theorem is a direct consequence of Proposition 3.3.1 and Corollary 3.2.1: Theorem 3.3.1. Assume that the rows of the design matrix X are i.i.d. normal vectors with mean 0 and covariance matrix I, and xN (0;s 2 I n ). Let l s , l o be chosen as in (3.2.9) and (3.2.10) respectively, with constants A> 0 and B> 0. Furthermore, we assume that the sample size n is large enough such that s+ o n d 0 for a sufficiently small constant d 0 . Then with probability at least 1 1 p A 3 n B (1+e 2 )e n=24 , the following inequality holds: b bb 2 Cs (s+ o) n + r s n p log(pn); (3.3.1) where C is an absolute constant depending only on A and B. Remark 3.3.1. Thompson [49] shows that the ATP and IP conditions hold for the sub-Gaussian design matrices as well, with constants of the same order as in the Gaussian case. Therefore, our main results remain true for sub-Gaussian design as well. 3.4 Conclusion and future work In this chapter we proposed a double penalized estimator for regression coefficients, which was shown to achieve the theoretically optimal error bound under certain assumptions on the design matrix X. Importantly, the choices of tuning parametersl s ,l o are independent of the variance of the noise vector, which is unknown in practice. Moreover, since the loss function (3.2.2) retains global convexity, our estimator is still computationally attractive. On the other hand, there are some limitations of our model, which were not addressed in this work. We list them in this section as possible avenues for future research. 50 (i) Heavier-tailed noise distributions. For simplicity, in Theorem 3.2.1 we assumed that the noise vector x follows a Gaussian distribution with mean 0 and covariance matrix s 2 I n . The Gaussian noise assumption can be easily extended to the sub-Gaussian noise, see for example, Bellec et al. [3, Section 9]. Indeed, by Theorem 3.2.1, it is clear that we only need to control the tail probabilities of 1 p n X T x ¥ ,kxk ¥ andkxk 2 , which can be done using standard tools under the sub-Gaussian assumption. We believe that the assumption on x can be further relaxed to heavier tailed distributions by employing the decomposition idea as discussed in Section 2.4.2. To be precise, we can always set x noise i :=x i 1fjx i j Rg and x out i :=x i 1fjx i j 2 Rg for i= 1;:::;n with the truncation level R> 0, and decomposex i as x i =x out i +x noise i . Note that x noise i is a sub-Gaussian random variable. On the other hand, by choosing an appropriate truncation level,x out i is guaranteed to take the value 0 with high probability, which can be merged withq i and considered as outliers. Consequently, the model with a heavy-tailed noise term is transformed into the standard model in (3.2.1), where we can hopefully deduce similar estimation results. (ii) Conditions on the design matrix. The result in Theorem 3.2.1 relies on the assumption that the design matrix X satisfies ATP(c 1 ;c 2 ;c 3 ) and IP(0;b 2 ;b 3 ) conditions, see Definition 3.2.1 and 3.2.2. In Section 3.3 we mentioned that the sub-Gaussian design matrices satisfy the desired assumptions. Are there any other classes of design matrices satisfying these assump- tions? In particular, given a fixed design matrix X, how can we validate these assumptions? These questions were not answered in our work and are left for future research. (iii) Numerical experiments. As we mentioned in Section 3.2.1, the problem of finding the square-root Lasso estimator is a convex problem, and hence we can apply standard numerical methods to compute the estimator efficiently. In particular, if we setl s =l o =l and denote A :=[X; p nI n ], w :=[b;q], we can equivalently express (3.2.3) as b w= argmin w2R p R n 1 p 2n kY Awk 2 +lkwk 1 ; 51 which can be viewed as a standard square-root Lasso estimator. Belloni et al. [4] suggest that such a problem is equivalent to a conic programming problem with second-order constraints, which admits computationally efficient algorithms. We will perform numerical experiments and compare the performance of our estimator to other robust estimators in the future re- search. 52 Bibliography [1] Alexei B Aleksandrov and Vladimir V Peller. Operator Lipschitz functions. Russian Mathe- matical Surveys, 71(4):605, 2016. [2] Amir Beck. First-order methods in optimization. SIAM, 2017. [3] Pierre C Bellec, Guillaume Lecu´ e, and Alexandre B Tsybakov. Slope meets Lasso: improved oracle bounds and optimality. Annals of Statistics, 46(6B):3603–3642, 2018. [4] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [5] Thorsten Bernholt. Robust estimators are hard to compute. Technical report, Technical report, 2006. [6] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013. [7] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of statistics, 37(4):1705–1732, 2009. [8] James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modification of the symmetric eigenproblem. Numerische Mathematik, 31(1):31–48, 1978. [9] T Tony Cai, Cun-Hui Zhang, and Harrison H Zhou. Optimal rates of convergence for covari- ance matrix estimation. The Annals of Statistics, 38(4):2118–2144, 2010. [10] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of statistics, 35(6):2313–2351, 2007. [11] Mengjie Chen, Chao Gao, and Zhao Ren. A general decision theory for Huber’s e- contamination model. Electronic Journal of Statistics, 10(2):3752–3774, 2016. [12] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. Annals of Statistics, 46(5):1932–1960, 2018. [13] Yu Cheng, Ilias Diakonikolas, Rong Ge, and David P Woodruff. Faster algorithms for high- dimensional robust covariance estimation. In Conference on Learning Theory, pages 727– 757. PMLR, 2019. [14] Patrick L Combettes and Val´ erie R Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005. 53 [15] Arnak S Dalalyan and Philip Thompson. Outlier-robust estimation of a sparse linear model using` 1 -penalized Huber’s M-estimator. arXiv preprint arXiv:1904.06288, 2019. [16] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012. [17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR, 2017. [18] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stew- art. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019. [19] David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3): 935–969, 2016. [20] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Bull. Am. Math, 54(2017):151–165, 2017. [21] Ratnik Gandhi and Amoli Rajgor. Updating singular value decomposition for rank one matrix perturbation. arXiv preprint arXiv:1707.08369, 2017. [22] Kevin Gimpel, Dipanjan Das, and Noah A Smith. Distributed asynchronous online learning for natural language processing. In Proceedings of the Fourteenth Conference on Computa- tional Natural Language Learning, pages 213–222, 2010. [23] Christophe Giraud. Introduction to high-dimensional statistics. Monographs on Statistics and Applied Probability, 139:139. [24] Gene H Golub. Some modified matrix eigenvalue problems. Siam Review, 15(2):318–334, 1973. [25] Jarvis Haupt, Waheed U Bajwa, Michael Rabbat, and Robert Nowak. Compressed sensing for networked data. IEEE Signal Processing Magazine, 25(2):92–101, 2008. [26] Wassily Hoeffding. A class of statistics with asymptotically normal distribution. In Break- throughs in statistics, pages 308–334. Springer, 1992. [27] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992. [28] Peter J Huber. Robust statistics (pp. 1248-1251). Springer Berlin Heidelberg. Rehabilitation Psychology, 46:382–399, 2011. [29] Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Mini-batch gradient descent: Faster convergence under data sparsity. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2880–2887. IEEE, 2017. 54 [30] N. Kishore Kumar and Jan Schneider. Literature survey on low rank approximation of matri- ces. Linear and Multilinear Algebra, 65(11):2212–2244, 2017. [31] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 23(1):110–133, 2017. [32] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covari- ance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016. [33] Karim Lounici. High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20(3):1029–1058, 2014. [34] Ricardo A Maronna, R Douglas Martin, Victor J Yohai, and Mat´ ıas Salibi´ an-Barrera. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019. [35] Nicolai Meinshausen and Peter B¨ uhlmann. High-dimensional graphs and variable selection with the Lasso. Annals of statistics, 34(3):1436–1462, 2006. [36] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high- dimensional data. The annals of statistics, 37(1):246–270, 2009. [37] Shahar Mendelson and Nikita Zhivotovskiy. Robust covariance estimation under L 4 -L 2 norm equivalence. Annals of Statistics, 48(3):1648–1664, 2020. [38] Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111–119, 2017. [39] Stanislav Minsker and Xiaohan Wei. Robust modifications of U-statistics and applications to covariance estimation problems. Bernoulli, 26(1):694–727, 2020. [40] Aaron J Molstad. Insights and algorithms for the multivariate square-root Lasso. arXiv e- prints, pages arXiv–1909, 2019. [41] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003. [42] Yurii E Nesterov. A method for solving the convex programming problem with convergence rate O (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983. [43] Nam H Nguyen and Trac D Tran. Robust lasso with missing and grossly corrupted observa- tions. IEEE transactions on information theory, 59(4):2036–2058, 2012. [44] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. Ad- vances in Neural Information Processing Systems, 27:1574–1582, 2014. [45] Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. A unified approach to robust mean estimation. arXiv preprint arXiv:1907.00927, 2019. 55 [46] Takeyuki Sasai and Hironori Fujisawa. Robust estimation with Lasso when outputs are ad- versarially contaminated. arXiv preprint arXiv:2004.05990, 2020. [47] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011. [48] Peter Stange. On the efficient update of the singular value decomposition. In PAMM: Pro- ceedings in Applied Mathematics and Mechanics, volume 8, pages 10827–10828. Wiley On- line Library, 2008. [49] Philip Thompson. Outlier-robust sparse/low-rank least-squares regression and robust matrix completion. arXiv preprint arXiv:2012.06750, 2020. [50] Xiaoying Tian, Joshua R Loftus, and Jonathan E Taylor. Selective inference with unknown variance via the square-root Lasso. Biometrika, 105(4):755–768, 2018. [51] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. [52] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization, 1, 2008. [53] John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, pages 448–485, 1960. [54] Sara Van de Geer. The deterministic Lasso. Seminar f¨ ur Statistik, Eidgen¨ ossische Technische Hochschule (ETH) Z¨ urich, 2007. [55] Sara A Van de Geer. Estimation and testing under sparsity. Springer, 2016. [56] Sara A Van De Geer and Peter B¨ uhlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. [57] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. [58] G Alistair Watson. Characterization of the subdifferential of some matrix norms. Linear algebra and its applications, 170:33–45, 1992. [59] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210–227, 2008. [60] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. [61] Peng Zhao and Bin Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. 56 [62] Peng Zhao, Guilherme Rocha, and Bin Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468–3497, 2009. 57 Appendices A Appendix to Chapter 2 A.1 Proofs ommited in Section 2.3.2 In this section, we present the proofs that were omitted from the main exposition in Section 2.3.2. We start by introducing some technical tools that will be useful for the proof. A.1.1 Technical tools First, we have the following useful trace duality inequalities for the Frobenius inner product. Proposition A.1. For any A;B2R d 1 d 2 , jhA;BijkAk F kBk F ; jhA;BijkAk 1 kBk: Next, let L be a linear subspace ofR d and L ? be its orthogonal complement, namely, L ? = v2R d :hv;ui= 0;8u2 L . In what follows, P L will stand for the orthogonal projection onto L, meaning that P L 2R dd is such that P 2 L = P L = P T L and Im(P L ) L, where Im(P L ) represents the image of P L . Given the spectral decomposition of a real symmetric matrix, we have the following proposition: 58 Proposition A.2. Let S2 S d (R) be a real symmetric matrix with spectral decomposition S = å d j=1 l j u j u T j , where the eigenvalues satisfyjl 1 jjl d j 0. Denote L= Im(S)= span u j :l j 6= 0 . Then P L =å j:l j 6=0 u j u T j and P L ? =å j:l j =0 u j u T j . Moreover, we will be interested in a linear operatorP L :R dd 7!R dd defined as P L (A) := A P L ?AP L ?: (A.1) The following lemma provides some results onP L () that will be useful in our proof. Lemma A.1. Let L be a linear subspace of R d and A2 S d (R) be an arbitrary real symmetric matrix, then (i)kP L (A)kkAk. (ii) rank(P L (A)) 2dim(L). Proof. The proof of this lemma follows straightforward from the definition ofP L and hence omit- ted here. The following proposition characterizes the subdifferential of the convex function A7!kAk 1 . Proposition A.3 (Watson [58]). Let A2 S d (R) be a symmetric matrix and A=å rank(A) j=1 s j u j v T j be the singular value decomposition defined in Proposition 2.2.1. Denote L= spanfu 1 ;:::;u r g, then ¶kAk 1 = ( å j:s j >0 u j v T j + P L ?WP L ? :kWk 1 ) ; where P L ? represents the orthogonal projection onto L ? . Next, we state some results for the best rank-k approximation. We say that the functionjjjjjj : R d 1 d 2 7!R is a matrix norm if for any scalara2R and any matrices A;B2R d 1 d 2 , the following properties are satisfied: jjjaAjjj=jajjjjAjjj; 59 jjjA+ BjjjjjjAjjj+jjjBjjj; jjjAjjj 0, andjjjAjjj= 0 if and only if A= 0 d 1 d 2 . The operator normkk, the Frobenius normkk F and the nuclear normkk 1 introduced in Definition 2.2.1 are concrete examples of matrix norms. Given a nonnegative definite matrixS, we say that S(k) is the best rank-k approximation ofS with respect to the matrix normjjjjjj, if S(k)= argmin S:rank(S)k jjjSSjjj: The following theorem characterized the best rank-k approximation. Theorem A.1 (Kishore Kumar and Schneider [30]). LetS2R dd be a nonnegative definite matrix with spectral decomposition S=å d j=1 l j u j u T j , where the eigenvalues satisfy l 1 l d 0. Then the matrix A :=å k j=1 l j u j u T j is the best rank-k approximation ofS in both Frobenius norm and operator norm. Consequently, we have that min S:rank(S)k kSSk=l k+1 and min S:rank(S)k kSSk F = v u u t d å j=k+1 l 2 j : The following two corollaries will be used in our proof. Corollary A.1. Let S(k) be the best rank-k approximation of S in the operator norm. Then kS(k)SkkSk rk(S) k ^ q rk(S) k andkS(k)Sk F tr(S) 2 k . Proof. Letl j (A) be the j-th largest eigenvalue of a nonnegative definite matrix, then by Theorem A.1,kS(k)Sk=l k+1 (S). Moreover, we have that l k+1 (S) å k+1 j=1 l j (S) k+ 1 tr(S) k+ 1 tr(S) k =kSk rk(S) k : (A.2) 60 Note thatl k+1 (S) p kSk p l k+1 (S). Combining this with the previous display, we get another inequality l k+1 (S) r kSktr(S) k =kSk r rk(S) k : So we have thatkS(k)SkkSk rk(S) k ^ q rk(S) k . To obtain the bound in Frobenius norm, we note that kS(k)Sk 2 F = å jk+1 l j (S) 2 (tr(S)) 2 å jk+1 j 2 tr(S) 2 k ; where the first inequality follows from (A.2) and the second inequality follows fromå jk+1 j 2 = å jk+1 1 j( j+1) = 1 k+1 : Remark A.1. It is easy to see that rk(S) k q rk(S) k if and only if rk(S) k, so when S has low effective rank, i.e. rk(S) k, the upper bound becomes kS(k)Sk kSkrk(S) k : Corollary A.2. Let S(k) be the best rank-k approximation of S in the operator norm defined in Theorem A.1, and L(k) := Im(S(k)). ThenP L(k) (S)=S(k). Proof. By Theorem A.1,S(k) has spectral decompositionS(k)=å d j=1 l j u j u T j withl k+1 == l d = 0. Then Proposition A.2 implies that P L(k) ? =å d j=k+1 u j u T j . Therefore, P L(k) (S)=S P L(k) ?SP L(k) ? = d å j=1 l j u j u T j d å j=k+1 l j u j u T j = k å j=1 l j u j u T j =S(k): A.1.2 Proof of Theorem 2.3.1 To simplify the notations, we denote N := n(n 1). Let F(S;U U U I 2 n )= 1 N å i6= j e Y i; j e Y T i; j S p NU i; j 2 F +l 1 kSk 1 +l 2å i6= j U i; j 1 : (A.3) 61 The function F is convex, and we have that ¶F(S;U U U I 2 n )= 0 B B B B B B B @ 2 N å i6= j ( e Y i; j e Y T i; j S p NU i; j ) 2 N p N( e Y 1;2 e Y T 1;2 S p NU 1;2 ) . . . 2 N p N( e Y n;n1 e Y T n;n1 S p NU n;n1 ) 1 C C C C C C C A + 0 B B B B B B B @ l 1 ¶kSk 1 l 2 ¶ U 1;2 1 . . . l 2 ¶ U n;n1 1 1 C C C C C C C A (A.4) where¶kAk 1 represents the subdifferential ofkk 1 at A. Note that for any symmetric matrices S;U 1;2 ;:::;U n;n1 , the directional derivative of F at the point b S l ; b U 1;2 ;:::; b U n;n1 in the direction S b S l ;U 1;2 b U 1;2 ;:::;U n;n1 b U n;n1 is nonnegative. In particular, we consider an arbitrary S and U 1;2 := e U 1;2 ;:::;U n;n1 := e U n;n1 . By the necessary condition of the minima, there exist b V2¶k b S l k 1 ; b W 1;2 2¶k b U 1;2 k 1 ;:::; b W n;n1 2¶k b U n;n1 k 1 such that D ¶F( b S; b U U U I 2 n );(S b S l ; e U U U I 2 n b U U U I 2 n ) E = 2 N å i6= j D e Y i; j e Y T i; j b S l p N b U i; j ;S b S l + p N( e U i; j b U i; j ) E +l 1 D b V;S b S l E +l 2å i6= j D b W i; j ; e U i; j b U i; j E 0: For any choice of V2¶kSk 1 ; W 1;2 2¶ e U 1;2 1 ;:::;W n;n1 2¶ e U n;n1 1 , by the monotonicity of subgradients we deduce that D V b V;S b S l E 0; D W i; j b W i; j ; e U i; j b U i; j E 0; (i; j)2 I 2 n : Hence the previous display implies that 2 N å i6= j D e Y i; j e Y T i; j b S l p N b U i; j ;S b S l + p N( e U i; j b U i; j ) E l 1 D V;S b S l E +l 2å i6= j D W i; j ; e U i; j b U i; j E ; 62 which is equivalent to 2 N å i6= j D S b S l + p N( e U i; j b U i; j );S b S l + p N( e U i; j b U i; j ) E l 1 D V; b S l S E l 2å i6= j D W i; j ; b U i; j e U i; j E 2 * 1 N å i6= j e X i; j S;S b S l + 2 p N å i6= j D e X i; j S; e U i; j b U i; j E ; (A.5) where e X i; j = e Z i; j b Z T i; j . We will bound (A.5) in two cases. Case 1: Assume that 2 N å i6= j D S b S l + p N( e U i; j b U i; j );S b S l + p N( e U i; j b U i; j ) E 0: Applying the law of cosines, 2hA;Bi=kAk 2 F +kBk 2 F kA Bk 2 F ,8A;B2R dd , to the left hand side of (A.5), we get that 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F + 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F kS Sk 2 F + 2 * 1 N å i6= j e X i; j S; b S l S + + 2 p N å i6= j D e X i; j S; b U i; j e U i; j E l 1 D V; b S l S E l 2å i6= j D W i; j ; b U i; j e U i; j E : (A.6) We will now analyze the terms on the right-hand side of equation (A.6) one by one. First, let S=å rank(S) j=1 s j (S)u j v T j be the singular value decomposition of S, where s j (S) is the j-th largest singular value of S. Then we can represent any V2¶kSk 1 by V =å rank(S) j=1 u j v T j + P L ?WP L ? for somekWk 1, where L= spanfu 1 ;:::;u rank(S) g. From this representation, we have thatP L (V)= V P L ?V P L ? =å rank(S) j=1 u j v T j , and D V; b S l S E = D P L (V); b S l S E D P L ?V P L ?; b S l S E 63 = D P L (V); b S l S E D W;P L ? b S l P L ? E D P L (V); b S l S E P L ? b S l P L ? 1 kP L (V)k F b S l S F P L ? b S l P L ? 1 = p rank(S) b S l S F P L ? b S l P L ? 1 ; (A.7) where we chose W such that D W;P L ? b S l P L ? E = P L ? b S l P L ? 1 . Similarly, let L i; j be the image of e U i; j ;(i; j)2 I 2 n , then for properly chosen W 1;2 2¶ e U 1;2 1 ;:::;W n;n1 2¶ e U n;n1 1 , we have that å i6= j D W i; j ; b U i; j e U i; j E å i6= j P L ? i; j b U i; j P L ? i; j 1 + å i6= j D P L i; j (W i; j ); b U i; j e U i; j E å i6= j q rank( e U i; j ) b U i; j e U i; j F å i6= j P L ? i; j b U i; j P L ? i; j 1 å (i; j)2 e J p 2 b U i; j e U i; j F å i6= j P L ? i; j b U i; j P L ? i; j 1 ; (A.8) where we used the fact that rank( e U i; j ) 2, and e U i; j = 0, L i; j =f0g for(i; j)= 2 e J. Next, we denote D := 1 N å i6= j e X i; j S and recall the linear operator defined in (A.1): P(A)= A P L ?AP L ?: It is easy to check thatP L (D)= P L ?DP L + P L D, which implies rank(P L (D)) 2rank(S). There- fore, D D; b S l S E = D P L (D); b S l S E + D P L ?DP L ?; b S l S E = D P L (D); b S l S E + D D;P L ?( b S l S)P L ? E kP L (D)k F b S l S F +kDk P L ? b S l P L ? 1 p rank(P L (D))kP L (D)k b S l S F +kDk P L ? b S l P L ? 1 p 2rank(S)kDk b S l S F +kDk P L ? b S l P L ? 1 ; (A.9) 64 where the last inequality follows from the boundkP L (D)kkDk. Finally, it is easy to see that å i6= j D e X i; j S; b U i; j e U i; j E = å i6= j D P L i; j ( e X i; j S); b U i; j e U i; j E + å i6= j D P L ? i; j ( e X i; j S)P L ? i; j ; b U i; j e U i; j E å i6= j P L i; j ( e X i; j S) F b U i; j e U i; j F + å i6= j e X i; j S P L ? i; j b U i; j P L ? i; j 1 å (i; j)2 e J q 2rank( e U i; j ) e X i; j S b U i; j e U i; j F + å i6= j e X i; j S P L ? i; j b U i; j P L ? i; j 1 å (i; j)2 e J p 4 e X i; j S b U i; j e U i; j F + å i6= j e X i; j S P L ? i; j b U i; j P L ? i; j 1 : (A.10) Combining inequalities (A.7, A.8, A.9, A.10) with (A.6), we deduce that 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F + 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F kS Sk 2 F + 2 p 2rank(S)kDk b S l S F +kDk P L ? b S l P L ? 1 + 2 p N 0 @ å (i; j)2 e J p 4 e X i; j S b U i; j e U i; j F + å i6= j e X i; j S P L ? i; j b U i; j P L ? i; j 1 1 A +l 1 p rank(S) b S l S F P L ? b S l P L ? 1 +l 2 0 @ å (i; j)2 e J p 2 b U i; j e U i; j F å i6= j P L ? i; j b U i; j P L ? i; j 1 1 A ; which is equivalent to 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F + 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F +(l 1 2kDk) P L ? b S l P L ? 1 + l 2 2 p N max i6= j e X i; j S P L ? i; j b U i; j P L ? i; j 1 kS Sk 2 F + 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F : (A.11) 65 Now consider the event E 1 := l 1 2kDk;l 2 3 p N max i6= j e X i; j S : We will derive a bound for b S l S 2 F +å i6= j b U i; j e U i; j 2 F onE 1 . Applying the identitykA+ Bk 2 F = kAk 2 F +kBk 2 F + 2hA;Bi to the the left-hand side of (A.11), we get that on the eventE 1 , 1 N å i6= j S b S l 2 F + 1 N å i6= j N e U i; j b U i; j 2 F 2 N å i6= j D S b S l ; p N( e U i; j b U i; j ) E + 1 N å i6= j S b S l 2 F + 1 N å i6= j N e U i; j b U i; j 2 F 2 N å i6= j D S b S l ; p N( e U i; j b U i; j ) E kS Sk 2 F + 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F ; which is equivalent to S b S l 2 F + S b S l 2 F + 2 å i6= j e U i; j b U i; j 2 F kS Sk 2 F + 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i j S +l 2 p 2 b U i; j e U i; j F + 2 p N å i6= j D S b S l ; e U i; j b U i; j E + 2 p N å i6= j D S b S l ; e U i; j b U i; j E : (A.12) We now bound the inner product terms on the right-hand side. First, combining inequalities (A.7, A.8, A.9, A.10) with (A.5), we deduce the following bound: 2 N å i6= j D S b S l + p N( e U i; j b U i; j );S b S l + p N( e U i; j b U i; j ) E +(l 1 2kDk) P L ? b S l P L ? 1 + l 2 2 p N max i6= j e X i; j S P L ? i; j b U i; j P L ? i; j 1 66 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F : (A.13) On the eventE 1 along with the assumption that 2 N å i6= j D S b S l + p N( e U i; j b U i; j );S b S l + p N( e U i; j b U i; j ) E 0; (A.13) implies that 1 3 l 2 P L ? i; j b U i; j P L ? i; j 1 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F : Recall that L i; j =f0g for any(i; j) = 2 e J, hence l 2 å (i; j)= 2 e J P L ? i; j b U i; j P L ? i; j 1 =l 2 å (i; j)= 2 e J P L ? i; j b U i; j e U i; j P L ? i; j 1 =l 2 å (i; j)= 2 e J b U i; j e U i; j 1 3 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + 3 å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F : (A.14) Next, we can estimateå (i; j)= 2 e J D b S l S; b U i; j e U i; j E as follows: å (i; j)= 2 e J D b S l S; b U i; j e U i; j E b S l S å (i; j)= 2 e J b U i; j e U i; j 1 3 b S l S l 2 " 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F # 67 3 b S l S 2 4 ( p 2+ 1) p rank(S) l 1 l 2 b S l S F + 4 3 + p 2 å (i; j)2 e J b U i; j e U i; j F 3 5 ; (A.15) where the last inequality holds on eventE 1 . This implies that 2 p N å i6= j D S b S l ; e U i; j b U i; j E 2 p N å (i; j)2 e J D S b S l ; e U i; j b U i; j E + 6 p N b S l S " ( p 2+ 1) p rank(S) l 1 l 2 b S l S F + 4 3 + p 2 å (i; j)2 e J b U i; j e U i; j F # 2 b S l S F 2 2 1 p N r e J v u u t å (i; j)2 e J b U i; j e U i; j 2 F +(6 p 2+ 6) l 1 l 2 r rank(S) N b S l S 2 F + 2 b S l S F 2 (8+ 6 p 2) 1 p N r e J v u u t å (i; j)2 e J b U i; j e U i; j 2 F " (6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + h 4+(6 p 2+ 8) 2 i j e Jj N å (i; j)2 e J b U i; j e U i; j 2 F ; (A.16) where the second inequality follows from the fact thatkAkkAk F for any symmetric matrix A, and the last inequality follows from the fact that 2ab a 2 + b 2 for any real numbers a;b. Similarly, we deduce that 2 p N å i6= j D S b S l ; e U i; j b U i; j E (6 p 2+ 6) r rank(S) N l 1 l 2 b S l S 2 F + " (6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + h 4+(6 p 2+ 8) 2 i j e Jj N å (i; j)2 e J b U i; j e U i; j 2 F : (A.17) Combining (A.16, A.17) with (A.12), one sees that on eventE 1 , S b S l 2 F + S b S l 2 F + 2 å i6= j e U i; j b U i; j 2 F 68 kS Sk 2 F + 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F + å (i; j)2 e J 4 p N e X i; j S +l 2 p 2 b U i; j e U i; j F + 2 h 4+(6 p 2+ 8) 2 i j e Jj N å (i; j)2 e J b U i; j e U i; j 2 F + " 2(6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + " (6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F kS Sk 2 F +2l 1 p 2+ 1 p rank(S) 1 2 b S l S F +2 1 p 2 l 2 ( 4 3 + p 2) q j e Jj 1 p 2 v u u t å (i; j)2 e J b U i; j e U i; j 2 F + " 2(6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + " (6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + 2 h 4+(6 p 2+ 8) 2 i j e Jj N å (i; j)2 e J b U i; j e U i; j 2 F kS Sk 2 F + 1 4 b S l S 2 F + 1 2 å i6= j e U i; j b U i; j 2 F +l 2 1 p 2+ 1 2 rank(S)+ 1 2 l 2 2 ( 4 3 + p 2) 2 j e Jj + " 2(6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + " (6 p 2+ 6) r rank(S) N l 1 l 2 + 1 2 # b S l S 2 F + 2 h 4+(6 p 2+ 8) 2 i j e Jj N å (i; j)2 e J b U i; j e U i; j 2 F : (A.18) Assuming that 2(6+ 6 p 2) q rank(S) N l 1 l 2 1 8 and 2 h 4+(6 p 2+ 8) 2 i j e Jj N 11 8 , we conclude that 1 8 S b S l 2 F + å i6= j e U i; j b U i; j 2 F ! kS Sk 2 F + rank(S)l 2 1 p 2+ 1 2 +l 2 2 (4=3+ p 2) 2 2 j e Jj: (A.19) The assumptions above are valid provided rank(S) 1 56000 n 2 l 2 2 l 2 1 andj e Jj N 402 . Note that if we apply the inequality 2ab a 2 + b 2 in the derivation above with different choices of constants, we can reduce the conditions on rank(S) andj e Jj to rank(S) c 1 n 2 l 2 2 l 2 1 ; 8c 1 1 5980 (A.20) 69 and j e Jj c 2 N; 8c 2 1 295 : Case 2: Assume that 2 N å i6= j D S b S l + p N( e U i; j b U i; j );S b S l + p N( e U i; j b U i; j ) E < 0: We start with several lemmas. Lemma A.2. On the event E 2 := ( l 1 4 1 N å i6= j e X i; j S(k) ;l 2 4 p N max i6= j e X i; j S(k) ) ; the following inequality holds l 1 P L(k) ? b S l P L(k) ? 1 +l 2å i6= j P L ? i; j b U i; j P L ? i; j 1 3 0 @ l 1 P L(k) ( b S l S(k)) 1 +l 2 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 1 A ; where L(k)= Im(S(k)), L i; j = Im e U i; j , and P L(k) , P L i; j are the orthogonal projections onto the corresponding subspaces. Proof of Lemma A.2. Denote Q(S;U 1;2 ;:::;U n;n1 ) := 1 N å i6= j e Y i; j e Y T i; j S p NU i; j 2 F : By definition of b S l , Q( b S l ; b U 1;2 ;:::; b U n;n1 ) Q(S(k); e U 1;2 ;:::; e U n;n1 ) 70 l 1 kS(k)k 1 k b S l k 1 +l 2å i6= j (k e U i; j k 1 k b U i; j k 1 ): (A.21) By convexity of thekk 1 norm, for any V2¶kS(k)k 1 ,kS(k)k 1 k b S l k 1 D V;S(k) b S l E . Let r = rank(S(k)) k, we have the representation V =å r j=1 v j v T j + P L(k) ?WP L(k) ?, wherekWk 1. By duality between the spectral and nuclear norm (Proposition A.1), we deduce that with an appropriate choice of W, kS(k)k 1 k b S l k 1 D V;S(k) b S l E = D P L(k) (V);S(k) b S l E + D P L(k) ?WP L(k) ?;S(k) b S l E P L(k) (S(k) b S l ) 1 P L(k) ? b S l P L(k) ? 1 : (A.22) Similalry, å i6= j (k e U i; j k 1 k b U i; j k 1 ) å i6= j P L i; j ( e U i; j b U i; j ) 1 P L ? i; j b U i; j P L ? i; j 1 ; (A.23) where L i; j is the image of e U i; j ,8(i; j)2 I 2 n . On the other hand, recall that e Y i; j e Y T i; j = e X i; j + p N e U i; j andÑQ is given by the first term in equation (A.4). Convexity of Q implies that Q( b S l ; b U 1;2 ;:::; b U n;n1 ) Q(S(k); e U 1;2 ;:::; e U n;n1 ) D ÑQ S(k); e U 1;2 ;:::; e U n;n1 ;( b S l S(k); b U 1;2 e U 1;2 ;:::; b U n;n1 e U n;n1 ) E = 2 N å i6= j D e X i; j S(k); b S l S(k) E 2 p N å i6= j D e X i; j S(k); b U i; j e U i; j E = 2 * S(k) 1 N å i6= j e X i; j ; b S l S(k) + + 2 p N å i6= j D S(k) e X i; j ; b U i; j e U i; j E 2 S(k) 1 N å i6= j e X i; j b S l S(k) 1 2 p N max i6= j S(k) e X i; j å i6= j b U i; j e U i; j 1 : (A.24) 71 On the event E 2 := ( l 1 4 1 N å i6= j e X i; j S(k) ;l 2 4 p N max i6= j e X i; j S(k) ) ; the inequality (A.24) implies that Q( b S l ; b U 1;2 ;:::; b U n;n1 ) Q(S(k); e U 1;2 ;:::; e U n;n1 ) 1 2 l 1 b S l S(k) 1 +l 2å i6= j b U i; j e U i; j 1 ! : (A.25) Moreover, note that b S l S(k) 1 P L(k) ( b S l S(k)) 1 + P L(k) ? b S l P L(k) ? 1 and b U i; j e U i; j 1 P L i; j ( b U i; j e U i; j ) 1 + P L ? i; j b U i; j P L ? i; j 1 : Combining these inequalities with (A.25), we get the lower bound Q( b S l ; b U 1;2 ;:::; b U n;n1 ) Q(S(k); e U 1;2 ;:::; e U n;n1 ) 1 2 " l 1 P L(k) ( b S l S(k)) 1 + P L(k) ? b S l P L(k) ? 1 +l 2 P L i; j ( b U i; j e U i; j ) 1 + P L ? i; j b U i; j P L ? i; j 1 # : (A.26) Combining (A.21, A.22, A.23) with the lower bound (A.26), we deduce the “sparsity inequality” l 1 P L(k) ? b S l P L(k) ? 1 +l 2å i6= j P L ? i; j b U i; j P L ? i; j 1 3 0 @ l 1 P L(k) ( b S l S(k)) 1 +l 2 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 1 A : 72 Lemma A.3. Assume that max 6 p 2 l 1 l 2 q k N ;7 q j e Jj N 1 4 . Then on the eventE 2 of Lemma A.2, the following inequality holds: b S l S 2 F + å i6= j b U i; j e U i; j 2 F 2 N å i6= j b S l S+ p N( b U i; j e U i; j ) 2 F : Proof of Lemma A.3. First, we consider the decomposition 2 p N å i6= j D b S l S; b U i; j e U i; j E = 2 p N å (i; j)2 e J D b S l S; b U i; j e U i; j E | {z } I + 2 p N å (i; j)= 2 e J D b S l S; b U i; j e U i; j E | {z } II : (A.27) For the term I, we have that I 2 b S l S F p N å (i; j)2 e J b U i; j e U i; j F 2 b S l S F s j e Jj N v u u t å (i; j)2 e J b U i; j e U i; j 2 F : (A.28) To estimate the term II, note thatP L(k) (S)=S(k)=P L(k) (S(k)) as L(k)= Im(S(k)). Moreover, å (i; j)= 2 e J P L ? i; j b U i; j P L ? i; j 1 =å (i; j)= 2 e J b U i; j e U i; j 1 , hence Lemma A.2 yields that on the eventE 2 , II 2 b S l S p N å (i; j)= 2 e J b U i; j e U i; j 1 2 b S l S p N 3 0 @ l 1 l 2 P L(k) ( b S l S(k)) 1 + å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 1 A 6 b S l S p N l 1 l 2 p 2rank(S(k)) P L(k) ( b S l S) F + å (i; j)2 e J q 2rank( e U i; j ) P L i; j ( b U i; j e U i; j ) F ! 6 b S l S p N 0 @ l 1 l 2 p 2k b S l S F + å (i; j)2 e J p 4 b U i; j e U i; j F 1 A 73 6 b S l S F 0 B @ l 1 l 2 r 2k N b S l S F + 2 s j e Jj N v u u t å (i; j)2 e J b U i; j e U i; j 2 F 1 C A : (A.29) Therefore, 2 p N å i6= j D b S l S; b U i; j e U i; j E 6 b S l S 2 F l 1 l 2 r 2k N s j e Jj N 2 p 7 b S l S F 0 B @ p 7 v u u t å (i; j)2 e J b U i; j e U i; j 2 F 1 C A 6 b S l S 2 F l 1 l 2 r 2k N s j e Jj N 0 @ 7 b S l S 2 F + 7 å (i; j)2 e J b U i; j e U i; j 2 F 1 A 0 @ 6 p 2 l 1 l 2 r k N + 7 s j e Jj N 1 A b S l S 2 F 7 s j e Jj N å (i; j)2 e J b U i; j e U i; j 2 F ; (A.30) where we used 2ab a 2 + b 2 in the second inequality. Finally, given the assumption that max n 6 p 2 l 1 l 2 r k N ;7 s e J N o 1 4 ; on the eventE 2 we have that 2 N å i6= j b S l S+ p N( b U i; j e U i; j ) 2 F = 2 b S l S 2 F + å i6= j b U i; j e U i; j 2 F + 2 p N å i6= j D b S l S; b U i; j e U i; j E ! 2 b S l S 2 F + å i6= j b U i; j e U i; j 2 F 1 2 b S l S 2 F 1 4 å i6= j b U i; j e U i; j 2 F ! b S l S 2 F + å i6= j b U i; j e U i; j 2 F : (A.31) 74 Remark A.2. We now consider the intersection of eventsE 1 andE 2 . Consider k =b Nl 2 2 1200l 2 1 c, e J N 6400 (implying that e J N 800 ). Corollary A.1 guarantees thatkS(k)SkkSk q rk(S) k , so 4 1 N å i6= j e X i; j S(k) 4 1 N å i6= j e X i j S +kSS(k)k ! 4 1 N å i6= j e X i; j S + 4kSk r rk(S) k 4 1 N å i6= j e X i; j S + 140kSk r rk(S) N : Similarly, for the second term we have that 4 p N max i6= j e X i; j S(k) 4 p N max i6= j e X i; j S +kS(k)Sk 4 p N max i6= j e X i; j S + 4 p N kS(k)Sk 4 1 p N max i6= j e X i; j S + 4 p N kSk r rk(S) k 4 1 p N max i6= j e X i; j S + 140 kSk p rk(S) N : Therefore, the event E := ( l 1 140kSk p N p rk(S)+ 4 1 N å i6= j e X i; j S ; l 2 140kSk N p rk(S)+ 4 1 p N max i6= j e X i; j S ) is a subset of both eventE 1 and eventE 2 , and all previous results hold on the eventE naturally. Now applying law of cosines to the left-hand side of 2 N å i6= j D S b S l + p N( e U i; j b U i; j );S b S l + p N( e U i; j b U i; j ) E < 0; 75 we get that 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F + 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F <kS Sk 2 F : This implies the inequality 1 N å i6= j S b S l + p N( e U i; j b U i; j ) 2 F <kS Sk 2 F : (A.32) On the eventE with k=b Nl 2 2 1200l 2 1 c, e J N 800 , we can combine the result of Lemma A.3 with the equation (A.32) to get that 1 2 b S l S 2 F + å i6= j b U i; j e U i; j 2 F ! kS Sk 2 F : (A.33) This bound is consistent with (A.19), which provides upper bounds for both the estimation ofS and e U i; j ;(i; j)2 I 2 n . To complete the proof, we repeat part of the previous argument to derive an improved bound for the estimation ofS only, while treating e U i; j as “nuisance parameters”. Let G(S) := 1 N å i6= j e Y i; j e Y T i; j S p N b U i; j 2 F +l 1 kSk 1 (A.34) as before, and we note that the directional derivative of G at the point b S l in the direction S b S l is nonnegative for any symmetric matrix S, implying that there exists b V2¶ b S l 1 such that 2 N å i6= j D e Y i; j e Y T i; j b S l p N b U i; j ;S b S l E +l 1 D b V;S b S l E 0: Proceeding as before, we see that there exists V2¶kSk 1 such that 2 N å i6= j D S b S l ;S b S l E l 1 D V;S b S l E + 2 N å i6= j D e X i; j S; b S l S E 76 + 2 p N å i6= j D e U i; j b U i; j ; b S l S E : Combining (A.7, A.9) with the inequality above and applying the law of cosines, we deduce that S b S l 2 F + S b S l 2 F +(l 1 2kDk) P L ? b S l P L ? 1 2 p N å i6= j D e U i; j b U i; j ; b S l S E + 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F +kS Sk 2 F ; (A.35) where, as before, D= 1 N å i6= j e X i; j S. On the eventE , we have that l 1 2kDk, so using the inequality 2ab a 2 + b 2 , we get that 2 p 2rank(S)kDk+l 1 p rank(S) b S l S F = 2 p 2 2 p 2rank(S)kDk+l 1 p rank(S) 0 @ b S l S F 2 p 2 1 A 1 8 b S l S 2 F + 2l 2 1 rank(S)( p 2+ 1) 2 : (A.36) On the other hand, we can repeat the reasoning in (A.29) and apply Lemma A.2 to deduce that å (i; j)= 2 e J b U i; j e U i; j 1 = å e J P L ? i; j b U i; j P L ? i; j 1 3 l 1 l 2 P L(k) b S l S(k) 1 + å e J P L i; j b U i; j e U i; j 1 ! 3 l 1 l 2 p 2k b S l S F + 2 å e J b U i; j e U i; j F ! : Therefore, 2 p N å i6= j D e U i; j b U i; j ; b S l S E 77 2 S b S l F p N å (i; j)2 e J e U i; j b U i; j F + 2 S b S l p N å (i; j)= 2 e J e U i; j b U i; j 1 2 S b S l F p N 7 å e J e U i; j b U i; j F + l 1 l 2 3 p 2k b S l S F ! l 1 l 2 3 p 2 r k N S b S l 2 F + b S l S 2 F + 14 S b S l F s j e Jj N s å e J b U i; j e U i; j 2 F : (A.37) To estimate r å e J b U i; j e U i; j 2 F , we apply the inequality (A.19) which entails that s å e J b U i; j e U i; j 2 F 2 p 2 kS Sk F + p rank(S)l 1 ( p 2 + 1) + l 2 (4=3+ p 2) p 2 q j e Jj ! ; given that k =b Nl 2 2 1200l 2 1 c, rank(S) 1 56000 n 2 l 2 2 l 2 1 ,j e Jj N 6400 . Therefore, by applying the bound 2ab a 2 + b 2 several times, we deduce that 14 S b S l F s j e Jj N s å e J b U i; j e U i; j 2 F 2 14 p 2 s j e Jj N S b S l F kS Sk F + 2 S b S l F 2 28(4=3+ p 2)l 2 s j e Jj 2 N + 2 2 p 2( p 2+ 1) s j e Jj N S b S l F 7l 1 p rank(S) 14 p 2 s j e Jj N S b S l 2 F +kS Sk 2 F + 8( p 2+ 1) 2 j e Jj N + 1 4 ! S b S l 2 F + 49l 2 1 rank(S)+ 28(4=3+ p 2) 2 j e Jj 2 N l 2 2 : Combining this with (A.35,A.36,A.37), we obtain that S b S l 2 F 11 5 kS Sk 2 F + 8 5 2( p 2+ 1) 2 + 49 l 2 1 rank(S) + 8 5 28(4=3+ p 2) 2 j e Jj 2 N l 2 2 (A.38) 78 under the assumptions that rank(S) 1 56000 n 2 l 2 2 l 2 1 , 3 p 2 l 1 l 2 q k N +14 p 2 q j e Jj N 3 8 and 8( p 2+1) 2 j e Jj N + 1 4 1 2 . The assumptions hold for rank(S) 1 56000 n 2 l 2 2 l 2 1 , k=b Nl 2 2 1200l 2 1 c andj e Jj N 6400 . Note that the coefficient 11 5 can be made smaller. Givend2(0; 3 8 ), we assume that 3 p 2 l 1 l 2 q k N +14 p 2 q j e Jj N d, which holds with the choices of k Nl 2 2 72l 2 1 d 2 andj e Jj N 1568 d 2 respectively. Also, we assume that rank(S) c 1 n 2 l 2 2 l 2 1 for some constant c 1 1 5980 according to (A.20). Then (A.38) becomes S b S l 2 F 1+d 1d kS Sk 2 F + 1 1d 2( p 2+ 1) 2 + 49 l 2 1 rank(S) + 1 1d 28(4=3+ p 2) 2 j e Jj 2 N l 2 2 ; (A.39) where 1+d 1d 2(1; 11 5 ] is a number close to 1. Finally, by (2.3.6), we see that j e Jj N 2 jJj n , so we can write the last term of the inequality (A.39) as 28(4=3+ p 2) 2 j e Jj 2 N l 2 2 = 28(4=3+ p 2) 2 l 2 2 jJj 2 (2njJj 1) 2 n(n 1) 4 28(4=3+ p 2) 2 l 2 2 jJj 2 under the assumption thatjJj nd 2 3136 . This completes the proof. A.1.3 Proof of Theorem 2.3.2 In this section we prove Theorem 2.3.2, which provides the lower bound for the choice ofl 1 . We start with a well-known theorem on the concentration of sample covariance matrix. Theorem A.2 (Koltchinskii and Lounici [31, Theorem 9]). Assume that Z is L-sub-Gaussian with mean zero and sample covariance matrixS. Let Z 1 ;:::;Z n be independent samples of Z, then there exists c(L)> 0 depending only on L, such that 1 n n å j=1 Z j Z T j S c(L)kSk r rk(S) n _ rk(S) n _ r t n _ t n ! 79 with probability at least 1 e t . Remark A.3. Assuming that rk(S) n and t n, the bound can be reduced to c(L)kSk r rk(S) n _ r t n ! : Now we prove Theorem 2.3.2. Proof of Theorem 2.3.2. First, it is well-known that 1 n(n 1) å i6= j (Z i Z j )(Z i Z j ) T 2 = 1 n 1 n å i=1 (Z i ¯ Z)(Z i ¯ Z) T ; (A.40) where ¯ Z := 1 n å n i=1 Z i . Therefore, D := 1 n(n 1) å i6= j e Z i j e Z T i j S= 1 n(n 1) å i6= j (Z i Z j )(Z i Z j ) T 2 S= e S s S: RecallE Z j =m; j= 1;:::;n, and note that e S s = 1 n 1 n å i=1 (Z i m)(Z i m) T n( ¯ Zm)( ¯ Zm) T ! ; hence we have the decomposition (n 1)kDk= (n 1) e S s nS+S n å i=1 (Z i m)(Z i m) T nS + S n( ¯ Zm)( ¯ Zm) T : (A.41) We will bound the two terms on the right-hand side of (A.41) one by one. First, note that Z j m; j= 1;:::;n are i.i.d L-sub-Gaussian random vectors with mean zero and covariance matrixS, hence Theorem A.2 immediately gives that n å i=1 (Z i m)(Z i m) T nS nc(L)kSk r rk(S) n _ rk(S) n _ r t n _ t n ! (A.42) 80 with probability at least 1 e t . To bound the second term, consider the random variable Y := p n( ¯ Zm)= 1 p n å n i=1 (Z i m). Clearly,E[Y]= 0 and E YY T = nE ( ¯ Zm)( ¯ Zm) T = 1 n E " n å i=1 n å j=1 (Z i m)(Z j m) T # = 1 n E " n å i=1 (Z i m)(Z i m) T # =S; where we used the independence of Z i ;i= 1;:::;n in the third equality. Moreover, Corollary 2.2.1 guarantees that Y is L-sub-Gaussian. Therefore, Y satisfies the conditions in Theorem A.2, and a direct application of the theorem implies that S n( ¯ Zm)( ¯ Zm) T = YY T S c(L)kSk(rk(S)_t) (A.43) with probability at least 1 e t , given that t 1. Combining (A.41, A.42, A.43), we deduce that for any t 1, kDk c(L) " n n 1 kSk r rk(S) n _ rk(S) n _ r t n _ t n ! + 1 n 1 kSk(rk(S)_t) # c(L)kSk " r rk(S)+t n + rk(S)+t n # with probability at least 1 2e t , where c(L) is an absolute constant that only depends on L but could vary from step to step. A.1.4 Proof of Theorem 2.3.3 In this section we prove Theorem 2.3.3, which provides the lower bound for the choice ofl 2 . 81 Proof of Theorem 2.3.3. Fix i2f1;:::;ng, we apply Theorem A.2 to Z i and deduce that for any u 1, Z i Z T i S c(L)kSk(rk(S)+ u)= c(L)(tr(S)+kSku) with probability at least 1 e u . Therefore, by union bound we have that for t 1 and n 1, P max i Z i Z T i S c(L)[tr(S)+ log(n)kSk+kSkt] n å i=1 P Z i Z T i S c(L)[tr(S)+ log(n)kSk+kSkt] = nP Z i Z T i S c(L)[tr(S)+kSk(log(n)+t)] ne log(n)t = e t : In other words, for t 1, we have that with probability at least 1 e t , max i Z i Z T i S c(L)[tr(S)+kSk(log(n)+t)] = c(L)kSk(rk(S)+ log(n)+t); as desired. A.2 Proof of Lemma 2.4.1 and Theorem 2.4.1 In this subsection we present the proof of Lemma 2.4.1 and Theorem 2.4.1, which provide error bounds of the estimator in (2.4.1) in the operator norm. To simplify the expressions, we introduce the following notations, which are valid in this subsection only: Denote r(u) :=r 1 (u)= 8 > < > : u 2 2 ; juj 1 juj 1 2 ; juj> 1 8u2R: 82 Denote k 0 =bn=2c and N= n(n 1). Denoteq 2 := 2 l 2 p n(n1) and q s := 1 s r 2t k 0 wheres> 0, t> 0 are constants to be specified later. It is easy to check thatr l (u)=l 2 r( u l ) andr 0 l (u)=lr( u l ), so with the above notations, we can rewrite the loss function in (2.4.2) as L(S)= 1 q 2 2 1 N å i6= j r q 2 (H i; j S) + l 1 2 kSk 1 : (A.44) The gradient of the loss function is ÑL(S)= 1 Nq 2 å i6= j r 0 q 2 (H i; j S) + l 1 2 ¶kSk 1 : (A.45) Given a2(0;1), one can easily verify that r 0 () is H¨ older continuous on R, namely,jr 0 (x) r 0 (y)j 2jx yj a ;8x;y2R. The following theorem shows thatr 0 () is H¨ older continuous in the operator norm, which is crucial for the next part of the proof: Theorem A.3. (Aleksandrov and Peller [1, Theorem 1.7.2]) Assume that f(x) is H¨ older continu- ous onR witha2(0;1), i.e.j f(x) f(y)j C 0 jx yj a ,8x;y2R. Then there exists an absolute constant c such that k f(A) f(B)k c(1a) 1 C 0 kA Bk a for any symmetric matrices A and B. We now present the proofs of Lemma 2.4.1 and Theorem 2.4.1. 83 A.2.1 Proof of Lemma 2.4.1 It is worth noting that the proof follows the argument in Minsker and Wei [39, Section 5]. Recall the loss function and its gradient: L(S)= 1 q 2 2 1 N å i6= j r(q 2 (H i; j S))+ l 1 2 kSk 1 ; ÑL(S)= 1 Nq 2 å i6= j r 0 (q 2 (H i; j S))+ l 1 2 ¶kSk 1 ; where H i; j = e Y i; j e Y T i; j . Consider the choice l 1 > 1 2q s , where q s := 1 s q 2t k 0 . We assume that the minimizer e S= V6= 0. Since L(S) is convex, we have L(V) L(0)hÑL(0);V 0i: Plugging in the explicit form ofÑL(0), we get that for any W2¶kSk 1j S=0 , L(V) L(0) * 1 Nq 2 å i6= j r 0 (q 2 H i; j )+ l 1 2 W;V + ; hence L(V) L(0) sup W2¶kSk 1 j S=0 * 1 Nq 2 å i6= j r 0 (q 2 H i; j )+ l 1 2 W;V + : (A.46) Consider the random variableX i; j :=1f H i; j S 1 aq s g, a 2, and setq 2 =q s in what follows. By Chebyshev’s inequality, P(X i; j = 0) a 2 q 2 s trE (H i; j S) 2 a 2 2t k 0 r H ; where r H = rk(E (H i; j S) 2 ). Define the event E := n 1 N å i6= j (1X i; j ) r H 2a 2 t k 0 (1+ r 3 2a 2 r H ) o 84 By the finite difference inequality (see for example, Minsker and Wei [39, Fact 4-6]), P 1 N å i6= j (1X i; j ) r H 2a 2 t k 0 (1+t) e t 2 2a 2 tr H =3 ; 0<t< 1: Settingt = q 3 2a 2 r H we get P(E) 1 e t . Therefore, for k 0 32a 2 tr H , we have that 1 N å i6= j (1X i; j ) 1 8 with probability 1 e t . Note that on the eventfX i; j = 0g, r 0 (q s H i; j ) 1 sincejr 0 (x)j 1 for any x2R. On the other hand, on the eventfX i; j = 1g, we have that H i; j S 1 aq s , hence H i; j 1 aq s +kSk 1 aq s + 1 bq s given that k 0 2b 2 t 2 kSk 2 s 2 . Therefore, by Theorem A.3 we have that r 0 (q s H i; j ) 2c(1a) 1 q s H i; j a 2c(1a) 1 1 a + 1 b a : Setting a;b large enough such that 2c(1a) 1 1 a + 1 b a + 1 8 1 4 , we have that 1 q s N å i6= j r 0 (q s H i; j ) 1 q s N å i6= j r 0 (q s H i; j )X i; j + 1 q s N å i6= j r 0 (q s H i; j )(1X i; j ) 1 q s 2c(1a) 1 ( 1 a + 1 b ) a 1 N å i6= j X i; j + 1 8q s 1 4 1 q s : Therefore, L(V) L(0) sup W2¶kSk 1 j S=0 * 1 Nq 2 å i6= j r 0 (q 2 H i; j )+ l 1 2 W;V + 85 = * 1 Nq 2 å i6= j r 0 (q 2 H i; j );V + + l 1 2 kVk 1 1 4q s kVk 1 + l 1 2 kVk 1 > 0; where we used the fact that¶kSkj S=0 =fW :kWk 1g and sup W:kWk1 hW;Vi=kVk 1 . This is a contradiction to the fact that V is a minimizer of the loss function L(S), and hence we conclude that argmin S L(S)= 0 with probability at least 1 e t . A.2.2 Proof of Theorem 2.4.1 Recall the loss function L(S)= 1 q 2 2 1 N å i6= j r q 2 (H i; j S) + l 1 2 kSk 1 and its gradient ÑL(S)= 1 Nq 2 å i6= j r 0 q 2 (H i; j S) + l 1 2 ¶kSk 1 ; where H i; j = e Y i; j e Y T i; j ,r()=r 1 () andq 2 = 2 p Nl 2 . Consider the proximal gradient descent iteration: (1) S 0 :=E[H]=S. (2) For t= 1;2;:::, do: T t+1 := S t + 1 Nq 2 å i6= j r 0 q 2 (H i; j S t ) . S t+1 := argmin S f 1 2 S T t+1 2 F + l 1 2 kSk 1 g. We will show that with an appropriate choice of q 2 , S t+1 does not escape a small neighborhood of S with high probability, and the result will easily follow. First, the following lemma bounds S t+1 T t+1 : Lemma A.4. S t+1 T t+1 l 1 2 : 86 Proof. Repeating the reasoning for equation (A.85) in Appendix A.5.1, we can solve for S t+1 explicitly: S t+1 = argmin S 1 2 S T t+1 2 F + l 1 2 kSk 1 =g l 1 2 (T t+1 ); whereg l (u)= sign(u)(jujl) + is the function that shrinks eigenvalues to 0. A direct calculation gives that S t+1 T t+1 = S t+1 T t+1 = g l 1 2 (T t+1 ) T t+1 l 1 2 : Applying Lemma A.4 , we see that S t+1 S S t+1 T t+1 + T t+1 S l 1 2 + T t+1 S : It remains to bound T t+1 S . Note that T t+1 S = S t S+ 1 Nq 2 å i6= j r 0 q 2 (H i; j S t ) 1 Nq 2 å i6= j r 0 q 2 (H i; j S t ) r 0 (q 2 (H i; j S)) + S t S | {z } :=I + 1 Nq 2 å i6= j r 0 q 2 (H i; j S) | {z } :=II : (A.47) We will bound terms I and II separately. Set k 0 =bn=2c and define Y i; j (S;q) :=r 0 (q(H i; j S)); W i 1 ;:::;i n (S;q) := 1 k 0 h Y i 1 ;i 2 (S;q)++Y i 2k 0 1 ;i 2k 0 (S;q) i ; where(i 1 ;:::;i n )2p n is a permutation. Fact 6 in Minsker and Wei [39] implies that I= 1 q 2 n! å p n W i 1 ;:::;i n (S t ;q 2 )W i 1 ;:::;i n (S;q 2 ) + S t S ; (A.48) 87 II= 1 q 2 n! å p n W i 1 ;:::;i n (S;q 2 ) : (A.49) For a given s 2 E (H i; j S) 2 and q s := 1 s q 2t k 0 , the following lemma provides a bound for the term II: Lemma A.5. Recall that r H = rk(E (H i; j S) 2 ). Given t 1, we have that 1 q 2 n! å p n W i 1 ;:::;i n (S;q 2 ) q 2 s 2 + t q 2 k 0 with probability at least 1 8 3 r H e t . Whenq 2 =q s , the upper bound takes the form 3 p 2 s q t k 0 . Proof. It is easy to verify that for any x2R, log(1 x+ x 2 )r 0 (x) log(1+ x+ x 2 ); and the rest of the proof follows from the argument in Minsker and Wei [39, Section 5.5]. To estimate the term I, consider the random variable L n (d) := sup kSSkd 1 q s n! å p n (W i 1 ;:::;i n (S;q s )W i 1 ;:::;i n (S;q s ))+ SS : Lemma A.6. Givena2(0;1), we have that for alld 1 2 1 q s , L n (d) r H 13t k 0 d 1+ 4c(1a) 1 with probability at least 1 e t , where c> 0 is an absolute constant specified in Theorem A.3. 88 Proof. Define X i; j := 1 H i; j S 1 2q s and consider the event E := å i6= j (1X i; j ) N 8t k 0 r H (1+ q 3 8r H ) . Minsker and Wei [39] proves that P(E) 1 e t . For S withkSSk d 1 2q s , we have that 1 q s n! å p n (W i 1 ;:::;i n (S;q s )W i 1 ;:::;i n (S;q s ))+ SS = 1 Nq s å i6= j r 0 q s (H i; j S) r 0 (q s (H i; j S)) + SS = 1 Nq s å i6= j r 0 q s (H i; j S) r 0 (q s (H i; j S)) X i; j + SS + 1 Nq s å i6= j r 0 q s (H i; j S) r 0 (q s (H i; j S)) (1X i; j ) : We will separately control the two terms on the right-hand side of the equality above. First, note that whenX i; j = 1, we have that H i; j S 1 2q s 1 q s , and H i; j S H i; j S +kS Sk 1 q s . Therefore, on the eventE , 1 Nq s å i6= j r 0 q s (H i; j S) r 0 (q s (H i; j S)) X i; j + SS = 1 Nq s å i6= j q s (H i; j S)q s (H i; j S) X i; j + SS = 1 N å i6= j (SS)(1X i; j ) r H 8t k 0 (1+ r 3 8r H )d r H 13t k 0 d: (A.50) Next, recall that for anya2(0;1),jr 0 (x)r 0 (y)j 2jxyj a for any x;y2R, so by Theorem A.3, there exists a constant c> 0 such that r 0 (A)r 0 (B) 2c(1a) 1 kA Bk a for any symmetric matrices A and B. Therefore, on the eventE , 89 1 Nq s å i6= j r 0 q s (H i; j S) r 0 (q s (H i; j S)) (1X i; j ) 2c(1a) 1 kq s (S S)k a 1 Nq s å i6= j (1X i; j ) 2c(1a) 1 ( 1 2 ) a1 d r H 13t k 0 4c(1a) 1 r H 13t k 0 d: (A.51) Combining (A.50), (A.51) and P(E) 1 e t , we have that L n (d) r H 13t k 0 d 1+ 4c(1a) 1 with probability at least 1 e t . Now we can bound S t+1 S as follows: For t= 0;1;:::, define d 0 = 0; d t+1 = r H 13t k 0 1+ 4c(1a) 1 d t + 5:75s r t k 0 + l 1 2 : Choose t;k such that r H 13t k 0 1+ 4c(1a) 1 1 20 and t k 0 520 , we have that 5:75s q t k 0 1 40q s , hence d t+1 1 20 d t + 1 40q s + l 1 2 1 2q s given thatd t 1 2q s andl 1 1 2q s . Since S 0 S = 0 1 2q s , we have that for t= 0;1;:::, S t+1 S l 1 2 +L n (d t )+ 3 p 2 s r t k 0 l 1 2 +r H 13t k 0 1+ 4c(1a) 1 d t + 3 p 2 s t k 0 d t+1 with probability at least 1( 8 3 r H +1)e t . Finally, forg := r H 13t k 0 1+ 4c(1a) 1 1 40 , it is easy to check that for t= 0;1;:::, 90 d t+1 =g t+1 d 0 + t å l=0 g t ( l 1 2 + 3 p 2 s r t k 0 ) å l0 1 40 l ( l 1 2 + 3 p 2 s r t k 0 )= 20 39 l 1 + 20 p 2 13 s r t k 0 : By Theorem 2.5.1, S t ! b S pointwise as t!¥, so the result follows. To this end, we note that the proof above can be repeated with q 2 3t, which requires the order of t to be at most log(n). A.3.2 Proof of Theorem 2.4.2 In this section we present the proof of Theorem 2.4.2, which gives an improved error bound in the Frobenius norm for heavy-tailed data. The main idea making the improvement possible is the fact that for heavy-tailed data, the “outliers” e V i; j are nonzero if and only if the “well-behaved” term e Z i; j equals zero. We will repeat parts of the proof of Theorem 2.3.1 using this fact along with the inequality of Theorem 2.4.1 instead of the inequalitykAkkAk F to derive an improved bound. 93 We start with some notations, which are specific to this proof. Let N = n(n 1), and c(K) be a constant depending on K only, which can vary from step to step. Consider the events E 1 = n e c(K) rk(S Y )log n rk(S Y ) n o ; E 2 = ( b SS Y c(K)kS Y k r rk(S Y )log(n) 3 n ) ; and E = ( l 1 140kS Z k p n(n 1) p rk(S Z )+ 4 1 n(n 1) å i6= j e Z i; j e Z T i; j S Z ; l 2 140kS Z k n(n 1) p rk(S Z )+ 4 1 p n(n 1) max i6= j e Z i; j e Z T i; j S Z ) : We will need to condition on these three events throughout the proof, so we will first estimate their probabilities. (i) In the view of Lemma 2.4.2, we have that P(E 1 ) 1 1 n . (ii) ForE , we need to choosel 1 andl 2 appropriately in order to guarantee thatE happens with high probability. Since e Z i; j 2 R almost surely, we can invoke the following version of matrix Bernstein inequality, which is a corollary of Theorem 3.1 from Minsker [38], Theorem A.4. Let Z 1 ;:::;Z n be i.i.d. random vectors withE[Z 1 ]= 0 andkZ 1 k 2 R almost surely. DenoteS Z =E Z 1 Z T 1 and B=E (Z 1 Z T 1 ) 2 , then for t 9nkBk 16R 4 , 1 n å i Z i Z T i S Z C r kBk(log(rk(B))+t) n _ R 2 (log(rk(B))+t) n ! with probability at least 1 e t . Following the same argument as in Section A.1.3, we can derive the following corollary for the transformed data: 94 Corollary A.3. Let e Z i; j be defined as in (2.4.3), namely, b Z=1 n e Y i; j 2 R o . Then 1 N å i6= j e Z i; j e Z T i; j S Z C r kBk(log(rk(B))+t) n _ R 2 (log(rk(B))+t) n ! (A.55) with probability at least 1 2e t , where B=E h ( e Z i; j e Z T i; j ) 2 i . It remains to estimatekBk and rk(B). Mendelson and Zhivotovskiy [37] showed that if Y satisfies an L 4 L 2 norm equivalence with constant K, then ckS Z ktr(S Z )kBk c(K)kS Y ktr(S Y ) and tr(B) c(K)tr(S Y ) 2 : Combining these two bounds, we have that rk(B)= tr(B) kBk c(K) tr(S Y ) 2 kS Z ktr(S Z ) : (A.56) On the other hand, we have the following lemma which guarantees thatS Z is close toS Y . Lemma A.7. Let Y2R d be a mean zero random vector satisfying the L 4 L 2 norm equiva- lence with constant K. Then kS Z S Y k c(K) kS Y ktr(S Y ) R 2 = c(K) kS Y k 2 rk(S Y ) R 2 ; (A.57) jtr(S Z ) tr(S Y )j c(K) tr 2 (S Y ) R 2 = c(K) kS Y k 2 rk(S Y ) 2 R 2 ; (A.58) kS Z S Y k F c(K) kS Y k 1 2 tr(S Y ) 3 2 R 2 = c(K) kS Y k 2 rk(S Y ) 3 2 R 2 ; (A.59) 95 whereS Y =E YY T , Z= Y1fkYk 2 Rg,S Z =E ZZ T , and c(K) is a constant depending only on K. The proof of Lemma A.7 is presented in Appendix A.5.2. In particular, it implies that bothkS Z k and tr(S Z ) are equivalent up to a multiplicative constant factor tokS Y k and tr(S Y ) respectively, as long as R c(K) p tr(S Y ). The condition is valid given that n c(K)rk(S Y ) h log rk(S Y ) + log(n) i , and hence by (A.56), rk(B) c(K)rk(S Y ): Combining the bounds onkBk and rk(B) with Corollary A.3, the choice of R as R= tr(S Y )kS Y kn log n rk(S Y ) ! 1 4 ; and the choice of t= log(n), we deduce that 1 n(n 1) å i6= j e Z i; j e Z T i; j S Z c(K)kS Y k s rk(S Y ) log n rk(S Y )) n (A.60) with probability at least 1 2 n . Similarly, applying Theorem A.4 to each single point e Z i; j and proceeding in a similar way in Section A.1.4, we deduce that max i6= j e Z i; j e Z T i; j S Z c(K)kS Y k q n rk(S Y ) log n rk(S Y ) (A.61) with probability at least 1 1 n . It follows that with the choices of l 1 c(K)kS Y k s rk(S Y ) log n rk(S Y )) n (A.62) and l 2 c(K)kS Y k s rk(S Y ) log n rk(S Y ) n ; (A.63) 96 we have that P(E) 1 3 n . (iii) To estimate the probability of the eventE 2 , we first state a modified version of Theorem 2.4.1. Remark A.4. Following the same argument as in the proof of Theorem 2.4.1, we can show that for A 1, with the choice ofl 1 c(K)kS Y k p nlog(n)rk(S Y ),l 2 = c(K)kS Y k q rk(S Y )log(n) An and under the assumptions that r H log(n) n 1+ 2c(1a) 1 1 1280 ; the following inequality holds with probability at least 1( 8 3 r H + 1) 1 n A : b S l S Y 20 39 l 1 + 40 13 c(K)kS Y k r rk(S Y )Alog(n) 3 n : (A.64) Applying Remark A.4 withl 1 = c(K)kS Y k r rk(S Y ) h log nrk(S Y )) i n ,l 2 = c(K)kS Y k q rk(S Y )log(n) An and the assumption n c(K)rk(S Y ) log n rk(S Y ) , we get that b SS Y c(K)kS Y k r rk(S Y )Alog(n) 3 n (A.65) with probability at least 1 ( 8 3 r H +1) n A . This confirms that P(E 2 ) 1 ( 8 3 r H +1) n A . Note that the choices ofl 1 andl 2 coincide with (A.62) and (A.63). For what follows, we will condition on the eventsE ,E 1 andE 2 . Repeating parts of the argument in Section A.1.2, we can arrive at the inequality S Z b S l 2 F + S b S l 2 F kS Z Sk 2 F + 1 8 b S l S 2 F + 2l 2 1 rank(S)( p 2+ 1) 2 + 2 p N å i6= j D e U i; j b U i; j ; b S l S E : (A.66) 97 By Lemma A.7 and choosing R as R= tr(S Y )kS Y kn log n rk(S Y ) ! 1 4 ; (A.67) we have that kS Z S Y k F c(K)kS Y k rk(S Y ) q log rk(S Y ) + log(n) p n : Therefore, we can deduce from (A.66) that S Y b S l 2 F + S b S l 2 F kS Z Sk 2 F + 1 8 b S l S 2 F + 2l 2 1 rank(S)( p 2+ 1) 2 + 2 p N å i6= j D e U i; j b U i; j ; b S l S E + c(L)kS Y k 2 rk(S Y ) 2 log n rk(S Y ) n : (A.68) It remains to bound the expression 2 p N å i6= j D e U i; j b U i; j ; b S l S E : First, note that å (i; j)= 2 e J b U i; j e U i; j 1 = å (i; j)= 2 e J P L ? i; j b U i; j P L ? i; j 1 and that å (i; j)2 e J b U i; j e U i; j 1 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 + å (i; j)2 e J P L ? i; j b U i; j P L ? i; j 1 : By Lemma A.2, we have that å i6= j b U i; j e U i; j 1 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 + å i6= j P L ? i; j b U i; j P L ? i; j 1 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 + 3 0 @ l 1 l 2 P L(k) ( b SS(k)) 1 + å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 1 A 98 4 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 + 3 l 1 l 2 P L(k) ( b SS(k)) 1 Repeating the argument behind (A.29), we have that 2 p N å i6= j D e U i; j b U i; j ; b S S E 2 b S S p N å i6= j e U i; j b U i; j 1 2 b S S p N 0 @ 4 å (i; j)2 e J P L i; j ( b U i; j e U i; j ) 1 + 3 l 1 l 2 P L(k) ( b SS(k)) 1 1 A 2 b S S p N 0 @ 3 p 2k l 1 l 2 b SS Z F + 8 å (i; j)2 e J b U i; j e U i; j F 1 A 6 b S S l 1 l 2 r 2k N b SS Z F + 16 b S S s j e Jj N v u u t å (i; j)2 b J b U i; j b U i; j 2 F : (A.69) We will estimate the two terms on the right-hand side of the above inequality one by one. Note that we did not apply the crude bound b S S b S S F since b S S is strictly smaller for the heavy tailed data due to the independence of the “outliers”. By triangle inequality, Lemma A.7 and the choice of R= tr(S)kSkn log nrk(S) 1 4 ; we have that on the eventE 2 , b S l S b S l S Y +kS Y S Z k+kS Z Sk c(L) kS Y k r rk(S Y )Alog(n) 3 n + kS Y ktr(S Y ) R 2 ! +kS Z Sk c(L)kS Y k r rk(S Y )Alog(n) 3 n | {z } :=I +kS Z Sk: (A.70) Note that the term I is of the order q rk(S Y ) n , up to the logarithmic factors. For what follows, we set S=S Y , and (A.70) implies that b S l S = b S l S Y I: To estimate r å e J b U i; j e U i; j 2 F , we can apply inequality (A.18) which entails that 99 s å e J b U i; j e U i; j 2 F 2 p 2 kS Z S Y k F + p rank(S)l 1 ( p 2+ 1)+l 2 (4=3+ p 2) p 2 q j e Jj + s 2(6 p 2+ 6) l 1 l 2 rank(S Y ) N 1 4 b S l S Y F ! ; given that k=b Nl 2 2 1200l 2 1 c andj e Jj N 6400 . For simplicity, we denote B= q 2(6 p 2+ 6) l 1 l 2 rank(S Y ) N 1 4 . Now we will estimate the two terms in (A.69): First, 6 b S l S Y l 1 l 2 r 2k N b S l S Z F 6 I l 1 l 2 r 2k N b S l S Z F : (A.71) This term is independent of the outliers, and a direct application of the inequality 2ab a 2 + b 2 gives that 6 b S l S Y l 1 l 2 r 2k N b S l S Z F 3 l 1 l 2 r 2k N I 2 + b S l S Z 2 F : (A.72) Second, 16 b S l S Y s j e Jj N v u u t å (i; j)2 e J b U i; j e U i; j 2 F 16 b S l S Y s j e Jj N 2 p 2 kS Z S Y k F + p rank(S Y )l 1 ( p 2+ 1)+l 2 (4=3+ p 2) p 2 q j e Jj + B b S l S Y F ! 16 p 2 s j e Jj N S Y b S l 2 +kS Z S Y k 2 F + 8( p 2+ 1) 2 j e Jj N S Y b S l 2 + 64l 2 1 rank(S Y ) + 32(4=3+ p 2) Il 2 s j e Jj 2 N + 16 p 2 p n I 2 B 2 + j e Jj N 1 p n b S l S Y 2 F ! : (A.73) 100 Combining (A.68, A.69, A.72,A.73), and assuming that 6 l 1 l 2 q 2k N +96 p 2 q j e Jj N d 3 8 and B 2 d p A 16 p 2n , we deduce that (1d) S Y b S l 2 F c(d)76l 2 1 rank(S Y )+d I 2 + 32(4=3+ p 2) Il 2 en + c(L;d)kS Y k 2 rk(S Y ) 2 log n rk(S Y ) n ; (A.74) wheree = j e Jj N is the proportion of outliers. Finally, we recall that the choices ofl 1 andl 2 are l 1 = c(K)kS Y k s rk(S Y ) log n rk(S Y )) n (A.75) and l 2 = c(K)kS Y k r rk(S Y )log(n) An : (A.76) Also, recall the definition of I in (A.70): I= c(K)kS Y k r rk(S Y )Alog(n) 3 n : (A.77) Combining the equations (A.75, A.75, A.77) with (A.74), we derive that S Y b S l 2 F c(K;d)kS Y k 2 rk(S Y )log n rk(S Y ) n rank(S Y )+ c(K;d)kS Y k 2 rk(S Y )Alog(n) 3 n + c(K;d)e nkS Y k 2 rk(S Y )log(n) 2 n + c(K;d)kS Y k 2 rk(S Y ) 2 log n rk(S Y ) n c(K;d)kS Y k 2 rk(S Y )log n rk(S Y ) n rank(S Y )+ c(K;d)kS Y k 2 rk(S Y ) 2 Alog(n) 3 n (A.78) under the assumptions that B 2 d p A 16 p 2n and n c(K)rk(S Y ) log n rk(S Y ) , where the last step in (A.78) follows from Lemma 2.4.2. Note that the assumption B 2 d p A 16 p 2n is valid as long as rank(S Y ) c 1 d 2 n Al 2 2 l 2 1 for any constant c 1 1 4(6 p 2+6) 2 . Finally, by the union bound over the 101 eventsE ,E 1 andE 2 , inequality (A.78) will hold with probability at least 1 ( 8 3 r H +1) n A 4 n . To this end, note that the condition rank(S Y ) c 1 d 2 n Al 2 2 l 2 1 is equivalent to rank(S Y ) c(K) n log(n) log n rk(S Y ) whenl 1 ,l 2 are chosen as (A.75) and (A.76) respectively. The upper bound on rank(S Y ) is in the order of n up to logarithmic factors. A.4 Proofs ommitted in Section 2.5 A.4.1 Convergence analysis of the proximal gradient method (Theorem 2.5.1) In this section we present the convergence analysis of the proximal gradient method (with matrix variables). It is worth noting that our analysis follows the argument in Beck [2, Chapter 10]. Recall that our loss function can be written in the form L(S)= g(S)+h(S), where h is convex, and g is the average of N functions g i; j (S)= tr r p Nl 2 2 ( e Y i; j e Y T i; j S) . Note thatÑg i; j (S)=r 0 p Nl 2 2 ( e Y i; j e Y T i; j S), and by Bhatia [6, Lemma VII.5.5] we have thatÑg i; j (S) is Lipschitz in Frobenius norm with L= 1, i.e. Ñg i; j (U)Ñg i; j (V) F LkUVk F ; hence g(S) is also Lipschitz in Frobenius norm with L= 1. We have the following matrix form of the descent lemma: Lemma A.8. Assume that g(S) is Lipschitz in Frobenius norm with constant L> 0. Then g(S 2 ) g(S 1 )+hÑg(S 1 );S 2 S 1 i+ L 2 kS 2 S 1 k 2 F : Proof. First, denote U t = S 1 +t(S 2 S 1 ), we have that g(S 2 )= g(S 1 )+ Z 1 0 hÑg(U t );S 2 S 1 idt; 102 hence g(S 2 ) g(S 1 )hÑg(S 1 );S 2 S 1 i = Z 1 0 hÑg(U t )Ñg(S 1 );S 2 S 1 idt Z 1 0 kÑg(U t )Ñg(S 1 )k F kS 2 S 1 k F dt L 2 kS 2 S 1 k 2 F : Now recall that the proximal gradient descent algorithm update is S t+1 = prox a t ;h (S t a t Ñg(S t )): Set G a (S)= 1 a S prox a;h (SaÑg(S)) , then S t+1 = S t a t G a t (S t ). The following lemma guarantees that the PGD makes progress at each step. Lemma A.9. Assume that 0a t L for all t= 1;2;:::, then for any symmetric matrix U, L(S t+1 ) L(U)+ G a t (S t );S t U a t 2 G a t (S t ) 2 F : Proof. Since g() is convex, we have that for any symmetric matrix U, g(U) g(S t )+ Ñg(S t );U S t : Combining this with Lemma A.8, we have that g(S t+1 ) g(S t )+ Ñg(S t );S t+1 S t + a t 2 G a t (S t ) 2 F g(U) Ñg(S t );U S t + Ñg(S t );S t+1 S t + a t 2 G a t (S t ) 2 F = g(U)+ Ñg(S t );S t+1 U + a t 2 G a t (S t ) 2 F : 103 Since h() is convex, for any V2¶h(S t+1 ), h(U) h(S t+1 )+ V;U S t+1 : Recall that S t+1 = argmin S n h(S)+ 1 2a t S(S t a t Ñg(S t )) 2 F o : By the optimality conditions, 02¶h(S t+1 )+ 1 a t (S t+1 S t +a t Ñg(S t )): Therefore, G a t (S t )Ñg(S t )2¶h(S t+1 ); and L(S t+1 ) g(U)+ h(U)+ Ñg(S t );S t+1 U + a t 2 G a t (S t ) 2 F + G a t (S t )Ñg(S t );S t+1 U = L(U)+ G a t (S t );S t+1 U + a t 2 G a t (S t ) 2 F L(U)+ G a t (S t );S t U a t 2 G a t (S t ) 2 F ; where in the last step we used the fact that S t+1 = S t a t G a t (S t ). Taking U = S t in Lemma A.9, we have that L(S t+1 ) L(S t ) a t 2 G a t (S t ) 2 F ; i.e. the PGD method is making progress at each iteration. Taking U = S in Lemma A.9, where S is the true minimizer of L(S), we have that 104 L(S t+1 ) L(S ) G a t (S t );S t S a t 2 G a t (S t ) 2 F = 1 2a t 2a t G a t (S t );S t S a t G a t (S t ) 2 F = 1 2a t S t S 2 F a t G a t S t + S 2 F = 1 2a t S t S 2 F S t+1 S 2 F : Assuming that the step size is fixed (i.e. a t =a) or diminishing (i.e. a t a T+1 =a), summing up both sides of the above inequality for t = 0;1;:::;T , and recalling that L(S t+1 ) L(S t ), we have that (T+ 1)(L(S t+1 L(S )) 1 2a S 0 S 2 F S t+1 S 2 F ; hence L(S t+1 ) L S 0 S 2 F 2a(T+ 1) ; as desired. Note that the convergence rate can be improved toO( 1 T 2 ), see Nesterov [42, 41], Tseng [52] for details. A.4.2 Numerical method of updating eigenvalues (solving equation (2.5.4)) In this section, we present the numerical method introduced by Bunch et al. [8] which computes the roots ofw i (m)= 0 for i= 1;:::;k d, wherew i (m) is defined as w i (m)= 1+ k å j=1 z 2 j d j m and d j =(d j d i )=r. Recall that the eigenvalues of C = D+rzz T , denoted as e d 1 ;:::; e d k , and the eigenvalues of D, denoted as d 1 ;:::;d k , satisfy the identity e d i = d i +rm i with w i (m i ) = 0. Therefore, it remains to solve equationsw i (m)= 0, i= 1;:::;k. Fix i2f1;:::;kg, and define y i (t)= i å j=1 z 2 j d j t ; i= 1;:::;k; 105 and f i (t)= 8 > < > : 0; i= k; å k j=i+1 z 2 j d j t ; 1 i< k: It is clear that w i (t)= 1+y i (t)+f i (t). Without loss of generality, we shall assume that r > 0; otherwise, we can replace d i byd ki+1 andr byr. Also, we assume that k> 1; otherwise, we have the trivial casem 1 =z 2 1 . We will deal with the case i< k and i= k separately. (1) Assume that i2f1;:::;k 1g is fixed. We are seeking m i such that 0< m i < minf1 å i1 j=1 ;d i+1 g (by Theorem 2.5.2) and y i (m i )=f i (m i )+ 1: Assume that we have an approximation t 1 to the root m i with 0< t 1 <m i , and we want to get an updated approximation t 2 . As suggested by Bunch et al. [8], we shall consider the local approximation to the rational functionsf i andy i at t 1 , namely, p 1 q 1 t 1 =y i (t 1 ); p 1 (q 1 t 1 ) 2 =y 0 i (t 1 ); (A.79) r 1 + s 1 dt 1 =f i (t 1 ); s 1 (dt 1 ) 2 =f 0 i (t 1 ): (A.80) whered =d i+1 . It can be easily verified that p 1 ;q 1 ;r 1 ;s 1 satisfies p 1 =y i (t 1 ) 2 =y 0 i (t 1 ); q 1 = t 1 +y i (t 1 )=y 0 i (t 1 ); (A.81) r 1 =f i (t 1 )(dt 1 )f 0 i (t 1 ); s=(dt 1 ) 2 f 0 i (t 1 ): (A.82) The updated approximation t 2 is then obtained by solving the following equation: p 1 q 1 t 2 = 1+ r 1 + s 1 dt 2 : (A.83) 106 Direct computation shows that t 2 = t 1 + 2b=(a+ p a 2 4b); where a= (dt 1 )(1+f i (t 1 ))+y i (t 1 ) 2 =y 0 i (t 1 ) c +y i (t 1 )=y 0 i (t 1 ); b= (dt 1 )wy i (t 1 ) y 0 i (t 1 )c ; c= 1+f i (t 1 )(dt 1 )f 0 i (t 1 ); w= 1+f i (t 1 )+y i (t 1 ): The following theorem shows that the update (A.83) is guaranteed to converge tom i : Theorem A.5 (Bunch et al. [8]). Let t 0 2 (0;m i ) and t j+1 be the solution of p j q j t = 1+ r j + s j dt , j 0, where p j ;q j ;r j ;s j are defined by (A.81). Then we have that t j < t j+1 < m i and lim j!¥ = m i . Moreover, the rate of convergence is quadratic, meaning that for any j sufficiently large,jt j+1 m i j Cjt j m i j 2 , where C is an absolute constant independent of iteration. It remains to determine an initial guess t 0 such that t 0 2(0;m i ). Recall thatw i (m i )= 0, which is equivalent to 1+ k å j=1; j6=i;i+1 z 2 j d j m i + z 2 i+1 d i+1 m i = z 2 i m i : Sincem i <d i+1 , we can define t 0 to be the positive solution of the equation 1+ k å j=1; j6=i;i+1 z 2 j d j d i+1 + z 2 i+1 d i+1 t 0 = z 2 i t 0 : By monotonicity, we have that t 0 2(0;m i ), as desired. 107 (2) Now we assume that i= k. In this case,f k (t)= 0 and we want to solve the equationy k (t)= 1. Theorem A.5 is still valid, and the update (A.83) can be simplified as t j+1 = t j + 1+y k (t j ) y 0 k (t j ) y k (t j ): (A.84) To choose t 0 2(0;m k ), we again recall thatw k (m k )= 0, which is equivalent to 1 z 2 k m k + k1 å j=1 z 2 j d j m k = 0: Sincem k < 1, we define t 0 to be the solution of 1 z 2 k t 0 + k1 å j=1 z 2 j d j 1 = 0: By monotonicity, we have that t 0 <m k . Moreover, note thatå k1 j=1 z 2 j kzk 2 = 1 and d j < 0, 8 j= 1;:::;k 1, so 1+å k1 j=1 z 2 j d j 1 > 0. Therefore, t 0 2(0;m k ), as desired. A.5 Auxiliary technical results A.5.1 Detailed derivation of the claim of Remark 2.3.1 In this section we present the detailed derivation of Remark 2.3.1. First, consider the function as follows F(S;U 1 ;;U n ) := 1 2 n å i=1 Y i Y T i SU i 2 F +l 1 kSk 1 +l 2 n å i=1 kU i k 1 : For a fixed matrix S, the matrix Y i Y T i S has a spectral decomposition Y i Y T i S= d å j=1 l (i) j v (i) j v (i)T j ; for all i= 1;:::;n; 108 where l (i) j is the j-th eigenvalue of Y i Y T i S and v (i) j is the corresponding eigenvector. We claim that F(S;) can be minimized by choosing e U i = d å j=1 sign(l (i) j ) jl (i) j jl 2 + v (i) j v (i)T j =g l 2 (Y i Y T i S); (A.85) where g l (u) := sign(u)(jujl) + ;8u2R;l2R + , and (x) + := max(x;0). Indeed, note that (U 1 ;:::;U n )7! G S (U 1 ;:::;U n ) := F(S;U 1 ;:::;U n ) is strictly convex, so a sufficient and necessary condition for( e U 1 ;:::; e U n ) to be a point of minimum is 02¶G S ( e U 1 ;:::; e U n )= Y 1 Y T 1 SU 1 +l 2 e V 1 ;:::; Y n Y T n SU n +l 2 e V n ! ; where e V i 2¶ e U i ;i= 1;:::;n. By choosing e V i :=å j:jl (i) j j>l 2 sign(l (i) j )v (i) j v (i)T j +å j:jlj (i) j l 2 l (i) j l 2 v (i) j v (i)T j 2 ¶ e U i 1 , it is easy to verify that ¶G S ( e U 1 ;:::; e U n )= 0, hence( e U 1 ;:::; e U n ) is the minimizer. Plug- ging in to F(S;U 1 ;:::;U n ), we get that F(S; e U 1 ;; e U n )= 1 2 n å i=1 Y i Y T i S e U i 2 F +l 1 kSk 1 +l 2 n å i=1 e U i 1 = 1 2 n å i=1 d å j=1 h l (i) j g l 2 (l (i) j ) i v (i) j v (i)T j 2 F +l 2 n å i=1 d å j=1 g l 2 (l (i) j )+l 1 kSk 1 = n å i=1 0 B @ å j:jl (i) j j>l 2 (l 2 jl (i) j j l 2 2 2 )+ å j:jl (i) j jl 2 l (i)2 j 2 1 C A +l 1 kSk 1 = tr n å i=1 r l 2 (Y i Y T i S) ! +l 1 kSk 1 ; (A.86) where r l (u)= 8 > < > : u 2 2 ; jujl ljuj l 2 2 ; juj>l 109 is the Huber’s loss function. Note that our loss function L(S;U U U I 2 n ) can be expressed as L(S;U U U I 2 n )= 1 N å i6= j e Y i; j e Y T i; j S p NU i; j 2 F +l 1 kSk 1 +l 2å i6= j U i; j 1 = 2 N " 1 2 å i6= j e Y i; j e Y T i; j S p NU i; j 2 F + p Nl 2 2 å i6= j p NU i; j 1 # +l 1 kSk 1 : Therefore, (A.86) implies that min S;U U U I 2 n L(S;U U U I 2 n )= min S min U U U I 2 n L(S;U U U I 2 n )= min S L(S; e U U U I 2 n ) = min S ( 2 N tr å i6= j r p Nl 2 2 ( e Y i; j e Y T i; j S) ! +l 1 kSk 1 ) ; where e U i; j = 1 p N d å k=1 sign(l (i; j) k ) jl (i; j) k j p Nl 2 2 + v (i; j) k v (i; j)T k =g p Nl 2 2 ( e Y i; j e Y T i; j S) with e Y i; j e Y T i; j S=å d k=1 l (i; j) k v (i; j) k v (i; j)T k ; for all i= 1;:::;n. A.5.2 Proof of Lemma A.7 We denoteS Y =S, which is valid throughout this proof only. The proof of relations (A.57,A.58) was presented in Mendelson and Zhivotovskiy [37, Lemma 2.1] with constants c(K)= K 3 and c(K) = K 4 respectively. For (A.59), assume that S Z S has eigenvalues l 1 l d with corresponding orthonormal eigenvector setfu 1 ;:::;u d g. Define T 1 = 0 and T j =å j1 l=1 l l ; j = 2;:::;d+ 1. Thenl j = T j+1 T j , and we have that kS Z Sk 2 F = d å j=1 l 2 j = d å j=1 l j (T j+1 T j ): Summation by parts implies that 110 d å j=1 l j (T j+1 T j )=(l d T d+1 l 1 T 1 ) d å j=2 T j (l j l j1 ) =l d T d+1 d å j=2 T j (l j l j1 )jl d jjT d+1 j+ d å j=2 T j (l j l j1 ) : Sincel j l j1 0, we have that d å j=2 T j (l j l j1 ) max 2 jd jT j j d å j=2 (l j l j1 )=(l d l 1 ) max 2 jd jT j j; hence kS Z Sk 2 F jl d jjT d+1 j+(l d l 1 ) max 2 jd jT j j 2kS Z Sk max 2 jd+1 jT j j: (A.87) It remains to bound max 2 jd+1 jT j j. Note that for j= 2;:::;d+ 1, jT j j= j1 å l=1 l l = j1 å l=1 h(S Z S)u l ;u l i = j1 å l=1 hS Z u l ;u l ihSu l ;u l i = j1 å l=1 E h hY;u l i 2 1fkYk 2 Rg i E h hY;u l i 2 i = j1 å l=1 E h hY;u l i 2 1fkYk 2 > Rg i : Applying Cauchy-Schwartz inequality and L 4 L 2 norm equivalence, we deduce that jT j j j1 å l=1 E h hY;u l i 4 i 1 2 P(kYk 2 > R) 1 2 j1 å l=1 K 2 E h hY;u l i 2 i P(kYk 2 > R) 1 2 K 2 P(kYk 2 > R) 1 2 d å l=1 E h hY;u l i 2 i : (A.88) Observe thatfu 1 ;:::;u d g is an orthonormal set onR d , so Parseval’s identity implies that d å l=1 E h hY;u l i 2 i =E h kYk 2 2 i = tr(E Y T Y )=E tr(YY T ) = tr(E YY T )= tr(S): (A.89) 111 On the other hand, applying Cauchy-Schwartz inequality and L 4 L 2 norm equivalence again, we have that E h kYk 4 2 i =E " d å j=1 Y;e j 2 2 # =E " å i; j hY;e i i 2 Y;e j 2 # å i; j E h hY;e i i 4 i 1 2 E h Y;e j 4 i 1 2 K 4 å i; j E h hY;e i i 2 i E h Y;e j 2 i = K 4 å i; j S i;i S j; j = K 4 tr(S) 2 : Markov’s inequality implies that P(kYk 2 > R) 1 2 0 @ E h kYk 4 2 i R 4 1 A 1 2 K 2 tr(S) R 2 : (A.90) Combining (A.88,A.89,A.90) together, we have that jT j j K 2 K 2 tr(S) R 2 tr(S)= K 4 tr(S) 2 R 2 for j= 2;:::;d+ 1. Therefore, kS Z Sk 2 F 2kS Z Sk max 2 jd+1 jT j j 2 K 3 kSktr(S) R 2 K 4 tr(S) 2 R 2 = 2K 7 kSktr(S) 3 R 4 ; hence kS Z Sk F p 2K 7 2 kSk 1 2 tr(S) 3 2 R 2 = c(K) kSk 2 rk(S) 3 2 R 2 ; as desired. 112 B Appendix to Chapter 3 B.1 Proofs ommited in Section 3.2.3 In this section, we present the proofs that were omitted from the main exposition. Throughout this section, we denoteD b = b bb ,D q = b qq to be the error vectors, andD=[D b ;D q ]2R p+n to be the augmented error vector. B.1.1 Proof of Theorem 3.2.1 It is worth noting that the main part of the proof follows the argument in Dalalyan and Thompson [15]. First, recall that S= j :b j 6= 0 , O= j :q j 6= 0 and s= Card(S), o= Card(O) respec- tively. The following lemma ensures that the augmented error vectorD belongs to the cone defined in (3.2.6), with c 0 = 3 andg = l s l o . Lemma B.1. On the event E = n l s kxk 2 r 2 n X T x ¥ ;l o kxk 2 p 2kxk ¥ o ; the following inequality holds: l s ( b bb ) S C 1 +l o ( b qq ) O C 1 3 l s ( b bb ) S 1 +l o ( b qq ) O 1 : Equivalently,D2C S;O 3; l s l o . Proof of Lemma B.1. By the definition of b b; b q, we have that Q( b b; b q) 1 2 Q(b ;q ) 1 2 l s kb k 1 b b 1 +l o kq k 1 b q 1 : (B.1) Sincekk 1 is convex, for any v2¶kb k 1 , kb k 1 b b 1 D v;b b b E : 113 Letb =å p j=1 b j e j , wherefe j : j= 1;:::; pg is the canonical basis ofR p . Then v2¶kb k 1 has the representation v= å j:b j 6=0 sign(b j )e j (p)+ å j:b j =0 w j e j (p); where w=(w 1 ;:::;w p ) T ,kwk ¥ 1. Therefore, kb k 1 b b 1 D v;b b b E = * å j:b j 6=0 sign(b j )e j (p);b b b + + * å j:b j =0 w j e j (p);b b b + å j:b j 6=0 sign(b j )e j (p) ¥ (b b b) S 1 + * å j:b j =0 w j e j (p);b b b + (b b b) S 1 (b b b) S C 1 ; where in the last step we chose w j =sign(b j b b j ) for j2 S C . Similarly, one can show that kq k 1 b q 1 (q b q) O 1 (q b q) O C 1 : Combining the inequalities above with (B.1), we get that Q( b b; b q) 1 2 Q(b ;q ) 1 2 l s (b b b) S 1 (b b b) S C 1 +l o (q b q) O 1 (q b q) O C 1 : (B.2) On the other hand, convexity of Q(b;q) 1 2 implies that Q( b b; b q) 1 2 Q(b ;q ) 1 2 D ¶ b Q 1 2 (b ;q ); b bb E + D ¶ q Q 1 2 (b ;q ); b qq E ; where¶ b Q 1 2 (b ;q ) or¶ q Q 1 2 (b ;q ) represents the subgradient of Q 1 2 with respect tob (orq), evaluated at the point(b ;q ). If Q(b ;q )6= 0, we have that ¶ b Q 1 2 (b ;q )= 1 2 Q(b ;q ) 1 2 1 n X T Y Xb p nq 114 and ¶ q Q 1 2 (b ;q )= 1 2 Q(b ;q ) 1 2 1 p n Y Xb p nq : Therefore, Q( b b; b q) 1 2 Q(b ;q ) 1 2 1 n å n j=1 (y j X T j b p nq j )X T j ( b bb ) 2Q(b ;q ) 1 2 1 p n å n j=1 (y j X T j b p nq j )( b q j q j ) 2Q(b ;q ) 1 2 1 n å n j=1 x j X j ¥ b bb 1 2Q(b ;q ) 1 2 1 p n max j=1;:::;n jx j j b qq 1 2Q(b ;q ) 1 2 = 1 n X T x ¥ b bb 1 2Q(b ;q ) 1 2 1 p n kxk ¥ b qq 1 2Q(b ;q ) 1 2 : which is equivalent to 2Q(b ;q ) 1 2 Q( b b; b q) 1 2 Q(b ;q ) 1 2 1 n X T x ¥ b bb 1 1 p n kxk ¥ b qq 1 : (B.3) If Q(b ;q )= 0, we note that 1 p 2n kxk 2 = Q(b;q ) 1 2 = 0, and hence inequality (B.3) is still valid. Next, on the eventE , we have that 1 n X T x ¥ l s r 1 2n kxk 2 =l s Q(b ;q ) 1 2 and 1 p n kxk ¥ l o r 1 2n kxk 2 =l o Q(b ;q ) 1 2 : Hence in the view of (B.3), we have that Q(b ;q ) 1 2 Q( b b; b q) 1 2 Q(b ;q ) 1 2 Q(b ;q ) 1 2 l s 2 b bb 1 l o 2 b qq 1 : (B.4) 115 Combining (B.2, B.4) and noting that b bb 1 = ( b bb ) S 1 + ( b bb ) S C 1 , b qq 1 = ( b qq ) O 1 + ( b qq ) O C 1 , we deduce the following inequality: l s ( b bb ) S C 1 +l o ( b qq ) O C 1 3 l s ( b bb ) S 1 +l o ( b qq ) O 1 ; as desired. Next, we present a result which bounds the errors of the estimators b b and b q simultaneously, assuming that the augmented matrix 1 p n [X; p nI n ] satisfies an augmented restricted eigenvalue con- dition, namely, 1 n Xu+ p nv 2 2 k 2 (kuk 2 2 +kvk 2 2 ) (B.5) for all[u;v]2C S;O (3;g). Proposition B.1. Assume that the design matrix X2R np satisfies the augmented restricted eigenvalue condition (B.5). Letl s ,l o be the positive numbers such that (l 2 s s+l 2 o o) 1 2 k 1 4 p 2 d (B.6) for some absolute constantd < 1. Then on the event E = n l s kxk 2 r 2 n X T x ¥ ;l o kxk 2 p 2kxk ¥ o ; we have the following bounds: b b b 2 + q b q 2 3(l 2 s s+l 2 o o) 1 2 k 2 (1d) kxk 2 p n : (B.7) l s b b b 1 +l o q b q 1 12(l 2 s s+l 2 o o) k 2 (1d) kxk 2 p n : (B.8) Proof of Theorem B.1. To simplify the expressions, we introduce the following notations, which are used in this proof only. 116 b Q := Q( b b; b q) and Q := Q(b ;q ); A (n) := 1 p n A, whenever A is a number, vector or matrix; b x := Y X b b p n b q; Also, we note that b Q= 1 2n b x 2 2 and Q = 1 2n kxk 2 2 . First, since the loss function L(b;q) defined in (3.2.2) is convex, the first-order optimality condition ensures the existence ofb v2¶ b b 1 ,b u2¶ b q 1 such thatb v Tb b = b b 1 ,b u T b q = b q 1 , and 0= 1 n X T (Y X b b p n b q) 2 b Q 1 2 +l s b v; 0= 1 p n (Y X b b p n b q) 2 b Q 1 2 +l o b u under the assupmtion that b Q6= 0. The two equations above are equivalent to [X (n) ;I n ] T (Y (n) X (n) b b b q)= 2 b Q 1 2 h l s (b v) T ;l o (b u) T i T : (B.9) Note that when b Q= 0, we have that b x = 0 and hence (B.9) is still valid. Next, recall that Y (n) = X (n)b b+ b q+x (n) , so by (B.9), [X (n) ;I n ] T [X (n) ;I n ]D=[X (n) ;I n ] T x (n) 2 b Q 1 2 h l s (b v) T ;l o (b u) T i T : Multiplying both sides of this equation byD T from the left, we get X (n) D b +D q 2 2 = (D b ) T (X (n) ) T x (n) +(D q ) T x (n) 2 b Q 1 2 l s (D b ) T b v 2 b Q 1 2 l o (D q ) T b u: (B.10) Note thatkb vk ¥ 1 andb v Tb b = b b 1 , hence (D b ) T b v=(b b b) T b v=(b ) T b v b b 1 kb k 1 b b 1 : 117 Similarly, we can show that (D q ) T b ukq k 1 b q 1 : Combining these bounds with equation (B.10) and noticing that(Q ) 1 2 = 1 p 2n kxk 2 , we deduce that on the eventE , X (n) D b +D q 2 2 l s (Q ) 1 2 D b 1 +l o (Q ) 1 2 D q 1 + 2 b Q 1 2 l s kb k 1 b b 1 + 2 b Q 1 2 l o kq k 1 b q 1 : (B.11) For brevity, set x= X (n) D b +D q 2 . Observe that b Q 1 2 (Q ) 1 2 + 1 p 2 X (n) D b +D q 2 . Therefore, we derive from inequality (B.11) that x 2 l s (Q ) 1 2 D b 1 +l o (Q ) 1 2 D q 1 +l s 2(Q ) 1 2 + p 2x (kb k 1 b b 1 )+l o 2(Q ) 1 2 + p 2x (kq k 1 b q 1 ) p 2x l s D b 1 +l o D q 1 +l s (Q ) 1 2 D b 1 + 2kb k 1 2 b b 1 +l o (Q ) 1 2 D q 1 + 2kq k 1 2 b q 1 ; (B.12) where in the last step we used the inequalitykuk 1 kvk 1 ku vk 1 that holds for any vectors u and v. Lemma B.1 implies that l s D b 1 +l o D q 1 4 l s D b S 1 +l o D q O 1 : Recall that S=f j :b j 6= 0g, hence D b 1 + 2kb k 1 2 b b 1 = D b 1 + 2kb S k 1 2 b b S 1 2 D b S C 1 D b 1 + 2 D b S 1 2 D b S C 1 = 3 D b S 1 D b S C 1 : 118 Similarly, D q 1 + 2kq k 1 2 b q 1 3 D q O 1 D q O C 1 : Combining these bounds with (B.12), we get x 2 4 p 2x l s D b S 1 +l o D q J 1 +l s (Q ) 1 2 3 D b S 1 D b S C 1 +l o (Q ) 1 2 3 D q O 1 D q O C 1 4 p 2x+ 3(Q ) 1 2 l s D b S 1 +l o D q O 1 : (B.13) SinceD2C S;O (3; l s l o ), by the augmented RE condition (B.5), we have that l s D b S 1 +l o D q O 1 l s p s k k D b 2 + l o p o k k D q 2 (l 2 s+l o o) 1 2 k X (n) D b +D q 2 2 1 2 = (l 2 s+l o o) 1 2 k x: Combing the above display with (B.13), we see that 1 4 p 2 (l 2 s+l o o) 1 2 k ! x 2 3(Q ) 1 2 (l 2 s+l o o) 1 2 k x: Under the assumption that 4 p 2 (l 2 s+l o o) 1 2 k d for somed < 1, we deduce that x 3(Q ) 1 2 1d (l 2 s+l o o) 1 2 k 3(l 2 s s+l 2 o o) 1 2 p 2k(1d) x (n) 2 : (B.14) Finally, using the augmented RE assumption again, we obtain the following inequality: x= X (n) D b +D q 2 k q D b 2 2 + D q 2 2 k p 2 D b 2 + D q 2 ; where we used the bound p 2(a+ b) p a+ p b that holds for any positive numbers a;b in the last step. Therefore, D b 2 + D q 2 3(l 2 s s+l 2 o o) 1 2 k 2 (1d) x (n) 2 : 119 Moreover, l s D b 1 +l o D q 1 4(l s D b S 1 +l o D q J 1 ) 4 (l 2 s s+l 2 o o) 1 2 k k( D b 2 + D q 2 ) 12(l 2 s s+l 2 o o) k 2 (1d) x (n) 2 ; which concludes the proof. Proposition B.1 requires that the design matrix X satisfies the augmented restricted eigenvalue condition (B.5), while our main result only requires ATP and IP conditions on X. The following lemma connects the ATP condition with the augmented restricted eigenvalue condition. Lemma B.2. Let[u;v]2C S;O (3;g). Assume that the design matrix X2R np satisfies the ATP(c 1 ;c 2 ;c 3 ) with (c 2 _gc 3 ) s+ o g 2 1 2 c 1 8 : Then X also satisfies the augmented eigenvalue condition (B.5) withk = c 1 2 . Proof of Lemma B.2. Since[u;v]2C S;O (3;g), by Cauchy-Schwartz inequality we have that c 2 kuk 1 + c 3 kvk 1 c 2 g _ c 3 (gkuk 1 +kvk 1 ) 4 c 2 g _ c 3 (gku S k 1 +kv O k 1 ) 4 c 2 g _ c 3 g 2 s+ o 1 2 k[u;v]k 2 : (B.15) So as long as (c 2 _gc 3 ) s+ o g 2 1 2 c 1 8 ; we have that c 2 kuk 1 + c 3 kvk 1 c 1 2 k[u;v]k 2 : By ATP(c 1 ;c 2 ;c 3 ), we have that 1 p n Xu+ p nv 2 c 1 k[u;v]k 2 c 1 2 k[u;v]k 2 = c 1 2 k[u;v]k 2 : 120 Since[u;v]2C S;O (3;g) is arbitrary, we proved that X satisfies the augmented eigenvalue condition (B.5) withk = c 1 2 . Combining Lemma B.2 with Proposition B.1, we immediately get the following corollary: Corollary B.1. Let c 1 ;c 2 ;c 3 andl s ;l o be positive constants such that l 2 s s+l 2 o o 1 2 c 1 8 min 8 < : d p 2 ; 1 c 2 l s _ c 3 l o 9 = ; (B.16) for some absolute constantd< 1. Assume that the design matrix X2R np satisfies the ATP(c 1 ;c 2 ;c 3 ), Then on the event E = n l s kxk 2 r 2 n X T x ¥ ;l o kxk 2 p 2kxk ¥ o ; we have the following bounds: b b b 2 + q b q 2 12(l 2 s s+l 2 o o) 1 2 c 2 1 (1d) kxk 2 p n : (B.17) l s b b b 1 +l o q b q 1 48(l 2 s s+l 2 o o) c 2 1 (1d) kxk 2 p n : (B.18) Remark B.1. Direct computation shows that (Q ) 1 2 ( b Q) 1 2 = 1 p 2 Y X (n) b q 2 Y X (n) b b b q 2 1 p 2 X (n) D b +D q 2 : Combining with (B.14) and (B.16), we derive that (Q ) 1 2 ( b Q) 1 2 3(l 2 s s+l 2 o o) 1 2 p 2k(1d) (Q ) 1 2 3d 8(1d) (Q ) 1 2 for any d < 1. This shows that under the assumptions of Proposition B.1, ( b Q) 1 2 can be close to (Q ) 1 2 . This fact will be important in the next proof. 121 Finally, we present the proof for Theorem 3.2.1, which provides an error bound for the estima- tor b b only. The main idea of the proof is to treatq as a “nuisance parameter” and repeat part of the previous argument. Proof of Theorem 3.2.1 . Recall the notations: b Q := Q( b b; b q) and Q := Q(b ;q ); A (n) := 1 p n A, whenever A is a number, vector or matrix; b x := Y X b b p n b q; Also, we note that b Q= 1 2n b x 2 2 and Q = 1 2n kxk 2 2 . Since b b2 argmin b 1 p 2n Y Xb p n b q 2 +l s kbk 1 ; there existsb v2¶ b b 1 withb v Tb b = b b 1 such that 1 n X T (Y X b b p n b q)+ 2( b Q) 1 2 l s b v= 0: This implies that (recall Y (n) = X (n)b b+ b q+x (n) ) (X (n) ) T X (n) D b +D q x (n) + 2( b Q) 1 2 l s b v= 0: Multiplying both sides from the left by D b T , we get: X (n) D b 2 2 = D X (n) D b ;D q E + D X (n) D b ;x (n) E 2( b Q) 1 2 l s D D b ;b v E : (B.19) Recall that D D b ;b v E kb k 1 b b 1 D b S 1 D b S C 1 = 2 D b S 1 D b 1 ; 122 and Remark B.1 implies that 25 26 (Q ) 1 2 ( b Q) 1 2 27 26 (Q ) 1 2 as long asd 4 43 . Also, on the eventE , we have that D X (n) D b ;x (n) E D b 1 1 n X T x ¥ l s (Q ) 1 2 D b 1 : Combining these results with (B.19) , we deduce that on the eventE , X (n) D b 2 2 D X (n) D b ;D q E +l s (Q ) 1 2 D b 1 + 2 b Q 1 2 l s 2 D b S 1 D b 1 D X (n) D b ;D q E + 12 13 l s (Q ) 1 2 9 2 D b S 1 D b 1 : Now it remains to control D X (n) D b ;D q E . Since X satisfies IP(b 1 ;b 2 ;b 3 ) as defined in Definition 3.2.2, we have that D X (n) D b ;D q E b 1 D b 2 D q 2 + b 3 D b 2 D q 1 + b 2 D b 1 D q 2 : Therefore, X (n) D b 2 2 12 13 l s (Q ) 1 2 ( 9 2 D b S 1 D b 1 )+ b 2 D b 1 D q 2 + b 1 D q 2 + b 3 D q 1 D b 2 : (B.20) By Corollary B.1, we have that forg :=l s =l o , D q 2 12 p 2 1d (s+ o=g 2 ) 1 2 c 2 1 l s (Q ) 1 2 3 13 l s b 2 (Q ) 1 2 ; provided that (s+o=g 2 ) 1 2 c 2 1 1d 52 p 2b 2 , and hence 123 12 13 l s (Q ) 1 2 9 2 D b S 1 D b 1 + b 2 D b 1 D q 2 = 9 13 5 D b S 1 D b S C 1 l s (Q ) 1 2 3 13 D b 1 l s (Q ) 1 2 + b 2 D b 1 D q 2 9 13 5 D b S 1 D b S C 1 l s (Q ) 1 2 : Combining the above result with (B.20), we deduce that X (n) D b 2 2 b 1 D q 2 + b 3 D q 1 D b 2 + 9 13 5 D b S 1 D b S C 1 l s (Q ) 1 2 : In what follows, denote A= b 1 D q 2 + b 3 D q 1 , B= 9 13 l s (Q ) 1 2 5 D b S 1 D b S C 1 + and x= D b 2 . Note that the ATP(c 1 ;c 2 ;c 3 ) condition defined in Definition 3.2.1 implies that 1 p n kXuk 2 c 1 kuk 2 c 2 kuk 1 for any u2R p . Therefore, we have that c 1 x c 2 D b 1 2 + Ax+ B: This implies that either x c 2 c 1 D b 1 , or c 1 x c 2 D b 1 A 2c 1 2 B+ A 2 4c 2 1 + Ac 2 c 1 D b 1 : In both cases, we have that x c 2 c 1 D b 1 + A 2c 2 1 + 1 c 1 B+ A 2 4c 2 1 + Ac 2 c 1 D b 1 1 2 3c 2 2c 1 D b 1 + 3A 2c 2 1 + B 1 2 c 1 ; (B.21) 124 where in the last step we used the algebraic inequalities p a+ b+ c p a+ p b+ p c and p ab 1 2 (a+ b) valid for any a;b;c 0. Note that B 45 13 l s (Q ) 1 2 p s D b 2 c 1 x 2 + 45l s p s 26c 1 (Q ) 1 2 2 ; (B.22) where the last inequality follows from the algebraic relation 4ab(a+ b) 2 . Combining (B.21) and (B.22), we arrive at x 2 3c 2 2c 1 D b 1 + 3A 2c 2 1 + 45l s p s 26c 2 1 (Q ) 1 2 3 2 c 2 c 1 D b 1 + b 1 + b 3 c 2 1 D q 1 + 45l s p s 26c 2 1 (Q ) 1 2 3 2 c 2 gc 1 _ b 1 + b 3 c 2 1 g D b 1 + D q 1 + 45l s p s 26c 2 1 (Q ) 1 2 64 p 2l o (1d) c 2 gc 1 _ b 1 + b 3 c 2 1 (g 2 s+ o) k 2 (Q ) 1 2 + 45l s p s 26c 2 1 (Q ) 1 2 ; where the last step follows from Corollary B.1. This completes the proof. B.1.2 Proof of Proposition 3.2.1 In this subsection we present the proof of Proposition 3.2.1, which gives the smallest choices ofl s andl o when the noisex is Gaussian distributed. The proof is essentially due to Giraud [23][p. 112] and the union bound. First, Giraud [23][p. 112] shows that P s kxk 2 p n (1 1 p 2 )s (1+ e 2 )e n=24 : Second, we denote X j , j= 1;:::; p, to be the j th column of X. Then by the union bound, we have that P 1 p n X T x ¥ p 2sr p log(p)+ L 1 = P max j=1;:::;p 1 p n jX T j xj p 2sr p log(p)+ L 1 125 p å j=1 P 1 p n jX T j xj p 2sr p log(p)+ L 1 : (B.23) Since Z := 1 p n X T j x is normally distributed with mean 0 and varianceE Z 2 =s 2 1 p n X j 2 2 s 2 r 2 , (B.23) implies that P 1 p n X T x ¥ p 2sr p log(p)+ L 1 pP jZj p 2sr p log(p)+ L 1 : By concentration inequality for standard normal distribution, we have that for any x> 0, P(jZjsrx) P jZj x q E[Z 2 ] e x 2 =2 : Setting x= p 2 p log(p)+ L 1 , we derive that P 1 p n X T x ¥ p 2sr p log(p)+ L 1 pe log(p)L 1 = e L 1 : Finally, by the union bound and the concentration inequality for standard normal distribution again, we have that P kxk ¥ s p 2 p log(n)+ L 2 n å j=1 P jx j js p 2(log(n)+ L 2 ) ne log(n)L 2 = e L 2 : 126
Abstract (if available)
Abstract
This dissertation investigates the learning scenarios where a high-dimensional parameter needs to be estimated from grossly corrupted data. The motivation for the topics we study comes from the real-world applications, where the data are often obtained from expensive experiments resulting in possible contamination and have to be analyzed robustly and efficiently. For example, in the sensor network model, one employs a huge network of sensors to collect independent measurements of a signal and send them to the central hub for estimation analysis. However, a small portion of the sensors might possibly fail to send the measurements correctly, or even transmit irrelevant measurements. Therefore, it is important that the estimation is robust to those irregular measurements. ❧ In the first part of the dissertation, we consider a problem of covariance estimation under adversarial contamination. Specifically, we are given a set of independent observations from some distribution with unknown mean and covariance, and we assume that a small fraction of them are adversarially corrupted, namely, they are missing or replaced by arbitrary values. Our goal is to construct an estimator that reliably recovers the covariance in this framework. We will show that the proposed estimator admits a theoretically optimal error bound in some regimes and can be computed efficiently. ❧ In the second part of the dissertation, we consider a problem of robustly estimating the regression coefficients in a linear model under adversarial contamination. Specifically, we are given n pairs of predictor-response values, which are assumed to satisfy a linear model. Moreover, we assume that a small fraction of the response values are adversarially corrupted, and we are aiming to construct an estimator that reliably recovers the regression coefficients. We will show that the proposed estimator admits nearly minimax optimal error bound under appropriate conditions and is computationally efficient. Moreover, our method is pivotal in the sense that it does not require any prior knowledge of the standard deviation of the noise term.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High dimensional estimation and inference with side information
PDF
Improved computational and statistical guarantees for high dimensional linear regression
PDF
Finite dimensional approximation and convergence in the estimation of the distribution of, and input to, random abstract parabolic systems with application to the deconvolution of blood/breath al...
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Large scale inference with structural information
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
PDF
Obtaining breath alcohol concentration from transdermal alcohol concentration using Bayesian approaches
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
M-estimation and non-parametric estimation of a random diffusion equation-based population model for the transdermal transport of ethanol: deconvolution and uncertainty quantification
PDF
Essays on high-dimensional econometric models
PDF
Linear quadratic control, estimation, and tracking for random abstract parabolic systems with application to transdermal alcohol biosensing
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Analysis using generalized linear models and its applied computation with R
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Autoregression and structured low-rank modeling of sinograms
PDF
Parameter estimation in second-order stochastic differential equations
Asset Metadata
Creator
Wang, Lang
(author)
Core Title
Robust estimation of high dimensional parameters
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2021-12
Publication Date
09/13/2021
Defense Date
08/19/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial contamination,covariance estimation,heavy-tailed distribution,linear regression,low-rank recovery,OAI-PMH Harvest,pivotal estimation,U-statistics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Minsker, Stanislav (
committee chair
), Goldstein, Larry (
committee member
), Sun, Wenguang (
committee member
)
Creator Email
langwang@usc.edu,wanglanggd@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15909832
Unique identifier
UC15909832
Legacy Identifier
etd-WangLang-10059
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Lang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adversarial contamination
covariance estimation
heavy-tailed distribution
linear regression
low-rank recovery
pivotal estimation
U-statistics