Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improved computational and statistical guarantees for high dimensional linear regression
(USC Thesis Other)
Improved computational and statistical guarantees for high dimensional linear regression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ImprovedComputationalandStatisticalGuaranteesforHigh
DimensionalLinearRegression
by
Ruolan Wang
A Thesis Presented to the
FACULTY OF THE USC DORNSIFE COLLEGE OF LETTERS, ARTS AND
SCIENCES
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(Applied Mathematics)
May 2021
Copyright 2021 Ruolan Wang
TableofContents
ListofTables iii
ListofFigures iv
Abstract v
Chapter1: Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter2: Algorithmsandnumericalexperiments 6
2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Compare ISTA and AdaIHT . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Condition number issue . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
References 19
ii
ListofTables
2.1 Generalization error of ISTA and AdaIHT for differentk . . . . . . . . . . . . . . 9
iii
ListofFigures
2.1 Generalization error of ISTA and AdaIHT as training proceeds. . . . . . . . . . . 10
2.2 Generalization error as training proceeds under different gradient descent steps,
including GD, NGD and Newton. Use AdaIHT as the thresholding step. . . . . . . 13
2.3 Change of the generalization error with respect to the condition number for ISTA,
GD AdaIHT and Newton AdaIHT. The number of iterations of these methods is
200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Generalization error as training proceeds under different thresholding methods,
including Ada-IHT and Ada-HTP. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Generalization error as training proceeds under Fast Newton’s and Newton’s method.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Change of the generalization error with respect to the condition number for ISTA,
GD+AdaIHT, GD+AdaHTP, FastNewton+AdaIHT, and FastNewton+AdaHTP. The
true signala= 20 and the number of iterations is 600. . . . . . . . . . . . . . . . 17
iv
Abstract
In high dimensional linear regression, iterative soft thresholding algorithm (ISTA) that solves
LASSO is a classical and popular method that outputs a sparse weight vector. While this method
is widely used in practice, it still has several issues including shrinkage bias, slow convergence
rate, and a dependence on condition number. Iterative hard thresholding (IHT) and hard threshold-
ing pursuit (HTP) are two practical methods that may help solve these issues. Unfortunately both
these methods rely heavily on a prior knowledge of the sparsity parameter. This thesis focuses on
a recent adaptive version of IHT and HTP that combines the best of the two realms of LASSO and
IHT. During our methodological study we also consider second order variants of IHT that turn out
to be superior in terms of their dependence to the condition number. Finally, we propose an ac-
celerated version of Newton IHT to speed up the computation. This thesis is mainly experimental
and the goal is to compare many existing and new algorithms using several criteria including the
generalization error.
All the codes can be found athttps://github.com/Ruola/Sparse-Linear-Regression.
git.
Keywords: high dimensional linear regression, sparse linear regression, iterative soft thresh-
olding, iterative hard thresholding, hard thresholding pursuit, Newton method.
v
Chapter1
Introduction
Nowadays, large amount of data are collected in the field of business, industry and science. As
datasets grow wide, they may have far more features than samples. There is a huge demand to
analyze these high dimensional data. In order to make it easier, people assume that only few of
features are important and this simplicity is called sparsity. People research on how the sparsity
influence the signal recovery, for example, in the problem of compressed sensing. There is a
famous and leading topic, named sparse linear regression. This thesis aims to compare new and
existing solutions to sparse linear regression.
1.1 Background
Linear regression is a very classic statistical model. Suppose that we are given a design matrix
H2R
np
that containsn observations and p features, and a responsey2R
n
that containsn scalar
responses for each observation. The normal linear model assumes the existence of x2R
p
that
satisfies
y=Hx+sx; (1.1)
where H is a design matrix, s is the variance of the noise, and the noise x is assumed to be a
standard normal random vector. Throughout this thesis, we assume that H is a random Gaussian
design with i.i.d. rows of distributionN (0;S) whereS is a covariance matrix.
1
Ifn> p, estimatingx can be done using the least square method. Ifn p, the model becomes
highdimensionallinearregression and minimizing the sum of squares objective
argmin
x2R
p
jjHxyjj
2
2
;H2R
np
;y2R
n
;
has infinitely many solutions. In the high dimensional scenario, the design matrix has too many
features. In order to make the above problem more relevant, we can assume that only few features
are important. In other words, the signalx has at mosts nonzero elements, orjxj
0
s. Sparsitys is
a very reasonable constraint, and the corresponding problem is calledsparselinearregression. In
practice, sparsity is good for model interpretation and computational efficiency. These advantages
make sparse linear regression an important topic.
Solving the least squares problem under sparsity is not trivial. Remember that, when n> p,
the least squares solution is unique and given by ˆ x=(H
>
H)
1
H
>
y, as long asH
>
H is invertible.
However, if n p, the above formula is invalid because H
>
H is no longer invertible. Moreover,
exhaustively searching the optimal solution among all possibles-sparse candidates is not practical,
because it searches
p
s
times which has exponential time complexity. Thus, it is meaningful to
develop algorithms to solve the sparse linear regression problem.
In this thesis we proceed to an extensive numerical study that aims to compare the performance
of several algorithms. In order to evaluate the statistical performance of an estimator ˆ x, we use the
generalization errorjjS
1=2
(x ˆ x)jj
2
instead of prediction error
1
n
jjH(x ˆ x)jj
2
, because it represents
to ability of our model to generalize the result for new data.
For a given invertible covariance matrixS, we define theconditionnumberk such that
k =
l
max
(S)
l
min
(S)
;
where l
max
(S) (resp. l
min
(S)) is the largest (resp. smallest) eigenvalue of S. When k = 1, we
regard the model as theisotropic case. As fork> 1, the model can be viewed as thean-isotropic
case. Recall that with a large condition number, a small change in the responsey can cause a huge
2
change of signal x, and signal estimation becomes more challenging. Among other methods to
solve the spare linear regression, the LASSO uses an`
1
regularization:
argmin
x2R
p
1
2
jjHxyjj
2
2
+ljjxjj
1
;
for some l > 0. It is well established that the LASSO suffers from the large condition number
of the design matrix. As k increases, both computational efficiency and statistical accuracy of
LASSO diminish. This issue has inspired researchers to develop novel algorithms to achieve better
performance in the an-isotropic scenario.
One of the goals of this thesis is to understand empirically the condition number’s dependence
of the generalization for certain methods. To make this statement more concrete we list, in the next
section, the motivation behind the present work.
1.2 Motivation
1. Debiasing
LASSO tends to return a sparse solution. Because of this property, some people prefer
LASSO to Ridge. LASSO also shrinks its selected coefficients to zero which can be seen as
a bias of estimation. To solve this issue, practitioners use several debiasing methods. Some
classical methods use LASSO as an initial estimator then plug it into some debiasing proce-
dure to generate a better estimation. For example, when n> p, relaxed LASSO [1] applies
least squares to nonzero elements of the LASSO estimation and leads to a result that is better
than that of LASSO. When n p, the paper [2] uses a debiased or desparsified LASSO
estimator ˆ x= ˆ x
LASSO
+
1
n
MH
>
(H ˆ x
LASSO
y), where ˆ x
LASSO
is the LASSO estimator and M
is an approximation of the inverse design covariance. Because of their hard thresholding,
IHT and HTP [3] can be seen as unbiased methods. In practice, they perform better than
iterative soft thresholding algorithm (ISTA) in terms of the generalization error.
3
We will run several simulations to display the bias issue of ISTA and the performances of
some of the above debiasing methods.
2. Fast methods
Blumensath [4] states that IHT needs at most iterations log
jjxjj
2
2
˜ e
s
to meet the squared pre-
diction risk ˜ e
s
. Also, Haoyang Liu [5] states that IHT converges after log
njjxjj
2
s
2
slog(pe=s)
steps
to the minimax prediction risks
2
slog(pe=s) under restricted strong convexity and restricted
smoothness conditions of the design. Note thate is the base of natural logarithm.
However, IHT relies on the knowledge of the sparsity s which is hard to obtain in practice.
Ndaoud [6] presents a novel method, Adaptive Iterative Hard Thresholding (AdaIHT) that
avoids this issue. It achieves fast convergence speed and optimality without the knowledge of
the sparsity. AdaIHT is more feasible than IHT and this thesis will focus on the performance
of variants of AdaIHT in both isotropic and an-isotropic scenarios.
As for LASSO, Alekh Agarwal [7] proposes that the method of projected gradient descent
to solve LASSO converges in globally geometric rate under global smoothness and strong
convexity assumptions.
As for second order methods, Chen [8] proposes Fast Newton Hard Thresholding Pursuit.
It reduces the time complexity per iteration to linear without damaging the convergence
performance, compared to Newton method for instance.
We will compare the convergence rate of different algorithms in terms of generalization
error.
3. Condition number issue
Both the generalization performance and convergence rate of LASSO suffer under large
condition number. There exist results showing that, in a minimax sense, it is impossible to
get rid of this dependence for polynomial time methods. This issue, in particular, ruins signal
recovery in the an-isotropic scenario. We call this problem the condition number issue.
4
The lower bound of the minimax generalization risk is
s
2
slogp
n
, and it is obtained by exhaus-
tively searching over all ssparse solutions which is NP-hard. So in practice, this optimal
solution may not be achieved. Haoyang Liu [5] shows that a variant of IHT achieves the
following prediction error:
1
n
kS
1=2
(x ˆ x)k
2
2
Cks
2
slogp
n
;
where k is the condition number of S. This suggests that iterative thresholding methods
depend onk in terms of the generalization error. Moreover, Yuchen Zhang [9] proposes that
the prediction error of polynomial time algorithms can not avoid the restricted eigenvalue
constant for some design satisfying RE. While the dependence on k is mainly due to the
bias, we conjecture that this dependence vanishes in scenarios where the signal entries are too
large (and hence support recovery is possible). We investigate this phenomenon and confirm
our conjecture through simulations for certain hard thresholding methods in Section 2.2.2.
1.3 Contributions
Our work is inspired by the above issues in Section 1.2. To be more specific, our contributions are
as follows:
1. Debiasing: In Section 2.2.1, we compare the generalization error per iteration in terms of the
comparison of ISTA and Ada-IHT. For large signals, we illustrate the bias issue of ISTA.
2. Convergence rate. We do simulations to display the generalization error at each step of
Ada-IHT, Ada-HTP and ISTA. Unsurprisingly, Ada-IHT and Ada-HTP converge faster than
ISTA, which also means that they can be trained using less number of iterations compared
to LASSO.
3. Condition number. In Section 2.2.2, we show that the performance of Ada-IHT and Ada-
HTP does not depend on the condition number for large signals.
5
Chapter2
Algorithmsandnumericalexperiments
2.1 Algorithms
We will compare several algorithms that solve sparse linear regression. Among the family of iter-
ative thresholding methods, we consider both iterative soft thresholding (for LASSO) and iterative
hard thresholding (IHT, Ada-IHT, . . . ). All these methods can be viewed as gradient descent (with-
out the knowledge of the sparsity), followed by a thresholding step to sparsify the estimate. Each
step of Iterative soft thresholding (ISTA) corresponds to
ˆ x
t
=Soft
l
[ ˆ x
t1
hH
>
(H ˆ x
t1
y)];
with Soft
t
(z)=(ztsign(z))
+
and h <
1
l
max
(H
>
H)
. This is a first order proximal method that
solves the LASSO. To choose the thresholdl, we use cross validation to tune it. Similar to ISTA,
adaptive iterative hard thresholding (Ada-IHT) tries to solve the`
o
penalty problem
argmin
x2R
p
1
2
jjHxyjj
2
2
+ljjxjj
0
:
For a sequence of thresholds(l
t
), Ada-IHT corresponds to
ˆ x
t
=T
l
t
[ ˆ x
t1
hH
>
(H ˆ x
t1
y)];
6
whereT
l
(a)=aI
fjaj>lg
. We will consider a choice of the thresholding sequence asl
t
=(0:95)
t
l
max
^
l where the initial value of l
max
is specified later and l is tuned using CV . The thresholding se-
quence is linearly decreasing until it hits a final threshold l. The choice of the constant 0:95 is
arbitrary here, and we can tune that parameter as well. We decided to keep this value because it
gave us a good trade-off between speed of convergence and good statistical performance.
These two iterative thresholding methods are both first order methods and share the same gra-
dient descent step ˆ x
t1
hH
>
(H ˆ x
t1
y). However, LASSO needs the design matrix to satisfy
Restricted Eigenvalue property (RE) which is not always practical. In order to solve this issue,
We consider using second order methods, including natural gradient descent (NGD) and Newton
method (Newton). Recall that gradient descent (GD) is given by ˆ x
t
= ˆ x
t1
hH
>
(H ˆ x
t1
y). We
define the second order methods as
ˆ x
t
= ˆ x
t1
hMH
>
(H ˆ x
t1
y); (2.1)
where h is the step size and M is a matrix. The choice of M =S
1
, where S is the design co-
variance, leads to NGD; whileM=(H
>
H)
1
leads to Newton’s Gradient Descent (for the sake of
brevity we denote by(H
>
H)
1
the Moore–Penrose pseudo-inverse of the matrix H
>
H). In other
words,
M=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
I
pp
for gradient descent,
S
1
for natural gradient descent,
(H
>
H)
1
for Newton’s method.
(2.2)
It turns out that in the an-isotropic scenario Newton’s method outperforms the other candidates.
However, this method is computationally expensive due to matrix inversions whose computational
complexity are O(p
3
). We then propose Fast Newton to decrease the computation complexity
compared of Newton’s method without harming its statistical performance. Instead of inverting
the Hessian matrix at each step, Fast Newton method runs several gradient descent steps without
7
thresholding and then applies a thresholding operator. Intuitively, the consecutive gradient steps
approximate the Newton gradient in some sense.
As for the thresholding step, other than using hard or soft threshold, there is also a greedy
method, called Hard Thresholding Pursuit (HTP). It updatesx by
ˆ x
t
=(H
>
S
t
H
S
t
)
1
H
>
S
t
y;
where S
t
is the subset of s largest (in absolute value) coordinates of the gradient step. To make
HTP adaptive, we consider the Adaptive Hard Thresholding Pursuit (Ada-HTP), where S
t
=fi :
ˆ x
i
>l
t
;i2[1;p]g.
In this work, we will compare different configurations based on combinations of gradient de-
scent steps (GD, NGD, Newton, Fast Newton), and thresholding steps (soft, hard, HTP) in terms
of speed convergence and generalization error beyond the isotropic scenario. We give below a
pseudo-code that represents all of the above procedures.
Algorithm1: Algorithms to solve sparse linear regression
Result: The estimation of the signal ˆ x.
Inputs: the design matrixH, the responsey;
Initialize ˆ x to a zero vector, gradient descent step sizeh =
1
2jjMH
>
Hjj
2
, andl
max
to a large
number;
Tune parameters the thresholdl
max
andb the speed of convergence ofl
t
by CV;
whilenotconvergedo
Gradient descent step ˆ x
t
= ˆ x
t1
hMH
>
(H ˆ x
t1
y);
Thresholding step (IHT or HTP) with the thresholdl
t
;
ifl
t
>l
max
then
Updatel
t
=bl
t
;
end
end
8
2.2 Numericalexperiments
For our numerical experiments, we generate data according to the sparse linear regression model
with Gaussian erroreN (0;s
2
I
n
);s = 0:1. Each experiment is run 200 times and the results we
show are empirical averages over all experiments. Each signal vectorw=(w
T
1
;:::;w
T
s
;0
T
;0
T
)
T
has
exactlys non-zero coordinates all set to the same valuea. We will mostly focus on the choice ofa=
1 which leads to a high signal-to-noise ratio (SNR). For each data set, we consider the number of
observationsn= 200, number of features p= 1000 and sparsitys= 10. Thenp design matrixX
is sampled such that the rows are independent and identically distributed according to a multivariate
normal distributionN (0;S). We consider 2 cases, isotropic (RIP design) corresponding toS=I
p
or an-isotropic (non-RIP design). As for the an-isotropic case, we considerS to be diagonal where
half of its entries are set to bek> 1 and the other half elements are 1 so that the condition number
ofS isk.
2.2.1 CompareISTAandAdaIHT
We start with a comparison of ISTA and AdaIHT in terms of convergence rate and the resulting
generalization error.
Table 2.1: Generalization error of ISTA and AdaIHT for differentk
k ISTA AdaIHT
1 0.356243829472815 0.0230619505397733
10 2.72661267796143 2.24893688566476
40 3.35024283953281 2.33923797653021
From the simulations in Figure 2.1 and Table 2.1, we can draw the following conclusions. First,
the convergence rate of AdaIHT is much faster than that of ISTA. This is a very useful feature when
it gets to cheap training. Second, in the isotropic case, AdaIHT reaches to a smaller generalization
error than ISTA, because ISTA has a bias issue and AdaIHT is supposed to be unbiased for large
signals. Third, in both methods, the resulting error in the an-isotropic scenario is larger than
that of isotropic case because of the condition number. As the condition number k increases, a
9
0 100 200 300 400
#iterations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
generalization error
Compare ISTA and IHT + isotropic
AdaIHT
ISTA
(a) Isotropick = 1
0 100 200 300 400
#iterations
5
10
15
20
generalization error
Compare ISTA and IHT + anisotropic
AdaIHT
ISTA
(b) An-isotropick = 10
0 100 200 300 400
#iterations
0
20
40
60
80
generalization error
Compare ISTA and IHT + anisotropic
AdaIHT
ISTA
(c) An-isotropick = 40
Figure 2.1: Generalization error of ISTA and AdaIHT as training proceeds.
slight change of observation y leads to a big change of the signal x, so the resulting error will be
worse. Fourth, ISTA suffers more from the an-isotropic scenario than AdaIHT, in terms of the
convergence rate and the generalization error. To be more specific, as for the convergence rate,
ISTA converges at about 200 iterations for k = 1, 300 iterations for k = 10 and 400 iterations
for k = 40. As for the bad performance of the generalization error, we can check the values in
Table 2.1. The anisotropy of the design matrix ruins the restricted eigenvalue property (RE) which
is a condition for optimal performance of LASSO. In other words, for LASSO to achieve both
computational efficiency and optimal result, it needs to satisfy a stronger assumption, for example,
10
restricted isometry properties (RIP). Thus, as the condition number increases, the performance of
ISTA, including both convergence rate and resulting error, becomes worse.
2.2.2 Conditionnumberissue
To get more insight on the condition number issue, we decided to analyze the performance of other
methods including second order ones.
In Figure 2.2, we notice that NGD performs badly in the an-isotropic scenario. We also observe
that Newton’s method converges faster than GD. If we take into account the cost of inverting the
Hessian matrix, it is not clear that Newton’s method outperforms GD. We will focus on GD and
Newton’s method from now on.
In Figure 2.3, we compare the generalization error of three methods in terms ofk. As expected,
ISTA suffers from large condition numbers and the dependence is linear. Second, Ada-IHT and
Newton IHT both have a similar performance. Moreover, it suggests that they both do not depend
on the condition number for large signals. To the best of our knowledge this feature has not been
explored in the literature. The Figure 2.6 will also suggest this feature.
In Figure 2.4, we observe that adaptive HTP methods converge faster than Ada-IHT. Because
Ada-HTP and Ada-IHT have similar resulting generalization error and that the cost of inverting a
submatrix of sizes in Ada-HTP is usually cheap, we conclude that Ada-HTP performs better than
Ada-IHT.
While Ada-HTP only requires inverting a submatrix, this submatrix may still be large if l is
too small since we will be selecting many coordinates. That may happen during CV . It will cause
computational inefficiency. For this reason, we decided to check the performance of Fast Newton
in Figure 2.5. FastNewton+AdaIHT10 represents an approximation of Newton’s method when we
consider 10 consecutive gradient steps before thresholding while FastNewton+AdaIHT20 consid-
ers 20 steps. It turns out that 10 steps are enough to approximate Newton’s method. Moreover, the
cost of speed of convergence is less than 6 times the cost of Newton’s method. Keeping in mind
11
that inverting the Hessian costs roughly p
3
operations while a gradient step costs around np oper-
ations, it is clear that Fast Newton represents a good trade-off combining both great computational
complexity and good generalization error.
Finally, in Figure 2.6, we dive deeper on the change of generalization error with respect to the
condition number. On this experiment, we increase the true signal a from 1 to 20, and we use
Fast Newton with 10 consecutive gradient steps before thresholding. It turns out that Ada-HTP
and Ada-IHT have similar resulting generalization error. The similar conclusion is also obtained
in Figure 2.4. Moreover, Fast Newton performs slightly worse than GD in terms of the resulting
generalization error. Last but not least, as for small condition numbers, these algorithms have
similar performance. However, as for large condition numbers, AdaHTP and AdaIHT do not
depend on the condition number for large signals. To the best of our knowledge this phenomenon
has not been explored in the literature and worth conducting future research on it.
12
0 25 50 75 100 125 150 175 200
#iterations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
generalization error
Second order methods comparison in isotropic design
gd+AdaIHT
ngd+AdaIHT
newton+AdaIHT
(a) Isotropick = 1
0 25 50 75 100 125 150 175 200
#iterations
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
generalization error
Second order methods comparison in anisotropic design
gd+AdaIHT
ngd+AdaIHT
newton+AdaIHT
(b) An-isotropick = 10
Figure 2.2: Generalization error as training proceeds under different gradient descent steps, includ-
ing GD, NGD and Newton. Use AdaIHT as the thresholding step.
13
Figure 2.3: Change of the generalization error with respect to the condition number for ISTA, GD
AdaIHT and Newton AdaIHT. The number of iterations of these methods is 200.
14
0 25 50 75 100 125 150 175 200
#iterations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
generalization error
Second order methods comparison in isotropic design
gd+AdaIHT
gd+AdaHTP
newton+AdaIHT
newton+AdaHTP
(a) Isotropick = 1
0 25 50 75 100 125 150 175 200
#iterations
0
10
20
30
40
50
generalization error
Second order methods comparison in anisotropic design
gd+AdaIHT
gd+AdaHTP
newton+AdaIHT
newton+AdaHTP
(b) An-isotropick = 20
Figure 2.4: Generalization error as training proceeds under different thresholding methods, includ-
ing Ada-IHT and Ada-HTP.
15
0 50 100 150 200 250 300 350 400
#iterations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
generalization error
Second order methods comparison in isotropic design
FastNewton+AdaIHT10
FastNewton+AdaIHT20
newton+AdaIHT
(a) Isotropick = 1
0 50 100 150 200 250 300 350 400
#iterations
5
10
15
20
25
generalization error
Second order methods comparison in anisotropic design
FastNewton+AdaIHT10
FastNewton+AdaIHT20
newton+AdaIHT
(b) An-isotropick = 10
Figure 2.5: Generalization error as training proceeds under Fast Newton’s and Newton’s method.
16
Figure 2.6: Change of the generalization error with respect to the condition number for ISTA,
GD+AdaIHT, GD+AdaHTP, FastNewton+AdaIHT, and FastNewton+AdaHTP. The true signala=
20 and the number of iterations is 600.
17
Conclusion
Although ISTA is a popular algorithm to solve the sparse linear regression, it can perform badly
in terms of the convergence speed and the resulting generalization error under an-isotropic design.
We have presented several adaptive thresholding methods that seem to outperform ISTA in general.
These methods are combinations of a gradient descent step (gradient descent, Newton, fast Newton
method) and a thresholding step (Ada-IHT, Ada-HTP) and perform better than the ISTA. As a
conclusion, it seems that Fast Newton’s method is the best candidate to be fast and achieve good
generalization performance.
All the codes can be found athttps://github.com/Ruola/Sparse-Linear-Regression.
git.
18
References
[1] Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374 –
393, 2007.
[2] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-
dimensional regression. TheJournalofMachineLearningResearch, 15(1):2869–2909, 2014.
[3] Simon Foucart. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM
JournalonNumericalAnalysis, 49(6), 2011.
[4] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing.
Appliedandcomputationalharmonicanalysis, 27(3):265–274, 2009.
[5] Haoyang Liu and Rina Foygel Barber. Between hard and soft thresholding: optimal iterative
thresholding algorithms, 2018.
[6] Mohamed Ndaoud. Scaled minimax optimality in high-dimensional linear regression: A non-
convex algorithmic regularization approach, 2020.
[7] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence of
gradient methods for high-dimensional statistical recovery. The Annals of Statistics, pages
2452–2482, 2012.
[8] Jinghui Chen and Quanquan Gu. Fast newton hard thresholding pursuit for sparsity con-
strained nonconvex optimization. In Proceedings of the 23rd ACM SIGKDD International
ConferenceonKnowledgeDiscoveryandDataMining, pages 757–766, 2017.
[9] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance
of polynomial-time algorithms for sparse linear regression. InConferenceonLearningTheory,
pages 921–948, 2014.
19
Abstract (if available)
Abstract
In high dimensional linear regression, iterative soft thresholding algorithm (ISTA) that solves LASSO is a classical and popular method that outputs a sparse weight vector. While this method is widely used in practice, it still has several issues including shrinkage bias, slow convergence rate, and a dependence on condition number. Iterative hard thresholding (IHT) and hard thresholding pursuit (HTP) are two practical methods that may help solve these issues. Unfortunately both these methods rely heavily on a prior knowledge of the sparsity parameter. This thesis focuses on a recent adaptive version of IHT and HTP that combines the best of the two realms of LASSO and IHT. During our methodological study we also consider second order variants of IHT that turn out to be superior in terms of their dependence to the condition number. Finally, we propose an accelerated version of Newton IHT to speed up the computation. This thesis is mainly experimental and the goal is to compare many existing and new algorithms using several criteria including the generalization error. ❧ All the codes can be found at https://github.com/Ruola/Sparse-Linear-Regression.git.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Robust estimation of high dimensional parameters
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Symmetric and trimmed solutions of simple linear regression
PDF
Analysis using generalized linear models and its applied computation with R
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
High dimensional estimation and inference with side information
PDF
Incorporating prior knowledge into regularized regression
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
High-dimensional regression for gene-environment interactions
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Linear quadratic control, estimation, and tracking for random abstract parabolic systems with application to transdermal alcohol biosensing
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Linear differential difference equations
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Sparseness in functional data analysis
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Algorithms and landscape analysis for generative and adversarial learning
Asset Metadata
Creator
Wang, Ruolan
(author)
Core Title
Improved computational and statistical guarantees for high dimensional linear regression
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Applied Mathematics
Publication Date
04/13/2021
Defense Date
03/18/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
hard thresholding pursuit,high dimensional linear regression,iterative hard thresholding,iterative soft thresholding,Newton method,OAI-PMH Harvest,sparse linear regression
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ndaoud, Mohamed (
committee chair
), Minsker, Stanislav (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
renee.ruolan.wang@gmail.com,ruolanwa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-442220
Unique identifier
UC11668678
Identifier
etd-WangRuolan-9445.pdf (filename),usctheses-c89-442220 (legacy record id)
Legacy Identifier
etd-WangRuolan-9445.pdf
Dmrecord
442220
Document Type
Thesis
Rights
Wang, Ruolan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
hard thresholding pursuit
high dimensional linear regression
iterative hard thresholding
iterative soft thresholding
Newton method
sparse linear regression