Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Performance trade-offs of accelerated first-order optimization algorithms
(USC Thesis Other)
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Performance trade-offs of accelerated first-order optimization
algorithms
by
Samantha Samuelson
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2024
Copyright 2024 Samantha Samuelson
Acknowledgements
I would like to express my sincere gratitude to the many family, friends, and colleagues who have
supported my along my journey towards my PhD.
First and foremost I am incredibly grateful to my advisor Professor Mihailo R. Jovanovic,´
whose unwavering support has made this dissertation possible. He helped me pursue my research
interests with curiosity and creativity, and provided me with excellent advice and guidance. His
considerable experience and expertise has been truly invaluable over the course of my career, and
I am continually learning from him. He has worked through challenging problems, poured over
paper drafts, and provided countless opportunities for growth and learning. I am truly thankful for
his all his support and encouragement throughout this challenging journey.
I am also grateful to Prof. Mahdi Soltanolkotabi, Prof. Meisam Razaviyayn, and Prof.
Ashutosh Nayyar for serving on my qualifying and defense committees, as well as for everything
I learned from their excellent courses. Their inspiring feedback and research suggestions, as well
as their friendliness and patience, has helped make this dissertation possible, and I have learned a
great deal from each.
I am thankful for the support and friendship of all my lab mates both past and present, including
Hesam Mohammadi, Dongsheng Ding, Ibrahim Ozaslan, Mohammad Tinati, and Dusan Bozic.
Their companionship and friendship has helped me weather discouragement and disappointment,
and their encouragement has helped me overcome challenges both technical and personal.
Finally I would like to thank my family for their unwavering support. I would like to thank
my father Larry, whose love of mathematics inspired me to pursue my PhD and who always has
ii
something new to teach me, and my partner Jason for reminding me there is more to life than
research.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Transient growth of accelerated first order optimization algorithms . . . . . . . . . 3
1.2 Variants of the standard two-step accelerated algorithm . . . . . . . . . . . . . . . 3
1.3 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Transient growth of accelerated first order algorithms . . . . . . . . . . . . 9
1.4.2 Averaging over algorithmic iterates of two-step momentum based firstorder algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.3 Accelerated gradient flow dynamics of order d . . . . . . . . . . . . . . . 10
1.4.4 Three step accelerated first-order algorithms . . . . . . . . . . . . . . . . . 10
Chapter 2: Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Quadratic optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 LTI formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Transient growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Noise amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 3: Transient growth of accelerated optimization algorithms . . . . . . . . . . . . . 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Convex quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Transient growth of accelerated algorithms . . . . . . . . . . . . . . . . . 24
3.2.2 Analytical expressions for transient response . . . . . . . . . . . . . . . . 27
3.2.3 The role of initial conditions . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 General strongly convex problems . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Proofs of Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
3.5.1.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.1.5 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.3 Proofs of Section 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3.1 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3.2 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3.3 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 4: The effect of averaging on accelerated first-order optimization algorithms for
strongly convex quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Convergence rate and transient growth . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Variance amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7.1 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7.2 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7.3 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7.4 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7.5 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.7.6 Proofs of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5: Performance of noisy higher-order accelerated gradient flow dynamics for
strongly convex quadratic optimization problems . . . . . . . . . . . . . . . . . . . . . 86
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Background for gradient flow dynamics . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1 Quadratic optimization problems . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Exponential stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Variance amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Main results for gradient flow dynamics . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Analysis for gradient flow dynamics with d = 3 . . . . . . . . . . . . . . . . . . . 99
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.1 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.2 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6.3 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.6.4 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.6.5 Proof of Theorem 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Chapter 6: Performance of noisy three-step accelerated first-order optimization algorithms
for strongly convex quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
v
6.2 Three-step accelerated algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1 Defining the ρ-convergence region . . . . . . . . . . . . . . . . . . . . . . 132
6.2.2 Deriving parameters which optimize convergence rate . . . . . . . . . . . 134
6.2.3 Bounding the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.4 Results for specific parameters . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3 Results for general discrete time algorithms of order d . . . . . . . . . . . . . . . 138
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.1 Proof of Theorem 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.2 Proof of Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.5.3 Proof of Proposition 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.5.4 Proof of Proposition 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.5.5 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5.6 Proof of Theorem 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.7 Outline of Proof of Conjecture 1 . . . . . . . . . . . . . . . . . . . . . . . 156
Chapter 7: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
vi
List of Tables
2.1 Conventional values of parameters and the corresponding convergence rates for
f ∈ FL
m, where κ := L/m [13, Theorems 2.1.15, 2.2.1]. The heavy-ball method
does not offer acceleration guarantees for all f ∈ FL
m. . . . . . . . . . . . . . . . . 15
2.2 Optimal parameters and the corresponding convergence rate bounds for a strongly
convex quadratic objective function f ∈ FL
m with λmax(∇
2
f) = L, λmin(∇
2
f) = m,
and κ := L/m [9, Proposition 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Parameters that provide optimal convergence rates for a convex quadratic objective
function (2.7) with κ := L/m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Behavior of v2(µ1(a0,a1), µ1(a0,a1), j) on the ρ-stability region shown in Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
List of Figures
3.1 Error in the optimization variable for Polyak’s heavy-ball (black) and Nesterov’s
(red) algorithms with the parameters that optimize the convergence rate for a
strongly convex quadratic problem with the condition number 103
and a unit norm
initial condition with x
0 ̸= x
⋆
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vii
3.2 Dependence of the error in the optimization variable on the iteration number for the
heavy-ball (black) and Nesterov’s methods (red), as well as the peak magnitudes
(dashed lines) obtained in Proposition 2 for two different initial conditions with
∥x
1∥2 = ∥x
0∥2 = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 On the left we show numerically calculated values of the argmax of h1(t) and h2(t)
for various values of ρ (dotted lines), alongside the estimate of tmax given in equation (4.12) (solid red line). On the right right we compare numerically calculated
values of the maximums of h1(t) and h2(t) for varying values of ρ (dotted dark purple and light purple respectively), and the estimates presented in equation (4.14)
(solid blue lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Stability region and ρ-convergence region ∆ρ of the two-step accelerated algorithm. As introduced in [35], for the linear sub-system associated with eigenvalue
λi of Q defined in (2.10), we see the set of a0(λi) and a1(λi) for which the system is stable (blue) and the set of a0(λi) and a1(λi) for which converges with
rate ρ (yellow). The ρ-convergence region ∆ρ is defined by a0 ∈ [−ρ
2
, ρ
2
] and
a1 ∈ [−ρ
−1
(a0 +ρ
2
), ρ
−1
(a0 +ρ
2
)]. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 For a given ρ-convergence region ∆ρ , dashed lines show the line segments
(a0(λ), a1(λ)) for λ ∈ [m, L] for the subset of two-step momentum algorithms
with parameters given in (4.20). The hyperparameter c gives the normalized distance of the (a0(λ),a1(λ)) line from the a1 axis. The heavy-ball parameters with
c = 1 lies along the XρYρ edge and is shown in blue, while the gradient descent
parameters with c = 0 lies along the a1 axis and is shown in green. . . . . . . . . . 63
viii
4.4 Transient response of the heavy-ball algorithm with optimal parameters and ρ =
0.98, in the case of no averaging (d = 1), averaging over a moving window of
fixed integer length (d = 10 and d = 30), and averaging over the entire algorithmic
history (d = t). We consider three separate cases where the Hessian Q has s eigenvalues at the Xρ corner of ∆ρ and n−s eigenvalues at the Yρ corner of ∆ρ , s = 9,
s = 5, and s = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Variance of the averaged output at time t of the heavy-ball algorithm with optimal
parameters where the Hessian Q has s eigenvalues at the Xρ corner of ∆ρ and n−s
eigenvalues at the Yρ corner of ∆ρ , with ρ = 0.98. . . . . . . . . . . . . . . . . . 68
4.6 Steady state variance as a function of number of history terms d. The Figures show
how the steady state variance V
∞
d
of the output averaged over a window of fixed
integer length d decreases as window length d increases, for systems where the
Hessian Q has s eigenvalues at the Xρ corner of ∆ρ and n−s eigenvalues at the Yρ
corner of ∆ρ , with ρ = 0.98. On the left we see the overall trend in d, the figure
on the right focuses on the behavior when d is small. Similarly to Figure 4.5 it is
apparent that the variance is more greatly reduced when Q has more eigenvalues
at λ = L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Variance Vˆ ∞
d
(µ1(a0,a1), µ2(a0,a1)) as a function of a1 for fixed a0 and given window length d, along with proposed lower bounds f1(µ1,µ2,d) and f3a(µ1,µ2,d)
defined in (4.37) and (4.39) respectively, for different values of a0 and d . The
vertical line in red marks a1 = −a0(1 + a0), while the red dots mark the crossover points of f1 and f3a with Vˆ ∞
d
, at which points the lower bounds fail. Figure
(a) demonstrates that in accordance with equations (4.38) and (4.40), Vˆ ∞
d > f1 for
all a1 to the right of the red line, while Vˆ ∞
d > f3a for all a1 to the left of the red
line. Figure (b) demonstrates the non-convexity of Vˆ ∞
d which arises when µ1 and
µ2 are convex which makes determining an exact minimum difficult. Figure (c)
demonstrates the lower bound given in (4.41). . . . . . . . . . . . . . . . . . . . . 77
ix
4.8 Geometry of the ρ-convergence region for two-step accelerated algorithms. Figure (a): For a given (a0,a1) point within the ρ-convergence region ∆ρ , w :=
(1+a0+a1) gives the horizontal distance to the XZ edge, while h := (1−a0) gives
the vertical distance to the XY edge. Figure (b): Regions of different behavior for the variance component v(µ1,µ2, j) = limk→∞ E[xˆ
k
i
xˆ
k+j
i
] defined in (4.43),
with µ1(a0,a1) and µ2(a0,a1) defined in (4.15c). The green triangle, defined by
a1 ∈ [−
√
4a0,
√
4a0] shows the region where eigenvalues µ1 and µ2 are imaginary
complex conjugates, on which v(µ1,µ2, j) is bounded. When eigenvalues µ1 and
µ2 are real and a1 ≤ 0, shown by the yellow triangle, v(µ1,µ2, j) is strictly decreasing in a1. When eigenvalues µ1 and µ2 are real and a1 ≥ 0, shown by the orange
triangle, v(µ1,µ2, j) is either strictly increasing or decreasing in a1 depending on
the parity of j, as shown in Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . 79
x
5.1 Stability region for third order gradient flow dynamics. On the left is the ρ-stability
region in terms of coefficients a0, a1, and a2 for ρ = κ
−1/3
at κ = 100. Each
color indicates a level set of a0(λ) = αλ. Notice that the set of a1(λ), a2(λ)
for which the system achieves ρ-stability grows larger as λ increases, and at λ =
m the stability region condenses to a single point. The black line corresponds
to parameters β1 = 2κ
−2/3
, γ1 = κ
1/3
, β2 = 3κ
−1/3
, γ2 = 0, for which the end
points of the line segment (a0(λ),a1(λ),a2(λ)) for λ = m and λ = L are given
by (κ
−1
,3κ
−2/3
,3κ
−1/3
) and (1,2κ
−2/3 + κ
1/3
,3κ
−1/3
), respectively. The red
line corresponds to the parameter set β1 = κ
−2/3
, γ1 = 2κ
1/3
, β2 = 2κ
−1/3
, γ2 =
κ
2/3 with end points (κ
−1
,3κ
−2/3
,3κ
−1/3
) and (1, κ
−2/3+2κ
1/3
,2κ
−1/3+κ
2/3
).
These parameters both yield the optimal rate. On the right is shown a level set of
the ρ-stability region at λ = (L + m)/2. The edges of the region are determined
by the four constraints given by the Routh-Hurwitz criterion as described in (5.31).
The system is stable when all constrains are positive, which is shown by the shaded
region. Notice that the level set is convex; the non-convexity appears in a0 as we
vary λ as seen on the left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 The possible placement of complex conjugate roots with fixed product µµ¯ = αλ
lies on the circle in red. It is evident that if we wish to minimize their real part,
both roots must lie on the real axis at µ = µ¯ = −
√
αλ. . . . . . . . . . . . . . . . 107
6.1 Stability and ρ-convergence regions for the three-step momentum algorithm. Figure (a) shows the 3-D stability region in a0, a1, and a2. Different shades of blue
correspond to the level sets of a0. Figure (b) shows a level set of the stability region
and the ρ-convergence region for ρ = 0.7, at a0 = 0.1, as defined by the positivity
constraints of (6.8) and (6.10). At a0 = ±ρ
3
, the convergence region collapses to
a single line, and at a0 = 0, ∆ρ (0) recovers the 2-D case. . . . . . . . . . . . . . . 133
xi
6.2 Overlaid a0 level sets of ∆ρ (a0) at ρ = 0.9, for a0(m) = −ρ
3/3 and a0(L) = ρ
3/3
in red and blue respectively. In black and gray we see two examples of a parameterized line (a2(λ),a1(λ)) which runs from the Xρ (m)Zρ (m) edge to the
Zρ (L)Yρ (L) edge. We can see that in order to satisfy the constraint corresponding to the Xρ (λ)Yρ (λ) edge as a0 changes, the a1/a2 slope must be more negative
than one might expect, and (a1,a2)(L) are both smaller than they would be at the
Yρ (m) vertex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xii
Abstract
First-order accelerated optimization algorithms are widely used in a variety of data-driven and
distributed control and learning applications, in which gradient estimates may be corrupted by
stochastic noise. In this dissertation, we employ tools of classic control theory to examine the performance of momentum-based accelerated first-order optimization algorithms used to solve mostly
strongly convex quadratic problems in the presence of additive white stochastic disturbances. We
consider three key performance metrics: convergence rate of the optimization error in expectation,
worst-case transient growth of normalized optimization error in non-asymptotic time frames, and
steady-state variance (expected squared error of the optimization variable) arising from gradient
noise.
We first examine the transient behavior of accelerated first-order optimization algorithms in
the absence of noise. For convex quadratic problems, we employ tools from linear systems theory
to quantify the transient diversion from the optimal solution caused. For strongly convex smooth
optimization problems, we utilize the theory of integral quadratic constraints (IQCs) to establish an
upper bound on the magnitude of the transient response of Nesterov’s accelerated algorithm. For
Nesterov’s accelerated method with smooth strongly convex objective function, we show that both
the maximum normalized optimization error over iteration number and possible initial conditions
and the rise time to the worst-case transient peak are proportional to the square root of the condition
number κ of the problem.
We next propose two variations to the class of standard two-step momentum based algorithms,
and investigate their effects on the performance metrics above. Our goal is to reduce worst-case
transient growth and steady state variance without compromising convergence rate. We consider
xiii
post-algorithmic averaging over the iterations of the optimization algorithms, and introducing additional momentum terms which reach further into algorithmic history in order to update each
iteration.
In the case of output averaging, for strongly convex quadratic problems, we show that averaging
over the entire algorithmic history eliminates steady-state variance of the averaged output at the
expense of slowing down convergence to a sub-linear rate. In contrast, averaging over a finite
window of fixed length achieves convergence with a linear rate, but leads to a non-zero value of
the steady-state variance. While this value is smaller than the steady-state variance of the iterates
of the heavy-ball algorithm, it has the same order-wise dependence on the condition number. We
also show that output averaging reduces the worst-case transient peak, up to a certain minimum
value, while retaining the dependence of transient behavior on condition number κ.
For algorithms with additional momentum terms, we consider accelerated dynamics in both
continuous and discrete time. In continuous time, for strongly convex quadratic problems with a
condition number κ, we determine the best possible convergence rate of continuous-time gradient
flow dynamics of order d and demonstrate that higher order terms do not affect the trade-offs
between convergence rate and variance amplification that exist for gradient flow dynamics with
d = 2. We show that increasing the order d improves both convergence rate and optimal steadystate variance, while the product of steady-state variance and algorithmic settling time is lower
bounded by a constant factor of κ for any order d. In discrete time, for strongly convex quadratic
problems and first-order algorithms with three momentum terms, we determine parameters which
achieve optimal rate of convergence ρ with respect to condition number κ, which previous research
has established cannot be improved. We additionally investigate how additional momentum affects
the trade-offs between rate of convergence, noise amplification, and condition number. Similarly
to the continuous time case, the lower bound on the product of steady-state variance and settling
time in terms of the square of the condition number is preserved with the introduction of the third
history term. In contrast to gradient flow dynamics, we show that moving from the two-step to
xiv
the three algorithm increases steady-state variance at optimal parameters and does not improve
convergence.
xv
Chapter 1
Introduction
First-order optimization algorithms are widely used in a variety of fields including statistics, signal and image processing, control, and machine learning [1–8], due to their favorable asymptotic
behavior [9–13] while maintaining per-iteration complexity. There is a vast literature focusing
on the convergence properties of accelerated algorithms for different stepsize rules and acceleration parameters, including [13–16]. A growing body of work also considers the implementation
of accelerated first order algorithms in stochastic settings [17–27]. The motivation for the study
of accelerated first order algorithms in the presence of noise arises from applications in which in
which only estimates of the gradient are available, or when algorithmic iterates themselves may be
corrupted by a noisy communication channel.
For example, in machine learning applications [21,22,28] and model-free optimal control [29–
31], estimates of the objective function are obtained by inexact simulations or interaction with a
real-time system via noisy sensor measurements. In stochastic gradient descent the gradient at
each iteration is calculated from a small subset of samples, and gives only a noisy estimate of
the true gradient. Finally in distributed optimization problems communication between agents
may be corrupted by noise. In addition, first order algorithms are often employed with a limited
number of iterations as part of larger multi-step optimization methods. In such applications, study
of algorithmic behavior in non-asymptotic time-frames is equally important.
Previous work has established that favorable convergence behavior of accelerated algorithms
relative to gradient descent comes at the expense of undesirable transient responses [32,33] due to
1
non-normal dynamics and increased sensitivity to gradient noise [21–27], which exhibits undesirable scaling with the condition number [34–36]. The trade-off between acceleration and robustness
has been well studied [28, 37–39], determining that increased acceleration comes at the price of
increased sensitivity to noise. Previous work [34, 35] establishes a fundamental limitation on the
product of noise amplification and settling time imposed by condition number. For strongly convex quadratic problems [34] provides bounds on noise amplification for standard methods with
optimal parameters which show that accelerated algorithms increase noise amplification by a factor of √
κ relative to gradient descent. For the general class of two-step first order momentum
algorithms, [35] establishes lower bounds on the product of settling time and steady-state variance
which scale with κ
2
and indicate a fundamental limitation of this class of algorithm.
In this dissertation we investigate the transient behavior of accelerated first order algorithms
for strongly convex problems, and specifically for quadratic problems extend the results of [34,35]
quantifying the trade-off between convergence speed and accuracy of first order algorithms in a
stochastic setting. We begin by quantifying the worst-case transient response of standard twostep accelerated first-order methods, for the class of smooth strongly convex objective functions.
We next propose two modifications to the class of standard two-step momentum algorithms in
an attempt to improve the trade-off between convergence rate and steady-state variance. First we
propose averaging the algorithmic iterates x
t over time, and consider the effect on convergence rate,
transient growth, and variance, considering both the average over the entire algorithmic history and
the average over a moving window of fixed length. Next we propose introducing additional history
terms, so that the current estimate x
t
is based on more than two previous estimates, and consider
the effect on the trade-off between convergence rate and steady-state variance in both discrete and
continuous time settings.
2
1.1 Transient growth of accelerated first order optimization
algorithms
While the optimization error of the standard gradient descent algorithm is monotonically decreasing for strongly convex problems, accelerated first order algorithms may display aberrant transient
behavior, where depending on the initialization of the algorithm, the optimization error increases
in initial iterations. For applications where unlimited iterations are not available, such behavior
may be problematic. In particular, multi-stage optimization algorithms such as ADMM perform
only a few iterations of first-order methods at each stage, and undesirable transient growth may
negate the benefits of acceleration.
For strongly convex quadratic problems, we quantify the transient response of accelerated firstorder algorithms by bounding the magnitude of the largest optimization error over iterations, normalized by the magnitude of the initial conditions. We provide upper and lower bounds on the
transient response at any iteration, as well as bounds on the iteration and magnitude of the worstcase transient peak, in terms of convergence rate. In addition, we show these bounds scale proportionally to the square root of the condition number of the objective function, and investigate the
influence of initial conditions on transient behavior. For general strongly convex problems, we use
Integral Quadratic Constraints determine bounds on the transient peak specifically for Nesterov’s
accelerated method.
1.2 Variants of the standard two-step accelerated algorithm
In many applications of first-order algorithms, exact gradient information is unavailable. For example, the gradient estimate may be obtained via interaction with a real-world system through
noisy sensor data, and in data-driven control problems sample gradients provide a noisy estimate
of the true gradient. The uncertainty in the gradient in these cases can be modeled as additive white
3
noise, with zero mean and uncorrelated samples. When the gradient is corrupted by zero mean additive noise, the algorithm converges in expectation, but without a zero gradient at the optimal
value iterates converge to a steady-state distribution around the optimal value, rather than a fixed
point. In order to quantify the accuracy and robustness of first-order algorithms in the presence of
gradient uncertainty, we examine the expected squared error of this steady-state distribution, called
the steady-state variance.
The robustness of first-order algorithms to additive inexact gradient information has been well
studied [23–27]. One approach to the persistence of excitation at steady-state due to gradient noise
is to use a diminishing step-size, which drives the update term to zero even in the presence of
additive noise and ensures the algorithm converges to a single stationary point [40]. It has been
shown that using a diminishing step-size can achieve acceleration for the class of general convex
problems in the presence of stochastic disturbances [37], but in general decaying step-sizes result in
sub-linear convergence rates [18,41]. Another approach to overcoming gradient noise is averaging
in conjunction with slower decaying step-sizes [17, 23, 42, 43], which has been shown to improve
the convergence behavior of stochastic gradient descent in a variety of applications [44–46]. In a
similar approach, recent work [10] introduced the triple momentum method, which performs an
additional convex combination on the algorithm state across two time-steps in order to achieve an
improved convergence rate for the class of smooth strongly convex objective functions.
Previous work [34, 35] has quantified the affect of acceleration on the steady-state variance in
terms of the convergence rate of the algorithm and condition number of the objective function for
strongly convex problems. They present two parameterized families of algorithms that are orderwise (in terms of the condition number) Pareto-optimal for simultaneous minimization of settling
time and steady-state variance. Recent work [47] further introduces a multi-stage stochastic gradient algorithm, where momentum parameters are held constant throughout each stage but updated
sequentially at the beginning of each stage, in order to adjust performance to favor convergence
rate or variance reduction at each stage. The results in [34, 35] demonstrate a fundamental limitation in the performance of two-step accelerated algorithms regarding the balance between speed
4
and accuracy. Motivated by that limitation, in this dissertation, we consider two variations on
the standard two-step accelerated algorithms in the hopes of improving the steady-state variance
without sacrificing convergence speed.
As mentioned above, one of the simplest approaches to mitigating the cost of increased steadystate variance associated with accelerated algorithms is to average the algorithmic iterates over
time. Intuitively, the average output is expected to produce a smaller steady-state variance, given
that in general sample variance decreases with sample size. Similarly, the expected value of the
error in the averaged output at a given iteration t should be larger than the error resulting from
non-averaged algorithmic iterates. In this chapter, we quantify the effects of averaging expected
error and expected squared error of the general class of two-step accelerated algorithms applied
to strongly convex quadratic problems. In particular, we investigate whether output averaging can
reduce steady state variance sufficiently to overcome the fundamental limitation on the product of
variance and settling time given in [34–36] in terms of the condition number κ.
We consider two approaches to averaging: first the average x
t over a moving window of a fixed
length d, and second the average over the entire algorithmic history. We show that averaging over
a fixed window length d reduces variance at a rate of approximately 1/d, while maintaining rate of
convergence, but maintains a lower bound on the product of variance and setting time that scales
with the condition number of the objective function. On the other hand, averaging over the entire
algorithmic history drives steady-state variance to zero, at the price of reducing the convergence
rate to sub-linear, similarly to algorithmic variance which use a decaying step-size.
Next we consider the effect of additional momentum of algorithmic performance, specifically
convergence rate and steady-state variance. In discrete time, additional momentum means increasing the number of previous iterates xt−k used to update the current estimate of the optimum x
t
.
In continuous time, we consider gradient flow dynamics of order d which converge to the optimization objective. Given the established effects on convergence rate and steady-state variance of
moving from first-order algorithms which update based on only one previous iterations, i.e. gradient descent, to first-order algorithms which update based on two previous iterates, i.e. heavy-ball
5
and Nesterov’s accelerated methods, we are motivated to investigate how these quantities, and the
trade-off between them, are affected by the introduction of further momentum terms. For unconstrained strongly convex optimization problems, previous work [34, 35] establishes a fundamental
limitation on the product of noise amplification and settling time imposed by condition number of
the objective function.
We first consider the continuous time setting, which is easier to analyze. The connection between ordinary differential equations and iterative optimization algorithms is well established [48–56]. Recently, a second-order continuous-time dynamical system with constant coefficients for which a certain implicit-explicit Euler discretization yields Nesterov’s accelerated
algorithm was introduced in [57]. For strongly convex problems, these accelerated gradient flow
dynamics were shown to be exponentially stable with rate 1/
√
κ, where κ is the condition number
of the problem. A more recent work [35] examined the tradeoffs between convergence rate and
robustness to additive white noise of accelerated gradient flow dynamics and established a lower
bound on the product between steady-state variance of the error in the optimization variable and
the settling time that scales with κ
2
. For this class of accelerated dynamical systems, there appears
to be a fundamental limitation between convergence rate and variance amplification imposed by
the condition number.
We consider gradient flow dynamics of order d which can be used to solve unconstrained
strongly convex quadratic optimization problems, and establish the connection between order d and
convergence rate and steady-state variance. We establish the optimal convergence rate ρ = κ
−1/d
and identify the complete set of constant algorithmic parameters that achieve the optimal rate, in
addition to identifying parameters which minimize steady-state variance for a given convergence
rate ρ. Finally we show the lower bound on the product of variance and settling time scales with
the square of the condition number for any order d.
We next consider three step accelerated first-order algorithms implemented in discrete time
with strongly convex quadratic objective function, and investigate how the addition of a single
additional momentum term affects steady-state variance. Exiting work [13] has established that
6
for strongly convex quadratic problems, the optimal convergence rate achieve by the heavy-ball
method cannot be improved upon by any first order method, indicating that further acceleration will
not result in further improvements in convergence rate, unlike the results we obtain in continuous
time for gradient flow dynamics. Instead, we present the generalized set of parameters which
optimize convergence rate, which includes parameters corresponding to the heavy-ball method,
and examine the steady state variance at these parameters. We additionally show that for any
stabilizing parameters, the product of variance and settling time is lower bounded by a constant
factor of κ
2
, indicating that additional acceleration does not offer significant improvement upon
the standard two-step methods.
1.3 Structure of the Dissertation
The rest of the dissertation is structured as follows. In rest of Chapter 1, we describe the problem
setting and provide key control theoretic tools we will use throughout the rest of the work. The
majority of our results are restricted to the class of strongly convex quadratic objective functions,
for which first-order accelerated algorithms can be cast as a linear time-invariant (LTI) system,
which allows us to leverage tools from linear systems to determine convergence behavior, maximum transient growth, and variance. An overview of these tools is presented in Section 2.2 of
Chapter 2
In Chapter 3, we establish bounds on the transient response in terms of the convergence rate
and the iteration number. Without imposing restrictions on initial conditions, we show that both the
peak value of the transient response and the rise time to this value increase with the square root of
the condition number of the problem. In Section 3.2 we consider convex quadratic problems, and
utilize the tools from linear systems theory to fully capture transient responses . In Section 3.3 we
consider general strongly convex problems, and employ the theory of integral quadratic constraints
7
to establish an upper bound on transient growth. We show this upper bound to be tight by identifying quadratic problem instances for which the worst-case transient growth of the normalized
optimization error is within a constant factor of this upper bound.
In Chapter 4, we propose averaging the algorithmic output over time and investigate the effects
on convergence rate, transient growth, and variance. We show that averaging the algorithmic output
over the entire algorithmic history results in zero steady-state variance at the expense of yielding a
sub-linear convergence rate. On the other hand, averaging over a moving window of fixed length d
reduces steady-state variance relative to the case without averaging while maintaining linear convergence rate. We demonstrate that the steady-state variance of the windowed average is reduced
by a factor of roughly 1/d compared to the steady-state variance of the non-averaged output x
t but
the quadratic dependence on the condition number still remains.
In Chapter 5 we propose adding additional momentum terms to accelerated gradient flow dynamics, and investigate the effects on convergence rate and steady-state variance. In the continuous
time setting, we show that additional momentum can increase convergence rate while decreasing
variance. The product of steady-state variance and settling time maintains scaling with the square
of the condition number. In Section 5.4 we provide a detailed examination of third-order gradient
flow dynamics.
In Chapter 6 we consider the three-step momentum algorithm, where the algorithmic state at iteration t xt
is updated based on three previous iterates rather than two. We determine that additional
momentum does not impair convergence rate, and the lower bound on the product of variance and
settling time maintains scaling with the square of the condition number of the problem.
1.4 Contributions of the Dissertation
In this section we summarize the main contributions of each chapter of the dissertation.
8
1.4.1 Transient growth of accelerated first order algorithms
We quantify the transient behavior of the optimization error of accelerated first-order algorithms
by bounding the largest value of the Euclidean distance between the optimization variable and
the global minimizer, normalized by the magnitude of the initial conditions, in terms of the algorithmic convergence rate. For strongly convex quadratic problems, for the heavy-ball method
and Nesterov’s accelerated method, the magnitude of the worst-case transient peak, and the iteration at which it occurs, both scale with the square root of the condition number of the objective
function. We additionally identify initial conditions which produce large transients. For strongly
convex problems, we provide an upper bound on transient growth which, for Nesterov’s accelerated
method, scales with the square root of the condition number, and show the bound is tight.
1.4.2 Averaging over algorithmic iterates of two-step momentum based
first-order algorithms
We examine the effect of averaging over algorithmic iterates of two-step accelerated first order optimization algorithms for strongly convex quadratic problems on convergence rate, transient growth,
and steady-state variance. We first show that averaging algorithmic iterates over the entire history
eliminates steady state variance at the expense of reducing the convergence rate to sub-linear. We
additionally show that averaging over a moving window of fixed length d maintains the linear rate
of convergence, while while reducing steady state variance by a factor of approximately 1/d. For
this averaging scheme, we determine that the product of steady-state variance and settling time is
proportional to the square of the condition number of the objective function, as is the case for the
standard two step algorithm, indicating the fundamental trade-off between variance and convergence is not avoided. Similarly, we show that windowed averaging reduces the magnitude of the
worst case transient peak up to a minimal value achieves by averaging over the entire algorithmic
history. In both cases, for heavy-ball and Nesterov’s accelerated methods, the transient peak is
proportional to the square root of the condition number.
9
1.4.3 Accelerated gradient flow dynamics of order d
We study performance of momentum-based accelerated first-order optimization algorithms in the
form of gradient flow dynamics in the presence of additive white stochastic disturbances. For
strongly convex quadratic problems with a condition number κ, we determine the best possible
convergence rate of continuous-time gradient flow dynamics of order d, and present the complete family of parameters which achieve this rate. For any stabilizing parameters, we show that
the product of steady-state variance and settling time is proportional to the square of the condition
number for any integer d, which indicates that additional momentum terms do not affect the fundamental trade-offs between convergence rate and variance amplification that exist for second order
gradient flow dynamics. Finally, we present parameters which minimize steady-state variance for
any given convergence rate ρ, and present tight bounds on the variance at these parameters, which
are proportional to 1/
√
d, indicating increasing the order d can both improve convergence and
decrease steady-state variance, even though the dependence on the square of the condition number
remains.
1.4.4 Three step accelerated first-order algorithms
For strongly convex quadratic problems, we investigate the performance of accelerated first-order
algorithms with three momentum terms, in which the optimization variable is updated using information from three previous iterations, as opposed to two. While two-step momentum algorithms
such as heavy-ball and Nesterov’s accelerated methods achieve the optimal convergence rate for
quadratic problems, it is an open question if the three-step momentum method can offer advantages in steady-state variance. For strongly convex quadratic problems, we identify algorithmic
parameters which achieve the optimal convergence rate and examine how additional momentum
terms affects the trade-offs between acceleration and noise amplification. We show that introducing
additional momentum increases the distance between the minimal and maximal contributions to
variance, and that for parameters which achieve optimal convergence rate, the steady state variance
increases as the update equation puts more weight on the third iteration in history. Furthermore,
10
we show that for any stabilizing parameters, the product of steady-state variance and settling time
admits a lower bound proportional to the square of the condition number. Our results suggest that
introducing further momentum to two-step accelerated algorithms offers limited advantage regarding the trade-off between convergence rate and variance amplification, and only for parameters
which achieve a sub-optimal convergence rate.
11
Chapter 2
Problem formulation
In this chapter we the problem setting and background information for all chapters. In particular,
most of our work concerns optimization over the class of strongly convex quadratic objective functions, for which first-order optimization algorithms can be cast as linear time-invariant systems.
The transformation is outlined in Secion 2.2.
2.1 Background
The unconstrained optimization problem
minimize
x
f(x) (2.1)
where f : R
n → R is a convex and smooth function, which can be solved using the class of two-step
momentum algorithms,
x
t+2 = β1x
t+1 + β0x
t − α∇ f
γ1x
t+1 +γ0x
t
+ w
t
(2.2)
1
where t is the iteration index, α is the stepsize, βk and γk are the algorithmic parameters, and w
t
is
a white noise that can account for uncertainty in gradient evaluation with
E[w
t
] = 0, E[w
t
(w
t
)
T
] = Iδ(t −τ). (2.3)
First-order optimality conditions impose the following constraints on βk and γk
,
β0 +β1 = 1, γ0 +γ1 = 1 (2.4)
and for particular choices of these parameters we uncover familiar gradient-based methods: (i)
gradient descent (γ0 = β0 = 0, γ1 = β1 = 1); (ii) Polyak’s heavy-ball method (γ0 = 0, γ1 = 1); and
(iii) Nesterov’s accelerated algorithm (γ0 = β0, γ1 = β1).
We denote by FL
m the set of functions f that are m-strongly convex and L-smooth; f ∈ FL
m
means that f(x)−
m
2
∥x∥
2
is convex and that the gradient ∇ f is L-Lipschitz continuous. In particular, for a twice continuously differentiable function f with the Hessian matrix ∇
2
f , we have
f ∈ FL
m ⇔ mI ⪯ ∇
2
f(x) ⪯ LI, ∀x ∈ R
n
(2.5)
where I is the identity matrix. For f ∈ FL
m, the parameters α and β can be selected such that
gradient descent and Nesterov’s accelerated method converge to the global minimum x
⋆ of (2.1) at
a linear rate,
∥x
t − x
⋆
∥ ≤ cρ
t
∥x
0 − x
⋆
∥ (2.6)
for all t and some positive scalar c > 0, where ∥ · ∥ is the Euclidean norm.
Table 2.1 provides the conventional values of these parameters and the corresponding guaranteed convergence rates [13]. Gradient descent achieves the convergence rate ρ =
p
1−2/(κ +1),
where κ := L/m is the condition number associated with FL
m. Thus, for reaching the accuracy level
13
∥x
t −x
∗∥ ≤ ε, gradient descent requires O(κ log(1/ε)) iterations. This dependence on the condition number can be significantly improved using Nesterov’s accelerated method which achieves
the rate
ρna =
s
1−
1
√
κ
≤ 1−
1
2
√
κ
thereby, requiring only O(
√
κ log(1/ε)) iterations. This rate is orderwise optimal in the sense
that no first-order algorithm can optimize all f ∈ FL
m with the rate ρhb = (√
κ −1)/(
√
κ +1) [13,
Theorem 2.1.13]. Note that 1−ρhb = O(1/
√
κ) and 1−ρna = Ω(1/
√
κ). In contrast to Nesterov’s
method, the heavy-ball method does not offer any acceleration guarantees for all f ∈ FL
m. However,
for strongly convex quadratic f , the parameters can be selected to guarantee linear convergence of
the heavy-ball method with a rate that outperforms the one achieved by Nesterov’s method [9]; see
Table 2.2.
While the convergence rate is a commonly used metric for evaluating performance of optimization algorithms, this quantity only determines the asymptotic behavior of the expected error, and it
does not provide useful insight into transient behavior or variance amplification, given by expected
squared error.
Due to the presence of noisy gradient estimates in implementation, expected squared error in
the asymptotic regime, referred to as steady-state variance, may be equally as important of a metric
as convergence rate. Similarly, due to applications with a limited iteration budget, the expected
optimization error in non-asymptotic time-frames, referred to as the transient growth, is another
important metric. In this work, investigate both performance metrics and their trade-offs against
convergence rate.
Our main focus is on optimization problems with strongly convex quadratic objective functions.
For this class of problem, accelerated first order algorithms can be formulated as a linear time
invariant system. In the next section, using tools from linear system theory, we present formal
definitions of all three performance metrics: convergence rate, worst-case transient growth, and
14
Method Parameters Linear rate
Gradient α =
1
L
ρ =
q
1 −
2
κ+1
Nesterov α =
1
L
β =
√
κ −1
√
κ +1
ρ =
q
1 − √
1
κ
Table 2.1: Conventional values of parameters and the corresponding convergence rates for f ∈ FL
m,
where κ := L/m [13, Theorems 2.1.15, 2.2.1]. The heavy-ball method does not offer acceleration
guarantees for all f ∈ FL
m.
Method Parameters Linear rate
Gradient α =
2
L+m
ρ =
κ−1
κ+1
Nesterov α =
4
3L+m
β =
√
√
3κ+1−2
3κ+1+2
ρ =
√
√
3κ+1−2
3κ+1
Polyak α =
4
(
√
L+
√
m)
2 β =
(
√
κ−1)
2
(
√
κ+1)
2 ρ =
√
κ−1
√
κ+1
Table 2.2: Optimal parameters and the corresponding convergence rate bounds for a strongly convex quadratic objective function f ∈ FL
m with λmax(∇
2
f) = L, λmin(∇
2
f) = m, and κ := L/m [9,
Proposition 1].
steady-state variance. Throughout the rest of the work, we will use these definitions to establish
the connection between performance metrics and algorithmic parameters.
2.2 Quadratic optimization
The majority of our results focus on strongly convex quadratic objective function f such that
f(x) = 1
2
x
TQx+q
T
x (2.7a)
where Q = Q
T ⪰ 0 is a positive semi-definite matrix. The set of strongly convex quadratic functions
is a subset of the FL
m, with parameters of strong convexity and smoothness m and L given by the
smallest and largest eigenvectors of Q = ∇
2
f respectively.
15
2.2.1 LTI formulation
For quadratic objective function (2.7a), the gradient ∇ f(x) = Qx − q is an affine function of x
and (2.2) with constant algorithmic parameters admits an LTI state-space representation,
ψ
t+1 = Aψ
t + Bwt
y
t = Cψ
t
. (2.8a)
Here, y
t
:= x
t −x
⋆ = x
t
is the distance to the optimal solution x
⋆ = Q
−1q = 0, ψ
t
is the state vector,
ψ
t =
(y
t
)
T
(y
t+1
)
T
T
(2.8b)
and A, B, C are constant matrices determined by
A =
0 I
β0I −αγ0Q β1I −αγ1Q
B =
0 I
T
, C =
I 0
(2.8c)
The eigenvalue decomposition of the Hessian matrix, Q =VΛV
T
, can be used to bring matrices
in (2.8) into their block diagonal forms. Here, V is an orthogonal matrix of the eigenvectors of Q,
Λ is a diagonal matrix of the corresponding eigenvalues, and the change of variables,
xˆ := V
T
x, wˆ := V
Tw (2.9)
allows us to transform system (2.8) into a family of n decoupled subsystems parameterized by the
ith eigenvalue λi of the Hessian matrix Q ∈ R
n×n
,
ψˆ
t+1
i = Aˆ(λi)ψˆ
t
i + Bˆwˆ
t
i
yˆ
t
i = Cˆψˆ
t
i
. (2.10a)
16
The ith component of the vector ˆw is given by ˆwi and
Aˆ(λi) =
0 1
β0I −αγ0λi β1I −αγ1λi
Bˆ =
0 1T
, Cˆ =
1 0
. (2.10b)
We now use tools from linear system theory to define the key performance metrics of accelerated first order algorithms for strongly convex quadratic objective functions.
2.2.2 Convergence rate
System (2.8) is stable if absolute values of the eigenvalues of matrices Aˆ(λi) are less than one
for each i = 1,...,n. We recall that stability of (2.8) implies that, in the absence of white noise,
iterations of the algorithm satisfy
∥ψ
t − ψ
⋆
∥ ≤ cρ
t
∥ψ
0 − ψ
⋆
∥ (2.11)
for some positive constant c, where the convergence rate is given by
ρ = max
λ∈[m,L]
|eig(Aˆ(λ))|. (2.12)
In the presence of white noise, the expected value of the state vector ψ
t
is governed by
E(ψ
t+1
) = AE(ψ
t
). Thus, E(ψ
t+1
) = A
t E(ψ
0
) and
E
ψ
t
≤ cρ
t
E
ψ
0
. (2.13)
The best achievable converge rate for strongly convex quadratic problems is achieved by the heavyball method [13],
ρ = 1 −
2
√
κ + 1
(2.14a)
with the following values of the algorithmic parameters
α =
4
(
√
L +
√
m)
2
, β =
1 −
2
√
κ + 1
2
= ρ
2
. (2.14b)
This improves the optimal rate achieved by gradient descent
ρ = 1 −
2
κ + 1
(2.15)
with the following value of the stepsize α = 2/(L+m).
The speed of the algorithms can be equivalently quantified by the settling time Ts
,
Ts
:=
1
1−ρ
(2.16)
which indicates how many iterations are required to reach a desired level of accuracy. Based
on (2.11), ε-accuracy
∥ψ
t∥2
∥ψ0∥2
≤ ε is achieved when cρ
t ≤ ε. Taking the logarithm of cρ
t ≤ ε and
the Taylor series expansion ln(1−x) ≈ −x, it is clear ε-accuracy occurs when
t ≥
ln(ε/c)
ln(ρ)
≈
1
1−ρ
ln(c/e) (2.17)
which allows us to use Ts
to indicate the number of iterations required to reach any specified level
of accuracy.
2.2.3 Transient growth
For linear systems, transient growth in the expected optimization error in the non-asymptotic
regime is due to the non-normal dynamics of A. While the convergence rate is independent of
initial conditions, for early iterations the expected error does depend on the specific values used
18
to initialize the algorithm. When investigating transient behavior of the expected optimization error, we consider the worst case transient growth, that is, the magnitude of the largest expected
optimization error for any initial conditions.
The optimization error at time t is given by y
t = CAtψ
0
, and thus the norm of the optimization
error is determined by the singular values of Φ(t) := CAt
, with
sup
ψ0 ̸=0
∥y
t∥2
∥ψ0∥2
= sup
∥ψ0∥2 =1
∥Φ(t)ψ
0
∥2 = σmax(Φ(t)) (2.18)
where σmax(·) is the largest singular value. We define the worst case transient peak by
Trmax := sup
t≥1,ψ0 ̸=0
∥y
t∥2
∥ψ0∥2
= max
t
σmax(Φ(t)). (2.19)
2.2.4 Noise amplification
The variance of the error in the optimization variable y
t
:= x
t −x
⋆ = x
t
is determined by
J
t
:= E
∥y
t
∥
2
=
n
∑
i=1
Jˆ
t
(λi) (2.20)
where Jˆt
(λi) = E[∥yˆ
t
i
∥
2
] denotes the variance amplification of the ith subsystem (2.10b). In particular, the algebraic Lyapunov equation
Pˆ
t+1
i = Aˆ(λi)Pˆ
t
i AˆT
(λi) +BˆBˆ
T
(2.21)
can be used to compute the modal contribution of the ith eigenvalue λi of Q to the variance amplification as
Jˆ
t
(λi) = trace
CˆPˆ
t
iCˆT
(2.22)
1
where Pˆt
i
denotes the covariance matrix of the state vector ψˆ
t
i
, Pˆt
i
:= E[ψˆ
t
i
(ψˆ
t
i
)
T
]. For stable systems, Pˆt
i
approaches its steady-state value Pˆ
i as t goes to infinity,
Pˆ
i = Aˆ(λi)Pˆ
iAˆT
(λi) +BˆBˆ
T
(2.23)
and the steady-state variance is determined by Jˆ(λi) = trace (CˆPˆ
iCˆT
).
For the heavy-ball and gradient descent algorithms, steady state variance scales with the condition number κ according to [34–36]
JHB = Θ
κ
3/2
, JGD = Θ(κ). (2.24)
20
Chapter 3
Transient growth of accelerated optimization algorithms
In this chapter, we examine the transient behavior of accelerated first-order optimization algorithms. For convex quadratic problems, we employ tools from linear systems theory to show that
transient growth arises from the presence of non-normal dynamics. We identify the existence of
modes that yield an algebraic growth in early iterations and quantify the transient excursion from
the optimal solution caused by these modes. For strongly convex smooth optimization problems,
we utilize the theory of integral quadratic constraints (IQCs) to establish an upper bound on the
magnitude of the transient response of Nesterov’s accelerated algorithm. We show that both the
Euclidean distance between the optimization variable and the global minimizer and the rise time
to the transient peak are proportional to the square root of the condition number of the problem.
Finally, for problems with large condition numbers, we demonstrate tightness of the bounds that
we derive up to constant factors.
3.1 Introduction
Optimization algorithms are increasingly being used in applications with limited time budgets. In
many real-time and embedded scenarios, only a few iterations can be performed and traditional
convergence metrics cannot be used to evaluate performance in these non-asymptotic regimes.
There is a vast literature focusing on the convergence properties of accelerated algorithms for
different stepsize rules and acceleration parameters, including [13–16]. There is also a growing
21
body of work which investigates robustness of accelerated algorithms to various types of uncertainty [34, 37, 38, 58–61]. These studies demonstrate that acceleration increases sensitivity to uncertainty in gradient evaluation.
In addition to deterioration of robustness in the face of uncertainty, asymptotically stable accelerated algorithms may also exhibit undesirable transient behavior [32]. This is in contrast to
gradient descent which is a contraction for strongly convex problems with suitable stepsize [62].
In real-time optimization and in applications with limited time budgets, the transient growth can
limit the appeal of accelerated methods. In addition, first-order algorithms are often used as a
building block in multi-stage optimization including ADMM [63] and distributed optimization
methods [64]. In these settings, at each stage we can perform only a few iterations of first-order
updates on primal or dual variables and transient growth can have a detrimental impact on the performance of the entire algorithm. This motivates an in-depth study of the behavior of accelerated
first-order methods in non-asymptotic regimes.
It is widely recognized that large transients may arise from the presence of resonant modal
interactions and non-normality of linear dynamical generators [65]. Even in the absence of unstable modes, these can induce large transient responses, significantly amplify exogenous disturbances, and trigger departure from nominal operating conditions. For example, in fluid dynamics,
∥x
t −
x
⋆∥
2
2
iteration number t
Figure 3.1: Error in the optimization variable for Polyak’s heavy-ball (black) and Nesterov’s (red)
algorithms with the parameters that optimize the convergence rate for a strongly convex quadratic
problem with the condition number 103
and a unit norm initial condition with x
0 ̸= x
⋆
.
22
such mechanisms can initiate departure from stable laminar flows and trigger transition to turbulence [66, 67].
While momentum based algorithms have faster convergence rates compared to the standard
gradient descent (γ = β = 0), they may suffer from large transient responses; see Fig. 3.1 for an
illustration. To quantify the transient behavior, we examine the ratio of the largest error in the
optimization variable to the initial error.
We focus on Nesterov’s accelerated and Polyak’s heavy-ball methods for convex quadratic
problems, where (2.2) can be cast as a linear time-invariant (LTI) system as seen in 2.2.1, for which
modal analysis of the state-transition matrix can be performed. For both accelerated algorithms, we
identify non-normal modes that create large transient growth, derive analytical expressions for the
state-transition matrices, and establish bounds on the transient response in terms of the convergence
rate and the iteration number. We show that both the peak value of the transient response and
the rise time to this value increase with the square root of the condition number of the problem.
Moreover, for general strongly convex problems, we combine a Lyapunov-based approach with
the theory of IQCs to establish an upper bound on the transient response of Nesterov’s accelerated
algorithm. As for quadratic problems, we demonstrate that this bound scales with the square root
of the condition number.
Adaptive restarting, which was introduced in [32] to address the oscillatory behavior of Nesterov’s accelerated method, provides heuristics for improving transient responses. In [33], the transient growth of second-order systems was studied and a framework for establishing upper bounds
was introduced, with a focus on real eigenvalues. The result was applied to the heavy-ball method
but was not applicable to quadratic problems in which the dynamical generator may have complex
eigenvalues. We account for complex eigenvalues and conduct a thorough analysis for Nesterov’s
accelerated algorithm as well. Furthermore, for convex quadratic problems, we provide tight upper
and lower bounds on transient responses in terms of the condition number and identify the initial
condition that induces the largest transient response. Similar results with extensions to the Wasserstein distance have been recently reported in [68]. Previous work on non-asymptotic bounds for
23
Nesterov’s accelerated algorithm includes [69], where bounds on the objective error in terms of
the condition number were provided. However, in contrast to our work, this result introduces a
restriction on the initial conditions. Finally, while [70] presents computational bounds we develop
analytical bounds on the non-asymptotic value of the estimated optimizer.
3.2 Convex quadratic problems
In this section, we examine transient responses of accelerated algorithms for convex quadratic
objective functions. The minimizers of (2.10) are determined by the null space of the matrix Q,
x
⋆ ∈ N (Q). The constant parameters α and β can be selected to provide stability of subsystems
in (2.10) for all λi ∈ [m,L], and guarantee convergence of ˆx
t
i
to ˆx
⋆
i
:= 0 with a linear rate determined
by the spectral radius ρ(Ai) < 1. On the other hand, for i = r+1,...,n the eigenvalues of Ai are β
and 1. In this case, the solution to (2.10) is given by
xˆ
t
i =
1 − β
t
1 − β
(xˆ
1
i − xˆ
0
0
) + xˆ
0
i
(3.1a)
and the steady-state limit of ˆx
t
i
,
xˆ
⋆
i
:=
1
1 − β
(xˆ
1
i − xˆ
0
i
) + xˆ
0
i
(3.1b)
is achieved with a linear rate β < 1. Thus, the iterates of (2.2) converge to the optimal solution
x
⋆ = Vxˆ
⋆ ∈ N (Q) with a linear rate ρ < 1 and Table 3.1 provides the parameters α and β that
optimize the convergence rate [9, Proposition 1].
3.2.1 Transient growth of accelerated algorithms
In spite of a significant improvement in the rate of convergence, acceleration may deteriorate performance on finite time intervals and lead to large transient responses. In particular, the constant c
24
Method Optimal parameters Linear rate ρ
Nesterov α =
4
3L+m
β =
√
√
3κ+1−2
3κ+1+2
1− √
2
3κ+1
Polyak α =
4
(
√
L+
√
m)
2 β =
(
√
κ−1)
2
(
√
κ+1)
2 1− √
2
κ+1
Table 3.1: Parameters that provide optimal convergence rates for a convex quadratic objective
function (2.7) with κ := L/m.
in (2.6) may become significantly larger than 1 for the accelerated methods whereas c = 1 for gradient descent because of its contractive property for strongly convex problems. Figure 3.1 shows the
transient growth of the error in the optimization variable for Nesterov’s accelerated method and
Polyak’s heavy-ball algorithm, with parameters given in Table 3.1 A strongly convex quadratic
problem with κ = 103
is considered and the parameters α and β in Table 3.1 that optimize the
linear convergence rate are used.
In spite of a significant improvement in the rate of convergence, acceleration may deteriorate
performance on finite time intervals and lead to large transient responses. This is in contrast to
gradient descent which is a contraction [62]. At any t, we are interested in the worst-case ratio of
the two norm of the error of the optimization variable y
t
:= x
t − x
⋆
to the two norm of the initial
condition ψ
0 −ψ
⋆ =
(y
0
)
T
(y
1
)
T
T
,
Tr2
(t) := sup
ψ0 ̸=ψ⋆
∥x
t − x
⋆∥
2
2
∥ψ0 − ψ⋆∥
2
2
. (3.2)
Proposition 1. For accelerated algorithms applied to convex quadratic problems, Tr(t) in (3.2) is
determined by
Tr2
(t) = max
max
i≤r
∥CiA
t
i∥
2
2
, β
2t
/(1 + β
2
)
. (3.3)
25
Proof. Since V is unitary and dynamics (2.10) that govern the evolution of each ˆx
t
i
are decoupled,
Tr(t) is determined by
Tr2
(t) = max
i
sup
ψˆ
0
i
̸=ψˆ
⋆
i
(xˆ
t
i − xˆ
⋆
i
)
2
∥ψˆ
0
i − ψˆ
⋆
i
∥
2
2
(3.4)
where ψˆ
⋆
i
:=
xˆ
⋆
i
xˆ
⋆
i
T
. Furthermore, the mapping from ψˆ
0
i −ψˆ
⋆
i
to ˆx
t
i −xˆ
⋆
i
is given by Φi(t) :=
CiA
t
i where the state-transition matrix A
t
i
is determined by the tth power of Ai
,
xˆ
t
i − xˆ
⋆
i = CiA
t
i
(ψˆ
0
i − ψˆ
⋆
i
) =: Φi(t)(ψˆ
0
i − ψˆ
⋆
i
). (3.5)
For λi ̸= 0, ψˆ
0
i −ψˆ
⋆
i = ψˆ
0
i
is an arbitrary vector in R
2
. Thus,
sup
ψˆ
0
i
̸=ψˆ
⋆
i
(xˆ
t
i − xˆ
⋆
i
)
2
∥ψˆ
0
i − ψˆ
⋆
i
∥
2
2
= ∥CiA
t
i∥
2
2
, i = 1,...,r. (3.6)
This expression, however, does not hold when λi = 0 in (2.10) because ψ
0
i −ψ
⋆
i
is restricted to a
line in R
2
. Namely, from (3.1),
xˆ
t
i − xˆ
⋆
i =
−β
t
1 − β
(xˆ
1
i − xˆ
0
0
)
ψ
0
i − ψ
⋆
i =
xˆ
0
i − xˆ
⋆
i
xˆ
1
i − xˆ
⋆
i
=
−(xˆ
1
i − xˆ
0
i
)
1 − β
1
β
(3.7)
which, for any initial condition with ˆx
0
i
̸= xˆ
1
i
, leads to
(xˆ
t
i − xˆ
⋆
i
)
2
∥ψ
0
i − ψ
⋆
i
∥
2
2
=
β
2t
1 + β
2
, i = r +1,...,n. (3.8)
Finally, substitution of (3.6) and (3.8) to (3.4) yields (3.3).
26
3.2.2 Analytical expressions for transient response
We next derive analytical expressions for the state-transition matrix A
t
i
and the response matrix
Φi(t) = CiA
t
i
in (2.10).
Lemma 1. Let µ1 and µ2 be the eigenvalues of the matrix
M =
0 1
a b
and let t be a positive integer. For µ1 ̸= µ2,
Mt =
1
µ2 − µ1
µ1µ2(µ
t−1
1 − µ
t−1
2
) µ
t
2 − µ
t
1
µ1µ2(µ
t
1 − µ
t
2
) µ
t+1
2 − µ
t+1
1
.
Moreover, for µ := µ1 = µ2, the matrix Mt
is determined by
Mt =
(1−t)µ
t
t µ
t−1
−t µ
t+1
(t +1)µ
t
. (3.9)
Lemma 1 with M = Ai determines explicit expressions for A
t
i
. These expressions allow us to
establish a bound on the norm of the response for each decoupled subsystem (2.10). In Lemma 2,
we provide a tight upper bound on ∥CiA
t
i
∥
2
2
for each t in terms of the spectral radius of the matrix
Ai
.
Lemma 2. The matrix M in Lemma 1 satisfies
∥
1 0
Mt
∥
2
2 ≤ (t −1)
2
ρ
2t + t
2
ρ
2t−2
(3.10)
where ρ is the spectral radius of M. Moreover, (3.10) becomes equality if M has repeated eigenvalues.
27
Remark 1. For Nesterov’s accelerated algorithm with the parameters that optimize the convergence rate (cf. Table 2.2), the matrix Aˆ
r
, which corresponds to the smallest non-zero eigenvalue of
Q, λr = m, has an eigenvalue 1−2/
√
3κ +1 with algebraic multiplicity two and incomplete sets
of eigenvectors. Similarly, for both λ1 = L and λr = m, Aˆ
1 and Aˆ
r
for the heavy-ball method with
the parameters provided in Table 2.2 have repeated eigenvalues which are, respectively, given by
(1−
√
κ)/(1+
√
κ) and −(1−
√
κ)/(1+
√
κ).
We next use Lemma 2 with M = Ai
to establish an analytical expression for J(t).
Theorem 1. For accelerated algorithms applied to convex quadratic problems, Tr(t) in (3.2) satisfies
Tr2
(t) ≤ maxn
(t −1)
2
ρ
2t + t
2
ρ
2(t−1)
, β
2t
/(1 + β
2
)
o
where ρ := maxi≤r ρ(Ai). Moreover, for the parameters provided in Table 2.2
Tr2
(t) = (t −1)
2
ρ
2t + t
2
ρ
2(t−1)
. (3.11)
Theorem 1 highlights the source of disparity between the long and short term behavior of the
response. While the geometric decay of ρ
t drives x
t
to x
⋆
as t → ∞, early stages are dominated
by the algebraic term which induces a transient growth. We next provide tight bounds on the time
tmax at which the largest transient response takes place and the corresponding peak value Tr(tmax).
Even though we derive the explicit expressions for these two quantities, our tight upper and lower
bounds are more informative and easier to interpret.
Theorem 2. For accelerated algorithms with the parameters provided in Table 2.2, let ρ ∈ [1/e,1).
Then the rise time tmax := argmaxt
J(t) and the peak value J(tmax) satisfy
−1/log(ρ) ≤ tmax ≤ 1 − 1/log(ρ)
−
√
2ρ
elog(ρ)
≤ Tr(tmax) ≤ −
√
2
eρ log(ρ)
.
28
∥x
t∥
2
2
iteration number t iteration number t
x
1 = x
0
x
1 = −x
0
Figure 3.2: Dependence of the error in the optimization variable on the iteration number for the
heavy-ball (black) and Nesterov’s methods (red), as well as the peak magnitudes (dashed lines)
obtained in Proposition 2 for two different initial conditions with ∥x
1∥2 = ∥x
0∥2 = 1.
For accelerated algorithms with the parameters provided in Table 2.2, Theorem 2 can be used
to determine the rise time to the peak in terms of condition number κ. We next establish that both
tmax and Tr(tmax) scale as √
κ.
Proposition 2. For accelerated algorithms with the parameters provided in Table 2.2, the rise time
tmax := argmaxt Tr(t) and the peak value Tr(tmax) satisfy
(i) Polyak’s heavy-ball method with κ ≥ 4.69
(
√
κ −1)/2 ≤ tmax ≤ (
√
κ +3)/2
(
√
κ −1)
2
√
2e(
√
κ +1)
≤ Tr(tmax) ≤
(
√
κ +1)
2
√
2e(
√
κ −1)
(ii) Nesterov’s accelerated method with κ ≥ 3.01
(
√
3κ +1−2)/2 ≤ tmax ≤ (
√
3κ +1+2)/2
(
√
3κ +1−2)
2
√
2e√
3κ +1
≤ Tr(tmax) ≤
3κ +1
√
2e(
√
3κ +1−2)
.
In Proposition 2, the lower-bounds on κ are only required to ensure that the convergence rate
ρ satisfies ρ ≥ 1/e, which allows us to apply Theorem 2. We also note that the upper and lower
bounds on tmax and J(tmax) are tight in the sense that their ratio converges to 1 as κ → ∞.
29
3.2.3 The role of initial conditions
The accelerated algorithms need to be initialized with x
0
and x
1 ∈ R
n
. This provides a degree of
freedom that can be used to potentially improve their transient performance. To provide insight, let
us consider the quadratic problem with Q = diag(κ,1). Figure 3.2 shows the error in the optimization variable for Polyak’s and Nesterov’s algorithms as well as the peak magnitudes obtained in
Proposition 2 for two different types of initial conditions with x
1 = x
0
and x
1 = −x
0
, respectively.
For x
1 = −x
0
, both algorithms recover their worst-case transient responses. However, for x
1 = x
0
,
Nesterov’s method shows no transient growth.
Our analysis shows that large transient responses arise from the existence of non-normal modes
in the matrices Ai
. However, such modes do not move the entries of the state transition matrix A
t
i
in arbitrary directions. For example, using Lemma 1, it is easy to verify that Ar
in (2.10b), associated with the smallest non-zero eigenvalue λr = m of Q in Nesterov’s algorithm with the parameters provided by Table 2.2 has the repeated eigenvalue µ = 1−2/
√
3κ +1 and A
t
r
is determined
by (3.9) with M = Ar
. Even though each entry of A
t
r
experiences a transient growth, its row sum is
determined by
A
t
r
1
1
=
1 + 2t/(
√
3κ +1−2)
1 + 2t/
√
3κ +1
(1 − 2/
√
3κ +1)
t
and entries of this vector are monotonically decaying functions of t. Furthermore, for i < r, it
can be shown that the entries of A
t
i
[1 1]
T
remain smaller than 1 for all i and t. In Theorem 3,
we provide a bound on the transient response of Nesterov’s method for balanced initial conditions
with x
1 = x
0
.
Theorem 3. For convex quadratic optimization problems, the iterates of Nesterov’s accelerated
method with a balanced initial condition x1 = x
0 and parameters provided in Table 2.2 satisfy
∥x
t −x
⋆∥2 ≤ ∥x
0 −x
⋆∥2.
Proof. See Appendix 3.5.2.
30
It is worth mentioning that the transient growth of the heavy-ball method cannot be eliminated
with the use of balanced initial conditions. To see this, we note that the matrices A
t
r
and A
t
1
for
the heavy-ball method with parameters provided in Table 2.2 also take the form in (3.9) with
µ = (1−
√
κ)/(1+
√
κ) and µ = −(1−
√
κ)/(1+
√
κ), respectively. In contrast to A
t
r
1 1 T
,
which decays monotonically,
A
t
1
1
1
=
1 + 2t
√
κ/(1−
√
κ)
1 + 2t
√
κ/(1+
√
κ)
(1−
√
κ)
t
(1+
√
κ)
t
experiences transient growth. It was recently shown that an averaged version of the heavy-ball
method experiences smaller peak deviation than the heavy-ball method [71]. We also note that
adaptive restarting provides effective heuristics for reducing oscillatory behavior of accelerated
algorithms [32].
Remark 2. For accelerated algorithms with the parameters provided in Table 2.2, the initial condition that leads to the largest transient growth at any time τ is determined by
ψˆ
0
r = c
(1−τ)ρ
τ
τρτ−1
T
, ψˆ
0
i = 0 for i ̸= r
where c ̸= 0 and ψˆ
0
r
is the principal right singular vector of CrA
τ
r
. Thus, the largest peak Tr(tmax)
occurs for {ψˆ
0
i = 0, i ̸= r} and ψˆ
0
r = c
(1−tmax)ρ
tmax tmax ρ
tmax−1
T
, where tight bounds on
tmax are established in Proposition 2.
Remark 3. For λi = 0 in (2.10), |xˆ
t
i − xˆ
⋆
i
| decays monotonically with a linear rate β and only
non-zero eigenvalues of Q contribute to the transient growth. Furthermore, for the parameters
provided in Table 2.2, our analysis shows that Tr2
(t) = maxi≤r ∥CiA
t
i
∥
2
2
. In what follows, we
provide bounds on the largest deviation from the optimal solution for Nesterov’s algorithm for
general strongly convex problems.
31
3.3 General strongly convex problems
In this section, we combine a Lyapunov-based approach with the theory of IQCs to provide bounds
on the transient growth of Nesterov’s accelerated algorithm for the class FL
m of m-strongly convex
and L-smooth functions. When f is not quadratic, first-order algorithms are no longer LTI systems
and eigenvalue decomposition cannot be utilized to simplify analysis. Instead, to handle nonlinearity and obtain upper bounds on J in (3.2), we augment standard quadratic Lyapunov functions
with the objective error.
For f ∈ FL
m, algorithm (2.2) is invariant under translation. Thus, without loss of generality, we
assume that x
⋆ = 0 is the unique minimizer of (2.1) with f(0) = 0. In what follows, we present a
framework based on Linear Matrix Inequalities (LMIs) that allows us to obtain time-independent
bounds on the error in the optimization variable. This framework combines certain IQCs [72] with
Lyapunov functions of the form
V(ψ) = ψ
TXψ + θ f(Cψ) (3.12)
which consist of the objective function evaluated at Cψ and a quadratic function of ψ, where X is
a positive definite matrix.
The theory of IQCs provides a convex control-theoretic approach to analyzing optimization
algorithms [9] and it was recently employed to study convergence and robustness of the first-order
methods [34,38,54,55,70,73]. The type of Lyapunov functions in (3.12) was introduced in [70,74]
to study convergence for convex problems. For Nesterov’s accelerated algorithm, we demonstrate
that this approach provides orderwise-tight analytical upper bounds on Tr(t).
Nesterov’s accelerated algorithm can be viewed as a feedback interconnection of linear and
nonlinear components
ψ
t+1 = Aψ
t + But
y
t = Cyψ
t
, u
t = ∆(y
t
)
(3.13a)
32
where the LTI part of the system is determined by
A =
0 I
−βI (1+β)I
, B =
0
−αI
, Cy =
−βI (1+β)I
(3.13b)
and the nonlinear mapping ∆: R
n → R
n
is ∆(y) := ∇ f(y). Moreover, the state vector ψ
t
and the
input y
t
to ∆ are determined by
ψ
t
:=
x
t
x
t+1
, y
t
:= (1+β)x
t+1 − βx
t
. (3.13c)
For smooth and strongly convex functions f ∈ FL
m, ∆ satisfies the quadratic inequality [9, Lemma
6]
y − y0
∆(y) − ∆(y0)
T
Π
y − y0
∆(y) − ∆(y0)
≥ 0 (3.14a)
for all y, y0 ∈ R
n
, where the matrix Π is given by
Π :=
−2mLI (L+m)I
(L+m)I −2I
. (3.14b)
Using u
t
:= ∆(y
t
) and y
t
:= Cyψ
t
and evaluating (3.14a) at y = y
t
and y0 = 0 leads to,
ψ
t
u
t
T
M1
ψ
t
u
t
≥ 0 (3.14c)
where
M1 :=
C
T
y 0
0 I
Π
Cy 0
0 I
=
−2mLCT
y Cy (L+m)C
T
y
(L+m)Cy −2I
. (3.14d)
33
In Lemma 3, we provide an upper bound on the difference between the objective function at two
consecutive iterations of Nesterov’s algorithm. In combination with (3.14), this result allows us to
utilize Lyapunov function of the form (3.12) to establish an upper bound on transient growth. We
note that variations of this lemma have been presented in [70, Lemma 5.2] and in [34, Lemma 3].
Lemma 3. Along the solution of Nesterov’s accelerated algorithm (3.13), the function f ∈ FL
m
with κ := L/m satisfies
f(x
t+2
) − f(x
t+1
) ≤
1
2
ψ
t
u
t
T
M2
ψ
t
u
t
(3.15a)
where the matrix M2 is given by
M2 :=
−mCT
2 C2 C
T
2
C2 −α(2−αL)I
, C2 :=
−βI βI
. (3.15b)
Using Lemma 3, we next demonstrate how a Lyapunov function of the form (3.12) with θ :=
2θ2 and C := [0 I] in conjunction with property (3.14) of the nonlinear mapping ∆ can be utilized
to obtain an upper bound on ∥x
t∥
2
2
.
Lemma 4. Let M1 be given by (3.14d) and let M2 be defined in Lemma 3. Then, for any positive
semi-definite matrix X and nonnegative scalars θ1 and θ2 that satisfy
W :=
A
TX A−X ATX B
B
T X A BT X B
+ θ1M1 + θ2M2 ⪯ 0 (3.16)
the transient growth of Nesterov’s accelerated algorithm (3.13) for all t ≥ 1 is upper bounded by
∥x
t
∥
2
2 ≤
λmax(X)∥x
0∥
2
2 + (λmax(X) +Lθ2)∥x
1∥
2
2
λmin(X) +mθ2
. (3.17)
34
In Lemma 4, the Lyapunov function candidate V(ψ):= ψ
TXψ +2θ2 f([0 I]ψ) is used to show
that the state vector ψ
t
is confined within the sublevel set {ψ ∈ R
2n
|V(ψ) ≤ V(ψ
0
)} associated
with V(ψ
0
). We next establish an order-wise tight upper bound on ∥x
t∥2 that scales linearly with
√
κ by finding a feasible point to LMI (3.16) in Lemma 4.
Theorem 4. For f ∈ FL
m with the condition number κ := L/m, the iterates of Nesterov’s accelerated algorithm (3.13) for any stabilizing parameters α ≤ 1/L and β < 1 satisfy
∥x
t
∥
2
2 ≤ κ
1+β
2
αβL
∥x
0
∥
2
2 + (1+
1+β
2
αβL
)∥x
1
∥
2
2
. (3.18a)
Furthermore, for the conventional values of parameters
α = 1/L, β = (√
κ −1)/(
√
κ +1) (3.18b)
the largest transient error, defined in (3.2), satisfies
√
2(
√
κ −1)
2
e
√
κ
≤ sup
{t ∈N, f ∈FL
m}
Tr(t) ≤
r
3κ +
4κ
κ −1
. (3.18c)
For balanced initial conditions, i.e., x
1 = x
0
, Nesterov established the upper bound √
κ +1
on Tr in [13]. Theorem 4 shows that similar trends hold without restriction on initial conditions.
Linear scaling of the upper and lower bounds with √
κ illustrates a potential drawback of using
Nesterov’s accelerated algorithm in applications with limited time budgets. As κ → ∞, the ratio of
these bounds converges to ep
3/2 ≈ 3.33, thereby demonstrating that the largest transient response
for all f ∈ FL
m is within the factor of 3.33 relative to the bounds established in Theorem 4.
3.4 Concluding remarks
We have examined the impact of acceleration on transient responses of first-order optimization
algorithms. Without imposing restrictions on initial conditions, we establish bounds on the largest
35
value of the Euclidean distance between the optimization variable and the global minimizer. For
convex quadratic problems, we utilize the tools from linear systems theory to fully capture transient
responses and for general strongly convex problems, we employ the theory of integral quadratic
constraints to establish an upper bound on transient growth. This upper bound is proportional
to the square root of the condition number and we identify quadratic problem instances for which
accelerated algorithms generate transient responses which are within a constant factor of this upper
bound. Future directions include extending our analysis to nonsmooth optimization problems and
devising algorithms that balance acceleration with quality of transient responses.
3.5 Proofs
3.5.1 Proofs of Section 3.2
We first present a technical lemma that we use in our proofs.
Lemma 5. For any ρ ∈ [1/e,1), a(t) := tρ
t
satisfies
argmax
t ≥1
a(t) = −1/log(ρ), max
t ≥1
a(t) = −1/(elog(ρ)).
Proof. Follows from the fact that da/dt = ρ
t
(1+t log(ρ)) vanishes at t = −1/log(ρ).
3.5.1.1 Proof of Lemma 1
For µ1 ̸= µ2, the eigenvalue decomposition of M is determined by
M =
1
µ2 − µ1
1 1
µ1 µ2
µ1 0
0 µ2
µ2 −1
−µ1 1
.
36
Computing the tth power of the diagonal matrix and multiplying throughout completes the proof
for µ1 ̸= µ2. For µ1 = µ2 =: µ, M admits the Jordan canonical form
M =
1 0
µ 1
µ 1
0 µ
1 0
−µ 1
and the proof follows from
µ 1
0 µ
t
=
µ
t
t µ
t−1
0 µ
t
.
3.5.1.2 Proof of Lemma 2
From Lemma 1, it follows
1 0
Mt =
"
−
t−2
∑
i=0
µ
i+1
1
µ
t−1−i
2
t−1
∑
i=0
µ
i
1µ
t−1−i
2
#
,
where µ1 and µ2 are the eigenvalues of M. Moreover,
t−2
∑
i=0
µ
i+1
1
µ
t−1−i
2
≤
t−2
∑
i=0
µ
i+1
1
µ
t−1−i
2
≤
t−2
∑
i=0
ρ
t ≤ (t −1)ρ
t
t−1
∑
i=0
µ
i
1µ
t−1−i
2
≤
t−1
∑
i=0
µ
i
1µ
t−1−i
2
≤
t−1
∑
i=0
ρ
t−1 ≤ tρ
t−1
by triangle inequality. Finally, for µ1 = µ2 ∈ R, we have ρ = |µ1| = |µ2| and these inequalities
become equalities.
37
3.5.1.3 Proof of Theorem 1
Let µ1i and µ2i be the eigenvalues and let ρi = max{|µ1i
|,|µ2i
|} be the spectral radius of Ai
. We
can use Lemma 2 with M := Ai
to obtain
max
i≤r
∥CiA
t
i∥
2
2 ≤ max
i≤r
(t −1)
2
ρ
2t
i + t
2
ρ
2t−2
i
≤ (t −1)
2ρ
2t + t
2ρ
2t−2
(3.19)
where ρ := maxi≤r ρi
. For the parameters provided in Table 2.2, the matrices A1 and Ar
, that correspond to the largest and smallest non-zero eigenvalues of Q, i.e., λ1 = L and λr = m, respectively,
have the largest spectral radius [34, Eq. (64)],
ρ = ρ1 = ρr ≥ ρi
, i = 2,...,r −1 (3.20)
and Ar has repeated eigenvalues. Thus, we can write
max
i≤r
∥CiA
t
i∥
2
2 ≥ ∥[1 0] A
t
r∥
2
2 = (t −1)
2
ρ
2t
r + t
2
ρ
2t−2
r = (t −1)
2
ρ
2t + t
2
ρ
2t−2
(3.21)
where the first equality follows from Lemma 2 applied to M := Ar and the second equality follows
from (3.20). Finally, combining (3.19) and (3.21) with β < ρ and Proposition 1 completes the
proof.
3.5.1.4 Proof of Theorem 2
Let a(t) := tρ
t
. Theorem 1 implies Tr2
(t) = ρ
2a
2
(t −1) +ρ
−2a
2
(t) and, for t ≥ 1, Tr(t) has only
one critical point, which is a maximizer. Moreover, since dTr2
(t)/dt is positive at t = −1/log(ρ)
and negative at t = 1 − 1/log(ρ), we conclude that the maximizer lies between −1/log(ρ) and
1−1/log(ρ). Regarding maxt Tr(t), we note that √
2ρa(t −1) ≤ Tr(t) ≤
√
2a(t)/ρ and the proof
follows from maxt≥1 a(t) = −1/(elog(ρ)) (cf. Lemma 5).
3
3.5.1.5 Proof of Proposition 2
Since for all a ≤ 1, we have [75]
a ≤ −log(1−a) ≤ a/(1−a)
ρhb = 1−2/(
√
κ +1) and ρna = 1−2/(
√
3κ +1) satisfy
2/(
√
κ +1) ≤ −log(ρhb) ≤ 2/(
√
κ −1)
2/
√
3κ +1 ≤ −log(ρna) ≤ 2/(
√
3κ +1−2).
The conditions on κ ensure that ρhb and ρna are not smaller than 1/e and we combine the above
bounds with Theorem 2 to complete the proof.
3.5.2 Proof of Theorem 3
The condition x0 = x1 is equivalent to ˆx
0
i = xˆ
1
i
in (2.10). Thus, for λi = 0, equation (3.7) yields
xˆ
t
i = xˆ
0
i = xˆ
⋆
i
. For λi ̸= 0, we have ψˆ
0
i −ψˆ
⋆
i =
xˆ
0
i
xˆ
0
i
T
and, hence,
∥x
t −x
⋆∥2
∥x
0 −x
⋆∥2
≤ max
i≤r
|xˆ
t
i −xˆ
⋆
i
|
xˆ
t
0 −xˆ
⋆
i
= max
i≤r
CiA
t
i
1
1
(3.22a)
where the equality follows from (3.5). To bound the right-hand side, we use Lemma 1 with M = Ai
to obtain
ωt(µ1i
,µ2i) =
1 0
A
t
i
1 1 T
(3.22b)
where µ1i and µ2i are the eigenvalues of Ai and
ωt(z1,z2) :=
t−1
∑
i=0
z
i
1
z
t−1−i
2 −
t−1
∑
i=1
z
i
1
z
t−i
2
(3.23)
39
for any t ∈ N and z1,z2 ∈ C.
For Nesterov’s accelerated method, the characteristic polynomial det(zI − Ai) = z
2 − (1 +
β)hiz+βhi yields µ1i
,µ2i = ((1+β)hi ±
q
(1+β)
2h
2
i −4βhi)/2, where λi
is the eigenvalue of Q
and hi
:= 1−αλi
. For the parameters provided in Table 2.2, it is easy to show that:
• For λi ∈ [m,1/α], we have hi ∈ [0,4β/(1+β)
2
] and µ1i and µ2i are complex conjugates of
each other and lie on a circle of radius β/(1+β) centered at z = β/(1+β).
• For λi ∈ (1/α,L], µ1i and µ2i are real with opposite signs and can be sorted to satisfy |µ2i
| <
|µ1i
| with −1 ≤ µ1i ≤ 0 ≤ µ2i ≤ 1/3.
The next lemma provides a unit bound on |wt(µ1i
,µ2i)| for both of the above cases.
Lemma 6. For any z = l cos(θ)e
iθ ∈ C with |θ| ≤ π/2 and 0 ≤ l ≤ 1, and for any real scalars
(z1,z2) such that −1 ≤ z1 ≤ 0 ≤ z2 ≤ 1/3, and z2 < −z1, the function ωt
in (3.23) satisfies
|ωt(z,z¯)| ≤ 1 and |ωt(z1,z2)| ≤ 1 for all t ∈ N, where z is the complex conjugate of z. ¯
Proof. Since ω1(z1,z2) = 1, we assume t ≥ 2. We first address θ = 0, i.e., z = l ∈ R and
ωt(z,z¯) = tlt−1 − (t − 1)l
t
. We note that dωt/dl = t(t − 1)(l
t−2 − l
t−1
) = 0 only if l ∈ {0,1}.
This in combination with l ∈ [0,1] yield |ωt(l,l)| ≤ max{|ωt(1,1)|,|ωt(0,0)|} ≤ 1.
To address θ ̸= 0, we note that b(t) := sin(tθ)/t satisfies
|b(t)| ≤ |sin(θ)| (3.24)
which follows from
|sin(tθ)| = |sin((t −1)θ) cos(θ) + cos((t −1)θ)sin(θ)| ≤ |sin((t −1)θ)| + |sin(θ)|.
For z = l cos(θ)e
iθ
, we have
ωt(z,z¯) = (z
t −z¯
t −zz¯(z
t−1 −z¯
t−1
))/(z−z¯) (3.25)
= (l cos(θ))t−1
(sin(tθ)−l cos(θ)sin((t −1)θ))/sin(θ).
40
Thus, dωt/dl = 0 only if l = 0, 1, or l
⋆
:= b(t)/(b(t −1) cos(θ)). Moreover, it is easy to show that
ωt(z,z¯) =
0, l = 0
(cos(θ))t−1
cos((t −1)θ), l = 1
(l
⋆
cos(θ))t−1b(t)/sin(θ), l = l
⋆
.
Combining this with (3.24) completes the proof for complex z.
To address the case of z1, z2 ∈ R, we note that ωt(z1,z2) =
z
t
1
(1−z2)−z
t
2
(1−z1)
/(z1 −z2).
Thus, differentiating with respect to z1 yields
dωt
dz1
= (1−z2)
(t −1)z
t−1
1 −z2 ∑
t−2
i=0
z
t−2−i
1
z
i
2
z1 −z2
.
Moreover, from |z2| < |z1|, it follows that
(t −1)
z
t−1
1
> |z2|
t−2
∑
i=0
z
t−2−i
1
z
i
2
>
z2
t−2
∑
i=0
z
t−2−i
1
z
i
2
.
Therefore, dωt/dz1 ̸= 0 over our range of interest for z1,z2. Thus, ωt(z1,z2) may take its extremum
only at the boundary z1 ∈ {0,−1}, i.e. |ωt(z1,z2)| ≤ max{|ωt(0,z2)|,|ωt(1,z2)|}. Finally, it is easy
to show that |ωt(0,z2)| =
z
t−1
2
< 1, and |ωt(−1,z2)| =
(−1)
t
(z2 −1) +2z
t
2
/(1+z2) ≤ 1.
We complete the proof of Theorem 3 by noting that the eigenvalues of Ai for Nesterov’s algorithm with parameters provided in Table 2.2 satisfy the conditions in Lemma 6.
3.5.3 Proofs of Section 3.3
3.5.3.1 Proof of Lemma 3
For any f ∈ FL
m, the L-Lipschitz continuity of the gradient ∇ f ,
f(x
t+2
) − f(y
t
) ≤ (∇ f(y
t
))T
(x
t+2 −y
t
) + L
2
∥x
t+2 −y
t
∥
2
2
(3.26a)
4
and the m-strong convexity of f ,
f(y
t
) − f(x
t+1
) ≤ (∇ f(y
t
))T
(y
t −x
t+1
) −
m
2
∥y
t −x
t+1
∥
2
2
(3.26b)
can be used to show that (3.15) holds along the solution of Nesterov’s accelerated algorithm (3.13).
In particular, for (3.13) we have u
t
:= ∇ f(y
t
) and
x
t+2 − y
t = −αu
t
y
t − x
t+1 = β(x
t+1 − x
t
) =
−βI βI
ψ
t
.
(3.27)
Substituting (3.27) into (3.26a) and (3.26b) and adding the resulting inequalities completes the
proof.
3.5.3.2 Proof of Lemma 4
Pre- and post-multiplication of LMI (3.16) by (η
t
)
T
and η
t
:= [ (ψ
t
)
T
(u
t
)
T
]
T yields
0 ≥ (η
t
)
T
A
TX A−X ATX B
B
T X A BT X B
η
t + θ1(η
t
)
TM1η
t + θ2(η
t
)
TM2η
t
≥ (η
t
)
T
A
TX A−X ATX B
B
T X A BT X B
η
t + θ2(η
t
)
TM2η
t
(3.28)
where the second inequality follows from (3.14c). This yields
0 ≤ Vˆ(ψ
t
) − Vˆ(ψ
t+1
) − θ2(η
t
)
TM2η
t
(3.29)
where Vˆ(ψ) := ψ
TXψ. Also, since Lemma 3 implies
−(η
t
)
TM2η
t ≤ 2
f(x
t+1
) − f(x
t+2
)
(3.30)
4
combining (3.29) and (3.30) yields
Vˆ(ψ
t+1
) + 2θ2 f(x
t+2
) ≤ Vˆ(ψ
t
) + 2θ2 f(x
t+1
).
Thus, using induction, we obtain the uniform upper bound
Vˆ(ψ
t
) + 2θ2 f(x
t+1
) ≤ Vˆ(ψ
0
) + 2θ2 f(x
1
). (3.31)
This allows us to bound Vˆ by writing
λmin(X)∥ψ∥
2
2 ≤ Vˆ(ψ) ≤ λmax(X)∥ψ∥
2
2
. (3.32a)
We can also upper and lower bound f ∈ FL
m as
m∥x∥
2
2 ≤ 2 f(x) ≤ L∥x∥
2
2
. (3.32b)
Finally, combining (3.31) and (3.32) yields
λmin(X)∥ψ
t∥
2
2 + mθ2∥x
t+1∥
2
2 ≤ λmax(X)∥ψ
0∥
2
2 + Lθ2∥x
1∥
2
2
.
We complete the proof by noting that ∥x
t+1∥2 ≤ ∥ψ
t∥2.
43
3.5.3.3 Proof of Theorem 4
To prove (3.18a), we need to find a feasible solution for θ1, θ2 and X in terms of the condition
number κ. Let us define
X :=
x1I x0I
x0I x2I
= x2
β
2
I −βI
−βI I
θ2 := θ1(L+m)β/(1−β)
x2 := ((L+m)θ1 + θ2)/α = θ2/(αβ).
(3.33)
If (3.33) holds, it is easy to verify that X ⪰ 0 with λmin(X) = 0, λmax(X) = (1+β
2
)x2 = θ2(1+
β
2
)/(αβ), and A
TXA−X = 0. Moreover, the matrix W on the left-hand-side of (3.16) is blockdiagonal, W := diag(W1,W2), and negative semi-definite for all α ≤ 1/L, where
W1 = −m(2θ1LCT
y Cy +θ2C
T
2 C2) ⪯ 0
W2 = −((2−α(L+m))θ1 + α(1−αL)θ2)I ⪯ 0.
Thus, the choice of (θ1,θ2,X) in (3.33) satisfies the conditions of Lemma 4. Using the expressions for the largest and smallest eigenvalues of the matrix X in equation (3.17) in Lemma 4, leads
to the upper bound for ∥x
t∥
2
2
in (3.18a). Furthermore, from (3.18a) we have
∥x
t
∥
2
2 ≤ κ
1+ (1+β
2
)/(αβL)
∥ψ
0
∥
2
2
and the upper bound in (3.18c) follows from the fact that, for α and β in (3.18b), 1 + (1 +
β
2
)/(αβL) = 3+4/(κ −1).
To obtain the lower bound in (3.18c), we employ our framework for quadratic objective functions in Section 3.2. In particular, for the parameters α and β in (3.18b), the largest spectral radius
ρ(Ai) corresponds to An, which is associated with the smallest eigenvalue λn = m of Q. Since An
4
has repeated real eigenvalues ρ = 1−1/
√
κ, using similar arguments as in Theorem 1 for quadratic
problems we obtain,
Tr(tmax) = q
(tmax −1)
2ρ
2tmax + t
2
maxρ
2(tmax−1)
≥
√
2(tmax − 1)ρ
tmax ≥
√
2(
√
κ −1)
2
/(e
√
κ)
which completes the proof.
45
Chapter 4
The effect of averaging on accelerated first-order optimization
algorithms for strongly convex quadratic problems
In this chapter we study the effect of averaging over optimization iterates on convergence speed and
noise amplification of two-step accelerated algorithms in the presence of additive white stochastic disturbances. For strongly convex quadratic problems, we show that averaging over the entire
algorithmic history eliminates steady-state variance of the averaged output at the expense of slowing down convergence to a sub-linear rate. In contrast, finite window averaging converges with a
linear rate but it leads to a non-zero value of the steady-state variance. While this value is smaller
than the steady-state variance of the iterates of the heavy-ball algorithm, it has the same orderwise
dependence on the condition number. We also show that the finite window averaging increases the
upper bound on the expected error at iteration t by a constant factor that depends on the length of
the averaging window.
4.1 Introduction
One of the simplest approaches to mitigating the cost of increased steady-state variance associated
with accelerated algorithms is to average the algorithmic iterates over time. Intuitively, the average
output is expected to produce a smaller steady-state variance, given that in general sample variance
decreases with sample size. Similarly, the expected value of the error in the averaged output at a
46
given iteration t should be larger than the error resulting from non-averaged algorithmic iterates. In
this chapter, we quantify the effects of averaging on expected error and expected squared error of
the general class of two-step accelerated algorithms applied to strongly convex quadratic problems.
Averaging of stochastic gradient methods was first proposed by Polyak [17, 42] in the context
of stochastic approximation, where it was shown that applying averaging to stochastic gradient
descent with a more slowly decaying step-size resulted in optimal convergence behavior robust
to gradient noise. In general, for first-order algorithms in the presence of gradient noise, averaging over the entire algorithmic history leads to better convergence rates as it allows larger stepsizes [43,76,77]. Subsequent work has has examined the application of averaged stochastic gradient descent, sometimes called Polyak-Ruppert averaging, to various problems and investigated the
link between averaging and step-size [44, 46]. In addition, [45] introduced accelerated stochastic
averaged gradient descent specifically for least-squares regression problems with quadratic objective function. They show that with a constant step size, averaging an accelerated algorithm with
parameters designed for strongly convex objective functions recovers the optimal convergence rate
for the class of non-strongly convex functions, that is optimization error on the order of O(1/t), as
opposed to O(1/t
2
) achieved by Nesterov’s accelerated method.
Our work additionally introduces the practice of averaging over a moving window of fixed
length, rather than the entire algorithmic history, and expands upon the convergence behavior and
steady-state value of the expected squared error in the optimization variable. Furthermore, we
consider the effect of condition number κ of the objective function on algorithmic performance. In
particular, we investigate whether output averaging can reduce steady state variance sufficiently to
overcome the fundamental limitation on the product of variance and settling time given in [34–36]
in terms of the condition number κ.
Another recent work [71] shows that averaging over the entire algorithmic history of the heavyball algorithm for quadratic and general strongly convex problems offers a reduction of worst-case
transient growth. We additionally consider the impact of averaging over a moving window of
a fixed length, but focus on the asymptotic behavior of the expected value and variance of the
47
averaged output. We show that while averaging over the entire algorithmic history eliminates
steady state variance, it reduces the convergence speed to a sub-linear rate. In contrast, averaging
over a moving window of fixed length d offers some reduction in steady-state variance while
maintaining linear convergence rate.
We consider two approaches to averaging: first the average x
t over a moving window of a fixed
length d,
z
t
d =
1
d
t
∑
k=t −d +1
x
k
, (4.1a)
and second the running averaging over the entire algorithmic history,
z
t
t =
1
t
t
∑
k=1
x
k
. (4.1b)
For iterations with t < d, the definition of z
t
d
in (4.1a) is not applicable and we define z
t
d = z
t
t when
t < d. In addition, we define the averaged outputs associated with subsystem i, corresponding to
eigenvalue λi of the Hessian Q, as
zˆ
t
d
(λi) :=
1
d
t
∑
k=t−d+1
xˆ
k
i
, zˆ
t
t
(λi) :=
1
d
t
∑
k=1
xˆ
k
i
. (4.2)
In Section 4.2, we summarize our main results that quantify the influence of averaging on the
trade-off between convergence speed and noise amplification for the heavy-ball method applied to
strongly convex quadratic problems. In Section 4.5, we provide an example that demonstrates the
merits and the effectiveness of different averaging schemes on convergence properties and variance
amplification for the heavy-ball method. In Section 4.6, we conclude our presentation. The proofs
of technical results are relegated to the appendix.
48
4.2 Summary of main results
In this section we will summarize our main results concerning the expected value and variance
of the optimization error of the averaged output of the two-step momentum algorithm, which will
be discussed in greater detail in Sections 4.3 and 4.4 respectively. Our first result concerns the
convergence rate of the expected optimization error.
Theorem 5 (Convergence rate). Let the two-step momentum algorithm (2.2) with the strongly
convex quadratic objective function f converge linearly with rate ρ,
E
x
t
≤ cρ
t E(∥ψ0∥). (4.3a)
Then the expected optimization error of the average over a window of fixed integer length d, E
z
t
d
,
converges linearly with rate ρ,
E
z
t
d
≤ c
1 − ρ
d
d (1 − ρ)ρ
d−1
ρ
t
∥E(ψ0)∥. (4.3b)
When averaging is done over the entire algorithmic history, the expected error converges sublinearly with rate 1/t,
E
z
t
t
≤ c
ρ (1−ρ
t
)
(1−ρ)t
∥E(ψ
0
)∥ (4.3c)
where c is the same positive constant in (4.3a), (4.3b), and (4.3c).
Averaging over a window of fixed length d may increase the expected error at iteration t but
preserves linear convergence rate ρ, while averaging over the entire history reduces convergence
rate to sublinear. The next theorem considers the effect of windowed averaging on transient growth.
Theorem 6 (Transient peak). Let the two-step momentum algorithm (2.2) with strongly convex
quadratic objective function f and converge linearly with rate ρ. Then the worst case transient
peak of the expected error of zt
d when d is a fixed integer,
max
t≥d
sup
∥ψ0∥2=1
E
z
t
d
2
,
is monotonically decreasing in d.
Precise expressions for the transient peak of the averaged output and the iteration at which it
occurs are given in Section 4.4, where we demonstrate how averaging smooths the transient peak by
both decreasing the worst-case magnitude while increasing the iteration at which it occurs. When d
grows large, the magnitude and location of the worst-case transient peak both vary approximately
linearly with d. However, for sufficiently large window length (d > 2ln2/lnρ), the transient peak
of the windowed average is lower bounded by that of the average over the entire history, meaning
that there is an unavoidable limitation to the reduction of transient growth averaging can provide.
For both averaging schemes, for parameters which achieve the optimal rate of convergence as
given by the heavy-ball algorithm, the iteration and magnitude of the worst-case transient peak
both scale with the square root of the condition number of the objective function, as is the case
for the non-averaged algorithm. The following Theorems concern the variance of the optimization
error.
Theorem 7 (Variance of running average). Let the two-step momentum algorithm (2.2) with
strongly convex quadratic objective function f converge linearly with rate ρ. Then the variance at
time t of the averaged output over the entire algorithmic history Vt
t
:= E
∥z
t
t∥
2
converges to zero
at the sub-linear rate 1/t, with
n
(1+ρ)
4
≤ lim
t→∞
t Vt
t ≤
n
(1−ρ)
4
for t > 1, where n is the problem dimension, with Q ∈ R
n×n
.
5
When taking the average of the algorithmic output over all iterations, the variance converges
to zero. The result is consistent with previous work on averaging of stochastic gradient descent [17, 45]. The behavior also matches that of first-order algorithms with decaying stepsize [78], which also achieve sub-linear convergence of expected error and expected squared error
towards zero. However, when taking the average over the entire history, the variance is nonmonotonic in iteration, and in a similar fashion to the expected error, exhibits transient growth in
early iterations before decreasing asymptotically towards zero. Next we consider the variance of
the averaged algorithmic output over a fixed window length.
Theorem 8 (Variance of windowed average). Let the two-step momentum algorithm (2.2) with
strongly convex quadratic objective function f converge linearly with rate ρ. Then the variance
a time t of the averaged output over a fixed window of length d ≥ 1, given by Vt
d
:= E
∥z
t
d
∥
2
converges to the steady-state value V ∞
d
as t goes to infinity. For a fixed window length d ≥ 1, total
variance is reduced by approximately 1/d, with
n
d(1+ρ)
4
≤ V
∞
d ≤
n
d (1−ρ)
4
where n is the problem dimension, with Q ∈ R
n×n
.
Theorem 9 (Product of variance and settling time). Let the two-step momentum algorithm (2.2)
with the strongly convex quadratic objective function f converge linearly with rate ρ. Then the
product of the steady-state variance of the averaged output over a fixed window of length d ≥ 1
V
∞
d
and settling time Ts = 1/(1−ρ) is bounded by
V
∞
d ×Ts ≥
κ
2
64d
where κ = L/m is the condition number of the problem.
51
The bounds in Theorem 8 are given by approximations of the maximal and minimal possible
modal contributiosn to variance, multiplied by the problem dimension n, with the approximations
becoming more accurate as d grows larger. Together both Theorems indicate that while increasing
the length of the averaging window reduces steady-state variance at a rate of approximately 1/d,
with the approximation becoming more accurate as d grows large, the dependence of the trade-off
between convergence speed and accuracy on the condition number of the problem is unchanged.
In the next section, we discuss the results presented in Theorems 5 and 6.
4.3 Convergence rate and transient growth
In this section we examine the effect of averaging on the expected error ∥E(z
t
d − x
⋆
)∥, where for
ease of notation we set the optimal value x
⋆ = 0 without loss of generality. Theorem 5 describes
the asymptotic behavior of the expected error, and is easily proven using geometric sums.
Proof of Theorem 5. For a given window length d, the error vector
z
t
d
is the linear combination
of terms x
t
,··· , x
t−d+1
, the norms of which all approach zero at rate ρ. In particular, given the
convergence bound
E
x
t
≤ ∥c∥
E
ψ
t
≤ cρ
t
∥E
ψ
0
∥
introduced in (2.13), we can write
E
z
t
d
=
1
d
t
∑
k=t−d+1
E
x
k
≤
1
d
t
∑
k=t−d+1
E
x
k
≤
1
d
t
∑
k=t−d+1
cρ
k
E(ψ
0
)
= cρ
t
ρ
1−d
(1−ρ
d
)
d (1−ρ)
E(ψ
0
)
(4.4
where the additional constant factor is given by the partial geometric sum of ρ
k
. When considering
the average over the entire algorithmic history, we can use the same technique to recover
E
z
t
t
≤ c
ρ (1−ρ
t
)
(1−ρ)
1
t
E(ψ
0
)
where we have now considered the geometric sum of ρ
k
from 1 to t.
The linear convergence of
E
z
t
d
is expected, as this output term is in fact a linear combination of d terms, all of which converge to the optimal value in expectation with rate ρ. The bound
on ∥E(z
t
t
)∥ on the other hand, while easily recovered from the bound on
E
z
t
d
by setting d = t
in (4.3b), demonstrates that allowing window length to vary with time introduces significantly
different converge behavior. Namely, in contrast to ∥E(x
t
)∥ and
E
z
t
d
, ∥E(z
t
t
)∥ does not enjoy
a linear convergence rate; rather, it converges to zero at a sub-linear rate 1/t. While expected
values of the algorithmic iterates and of the averaged output over the fixed window length d
converge at the same linear rate ρ, the bound on
E
z
t
d
is larger than the bound on ∥E(x
t
)∥
by a factor that depends on ρ and d, (1 − ρ
d
)/(d(1 − ρ)ρ
d−1
). This factor is a monotonically
increasing function of d, thereby suggesting that averaging over a longer window d increases the
expected error
E
z
t
d
at a given iteration t. Furthermore, since the second derivative of this
factor with respect to d is always positive, the relative increase in the error bound in (4.3b) grows
with additional increase in d.
As we discussed in Chapter 2, accelerated algorithms trade increased convergence rate against
unfavorable transient growth in the non-asymptotic time frame. Theorem 6 describes how fixedwindow averaging affects the transient behavior. In particular, we present bounds on the worst
case transient growth, that is, we consider how large the 2-norm the expected error ∥E(z
t
d
)∥2 grows
given initial conditions which maximize this growth. The proof of Theorem 6 is given belo
Proof of Theorem 6. For a linear system of the form given in (2.10), while asymptotic behavior is
determined by the eigenvalues of A
t
, non-asymptotic behavior depends on the singular values of
the state transition matrix CAt
. For the averaged output, we introduce the state transition matrix Φ,
defined by
Φ
t
d
:=
1
d
t
∑
k=t−d+1
C Ak
, E(z
t
d
) = Φ
t
d E(ψ
0
). (4.5)
As shown in equation (2.18), the worst-case transient response of the averaged output z
t
d
is determined by the largest singular value of the matrix Φt
d
, which are given by the singular values of the
n subsystems given in (2.10), with
σ
2
max
Φ
t
d
≤ max
i
σ
2
Φˆ
t
d
(λi)
, Φˆ
t
d
(λi) :=
1
d
t
∑
k=t−d+1
Cˆ(Aˆ
i)
k
. (4.6)
Thus in order to bound the transient behavior, we must determine the largest singular value of the
state transition matrix.
Proposition 3. The singular values of the state transition matrix Φˆ t
d
(λi) defined in (4.6) are maximized at λi such that Aˆ
i has repeated eigenvalues µ1 = µ2 = ρ.
The proof is given in Section 4.7.6 of the Appendix. We will denote the modal state transition
matrix associated with this block Φˆ t
d
(λ1). The singular values of this matrix are given by
σ
2
Φˆ
t
d
(λ1)
= (g1(t))2 + (g2(t))2
,
g1(t) :=
ρ
t−d+1
(d, (1−ρ) + (1−ρ
d
)(ρt −ρ −t)
d (1−ρ)
2
g2(t) :=
−ρ
t−d
(d (1−ρ) + (1−ρ
d
)(t −ρt +1)
d (1−ρ)
2
.
(4.7)
By taking the derivatives of g1(t) and g2(t) with respect to t we determine that both functions have
maximums with respect to t given by ρ b(ρ,d) and ρ
−1 b(ρ,d) respectively, where
b(ρ,d) :=
ρ
ζ (ρ,d)
(1−ρ
d
)
d (1−ρ)(−e ln(ρ)), ζ (ρ,d) :=
d ρ
d
(1−ρ
d)
−
1
(1−ρ)
. (4.8)
Thus it is evident that the maximum singular value of Φd(t) over the domain t ≥ d must satisfy
ρ b(ρ,d) ≤ max
t≥d
σmax
Φ
t
d
≤ ρ
−1
b(ρ,d)
d
1−ρ
d
−
1
1−ρ
−
1
lnρ
≤ argmax
t≥d
σmax
Φ
t
d
≤
d
1−ρ
d
−
ρ
1−ρ
−
1
lnρ
(4.9)
which bounds the magnitude of the worst-case transient growth and the iteration at which it occurs.
Here the maximum given by the term b(ρ,d) simplifies to the existing result −ρ/(e ln(ρ))
given in [79] when d = 1, and is strictly decreasing in d. As d grows large, the exponent ζ approaches −ρ/(1−ρ) and 1−ρ
d
approaches one, meaning b varies proportionally to 1/d for large
d. On the other hand, the iteration t at which peak magnitude occurs is monotonically increasing
in d approximately linearly.
Remark 1. For the Nesterov and heavy-ball methods, we can use the values of ρ in terms of
condition number κ given in Table 2.2 to determine that both the magnitude of the worst-case
transient peak and the iteration at which it occurs scale proportionally to √
κ for any positive
integer d.
For d = 1, the lower bound becomes redundant when ρ ≤ e
−1
in which case the expected error
is strictly decreasing and the lower bound on the iteration of the peak is negative. For d ≥ 2, the
lower bound is always positive, but the result only holds for t ≥ d.
While the maximum transient peak of E(z
t
d
) over the regime t ≥ d continues to decrease as d
increases, the transient cannot be reduced indefinitely. For early iterations of t when t < d, using
the definition z
t
d = z
t
t
implies that the maximum transient peak over all t ≥ 1 of z
t
d
has a floor given
by the maximum transient peak of z
t
t
.
Lemma 7. For two-step momentum algorithm (2.10) with the linear convergence rate ρ, the maximum transient peak of zt
t
is less than or equal to that of zt
d
,
max
t
sup
∥ψ0∥2=1
E
z
t
d
2 ≥ max
t
sup
∥ψ0∥2=1
E
z
t
t
2
.
The proof is given in the Appendix. Thus it is clear that fixed window averaging cannot produce
superior transient behavior compared to averaging over the entire algorithmic history. Furthermore,
the benefits of increasing the window length d plateau at d = tmax, where tmax indicates the iteration
at which the peak of the transient of z
t
t occurs. The behavior of the transient of z
t
t
is described below.
Lemma 8. For two-step momentum algorithm (2.10) with the linear convergence rate ρ, the maximum transient peak of zt
t
is lower bounded by
max
t
sup
∥ψ0∥2=1
E
z
t
t
2 ≥ −ρ
2 ρ (1−ρ) ln(4) + (4−2ρ +ρ
2
) ln(ρ)
8(1−ρ)
2 (2lnρ −ln(2ρ))
and occurs at approximately
argmax
t
sup
∥ψ0∥2=1
≈ 2(lnρ)
−1 + 3
√
2(lnρ)
−2
(1−ρ) ≥
−2 ln2
lnρ
−1.
Proof. Unfortunately, it is difficult to solve for the exact magnitude and iteration of the worst case
transient peak of z
t
t
. Using the same approach shown in the proof above, we can determine
σ
2
(Φˆ
t
t
(λ1)) = (h1(t))2 + (h2(t))2
,
h1(t) :=
ρ
2
(1−ρ
t
) + ρ(−ρ
t
t +ρ
t+1
t)
t(1−ρ)
2
h2(t) :=
1−ρ
t −ρ
t
t +ρ
t+1
t
t(1−ρ)
2
.
(4.10)
Setting the derivatives of h1(t) and h2(t) to zero yields the equations
ρ
−t = 1 − ln(ρ)t −
1−ρ
ρ
lnρ t
2
, ρ
−t = 1 − ln(ρ)t + (ρ −lnρ)t
2
(4.11)
neither of which can be solved exactly. Instead, we approximate the function ρ
−t with a fourth
order Taylor series approximation
ρ
−t ≈ 1 − lnρ t +
1
2
(lnρ)
2
t
2 −
1
6
(lnρ)
3
t
3 +
1
24
(lnρ)
4
t
4
which can be used to find approximate solutions to the equations in (4.11), yielding
argmax
t
h1(t) ≈
2
lnρ
−
2
√
2
√
ρ (lnρ)
3
q
−3(lnρ)
3 +3ρ (lnρ)
3 −ρ(lnρ)
4
argmax
t
h2(t) ≈
2
lnρ
−
2
√
2
(lnρ)
3
q
−3(lnρ)
3 +3ρ (lnρ)
3 −(lnρ)
4
(4.12)
which both admit the simplification
argmax
t
h1(t) ≈ argmax
t
h2(t) ≈
2
lnρ
+
3
√
2(1−ρ)
(lnρ)
2
. (4.13)
The approximation of tmax is greater than or equal to one only when ρ ≥ 0.425. Numerically we
determine that h2(t), which satisfies h2(1) = 1, is strictly decreasing for all t ≥ 1 when ρ ≲ 0.434,
while h1(t), which satisfies h1(1) = 0, is increasing at t = 1 for all ρ ∈ [0, 1]. Substituting the
approximation of tmax from (4.12) into h1 and h2 as defined in (4.10) does not yield a useful
expression. Instead, we will derive a lower bounding approximation to the maximums of h1 and
h2 as follows. It is straightforward to verify that the inequality
1−ρ
t
1−ρ
≥ t ρ
t/2 holds for all t ≥ 1
and ρ ∈ [0, 1] by noting that the inequality is equivalent to ρ
t +t ρ
t/2
(1−ρ) ≤ 1, which is satisfied
at t = 1 for all ρ, with the left side decreasing in t. Thus we bound both functions by
h1(t) ≥
ρ
2
(1−ρ)
ρ
t/2 −
ρ
t+1
1−ρ
, h2(t) ≥
1
(1−ρ)
ρ
t/2 −
ρ
t
1−ρ
,
57
ρ ρ
(a) argmaxt h(t) (b) maxt h(t)
Figure 4.1: On the left we show numerically calculated values of the argmax of h1(t) and h2(t) for
various values of ρ (dotted lines), alongside the estimate of tmax given in equation (4.12) (solid red
line). On the right right we compare numerically calculated values of the maximums of h1(t) and
h2(t) for varying values of ρ (dotted dark purple and light purple respectively), and the estimates
presented in equation (4.14) (solid blue lines).
and determine that the lower bounds are maximized at t = 4−2 ln(2ρ)/lnρ and t = −2 ln2/lnρ
respectively. Using these values as an approximation of tmax results in the lower bounds
h1(t) ≥ −ρ
2 ρ (1−ρ) ln(4) + (4−2ρ +ρ
2
) ln(ρ)
8(1−ρ)
2 (2lnρ −ln(2ρ)) , h2(t) ≥ −
(1−ρ) ln(4) + 3ρ lnρ
4ρ (1−ρ)
2
ln(4ρ)
,
(4.14)
and considering the fact that h1(t) < h2(t) for all t, we obtain the lower bound given in Lemma 8.
While we cannot provide an exact expression for magnitude and iteration of the maximum
transient growth, Figure 4.6 compares the approximations given in Lemma 8 to the magnitude and
iteration of the maximums of h1 and h2 computed numerically for a range of ρ values. It is apparent
from the figure that the estimates are close to the true numerical values. Together, Lemmas 7 and 8
demonstrate the limits of the benefits of averaging regarding the transient behavior of the expected
optimization error.
58
4.4 Variance amplification
In this section, we will examine how the variance of the optimization variable is effected by averaging. Just as in equation (2.20) we can write the variance of the averaged output V
t
d
as the sum of
modal contributions to variance associated with eigenvalue λi of the Hessian Q,
V
t
d =
n
∑
i=1
Vˆ
t
d
(λi), Vˆ
t
d
(λi) := E[∥zˆ
t
d
(λi)∥
2
] − E[∥zˆ
t
d
(λi)∥]
2
As seen in section 2.1, the dynamics of the two dimensional linear system associated with λi are
given by
ψˆ
t+1
i = Aˆ
iψˆ
t+1
i + Bˆwˆ
t
i
, Aˆ
i =
0 1
−a0(λi) −a1(λi)
(4.15a)
a0(λi) := −β0 +αγ0λi a1(λi) := −β1 +αγ1λi (4.15b)
with eigenvalues of Aˆ given by
µ1(λi) = 1
2
−a1(λi) −
q
a1(λi)
2 − 4a0(λi)
µ2(λi) = 1
2
−a1(λi) + q
a1(λi)
2 − 4a0(λi)
a0(λi) = µ1(λi)µ2(λi),
a1(λi) = −µ1(λi) − µ2(λi)
(4.15c)
The sub-system converges linearly with rate less than or equal to ρ if both eigenvalues satisfy
|µ| ≤ ρ. As seen in [35], this results in the geometric convergence region ∆ρ for a0 and a1 shown
in Figure 4.2, where
∆ρ := {a0, a1 : a0 ∈ [−ρ
2
, ρ
2
], a1 ∈ [−ρ
−1
(a0 +ρ
2
), ρ
−1
(a0 +ρ
2
)]}. (4.16)
59
• •
•
X = (−2, 1) Y = (2, 1)
Z = (0, −1)
• •
•
Xρ = (−2ρ, ρ2) Yρ = (2ρ, ρ2)
Zρ = (0, −ρ
2)
a1
a0
Figure 4.2: Stability region and ρ-convergence region ∆ρ of the two-step accelerated algorithm. As
introduced in [35], for the linear sub-system associated with eigenvalue λi of Q defined in (2.10),
we see the set of a0(λi) and a1(λi) for which the system is stable (blue) and the set of a0(λi)
and a1(λi) for which converges with rate ρ (yellow). The ρ-convergence region ∆ρ is defined by
a0 ∈ [−ρ
2
, ρ
2
] and a1 ∈ [−ρ
−1
(a0 +ρ
2
), ρ
−1
(a0 +ρ
2
)].
In this section we will gain insight into the variance V
t
d
by considering how modal contributions
to variance Vˆt
d
(λi) vary over the ρ-convergence region ∆ρ . We can consider the variance term Vˆt
d
a function of both eigenvalues µ1, µ2 and coordinates a0,a1, keeping in mind that for a given parameter set {α, β0, β1, γ0, γ1} there is a one-to-one mapping between a eigenvalue of the Hessian
λi
, coordinates a0,a1, and eigenvalues µ1,µ2.
Lemma 9. Let the two-dimensional system given in (2.10) converge linearly with rate ρ. Then
the variance at iteration t of the average output over a window of fixed integer length d, zˆ
t
d
(λi), is
given by
Vˆ
t
d
(λi) = 1
d
2
Cˆ
t
∑
k=t−d+1
Pˆ
k
i +
t−k
∑
j=1
Aˆ
j
i Pˆ
k
i + Pˆ
k
i
(Aˆ
j
i
)
T
!
CˆT
(4.17)
where Pˆk
i
solves the time-dependent Algebraic Lyapunov equation and satisfies Pˆ
k+1
i =
Aˆ
iPˆk
i
(Aˆ
i)
T + BˆBˆT
.
The proof is given in Section 9 of the Appendix. Using Lemma 9, we present the following
expression for the steady-state variance associated with λi
in terms of the eigenvalues µ1 and µ2,
where we have dropped the dependence of the eigenvalues on λi for ease of notation.
60
Lemma 10. Let the two-step momentum algorithm (2.2) with the strongly convex quadratic objective function f converge linearly with rate ρ. Then Vˆ ∞
d
(λi), the modal contribution to the steadystate variance of the averaged output over fixed window length d ≥ 1 associated with eigenvalue
λi of the Hessian Q, is given by
Vˆ ∞
d
(µ1,µ2) = 1
d (1− µ1)
2(1− µ2)
2
+
2(µ
d
1 −1)µ
2
1
d
2 (µ1 − µ2)(1− µ1)
3(1+ µ1)(1− µ1µ2)
−
2(µ
d
2 −1)µ
2
2
d
2 (µ1 − µ2)(1− µ2)
3(1+ µ2)(1− µ1µ2)
(4.18)
where µ1 and µ2 depend on λi according to (4.15c).
Based on Lemma 10, we can see that as d grows large the first term dominates, and the variance
at any given λi decreases at a rate of approximately 1/d.
Next we will identify and bound the smallest and largest contributions to variance over the
ρ-convergence region.
Lemma 11. The function Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) defined in Lemma 10 on the domain (a0,a1) ∈
∆ρ defined in (4.16) is maximized at a0 = ρ
2 and a1 = −2ρ, and satisfies
max
a0,a1
Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) ≤
1
d (1−ρ)
4
min
a0,a1
Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) ≥
1
d (1+ρ)
4
.
The proof of Lemma 11 is given in Sections 4.7.4 of the Appendix. Based on the Lemma,
Theorem 8 is immediate by noting that V
∞
d = ∑
n
i=1Vˆ ∞
d
(λi) must satisfy
n min
λi
Vˆ ∞
d
(λi) ≤
n
∑
i=1
Vˆ ∞
d
(λi) ≤ n max
λi
Vˆ ∞
d
(λi).
Theorem 8 indicates that although averaging may affect the steady-state modal contribution to
variance Vˆ ∞
d
(λi) differently for different values of λi
, overall the total steady state variance V
t
d
is
61
decreasing at a rate of approximately 1/d as the length d of the averaging window increases, which
is supported by the expression for Vˆ ∞
d
given in Lemma 10.
While it is clear that averaging the output over a window of length d does reduce the steadystate variance, this reduction does not overcome the fundamental relationship between noise amplification and the condition number of the problem, which is shown in Theorems 9 and 10. In
order to derive the result given in Theorem 9, we first introduce an alternate lower bound on Vˆt
d
(λ)
which is strictly decreasing in a1.
Lemma 12. The function Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) defined in Lemma 10 on the domain a0,a1 ∈
∆ρ defined in (4.16) satisfies
Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) ≥
1−ρ
2d (1+a0 +a1)(1−2ρ + (1+a0 +a1)ρ −ρ
2)
.
The proof is given in Section 4.7.5 of the Appendix. We can then use the fact that β0 +β1 =
γ0 +γ1 = 1 to establish
a0(λ) + a1(λ) = −β0 +αγ0λ −β1 +αγ1λ = −1+αλ
and express κ as
κ :=
L
m
=
α L
α m
=
1+a0(L) +a1(L)
1+a0(m) +a1(m)
then rearrange terms to write
1+a0(m) +a1(m) = 1
κ
(1+a0(L) +a1(L)) ≤
(1+ρ)
2
κ
. (4.19)
The last inequality is obtained by setting a0 and a1 as large as possible within the boundaries of
the convergence region ∆ρ defined in (4.16), which require a0 ≤ ρ
2
and a1 ≤ ρ
−1
(a0 +ρ
2
).
62
• •
•
Xρ Yρ
Zρ
a1
a0
Figure 4.3: For a given ρ-convergence region ∆ρ , dashed lines show the line segments
(a0(λ), a1(λ)) for λ ∈ [m, L] for the subset of two-step momentum algorithms with parameters
given in (4.20). The hyperparameter c gives the normalized distance of the (a0(λ),a1(λ)) line
from the a1 axis. The heavy-ball parameters with c = 1 lies along the XρYρ edge and is shown in
blue, while the gradient descent parameters with c = 0 lies along the a1 axis and is shown in green.
Combining equation (4.19) with Lemma 12 lets us write
Vˆ ∞
d
(m)
1−ρ
≥
κ
2
2d (1+ρ)
4(1−ρ)
≥
κ
2
2d (1+ρ)
5
>
κ
2
64d
where the final lower bound above is obtained by noting that 1/(1 + ρ)
5
is decreasing in ρ and
ρ < 1. Thus for any fixed d ≥ 1, the product of the steady-state variance of the averaged output
and the settling time maintains scaling with κ
2
.
In order to establish an upper bound on the product of variance and settling time, we restrict
consideration to a subset of two-step momentum algorithms which generalizes gradient descent
and the heavy-ball algorithms, parameterized by c and shown in Figure 4.3. Consider the parameter
set Θ˜ defined in terms of hyperparameter c
Θ˜ :=
α = (1 + ρ)(1 + cρ)/L, β = cρ
2
, γ = 0
, c ∈ [0, 1] (4.20)
where c = 1 recovers the heavy-ball method and c = 0 recovers gradient descent, shown in Figure 4.3. We can determine an upper bound on the product of steady-state variance of the finite
window averated z
t
d
and settling time that scales with the square of the condition number,
Theorem 10. Let the two step momentum algorithm with parameters belonging to the class Θ˜
with the strongly convex quadratic objective function f convergence linearly with rate ρ. Then the
63
product of the steady-state variance of the averaged output over a fixed window of length d ≥ 1,
V
∞
d
, and settling time Ts = 1/(1−ρ) is bounded by
V
∞
d ×Ts ≤
(κ +1)
3
8κ
.
Proof. For this class of algorithms, we know that (a0(m), a1(m)) and (a0(L), a1(L)) lie on the
XρZρ and ZρYρ edges of the stability region respectively, with a0(m) = a0(L) = cρ
2
. Thus we
express each of ρ, κ, and c as functions of the others,
κ =
(1+ρ)(1+cρ)
(1−ρ)(1−cρ)
, c =
κ −1−ρ −κρ
ρ (κ +1+ρ −κρ)
,
ρ =
(1+κ)(1+c) −
p
(1+κ)
2(1+c)
2 − 4(κ −1)
2 c
2(κ −1) c
.
(4.21)
Based on Proposition 4, we know that for this algorithm class, Vˆ ∞
d
(λ) is maximized at λ = m,
where a1 = −ρ
−1
(a0 +ρ
2
), and µ2 = ρ, µ1 = a0/ρ = cρ. In this case
Vˆ ∞
d
(m) = 1
d (1−ρ)
2(1−cρ)
2
−
2ρ(1−ρ
d
)
d
2 (1−c)(1−ρ)
3(1+ρ)(1−cρ
2)
+
2 c
2 ρ(1−(cρ)
d
)
d
2 (1−c)(1−cρ)
3(1+cρ)(1−cρ
2)
.
(4.22)
Combining (4.22) with the expression for c given in (4.21) yields an expression for Vˆ ∞
d
(m) in terms
of κ and ρ. The proof of Proposition 4 also tells us that the resulting expression of Vˆ ∞
d
(m)/(1−ρ)
is increasing in ρ. It is also clear that ρHB ≤ ρ ≤ ρGD for the algorithm class under consideration, where ρHB and ρGD are the optimal convergence rates for the heavy-ball and gradient decent
algorithms given in Table 2.2. Together this implies that Vˆ ∞
d
(m)/(1 − ρ) is upper bounded by
Vˆ ∞
d
(m)/(1−ρGD), which results in
Vˆ ∞
d
(m)
1−ρ
≤
(κ +1)
3
16d
2 κ
2d κ + (κ
2 −1)
κ −1
κ +1
d
−1
!!. (4.23)
64
Given that
lim
κ→∞
2d κ + (κ
2 −1)
κ −1
κ +1
d
−1
!! = 2d
2
,
and the term inside the limit is monotonically increasing in κ, the upper bound can be simplified
to (κ +1)
3/8κ.
Ultimately we conclude that while averaging may reduce the steady-state variance, that reduction is not sufficient to overcome the fundamental trade-off between variance amplification and
convergence speed, which scales with the square of the condition number.
While the variance V
t
d
grows monotonically in time, the variance V
t
t
exhibits transient growth
before converging asymptotically to zero at the sublinear rate 1/t, as stated in Theorem 7, the proof
of which is provided below.
Proof of Theorem 7. Based on Lemma 10 and allowing d → t, we can express the modal contribution to variance Vˆt
t
(λi) at time t as a function of the convergence rate ρ, time t, and eigenvalues
µ1 and µ2, which depend on λi according to (4.30). At time t, the modal contribution to variance
associated with λi
is given by
Vˆ
t
t
(λi) = 1
t
1
t
t
∑
k=1
CˆPˆ
k
i CˆT +
1
t
Cˆ
t
∑
k=1
t−k
∑
j=1
Aˆ
j
i Pˆ
k
i + Pˆ
k
i
(Aˆ
j
i
)
T
CˆT
!
(4.24)
As k grows large both Pˆk
i
and Pˆk
i ∑
t−k
j=1
(Aˆ
j
i
)
T
approach steady state values, and thus their averages do as well, with
lim
t→∞
1
t
t
∑
k=1
CˆPˆ
k
i CˆT =
1+ µ1µ2
(1− µ
2
1
)(1− µ
2
2
)(1− µ1µ2)
lim
t→∞
1
t
t
∑
k=1
t−k
∑
j=1
Cˆ
Aˆ
j
i Pˆ
k
i + Pˆ
k
i
(Aˆ
j
i
)
T
CˆT = 2
µ1 + µ2 − µ1µ2 − µ
2
1
µ
2
2
(1− µ1)
2(1+ µ1)(1− µ2)
2(1+ µ2)(1− µ1µ2)
(4.25)
65
which together leads to the limiting behavior of the variance
lim
t→∞
tVˆ
t
t
(λi) = 1
(1− µ1)
2(1− µ2)
2
.
Given that the eigenvalues µ1 and µ2 are both bounded in magnitude by ρ, we have
1
t(1+ρ)
4
≤
1
t(1− µ1)
2(1− µ2)
2
≤
1
t(1−ρ)
4
.
4.5 An example
In order to illustrate the benefits and properties of averaging we provide an example of averaging on
the heavy-ball algorithm with parameters that optimize the convergence rate as given in Table 2.2.
We consider the objective function f(x) = x
TQx where the Hessain Q is a diagonal matrix of order
n = 10,
Q =
LIn−s 0
0 mIs
with s eigenvalues at λ = m and (n−s) eigenvalues at λ = L. For optimal choice of parameters,
the (a0,a1) coordinates associated with these eigenvalues lie at the Xρ and Yρ corners of the ∆ρ
stability region defined in (4.16) respectively. We set κ = 10000 which results in ρ ≈ 0.98, and
choose initial conditions ψˆ
0
i = [1 0]
T
.
Figure 4.4 shows the effect of different averaging schemes on the expected error of z
t
t
for
different distributions of eigenvalues. As expected based on Theorem 6 and Lemma 7, we can
see that transient peak decreases as the averaging period d increases, with the transient peak being
66
∥E(z
t
d
)∥
t t t
(a) s = 9, n−s = 1 (b) s = 5, n−s = 5 (c) s = 1, n−s = 9
Figure 4.4: Transient response of the heavy-ball algorithm with optimal parameters and ρ = 0.98,
in the case of no averaging (d = 1), averaging over a moving window of fixed integer length (d = 10
and d = 30), and averaging over the entire algorithmic history (d = t). We consider three separate
cases where the Hessian Q has s eigenvalues at the Xρ corner of ∆ρ and n−s eigenvalues at the Yρ
corner of ∆ρ , s = 9, s = 5, and s = 1.
smallest when averaging is performed over the entire history. On the other hand Figure 4.4 clearly
shows the cost of averaging over the entire history, demonstrating significantly slower convergence.
In Figure 4.5 we see the effect of averaging on the variance as a function of iteration t. In
accordance with Theorem 8 we observe that increasing the averaging period d decreases the steady
state variance. We also observe that while the variance of the average over the entire algorithmic
history V
t
t decreases approximately linearly for large t, in the non-asymptotic regime the variance
is increasing in t. We conjecture that for any fixed window length d and iteration t ≥ d, the variance
satisfies V
t
t ≤V
t
d
, meaning that even in the non-asymptotic regime averaging over the entire history
always yields the smallest variance.
Both figures examine the effect of the distribution of eigenvalues λ of the Hessian Q. In
general, the state is updated according to ˆx
t+2
i = −a0(λi)xˆ
t
i − a1(λi)xˆ
t+1
i
, and can be written in
terms of eigenvalues as
xˆ
t = µ1µ2
µ
t−1
1 − µ
t−1
2
µ2 − µ1
xˆ
0 +
µ
t
2 − µ
t
1
µ2 − µ1
xˆ
1
. (4.26)
67
V
t
d
t t t
(a) s = 9, n−s = 1 (b) s = 5, n−s = 5 (c) s = 1, n−s = 9
Figure 4.5: Variance of the averaged output at time t of the heavy-ball algorithm with optimal
parameters where the Hessian Q has s eigenvalues at the Xρ corner of ∆ρ and n−s eigenvalues at
the Yρ corner of ∆ρ , with ρ = 0.98.
By definition we let µ2 ≥ µ1 without loss of generality. When a1 ≥ 0, µ1 and µ2 are either complex
conjugates with negative real part, or both negative real values with |µ2| ≤ |µ1. As a result, the
sub-system state ˆx
t
i
alternates signs with every iteration t when a1 > 0, in which case any amount
of averaging greatly reduces both the expected error and variance of z
t
d
. For our choice of initial
conditions, ˆx
t
i
alternates sign for a1 > 0. In Figures 4.4 we can see that when Q has more eigenvalues at L rather than m, the effect of averaging is more pronounced. Of course, variance results
are independent of initial conditions. As is seen in Table 4.1 in the proof of Proposition 6, the individual variance term limt→∞ E[xˆ
t
xˆ
t+j
] alternates sign depending on the parity of j when a1 ≥ 0.
Figure 4.5 shown how this effects the variance Vˆ ∞
d
, which is the sum of these individual covariance
terms. In fact, at λi = L, the modal variance term Vˆ ∞
d
(L) scales with κ as opposed to κ
2
as long as
d is even, limt→Vˆ ∞
d
(L)/(κ(1−ρ)) = 1/64.
Figure 4.6 highlights the dependence of the steady state variance V
∞
d
on window length d, for
various distributions of eigenvalues of Q. The figure on the left demonstrates the result given in
Theorem 8 which states that for large d, the variance V
∞
d
varies approximately linearly with 1/d.
68
V
∞
d
d d
Figure 4.6: Steady state variance as a function of number of history terms d. The Figures show
how the steady state variance V
∞
d
of the output averaged over a window of fixed integer length d
decreases as window length d increases, for systems where the Hessian Q has s eigenvalues at the
Xρ corner of ∆ρ and n − s eigenvalues at the Yρ corner of ∆ρ , with ρ = 0.98. On the left we see
the overall trend in d, the figure on the right focuses on the behavior when d is small. Similarly to
Figure 4.5 it is apparent that the variance is more greatly reduced when Q has more eigenvalues at
λ = L.
The figure on the right demonstrates how for small d, the steady state variance is smaller when d
is even, which is due to the alternating sign of ˆx
t
i
at λi = L mentioned above.
4.6 Concluding remarks
We present the effect of averaging over algorithmic iterates on the class of two-step momentum
algorithms applied to strongly convex quadratic problems, examining the impact on convergence
rate, worst-case transient growth of expected error, and steady-state variance. We have shown that
averaging over the entire algorithmic history reduces the rate of convergence to sublinear; however
the variance of the averaged output also decays to zero at a sublinear rate. In contrast, averaging
algorithmic iterates over a moving window of fixed integer length d preserves the rate of linear
convergence, while increasing the expected error at a given iteration t by a factor independent of t.
In addition, the worst-case transient grown in the expected error over the set of normalized initial
conditions is also reduced by a factor of approximately 1/d, though the iteration past which the
expected error is strictly decreasing is increased by averaging. Steady-state variance is also reduced
by a factor of approximately 1/d, while the product of steady-state variance and settling time
69
maintains scaling with the square of the problem condition number κ, as is the case of the variance
of the non-averaged output. In an example of windowed averaging of the heavy-ball algorithm
for a quadratic problem we observe that the effectiveness of averaging on variance reduction is
dependent on the distribution of eigenvalues of the Hessian of the objective function. Ultimately,
while averaging can improve steady-state variance due to gradient uncertainty, the fundamental
trade-off between convergence rate and noise amplification cannot be avoided.
4.7 Proofs
4.7.1 Proof of Lemma 7
Proof. Suppose the maximum transient peak of the infinite horizon averaged output z
t
t occurs at
time t = t1. In the case of fixed d > t1 the lemma holds by definition, as z
t
d = z
t
t
for t < d.
In the case d < t1, it suffices to show that at time t1, the windowed average z
t1
d must be larger
than the infinite horizon average z
t1
t
. We note that z
t1
d
can be written as
z
t1
d =
1
d
t1
∑
k=t1+1−d
x
k =
1
d
t1
∑
k=1
x
k −
t1−d
∑
k=1
x
k
!
=
1
d
t1
1
t1
t1
∑
k=1
x
k − (t1 −d)
1
t1 −d
t1−d
∑
k=1
x
k
!
=
t1
d
z
t1
t −
t1 −d
d
z
t1−d
t
,
and thus z
t1
d
is greater than z
t1
t
if z
t1
d − z
t1
t > 0, which is equivalent to
t1
d
z
t1
t −
t1 −d
d
z
t1−d
t − z
t1
t =
t1
d
−1
z
t1
t −
t1 −d
d
z
t1−d
t > 0.
As z
t
t
is maximized at t1, it must be true that z
t1
t > z
t2
t
for any t2 < t1, and thus the inequality is
proven.
70
4.7.2 Proof of Lemma 9
Proof. The variance of the average ˆz
t
d
(λi) is given by
Var[zˆ
t
d
(λi)] = 1
d
2
E
h
zˆ
t
d
(λi) − E
zˆ
t
d
(λi)
2
i
=
1
d
2
E
t
∑
k=t−d+1
xˆ
t
i − E
xˆ
t
i
!2
where the last equality is true by linearity of expectation. Rearranging terms gives
Vˆ
t
d
(λi) = 1
d
2
Cˆ E
t
∑
k=t−d+1
(ψˆ
k
i −E[ψˆ
k
i
])! t
∑
j=t−d+1
(ψˆ
j
i −E[ψˆ
j
i
])!T
CˆT
=
1
d
2
Cˆ
t
∑
k=t−d+1
E
xˆ
k
i −E[xˆ
k
i
]
xˆ
k
i −E[xˆ
k
i
]
T
CˆT
+
1
d
2
Cˆ
t
∑
k=t−d+1
"
t−k
∑
j=1
E
ψˆ
k
i −E[ψˆ
k
i
]
ψˆ
k+j
i −E[ψˆ
k+j
i
]
T
+
t−k
∑
j=1
E
ψˆ
k+j
i −E[ψˆ
k+j
i
]
ψˆ
k
i −E[ψˆ
k
i
]
T
#
CˆT
(4.27)
In general the variance of between states ψˆ
k
i
and ψˆ
k+j
i
is given by
E
hψˆ
k
i −E[ψˆ
k
i
]
ψˆ
k+j
i −E[ψˆ
k+j
i
]
i
= E
Aˆ
k
i ψˆ
0
i +
k−1
∑
n=0
(Aˆ
n
i Bˆwˆ
n
i
)−Aˆ
k
i ψˆ
0
i
! Aˆ
k+j
i ψˆ
0
i +
k+j−1
∑
m=0
(Aˆm
i Bˆwˆ
m
i
)−Aˆ
k+j
i ψˆ
0
i
!T
= E
k−1
∑
n=0
(Aˆ
n
i Bˆwˆ
n
i
)
! k+j−1
∑
m=0
(Aˆm
i Bˆwˆ
m
i
)
!T
= E
" k−1
∑
n=0
Aˆ
n
i Bˆ(wˆ
n
i
)
2Bˆ
T
(Aˆ
n
i
)
T
!
(Aˆ
t
i
)
T
#
= Pˆ
k
i
(Aˆ
j
i
)
T
(4.28)
where we have used the fact that E[wˆ
n
i wˆ
m
i
] = 0 for n ̸= m, and E[(wˆ
n
i
)
2
] = σw = 1, and noting that
the recursive definition of Pˆk
i with Pˆ0
i = Var[
ˆψ
0
i
] = 0 leads to
Pˆ
k
i = Aˆ
iPˆ
k
i
(Aˆ
i)
T + BˆBˆ
T =
k−1
∑
n=0
Aˆ
n
i BˆBˆ
T
(Aˆ
n
i
)
T
(4.29)
Thus by substituting the final expression in (4.28) into (4.27) we obtain the expression given in
Lemma 9.
4.7.3 Proof of Lemma 10
Proof. Lemma 9 gives an expression for Vˆt
d
(λi) in terms of the matrices Aˆ
i and Pˆk
i
. It remains to
determine expressions for these matrices in terms of eigenvalues µ1 and µ2 of Aˆ
i
. We use Lemma1
of Chapter 2 to express Aˆt
i
in terms of the eigenvalues µ1, and µ2, yielding
Aˆ
t
i =
1
µ2 − µ1
µ1µ2(µ
t−1
1 − µ
t−1
2
) µ
t
2 − µ
t
1
µ1µ2(µ
t
1 − µ
t
2
) µ
t+1
2 − µ
t+1
1
(4.30)
and use Theorem 1 of [58] to write Pˆt
i
in terms of eigenavlues µ1, µ2 as
Pˆ∞
i =
1
pi
1+ µ1µ2 µ1 + µ2
µ1 + µ2 1+ µ1µ2
, pi = (1+ µ1µ2)(1− µ1 − µ2 + µ1µ2)(1+ µ1 + µ2 + µ1µ2)
(4.31)
where the steady state covariance matrix Pˆ∞
i
:= limt→ ∞Pˆt
i
satisfies
Pˆ
t
i = Pˆ∞
i − Aˆ
t
i Pˆ∞
i
(Aˆ
t
i
)
T
, Pˆ∞
i = Aˆ
i Pˆ∞
i AˆT
i + BˆBˆ
T
. (4.32)
We substitute equations (4.30) and (4.31) into (4.32) to determine Pˆt
i
as a function of eigenvalues
µ1, µ2, and time t. In combination with equation (4.17) we can thus express Vˆt
d
(λi) as a function of
72
window length d, time t and eigenvalues µ1 and µ2, which are in turn functions of the eigenvalues
λi of the hessian Q. Expressions for Pˆt
i
and Vˆt
d
(λi) are omitted due to length. Taking the limit as
t → ∞ yields the expression for the steady-state variance Vˆ ∞
d
(λi) given in Theorem 10.
4.7.4 Proof of Lemma 11
Proof. As shown in equation (4.15c), eigenvalues µ1 and µ2 are functions of a0 and a1, the elements of the matrix Aˆ
i
. The system given in (2.10) converges with rate ρ on the region
−ρ
2 ≤ a0 ≤ ρ
2
, −ρ
−1
(a0 +ρ
2
) ≤ a1 ≤ ρ
−1
(a0 +ρ
2
)
depicted in Figure 4.2. In the following proof, we will drop the dependence of µ1 and µ2 on a0 and
a1 for ease of notation, and alternate between writing functions in terms of µ1 and µ2, and a0 and
a1, while keeping in mind the one-to-one mapping between a pair (µ1,µ2) and a pair (a0,a1).
We first prove the upper bound maxVˆ ∞
d
(µ1,µ2) ≤ 1/(d (1−ρ)
4
) by showing that Vˆ ∞
d
(µ1,µ2)
achieves it’s maximum at µ1 = µ2 = ρ. Based on Lemma 9, the total steady-state variance is given
by a summation of individual variance terms v(µ1,µ2, j) = limt→∞ E[xˆ
t
i
xˆ
t+j
i
],
Vˆ ∞
d
(µ1,µ2) = 1
d
2
d
∑
k=1
"
v(µ1,µ2,0) + 2
d−k
∑
j=1
v(µ1,µ2, j)
#
, (4.33)
where v(µ1,µ2, j) can be written in terms of eigenvalues µ1 and µ2 as
v(µ1,µ2, j) := lim
t→∞
E[xˆ
t
i
xˆ
t+j
i
] = µ
j+1
2
(1− µ
2
1
)− µ
j+1
1
(1− µ
2
2
)
(µ2 − µ1)(1− µ1µ2)(1− µ
2
1
)(1− µ
2
2
)
. (4.34)
In order to determine the maximum of Vˆ ∞
d
, we will examine the maximums of each individual
variance term v(µ1,µ2, j).
73
Proposition 4. The steady-state covariance term v(µ1,µ2, j) achieves its maximum at µ1 = µ2 = ρ,
in which case a0 = ρ
2 and a1 = −2ρ, with
max
µ1,µ2
v(µ1,µ2, j) = ρ
j
(1+ρ
2 + j(1−ρ
2
))
(1−ρ
2)
3
. (4.35)
The proof is given in Section 4.7.6 of the Appendix. Given that each variance term v(µ1,µ2, j)
achieves it’s maximum at µ1 = µ2 = ρ, the maximum of Vˆ ∞
d
(µ1,µ2) is given by the sum of maximums of v(µ1,µ2, j),
max
µ1,µ2
Vˆ ∞
d
(µ1,µ2) = 1
d
2
d
∑
k=1
"
1+ρ
2
(1−ρ
2)
3
+ 2
d−k
∑
j=1
ρ
j
(1+ρ
2 + j(1−ρ
2
))
(1−ρ
2)
3
#
=
1
d (1−ρ)
4
+
2ρ
d+1
d (1−ρ)
4(1+ρ)
2
−
4ρ(1+ρ +ρ
2
)(1−ρ
d
)
d
2(1−ρ)
5(1+ρ)
3
.
(4.36)
The upper bound is proven by showing that the higher-order terms in d are negative.
Proposition 5. For any d, ρ such that 0 ≤ ρ ≤ 1 and d ≥ 1, the following inequality is true.
2ρ
d+1
d (1−ρ)
4(1+ρ)
2
−
4ρ(1+ρ +ρ
2
)(1−ρ
d
)
d
2(1−ρ)
5(1+ρ)
3
≤ 0
Thus we obtain the bound
max
µ1,µ2
Vˆ ∞
d
(µ1,µ2) ≤
1
d (1−ρ)
4
.
We will now prove the lower bound minVˆ ∞
d
(µ1,µ2) ≥ 1/(d (1+ρ)
4
).
74
Based on the definition of Vˆ ∞
d
(λ) given in 4.18 we can split the function into terms which
which vary with 1/d, and terms which vary with 1/d
2
. with
Vˆ ∞
d
(µ1,µ2) = f1(µ1,µ2,d) + f2(µ1,µ2,d),
f1(µ1,µ2,d) :=
1
d (1− µ1)
2(1− µ2)
2
f2(µ1,µ2,d) :=
2µ
2
1
(1− µ
d
1
) (1− µ2)
3
(1+ µ2) − 2µ
2
2
(1− µ
d
2
) (1− µ1)
3
(1+ µ1)
d
2 (1− µ2)
3(1+ µ2)(1− µ1)
3(1+ µ1)(µ2 − µ1)(1− µ1µ2)
.
(4.37)
It is clear that the first term f1(µ1,µ2,d) is monotonically decreasing in µ1 and µ2 and achieves
the minimum 1/(d (1 + ρ)
4
) at a0 = ρ
2
, a1 = 2ρ. However, f1 only serves as a lower bound
to Vˆ ∞
d when f2 is positive. While f2(µ1,µ2,d) is non-monotonic in a0 and a1, it is clear that
f2(µ1(a0,a1),µ2(a0,a1),d) = 0 has a single solution at ˜a1(a0,d), and for given values of a0 and
d, f2(µ1(a0,a1),µ2(a0,a1),d) ≥ 0 for a1 ≥ a˜1(a0,d). At d = 1, we have ˜a1(a0,1) = −a0(1+a0),
and as d grows larger, ˜a1 becomes more negative, with limt→∞ a˜1(a0,d) = −2a0 (
1+a0
1+a
2
0
). Thus
Vˆ ∞
d
(µ1,µ2) ≥
1
d (1− µ1)
2(1− µ2)
2
=
1
d (1+a0 +a1)
2
(4.38)
For a1 ≥ −a0(1+a0), a0 ≥ 0.
75
We can also split the variance function Vˆ ∞
d
(µ1,µ2) into the sum of variance terms E[xˆ
k
i
xˆ
k
i
] and
the sum of cross variance terms E[xˆ
k
i
xˆ
j
i
], with
Vˆ ∞
d
(µ2,µ2) = f3a(µ1,µ2,d) + f3b(µ1,µ2,d) + f4(µ1,µ2,d)
f3a(µ1,µ2,d) = 1
2d (1− µ1)(1− µ2)(1− µ1µ2)
,
f3b(µ1,µ2,d) = 1
2d (1+ µ1)(1+ µ2)(1− µ1µ2)
f4(µ1,µ2,d) =
2µ
2
2
(1+ µ1)(1− µ1)
2
d −
1−µ
d
2
1−µ2
− 2µ
2
1
(1+ µ2)(1− µ2)
2
d −
1−µ
d
1
1−µ1
d
2 (1− µ1)(1− µ1)
2(µ2 − µ1)(1− µ1µ2)(1− µ2)(1− µ2)
2
.
(4.39)
The function f3a(µ1,µ2,d) is monotonically decreasing in µ1 and µ2 and achieves the minimum 1/(2d(1 + ρ)
3
(1 − ρ)) at a0 = ρ
2
, a1 = 2ρ. Given that f3b(µ1,µ2,d) is always positive,
f3a(µ1,µ2,d) clearly serves as a lower bound for Vˆ ∞
d
(µ1,µ2,d) when f4(µ1,µ2,d) is positive.
While f4(µ1,µ2,d) is non-monotonic in a1 and a0, f4(µ1(a0,a1),µ2(a0,a1),d) = 0 has a single solution at ˜a1(a0,d), and for given values of a0 and d, f4(µ1(a0,a1),µ2(a0,a1),d) ≥ 0 for
a1 ≤ a˜1(a0,d). At d = 1, f4 is identically zero, while at d = 1, f4 changes sign at ˜a1(a0,2) = 0.
As d increases, ˜a1(a0,d) decreases, with limt→∞ a˜1(a0,d) = −a0(1+a0). Thus
Vˆ ∞
d
(µ1,µ2) ≥
1
2d (1− µ1)(1− µ2)(1− µ1µ2)
=
1
2(1−a0)(1+a0 +a1)
(4.40)
for a1 ≤ −a0(1 + a0), a0 ≥ 0. When a0 is negative, the function f3a(µ1,µ2,d) is a valid lower
bound for all values of a1 within the ρ-convergence region. When a0 is negative, both eigenvalues
are real with µ1 < µ2, and it is straightforward to verify that the sum f3a(µ1,µ2,d) + f4(µ1,µ2,d)
is always positive, and thus
Vˆ ∞
d
(µ1,µ2) ≥
1
2d (1− µ1)(1− µ2)(1− µ1µ2)
=
1
2d (1−a0)(1+a0 +a1)
, a0 ≤ 0. (4.41)
76
a1 a1 a1
(a) a0 = 0.6, d = 3 (b) a0 = 0.85, d = 13 (c) a0 = −0.5, d = 7
Figure 4.7: Variance Vˆ ∞
d
(µ1(a0,a1), µ2(a0,a1)) as a function of a1 for fixed a0 and given window length d, along with proposed lower bounds f1(µ1,µ2,d) and f3a(µ1,µ2,d) defined in (4.37)
and (4.39) respectively, for different values of a0 and d . The vertical line in red marks a1 =
−a0(1+a0), while the red dots mark the cross-over points of f1 and f3a with Vˆ ∞
d
, at which points
the lower bounds fail. Figure (a) demonstrates that in accordance with equations (4.38) and (4.40),
Vˆ ∞
d > f1 for all a1 to the right of the red line, while Vˆ ∞
d > f3a for all a1 to the left of the red line.
Figure (b) demonstrates the non-convexity of Vˆ ∞
d which arises when µ1 and µ2 are convex which
makes determining an exact minimum difficult. Figure (c) demonstrates the lower bound given
in (4.41).
We can combine the lower bounds in equations (4.38), (4.40), and (4.41), to determine to lower
bound
Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) ≥ min
1
2d (1−a0)(1+a0 +a1)
,
1
d (1+a0 +a1)
2
∀ a0,a1 ∈ ∆ρ .
(4.42)
The first function is minimized at a0 = 0, a1 = ρ, while the second is minimized at a0 = ρ
2
,
a1 = 2ρ, and the minimums give
Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) ≥ min
1
2d (1+ρ)
,
1
d (1+ρ)
4
≥
1
d (1+ρ)
4
.
77
4.7.5 Proof of Lemma 12
The result is an immediate consequence of the piecewise lower bound given in equation (4.42).
Motivated by the relation between (1+a0 +a1) and κ given in (4.19), we would like to determine
an upper bound on (1 − a0) in terms of (1 + a0 + a1) and ρ. Using the geometric interpretation
introduced in [35], we note that w := (1+a0+a1) gives the distance from the current (a0,a1) point
to the XZ edge of the stability region, while h := (1−a0) gives the distance to the XY edge of the
stability region, as shown in Figure 4.8a. For a given (a0,a1) point, the distance h is maximized by
decreasing a0 until until we reach the XρZρ edge of the ρ-convergence region, given by the affine
relation a0 = −ρ(a1 +ρ), which is equivalent to h = 1−ρ +ρ/(1−ρ)w , which implies
h ≤ (1−ρ) + ρ
1−ρ
w.
In addition, given that the distance to the XZ edge w must satisfy (1 − ρ)
2 ≤ w ≤ (1 + ρ)
2
, it is
straightforward to verify that
w ≤ 2(1−ρ) +2
ρ
1−ρ
w
which together guarantees
Vˆ ∞
d
(µ1(a0,a1),µ2(a0,a1)) ≥
1−ρ
2d (1+a0 +a1) (1−2ρ + (1+a0 +a1)ρ −ρ
2)
.
4.7.6 Proofs of Propositions
Proof of Proposition 3
78
• •
•
X Y
Z
• •
•
Xρ Yρ
Zρ
h
w •
a1
a0
• •
•
•
Xρ Yρ
Zρ
a1 = −ρ−1
(a0 + ρ
2
) a1 = ρ−1
(a0 + ρ
2
)
a1 = −√
4a0 a1 = √
4a0 a1
a0
(a) (b)
Figure 4.8: Geometry of the ρ-convergence region for two-step accelerated algorithms.
Figure (a): For a given (a0,a1) point within the ρ-convergence region ∆ρ , w := (1+a0 +a1) gives
the horizontal distance to the XZ edge, while h := (1 − a0) gives the vertical distance to the XY
edge.
Figure (b): Regions of different behavior for the variance component v(µ1,µ2, j) =
limk→∞ E[xˆ
k
i
xˆ
k+j
i
] defined in (4.43), with µ1(a0,a1) and µ2(a0,a1) defined in (4.15c). The green
triangle, defined by a1 ∈ [−
√
4a0,
√
4a0] shows the region where eigenvalues µ1 and µ2 are imaginary complex conjugates, on which v(µ1,µ2, j) is bounded. When eigenvalues µ1 and µ2 are real
and a1 ≤ 0, shown by the yellow triangle, v(µ1,µ2, j) is strictly decreasing in a1. When eigenvalues µ1 and µ2 are real and a1 ≥ 0, shown by the orange triangle, v(µ1,µ2, j) is either strictly
increasing or decreasing in a1 depending on the parity of j, as shown in Table 4.1.
Proof. The singular values of the state transition matrix associated with the windowed average z
t
d
are given by the 2-norm of 1
d ∑
t
k=t−d+1Cˆ Aˆk
(λi) where Aˆ is defined in equation (4.30). For any
positive integer k, the term
µ
k
2 − µ
k
1
µ2 − µ1
=
k−1
∑
j=0
µ
j
2
µ
k−1−j
1
which is increasing in the magnitude of both µ1 and µ2, and is thus maximized at µ1 = µ2 = ρ.
Thus the 2-norm of the vector Cˆ Aˆk
(λi) is largest when µ1 = µ2 = ρ, as is the two norm of the
summation of this vector over any interval of positive integers.
Proof of Proposition 4:
Proof. We will first show that for any given a0, v(µ1(a0,a1),µ2(a0,a1), j) is maximized at a1 =
−ρ
−1
(a0 +ρ
2
), which is on the XρZρ edge of the stability region described in Figure 4.2, and then
show that the resulting maximum is in turn maximized at a0 = ρ
2
. In the following proof, we will
79
drop the dependence on a0 and a1 in µ1,2(a0,a1) for ease of notation, while keeping in mind that
eigenvalues µ1,2 are in fact functions of a0 and a1.
The result is shown by factoring v(µ1,µ2, j) into two functions in order to isolate the effect of
j, by introducing
v(µ1,µ2, j) = v1(µ1,µ2) ∗ v2(µ1,µ2,i)
v1(µ1,µ2) :=
1
(1− µ1µ2)(1− µ
2
2
)(1− µ
2
1
)
=
1
(1−a0)(1+a0 +a1)(1+a0 −a1)
v2(µ1,µ2, j) :=
µ
j+1
2 − µ
j+1
1
µ2 − µ1
− µ
2
1 µ
2
2
µ
j−1
2 − µ
j−1
1
µ2 − µ1,
(4.43)
where we have used the relationship between µ1,µ2 and a0,a1 to simplify v1(µ1,µ2). It is evident
that v1(µ1,µ2) is convex in a1, and achieves the maximum value at the extremes of the feasible
region of a1, with
max
a1
v1(µ1,µ2) = ρ
2
(1−a0)(ρ
2 −a
2
0
)(1−ρ
2)
, (4.44)
argmax
a1
v1(µ1,µ2) = ±ρ
−1
(a0 +ρ
2
). (4.45)
The function v2(µ1,µ2, j) captures the effect of j on the variance function v(µ1,µ2, j). When j = 0,
the function simplifies to v2(µ1,µ2,0) = 1 + a0, which is constant for fixed a0. When j is larger
than zero, the behavior of v2 depends on whether the eigenvalues are real or not, as well the signs
of a0 and a1 and the parity of j.
Proposition 6. The function v2(µ1,µ2, j) defined in (4.43) is maximized at a1 = −ρ
−1
(a0 + ρ
2
)
for a fixed value of a0 ∈ [−ρ
2
, ρ
2
] and any positive integer j.
The proof of Proposition 6 is given later in the section. As both v1(µ1,µ2) and v2(µ1,µ2, j are
maximized at a1 = −ρ
−1
(a0 +ρ
2
), the variance function v(µ1,µ2, j) must also be maximized at
80
a1 = −ρ
−1
(a0 + ρ
2
) for a given a0 ∈ [−ρ
2
, ρ
2
] and integer j ≥ 0. At this value of a1, µ2 = ρ,
µ1 = a0/ρ, and the maximum value is given by
max
a1
v (µ1(a0,a1), µ2(a0,a1), j) = ρ
j+2
(1−a0)(1−ρ
2)(ρ
2 −a0)
−
a
j+1
0
ρ
−j+2
(1−a0)(ρ
2 −a
2
0
)(ρ
2 −a0)
.
(4.46)
It then remains to determine the maximum of v(µ1,µ2, j) with respect to a0 at a1 = −ρ
−1
(a0+ρ
2
).
Proposition 7. The function v2(µ1,µ2, j) defined in (4.43) is maximized at a0 = ρ
2
for fixed µ2 = ρ,
ρ ∈ [0, 1] and any positive integer j.
The proof of Proposition 7 is given later in the Section. It is straightforward to verify that
v1(a0/ρ,ρ) is increasing in a0, and maximized at a0 = ρ
2
. The derivative is given by
∂
∂ a0
v1(a0/ρ,ρ) = ρ
2
(2a0 +ρ
2 −3a
2
0
)
(1−a0)
2(ρ
2 −a
2
0
)
2(1−ρ
2)
and is obviously positive given that ρ
2 ≥ a0.
As both v1 and v2 are maximized at a0 = ρ
2
for µ2 = ρ, thus the maximum of v(µ1,µ2, j) is
max
a0,a1
v (µ1(a0,a1), µ2(a0,a1), j) = ρ
i
(1+ρ
2 + (1−ρ
2
)i)
(1−ρ
2)
3
(4.47)
occurring at a0ρ
2
, a1 = −2ρ, with µ1 = µ2 = ρ.
Proof of Proposition 6
Proof. First we consider the case when both eigenvalues are complex conjugates, which occurs
when a0 ≥ 0 and a1 ∈ [−
√
4a0,
√
4a0] as depicted by the green region in Figure 4.8b, in which
81
case the function v2(µ1,µ2,i) is bounded. We begin by using the relation µ1 = a0/µ2 and the
identity x
n+1 −y
n+1 = (x−y)∑
n
k=0
x
k
y
n−k
to rewrite v2 as
v2(a0/µ2,µ2, j) = µ
−j
2
j
∑
k=0
µ
2k
2
a
j−k
0 − µ
2−j
2
j−2
∑
k=0
µ
2k
2
a
j−k
0
(4.48)
so that µ2 is the only argument of v2 which varies with a1. Since µ2 is a complex number, it’s
powers can be written as µ
k
2 = r
k
(cos(kθ) +isin(kθ)), where r and θ give the value of µ2 in polar
coordinates. In this case, given the definition of µ2(a0,a1) given in (4.15c), we have r =
√
a0 for
a1 ∈ [−
√
4a0,
√
4a0], which ensures µ
k
2 ∈ [−a
k/2
0
, a
k/2
0
]. Thus we can bound the summations by
j−1
∑
k=0
µ
2k
2
a
i−k
0 ∈ [−j ai/2
0
, j aj/2
0
]
which in turn allows us to bound v2 by
v2(µ1,µ2,i) ∈ [−a
j/2
0
(1+a0 + (1−a0) j), a
j/2
0
(1+a0 + (1−a0) j) ], a0 = µ1µ2. (4.49)
At the edges of the current domain of a1, where a1 = ±
√
4a0, µ1 and µ2 are repeated real
eigenvalues µ1 = µ2 = ±
√
a0 and the bounds presented in (4.49) are achieved. When j is even,
v2(µ1,µ2, j) is equal to the upper bound at both endpoints of the current domain, while when j is
odd, v2(µ1,µ2, j) achieves the upper bound at a1 = −
√
4a0 and the lower bound when a1 =
√
4a0.
We next examine the behavior of v2(µ1,µ2, j) in the case both eigenvalues are real. Using the
expression for v2 given in (4.48) we determine the derivative with respect to µ2,
∂
∂ µ2
v2
a0
µ2
,µ2,i
= µ
−j−1
2
i
∑
k=0
µ
2k
2
a
j−k
0
(2k − j) + µ
−j+1
2
j−2
∑
k=0
µ
2k
2
a
j−k
0
(2+2k − j). (4.50)
82
a0 a1 j
∂
∂ µ2
v2
a0
µ2
,µ2, j
v2
a0
µ2
,µ2, j
a0 ≥ 0
a1 ≤ −√
4a0 j ∈ N ≥ 0 decreasing in a1
−
√
4a0 ≤ a1 ≤
√
4a0 j ∈ N - bounded
√
4a0 ≤ a1
j even ≤ 0 increasing in a1
j odd ≥ 0 decreasing in a1
a0 ≤ 0
a1 ≤ 0 j ∈ N ≥ 0 decreasing in a1
0 ≤ a1
j even ≤ 0 increasing in a1
j odd ≥ 0 decreasing in a1
Table 4.1: Behavior of v2(µ1(a0,a1), µ1(a0,a1), j) on the ρ-stability region shown in Figure 4.2.
The derivative can be further simplified depending on whether j is even (∃n ∈ N : i = 2n) or odd
(∃n ∈ N : i = 2n−1),
∂
∂ µ2
v2
a0
µ2
,µ2,2n−1
= a
n
0
(1−a0)
n−2
∑
k=0
(2k +1)
a
−(k+1)
0
µ
2k
2 − a
k
0µ
−2(k+1)
2
+ (2n−1)a
n
0
a
−n
0
µ
2(n−1)
2 − a
n−1
0
µ
−2n
2
(4.51a)
∂
∂ µ2
v2
a0
µ2
,µ2,2n
=
2a
n
0
µ2
(1−a0)
n−2
∑
k=0
(k +1)
a
−(k+1)
0
µ
2(k+1)
2 − a
k+1
0
µ
−2(k+1)
2
+
2a
n
0
µ2
n
a
−j
0
µ
2n
2 −a
n
0µ
−2n
2
(4.51b)
and we can then determine whether both derivatives are positive or negative. We will divide the
domain into four regions based on the signs of a0 and a1.
In the case both are positive, with a0 ∈ [0, ρ
2
] and µ2 ∈ [
√
a0, ρ], the result is immediate
given µ
2
2 ≥ a0. In the case a0 is positive and µ2 is negative, with µ2 ∈ [−ρ, −
√
a0], when j is
odd, we note that all instances of µ2 in the derivative given in (4.51a) are squared. Thus given
µ
2
2 ≥ a0, the derivative is positive. When j is even, the term µ
−1
2
is negative and the derivative
is negative. When a0 is negative, the term inside the summation alternates sign as the summation
83
index k increases. Given that the magnitude of this term is increasing in k, the sign of the entire
sum is determined by the sign of the final term. For a1 ≤ 0, we have µ2 > a
2
0
, and the sign of the
summation matches the sign of a
n
0
, and thus the derivative is always positive. When a1 is positive,
µ2 satisfies 0 ≤ µ2 ≤
√
a−0, and while the term inside the summation again alternates sign as the
summation index k increases, the magnitude of this term is decreasing in k. Thus the sign of the
entire sum is determined by the sign of the first term. Thus the derivative is negative when j is
even and positive when j is odd.
Given the definition µ2 = (1/2)(−a1 +
q
−4a0 +a
2
1
), it is clear that for fixed a0, µ2(a0,a1) is
decreasing in a1 when the eigenvalues are real. Using the chain rule along with the results summarized in Table 4.1, we conclude that for any fixed a0 ∈ [−ρ
2
, ρ
2
], the function v2(a0/µ2,µ2, j)
must be maximized at either of the extremes of a1 when j is even and at the smallest possible value
of a1 when j is odd. Checking the functional value at the minimum and maximum possible values
of a1 confirms that the function is equal at both points, and thus the maximum is always achieved at
the smallest possible value of a1, where a1 = −ρ
−1
(a0 +ρ
2
) for any fixed a0 and positive integer
j.
Proof of Proposition 7:
Proof. It is similarly straightforward to verify that v2(a0/ρ,ρ, j) is increasing in a0 for any positive
integer i and fixed µ2 > 0. Using the expression of v2 given in (4.48), the derivative with respect
to a0 can be written as
∂
∂ a0 2
(a0/µ2,µ2, j) = (1− µ
2
2
)µ
−j
2
j−2
∑
k=0
µ
2k
2
a
j−k−1
0
(j −k) + µ
j−2
2
(4.52)
which is clearly positive on the domain 0 ≤ a0 ≤ 1, 0 ≤ µ2 ≤ 1. When a0 is negative and j is even,
the summation term is negative, and thus it is unclear whether the derivative is positive. However,
it is evident that
v2(|a0|/ρ,ρ, j) > v2(−|a0|/ρ,ρ, j),
84
and together the fact that v2(a0/ρ,ρ, j) is increasing in a0 when a0 is positive, the maximum occurs
at a0 = ρ
2
.
Proof of Proposition 5:
Proof. We first rearrange the inequality given in the proposition to obtain the equivalent condition
d ρ
d
1−ρ
d
≤
2(1+ρ +ρ
2
)
1−ρ
2
.
It is straightforward to verify the inequality must be true at d = 1 for any ρ ∈ [0, 1], and by
examining the derivative we will show that (dρ
d
)/(1 − ρ
d
) is decreasing in d. The derivative is
given by
∂
∂ d
d ρ
d
1−ρ
d
=
ρ
d
(1−ρ
d +d lnρ)
(1−ρ
d)
2
and is negative when ρ
d > 1+d lnρ < 0. Once again the inequality is straightforward to verity for
d = 1, ρ ∈ [0, 1]. We can then use relations
∂
∂ d
ρ
d
= ρ
d
lnρ,
∂
∂ d
(1 + d lnρ) = lnρ, lnρ < ρ
d
lnρ < 0
to show that while both sides of the inequality are decreasing with respect to d, the right side
is decreasing faster. Thus (dρ
d
)/(1 − ρ
d
) is decreasing in d, and the inequality given in the
proposition holds for any d ≥ 1, ρ ∈ [0, 1].
85
Chapter 5
Performance of noisy higher-order accelerated gradient flow
dynamics for strongly convex quadratic optimization problems
In this chapter we study the affect of additional momentum on the performance of momentumbased accelerated first-order optimization algorithms in the presence of additive white stochastic
disturbances, in continuous time. For strongly convex quadratic problems with a condition number
κ, we determine the best possible convergence rate of continuous-time gradient flow dynamics of
order d, and identify optimal parameters. We also demonstrate that additional momentum terms do
not affect the trade-offs between convergence rate and variance amplification that exist for gradient
flow dynamics with d = 2, by deriving a lower bound on the product of variance and settling time
which scales with the square of the condition number κ.
5.1 Introduction
Previous work [21–28] has established that accelerated methods are more sensitive to noise than
gradient descent, while providing superior rates of convergence. In this chapter we explore how
adding additional history terms affects the tradeoff between noise amplification and convergence
rate.
86
We first consider the continuous time setting, which is easier to analyze. The connection between ordinary differential equations and iterative optimization algorithms is well established [48–56]. Recently, a second-order continuous-time dynamical system with constant coefficients for which a certain implicit-explicit Euler discretization yields Nesterov’s accelerated
algorithm was introduced in [80]. For strongly convex problems, these accelerated gradient flow
dynamics were shown to be exponentially stable with rate 1/
√
κ, where κ is the condition number
of the problem. A more recent work [35] examined the tradeoffs between convergence rate and
robustness to additive white noise of accelerated gradient flow dynamics and established a lower
bound on the product between steady-state variance of the error in the optimization variable and
the settling time that scales with κ
2
. For this class of accelerated dynamical systems, there appears
to be a fundamental limitation between convergence rate and variance amplification imposed by
the condition number. In addition, similar phenomena was shown to persist in the discrete-time
setting for the class of noisy two-step momentum algorithms [34, 35]. Corroborating results for
d1iscrete-time algorithms were also presented in [36] by examining a parameterized family of
two-step momentum algorithms that enable systematic tradeoffs between these quantities.
In this chapter, we extend the results in [35, 80] by considering a dth-order accelerated gradient flow dynamics that generalizes the system presented in [80]. For strongly convex quadratic
problems, we analyze convergence properties of this system and sensitivity to additive white noise.
In particular, we establish the optimal convergence rate ρ = κ
−1/d
and identify the complete set
of constant algorithmic parameters that achieve the optimal rate. In addition, we derive general
analytical expressions for the steady-state variance based on system parameters, for systems of
any order d. This characterization allows us to show that the product of variance amplification J
(in the error of the optimization variable) and settling time 1/ρ is lower bounded by κ
2/(2d). In
addition, we present parameters which minimize stead-state variance for a given convergence rate,
and show that the resultant variance scales with the square of the condition number.
Previous work [81] obtained similar results regarding the parameters which achieve the convergence rate ρ = κ
−1/d
. We provide the complete set of optimal parameters and additional analysis
87
of system behavior, as well as an in-depth examination of the d = 3 case. Furthermore, our results
additionally consider the noise amplification properties of this class of accelerated gradient flow
dynamics, particularly the trade-off between convergence rate and settling time.
The rest of the chapter is structured as follows. In Section 5.2, we provide preliminaries and
background material which reintroduce momentum based algorithms as gradient flows. In Section 5.3, we present our results regarding convergence rate and steady-state variance amplification.
We first determine the optimal rate of exponential convergence ρ in terms of the condition number κ and system order d, and show that there exists a set of parameters which achieves this rate.
Next, we determine an analytical expression for the steady-state variance amplification in terms of
Routh-Hurtwitz coefficients and identify a lower bound on the product between variance amplification and settling time which scales with κ
2
. Finally, in Section 5.4, we present an example of
our results for the specific case with d = 3, where a single additional momentum term has been
added to the traditional two-step accelerated gradient flow dynamics considered in [35, 80].
5.2 Background for gradient flow dynamics
We consider a class of dynamical systems,
x
(d)
(t) +
d−1
∑
k=0
βkx
(k)
(t) + αg
d−1
∑
k=0
γkx
(k)
(t)
!
= w(t) (5.1)
where x
(k)
(t) is the kth derivative of x with respect to time t, g is a nonlinear function, α, βk
, and
γk are constant parameters, and w is a white-noise input with
E[w(t)] = 0, E[w(t1)w(t2)] = σ
2
Iδ(t1 −t2) (5.2)
88
and δ(·) is the Kronecker delta. Our motivation for studying system (5.1) comes from optimization. In the absence of noise, we can use system (5.1) with g(x) := ∇ f(x) to solve unconstrained
optimization problems
minimize
x
f(x) (5.3)
where f : R
n → R is an m-strongly convex function with an L-Lipschitz continuous gradient ∇ f .
Throughout the paper, we make the following assumption.
Assumption 1. The parameters in system (5.1) satisfy β0 = 0, γ0 = 1.
Assumption 1 ensures that the equilibrium points x
⋆ of system (5.1) satisfy the first-order optimality conditions for (5.3),
g(x
⋆
) = ∇ f(x
⋆
) = 0. (5.4)
As varying the parameter α is a matter of time-scaling, we set α = 1/L without loss of generality,
and note that, for n = 1, system (5.1) simplifies to the gradient flow dynamics,
x
(1)
(t) + (1/L)∇ f(x(t)) = w(t)
and, for d = 2, noisy accelerated gradient flow dynamics is obtained [80],
x
(2)
(t) + β1x
(1)
(t) + (1/L)∇ f(x(t) +γ1x
(1)
(t)) = w(t).
5.2.1 Quadratic optimization problems
For strongly convex quadratic optimization problems,
f(x) = 1
2
x
TQx − q
T
x (5.5)
89
with Q ∈ R
n×n
, the parameters of strong convexity and Lipschitz continuity, m and L, are respectively determined by the smallest and the largest eigenvalues of the Hessian matrix Q,
mI ⪯ Q ⪯ LI
and the condition number is given by κ := L/m. In this case, differential equation (5.1) with
g = ∇ f becomes linear,
x
(d)
(t) +
d−1
∑
k=0
(βk
I + γkαQ)x
(k)
(t) = w(t) (5.6)
and the optimization algorithm admits an LTI state-space representation,
ψ˙ = Aψ + Bw z = Cψ (5.7a)
where z := x−x
⋆
is the error in the optimization variable, ψ is the state vector defined by
ψ =
ψ
T
1 ψ
T
2
T
ψ1 = z, ψ2 :=
(x
(1)
)
T
··· (x
(d−1)
)
T
T
(5.7b)
and A, B, C and constant matrices that are partitioned conformably with the state vector ψ,
A =
0 I
A21 A22
, B =
0
I
, C =
I 0
(5.7c)
A21 = −(β0I +γ0αQ) A22 =
−(β1I +γ1αQ) ··· −(βd−1I +γd−1αQ)
. (5.7d)
The eigenvalue decomposition of the Hessian matrix, Q = VΛV
T
, can be utilized to bring matrices in (5.7) into their block diagonal forms, where V is an orthogonal matrix of the eigenvectors
of Q and Λ is a diagonal matrix of its eigenvalues. In particular, the change of variables,
xˆ := V
T
x, wˆ := V
Tw (5.8)
90
allows us to transform system (5.7) into a parameterized family of n decoupled subsystems indexed
by i = 1,...,n,
˙ψˆi = Aˆ(λi)ψˆi + Bˆwˆi zi = Cˆψi (5.9a)
where λi
is the ith eigenvalue of the matrix Q ∈ R
n×n
, ˆwi
is the ith component of the vector ˆw,
Aˆ(λ) =
0 I
−a0(λ) [−a1(λ) ··· −ad−1(λ)]
Bˆ =
0 ··· 0 1T
, Cˆ =
1 0 ··· 0
.
(5.9b)
Here,
ak(λ) := βk + γkαλ, k = {0,...,d −1} (5.9c)
and the characteristic polynomial of Aˆ(λ) is given by,
F(s) :=
n
∑
k=0
ak(λ)s
k =
d
∏
k=1
(s − µk) (5.9d)
where we let ad(λ) := 1 and µk are the eigenvalues of Aˆ(λ).
5.2.2 Exponential stability
System (2.8) is exponentially stable if all eigenvalues of the matrix A have negative real parts, i.e.,
if A is Hurwitz,
∥ψ(t)∥ = ∥e
Atψ(0)∥ ≤ c e
−ρt
∥ψ(0)∥ (5.10)
91
and the convergence rate ρ is determined by
ρ = |max ℜ(eig(A))| (5.11)
where ℜ(eig(·)) denotes the real part of the eigenvalues of a given matrix. From the modal decomposition (5.9), it can be seen that the convergence rate of system (2.8) is determined by the
slowest mode of matrices Aˆ(λ),
ρ = min
λ∈[m,L]
ρˆ(λ).
where ρˆ(λ) := |maxℜ(eig(Aˆ(λ)))|.
For a desired level of accuracy ε we require ce
−ρt ≤ ε, is equivalent to the condition t ≥
log(c/ε)/ρ, where
Ts
:=
1
ρ
(5.12)
is the settling time. Our goal is to examine the impact of the algorithmic parameters
θ := [β1 ··· βd−1 γ1 ··· γd−1 ]
T
(5.13)
on the eigenvalues µk(λ) of Aˆ(λ) and, thus, stability of the system. Under Assumption 1 on
(β0, γ0) and with α = 1/L, the vector of parameters θ define the coefficients ak(λ) in (5.9c) of the
characteristic polynomial (5.9d).
The Routh-Hurwitz (RH) criterion provides necessary and sufficient conditions for coefficients
ak(λ) in (5.9c) to ensure stability. Furthermore, by introducing the shifted characteristic polynomial
Fρ (s) :=
d
∑
k=0
ak(λ)(s − ρ)
k =
d
∏
k=1
(s − νk) (5.14)
92
we can utilize the RH criterion to determine conditions for ρ-exponential stability. In particular,
since the roots of F(s) and Fρ (s) are related by νk = µk +ρ, F(s) is ρ-exponentially stable (i.e., all
its roots have real parts smaller than −ρ) if and only if Fρ (s) is stable. Thus, it suffices to examine
the RH conditions on the coefficients of
Fρ (s) :=
d
∑
k=0
a˜
ρ
k
(λ)s
k
, a˜
ρ
k
(λ) :=
d−k
∑
i=0
ad−i(λ)
d −i
k
(−ρ)
(d−k−i)
(5.15)
where ˜a
ρ
d
(λ) := 1.
In Section 5.3, we determine the largest rate of convergence for a given condition number κ
and identify the vector of parameters θ that achieves this rate.
5.2.3 Variance amplification
In addition to the convergence rate, we are also interested in quantifying the steady-state variance
of the error in the optimization variable (variance amplification),
J := lim
t →∞
1
t
Z t
0
E
∥x(τ) − x
⋆
∥
2
dτ. (5.16)
For the LTI system (5.7), the variance J is given by J = trace
CXCT
where the state covariance
matrix X at the steady-state is given by
X = lim
t →∞
E
ψ(t)(ψ(t))T
(5.17)
where X solves the algebraic Lyapunov equation AX + XAT = −σ
2BBT
. The eigenvalue decomposition of the Hessian matrix Q can be utilized to express the variance amplification as
J =
n
∑
i=1
Jˆ(λi), Jˆ(λi) := CˆXˆ(λi)CˆT = Xˆ
11(λi). (5.18)
where Jˆ(λi) denotes the contribution of the ith eigenvalue λi of Q to the variance amplification. In
Section 5.3, we derive the expression for the element Xˆ
11(λi) of the matrix Xˆ(λi) that solves the
modal Lyapunov equation
Aˆ(λi)Xˆ(λi) + Xˆ(λi)AˆT
(λi) = −σ
2BˆBˆ
T
(5.19)
in terms of the coefficients of the Routh-Hurwitz table and utilize this relation to determine lower
bounds on J in terms of the convergence rate ρ and condition number κ.
5.3 Main results for gradient flow dynamics
In this section, we present our main results regarding the convergence rate and steady-state variance
amplification for the class of dth-order gradient flow dynamics described by (5.6). For strongly
convex quadratic problems with condition number κ, we establish the optimal rate of exponential
convergence ρ = κ
−1/d
and identify algorithmic parameters that achieve this optimal rate. We
also provide analytical expressions for the modal contributions Jˆ(λ) to the variance amplification,
first in terms of Routh-Hurtwitz coefficients of the linear time invariant system modeling the gradient flow dynamics, and second in terms of the eigenvalues of said LTI system. We use these
expressions to establish that a lower bound on the product between the variance amplification and
the settling time scales as κ
2
, for any stabilizing parameters. Our results extend the observations
made in [35] that only considered the case d = 2 to general d and they recover the same trade-off
between the settling time and variance amplification. Finally, for a given convergence rate ρ, we
present parameters which minimize steady-state variance, and bounds on the product of variance
at said parameters with settling time. Proofs of all results are relegated to the Appendix.
Theorem 11 establishes the optimal rate of convergence and determines parameters that achieve
the optimal rate.
94
Theorem 11. For strongly convex quadratic objective function f with condition number κ, under
Assumption 1 regarding the optimality conditions and the normalization condition α = 1/L, the
optimal rate of exponential convergence of system (5.6) is given by
ρ = κ
−1/d
and is achieved only by the parameter family θ
⋆ parameterized by γd−1,
γk =
d −2
k
ρ
−k +
d −2
k −1
ρ
d−k−1
γd−1
βk =
d
k
−
d −2
k
ρ
d−k −
d −2
k −1
ρ
2d−k−1
γd−1 k = 1,...,d −2,
γd−1 ∈ [0, ρ
−(d−1)
], ρ = κ
−1/d
(5.20)
As we demonstrate in the proof, the constraint on the optimal rate established by Theorem 11
is imposed by
αλ =
d
∏
k=1
(−µk)
where µk are the eigenvalues of Aˆ(λ) in (5.9) for all λ ∈ [m,L]. The parameters are derived by
defining the ρ-convergence region, the set of coefficients ak(λ) defined in (5.9c) which ensure the
algorithm converges with rate ρ. The region is defined by imposing the Routh-Hurwitz conditions
for stability (see [82]) on shifted the characteristic equation F
ρ
λ
(s) given in (5.14). At λ = m, for
the optimal rate of convergence ρ = κ
−1/d = (α m)
1/d
, the ρ-convergence region consists of a
single point at which µk = −ρ for k = 1,...,d. Thus the coefficient values ak(m) are fixed, with
a˜
ρ
k
(m) = 0 for all k. As λ increases towards L, the non-convexity of the ρ-convergence region,
arising from the non-convex nature of the RH stability constraints, together with the linearity of
the coefficients ˜a
ρ
k
(λ) in λ requires ˜a
ρ
k
(λ) = 0 for k = 0,...,d −3. This places d −2 eigenvalues at
µk = −ρ, with the two free eigenvalues constrained by the requirement µd−1 µd = −(αλ)ρ
−(d−2)
.
The next theorem presents two equivalent analytical expressions for the modal contribution to
variance amplification Jˆ(λ) for all λ, based on the characteristic equation Fλ
(s) given in (5.9d).
95
The first expression is in terms of the entries of the RH table associated with (5.9d) (see [82]), and
the second is in terms of the roots µk of Fλ
(s).
Theorem 12. For strongly convex quadratic objective function f , under Assumption 1 regarding
the optimality conditions, the modal contribution Jˆ(λ) to the steady-state variance amplification
of system (5.6) with stabilizing parameters θ and α = 1/L is given by both
Jˆ(λ) = σ
2
2a0 (λ)r(λ)
= σ
2
d
∑
i=1
(−1)
d
2µi
d
∏
k=1
k̸=i
(µ
2
i − µ
2
k
)
where r(λ) is the first (and only) entry in the nth row of the Routh-Hurwitz table associated with
the characteristic polynomial Fλ
(s) in (5.9d) of the matrix Aˆ(λ) in (5.9), and µk
, k = 1,...,d denote
the d roots of Fλ
(s), equivalently the eigenvalues of Aˆ(λ).
Both analytical expressions derived above are determined directly by the characteristic polynomial (5.9d), and provide a way to examine how the steady-state noise amplification is determined
by our choice of parameters θ. The expressions are obtained by two different approaches to solving the algebraic Lyapunov equation given in (2.22), leveraging the fact that the matrix Aˆ(λ) can
be expressed equivalently on terms of coefficients ak as seen in (6.5) or in terms of eigenvalues µk
through Jordan decomposition. We will use Theroem 12 to lead directly to our next results. The
relationship between Jˆ and the eigenvalues µk allow us to determine parameters which minimize
the variance, and the definition of Jˆin terms of RH coefficients will allow us to bound the variance
at those parameters.
First we will determine a lower bounds on the product of variance and settling time for any
stabilizing parameters.
Theorem 13. For strongly convex quadratic objective function f , under Assumption 1 regarding
the optimality conditions, with stabilizing parameters θ and the normalization condition α = 1/L,
96
the product of the modal contribution to the steady-state variance amplification and the settling
time Ts = 1/ρ is lower bounded by
Jˆ(λ)/ρ > σ
2
/(2d (αλ)
2
).
For the case λ = m, this simplifies to
Jˆ(m)/ρ > σ
2
κ
2
/(2d).
We observe that this lower bound is decreasing with λ and scales as κ
2
for λ = m. This leads
directly to a lower bound on the product of the variance of the complete system J and settling time
Ts
.
Corollary 1. Under the settings of Theorem 13, we have the lower bound
J
ρ
>
n
∑
i=1
σ
2
2d (αλi)
2
≥
σ
2κ
2
2d
+
σ
2
(n−1)
2d
The proof is immediate by combining Theorem 13 with equation (5.18). Based on this result,
we conclude that for systems of type (5.6), there is a fundamental tradeoff between settling time
and variance amplification, assuming d ≪ κ. While Theorem 11 provides parameters which will
optimize the convergence rate for a given condition number, the next Theorem provides parameters
which will optimize the variance for a given convergence rate.
Theorem 14. For strongly convex quadratic objective function f , under Assumption 1 regarding the optimality conditions and the normalization condition α = 1/L, the steady-state variance
amplification J of system (5.6) is minimized by paremeters
γk = ρ
−k
d −1
k
, βk = ρ
d−k
d −1
k −1
(5.21)
97
for k = 0,...,d −1. which places the coefficients ak(λ) at
ak =
d −1
k
a0 ρ
−k +
d −1
d −k
ρ
d−k
.
It is important to note that these parameters are included in the family θ
⋆ which can achieve the
optimal rate of convergence ρ = κ
−1/d provided in Theorem 11 , specifically with hyperparameter
γd−1 = ρ
−(d−1)
. While the exact value of the minimum variance provided by these parameters
depends on the distribution of eigenvalues λ between m and L, we can determine the following
bound.
Theorem 15. For strongly convex quadratic objective function f , with parameters given in Theorem 14 at the optimal converence rate ρ = κ
−1/d and the normalization condition α = 1/L, the
product of the modal contribution to the steady-state variance amplification and the settling time
Ts = 1/ρ is bounded by
n−1
2(d −1) +2ρ
d
+
κ
2
√
d
2d
√
π
≤ J ×Ts ≤
nκ
2
√
d
(2d − 1)
√
π
.
The result is achieved by noting that for the given parameters the modal contribution to variance
at λ = L can be lower bounded, and the modal contribution to variance at λ = m can be approximated. Importantly, Theorem 15 shows that the parameters presented in Theorem 14 achieve both
the optimal convergence rate ρ = κ
−1/d
for a given condtition number κ but also admit a steady
state variance J whose product with the settling time scales directly with κ
2
, regardless of the
distribution of eigenvalues of Q.
Finally, we note how choosing α ̸= 1/L influences the relation between the convergence rate
ρ and variance amplification J in terms of the condition number κ and parameter of Lipschitz
continuity L. Based on Theorem 11, optimal rate of exponential convergence is now given by
ρ = (αL/κ)
1/d
. (5.22)
98
The relationship between the noise amplification term Jˆ(m) and the RH coefficients is independent
of the scale parameter α. At ρ = (αLκ
−1
)
1/d
the binomial theorem gives the following values of
coefficients a0(m) and a1(m)
a0(m) = ρ
d
, a1(m) = dρ
d−1
. (5.23)
Using these coefficient values results in
Jˆ/ρ ≥
1
2dρ
dρ
d−1ρ
=
1
2d
(αL)
−2
κ
2
. (5.24)
Thus we conclude that adjusting out step size α does not fundamentally affect the relationship
between variance, settling time, and condition number outlined above.
5.4 Analysis for gradient flow dynamics with d = 3
In this section, we examine the behavior of the gradient flow (5.6) with d = 3 and identify parameters that achieve ρ = κ
−1/3
, as well as bounds on the product between noise amplification and
settling time. The techniques employed in this section extend directly to larger d, and the ability to
visualize they system behavior in three dimensions offers valuable insight into the proof provided
in the previous section. When compared to the case of d = 2 as seen in [35], in the d = 3 case the
3-D region of coefficients a0, a1, a2 for which the system achieves ρ-stability becomes non convex,
as seen in Figure 5.1. In this section, we illustrate how non-convexity of the convergence region
99
influences the choice of parameters which optimize convergence rate. The matrix Aˆ(λ) in (5.9)
takes the form
Aˆ(λi) =
0 1 0
0 0 1
−a0(λi) −a1(λi) −a2(λi)
(5.25a)
a0(λ) = αλ, a1(λ) = β1 +γ1αλ, a2(λ) = β2 +γ2αλ. (5.25b)
The characteristic equation can be written either in terms of the coefficients ak or in terms of the
roots µk
Fλ
(s) = (s− µ1)(s− µ2)(s− µ3) = s
3 +a2s
2 +a1s+a0
(5.26)
leading to the relationships between coefficients ak and eigenvalues µk
a0(λ) = −µ1µ2µ3, a1(λ) = µ1µ2 + µ1µ3 + µ2µ3, a2(λ) = −µ1 − µ2 − µ3. (5.27)
The shifted characteristic equation is given by
F
ρ
λ
(s) = (s−ρ)
3 +a2(s−ρ)
2 +a1(s−ρ) +a0 = s
3 +a˜
ρ
2
s
2 +a˜
ρ
1
s+a˜
ρ
0
(5.28)
leading to the following definitions of the shifted coefficients
a˜
ρ
2
(λ) = a2(λ)−3ρ, a˜
ρ
1
(λ) = a1(λ)−2a2(λ)ρ +3ρ
2
,
a˜
ρ
0
(λ) = a0(λ)−a1(λ)ρ +a2(λ)ρ
2 −ρ
3
.
(5.29)
When the roots of the shifted characteristic equation are less than zero, the roots of the original
characteristic equation satisfy µk ≤ ρ, ensuring the system converges with rate ρ. Theorem 11
states the optimal convergence rate to be κ
−1/3
. As stated in the proof provided in section 5.6.1,
100
at λ = m we have a0(m) = αm = κ
−1
, which according to equation (5.27) fixes µ1µ2µ3 = −κ
−1
.
It is impossible to place any eigenvalue µk
to left of −κ
−1/3 without forcing another eigenvalue
to become more positive than κ
−1/3
thus decreasing the convergence rate. Thus ρ = κ
−1/3
is the
optimal rate of convergence.
We will now derive the parameters which achieve this rate. At λ = m the ρ-convergence
region for ρ = κ
−1/3
consists of a single point at which all eigenvalues µk = −ρ, which based on
equations (5.27) and (5.29) fixes the coefficients of Fλ
(s) and F
ρ
λ
(s) at
a0(m) = ρ
3
a1(m) = 3ρ
2
a2(m) = 3ρ (5.30a)
a˜
ρ
0
(m) = 0 ˜a
ρ
1
(m) = 0 ˜a
ρ
2
(m) = 0. (5.30b)
As λ increases incrementally from m, the point given by (a0(λ),a1(λ),a2(λ)) must remain within
the ρ-stability region. We will use the Routh-Hurwitz criteria to determine conditions on the γk
’s
which ensure ρ-convergence as λ increases.
For general λ, the Routh-Hurwitz stability criterion on F
ρ
λ
(s) yields the constraints
a˜
ρ
0
(λ) ≥ 0, a˜
ρ
1
(λ) ≥ 0, a˜
ρ
2
(λ) ≥ 0 ˜a
ρ
2
(λ)a˜
ρ
1
(λ)−a˜
ρ
0
(λ) ≥ 0 (5.31)
which ensure ρ-convergence. See Figure 5.1 for an illustration. As we increase λ from m to m+ε
for ε ∈ [0, L−m], we can use definitions (5.29) and (5.25b) to write
a˜
ρ
2
(m+ε) = a˜
ρ
2
(m) + (ε α) γ2, a˜
ρ
1
(m+ε) = a˜
ρ
1
(m) + (ε α) (γ1 −2γ2ρ)
a˜
ρ
0
(m+ε) = a˜
ρ
0
(m) + (ε α) (1−γ1ρ +γ2ρ
2
)
(5.32)
The positivity constraints on ˜a
ρ
k
(λ) together with ˜a
ρ
k
(m) = 0 results in the following constraints on
γ1 and γ2
0 ≤ γ2 ≤
γ1
2ρ
, γ1 ≤ ρ
−1 +γ2ρ. (5.33)
101
The last constraint in (5.31) can be written
a˜
ρ
1
(m+ε)a˜
ρ
2
(m+ε)−a˜
ρ
0
(m+ε) = (ε α)
2
γ2(γ1 −2γ2ρ
2
)−(ε α) (1−γ1ρ +γ2ρ
2
) ≥ 0 (5.34)
Notice that the first term is scaled by ε
2 while the second is only scaled by ε, while both must be
positive, which forces ˜a
ρ
0
(m+ε) = 0 as ε → 0. Combining this with (5.33) and ρ = κ
−1/2 yields
0 ≤ γ2 ≤ κ
2/3
, γ1 = κ
1/3 +γ2κ
−1/3
. (5.35a)
Finally we solve for the βk parameters using βk = ak(m)−γk α m, which yields
β1 = 2κ
−2/3 − κ
−4/3
γ2, β2 = 3κ
−1/3 − κ
−1
γ2 (5.35b)
The coefficients ak(λ) generated by these parameters at γ2 = 0 and γ2 = κ
2/3
are shown in Figure 5.1, and it evident that the parameters given in (5.35) match those provided in Theorem 11.
As stated in the introduction, the variance is given by J = ∑
n
i=1
Jˆ(λi) with Jˆ(λi) = CˆXˆ(λi)CˆT
,
where the covariance matrix Xˆ(λ) solves
Aˆ(λ)Xˆ(λ) +Xˆ(λ)Aˆ(λ)
T = −σ
2BˆBˆ
T
.
For d = 3 using the expression for Aˆ(λ) given in (5.25a), it is straightforward to solve for the
one-one entry of Xˆ(λ) to obtain
Jˆ(λ) = σ
2a2(λ)
2a0(λ)(a1(λ)a2(λ)−a0(λ)). (5.36)
102
Figure 5.1: Stability region for third order gradient flow dynamics. On the left is the ρ-stability
region in terms of coefficients a0, a1, and a2 for ρ = κ
−1/3
at κ = 100. Each color indicates
a level set of a0(λ) = αλ. Notice that the set of a1(λ), a2(λ) for which the system achieves
ρ-stability grows larger as λ increases, and at λ = m the stability region condenses to a single
point. The black line corresponds to parameters β1 = 2κ
−2/3
, γ1 = κ
1/3
, β2 = 3κ
−1/3
, γ2 = 0,
for which the end points of the line segment (a0(λ),a1(λ),a2(λ)) for λ = m and λ = L are
given by (κ
−1
,3κ
−2/3
,3κ
−1/3
) and (1,2κ
−2/3 + κ
1/3
,3κ
−1/3
), respectively. The red line corresponds to the parameter set β1 = κ
−2/3
, γ1 = 2κ
1/3
, β2 = 2κ
−1/3
, γ2 = κ
2/3 with end points
(κ
−1
,3κ
−2/3
,3κ
−1/3
) and (1, κ
−2/3 + 2κ
1/3
,2κ
−1/3 + κ
2/3
). These parameters both yield the
optimal rate.
On the right is shown a level set of the ρ-stability region at λ = (L+m)/2. The edges of the region
are determined by the four constraints given by the Routh-Hurwitz criterion as described in (5.31).
The system is stable when all constrains are positive, which is shown by the shaded region. Notice
that the level set is convex; the non-convexity appears in a0 as we vary λ as seen on the left.
103
For any rate of convergence ρ ≤ κ
−1/3
, the smallest possible noise amplification Jˆ(λ) is given
by
min
a0(λ),a1(λ)
Jˆ(λ) = ρ(a0(λ) +2ρ
3
)
4a0(λ)(a0(λ) +ρ
3)
2
, (5.37a)
achieved at
a1(λ) = 2a0(λ)ρ
−1 + ρ
2
a2(λ) = a0(λ)ρ
−2 + 2ρ (5.37b)
γ1 = 2ρ
−1
β1 = ρ
2
γ2 = ρ
−2
β2 = 2ρ. (5.37c)
The result is proven by noting that for a fixed λ, the variance Jˆ(λ) given in (5.36) is strictly
decreasing in a1(λ) and a2(λ) on the region of convergence. As seen in Figure 5.1, for any given
λ, a1 and a2 are both maximized at the corner of the convergence region where ˜a
ρ
0
(λ) and ˜a
ρ
1
(λ)
are both zero. it is then straightforward to solve ˜a
ρ
0
(λ) = 0 and ˜a
ρ
1
(λ) = 0 for a1(λ) and a2(λ). This
parameter choice results is equivalent to placing two eigenvalues of the system at µ1 = µ2 = −ρ
and one at µ3 = −a0(λ)ρ
−2
.
The slope parameters γk determine how ak(λ) varies with λ, and so do not depend on the
intercept ak(m). Given that ak(λ)’s are linear in λ), for λ = m+ε we can write
a1(m+ε) = a1(m) +γ1α ε =
2(m+ε) +ρ
3
ρ
= a1(m) + 2α ε
ρ
,
a2(m+ε) = a2(m) +γ2α ε =
(m+ε) +2ρ
3
ρ
2
= a1(m) + α ε
ρ
2
,
104
which leads to γ1 = 2ρ
−1
, γ2 = ρ
−2
. The intercept parameters βk determine the value of ak(λ)
at λ = m. Using the relation a0(m) = α m = m/L = κ
−1
and rearranging (5.25b) to use βk =
ak(m)−γk κ
−1 we can determine
β1 = 2κ
−1
ρ
−1 +ρ
2 −2ρ
−1
κ
−1 = ρ
2
β2 = κ
−1
ρ
−1 +2ρ −ρ
−2
κ
−1 = 2ρ.
For the family of parameters which achieve optimal convergence rate established in (5.35), it is
straightforward to verify that
max
λ∈[m,L]
Jˆ(λ) = Jˆ(m) = (3/16)σ
2
κ
5/3
. (5.38)
To see this, notice that Jˆ(λ) is a decreasing function of λ as d Jˆ(λ)/d λ ≤ 0, and at λ = m the
values ak(m) are the same for any γ2 ∈ [0, κ
2/3
]. Thus, we can state that Jˆ(m) ≤ J ≤ nJˆ(m). which
leads directly to the result
(3/16)σ
2
κ
2 < J Ts ≤ (3/16)σ
2
nκ
2
.
5.5 Conclusion
For gradient flow dynamics of order d which correspond to accelerated first order optimization
algorithms, we prove the optimal convergence rate ρ = κ
−1/d
, and determine parameters which
optimize convergence rate, as ell as parameters which optimize variance for a fixed convergence
rate. In addition, we determine a lower bound on the product of variance and settling time. Our
results demonstrate that regardless of the number of momentum terms in accelerated gradient flow
dynamics, the product between variance amplification and settling time scales as κ
2
and 1/d. We
conclude that gradient flows of higher order offer increased convergence rate and reduced variance,
105
although the product of variance and settling time still scales with the square of the condition
number.
5.6 Proofs
5.6.1 Proof of Theorem 11
Proof. We first prove that the optimal rate cannot exceed ρ ≤ κ
−
1
d . We begin by examining the
best possible rate ρ(λ) associated with the matrix Aˆ(λ) for a fixed λ. The eigenvalues of Aˆ(λ) are
given by the roots of the characteristic equation given in (5.9d). By matching the constant terms
in (5.9d) of the product and summation expressions of Fλ
(s), we can write
a0(λ) = αλ =
d
∏
k=1
(−µk) (5.39)
Notice that the coefficient a0 is fixed with respect to λ, while the freedom to choose βk and γk
without constraints allows for any desired placement of ak(λ) for k ̸= 0. Therefore, to find the
optimal convergence rate ρ(λ) we must solve
minimize
µ1,...,µd
max
k
ℜ(µk)
subject to
d
∏
k=1
(−µk) = αλ.
(5.40)
In essence, we wish to place the real part of the largest eigenvalue as far from the imaginary axis
as possible, given that the product of all eigenvalues is fixed. The solution to this problem is given
by
µk = −(αλ)
1/d
, k = 1,...,d (5.41)
106
ℜ(µ)
ℑ(µ)
r =
√
αλ
µ¯
µ ×
×
•
Figure 5.2: The possible placement of complex conjugate roots with fixed product µµ¯ = αλ lies
on the circle in red. It is evident that if we wish to minimize their real part, both roots must lie on
the real axis at µ = µ¯ = −
√
αλ.
which yields ρ(λ) ≤ (αλ)
1/d
. The result is apparent when µk are real, and the extension to
imaginary roots is straightforward upon noting that all roots must come in complex conjugate
pairs, whose product is real and greater than the product of the real parts, as seen in Figure 5.2.
Then, the rate of the system (5.7) is upper bounded by
ρ = min
λ∈[m,L]
ρ(λ) ≤ min
λ∈[m,L]
(αλ)
1
d = (m/L)
1
d = κ
−
1
d
since α = 1/L. It follows that for a fixed λ, parameters θ can be chosen to achieve ρ ≤ κ
−1/d
. We
next show that there exists a θ which yields ρ = κ
−1/d
for all λ ∈ [m,L].
In total, we must select 2d −2 parameters βk
, γk
in order to design the linear functions ak(λ) =
βk +γkαλ for k = 0,...,d −1. This amounts to placing the line segment
a(λ) := [a0(λ),··· ,ad−1(λ)]T
, λ ∈ [m,L]
in R
d
. We begin the process of selecting these parameters by noticing that the end point a(m) of
this line segment is fixed, as a unique set of ak(m) allows ρ(m) = κ
−1/d
.
This follows directly from the optimal solution (5.41) for λ = m. The relationship between
a(m) and eigenvalues µk(Aˆ(m)) shown in (5.9d) yields
ak(m) =
d
k
ρ
d−k
(5.42)
107
where ρ = κ
−1/d
. Now that we have determined values of ak which give the desired rate of convergence at λ = m, we examine the conditions under which this margin of stability is maintained
as λ increases.
As stated in Section 5.2, we will determine conditions for ρ-exponential stability by imposing
the Routh-Hurwitz stability criterion on the coefficients of the shifted characteristic polynomial
shown in (5.15). We decompose the solution into two parts: given that rate of convergence ρ is
achieved at λ = m, we will determine the slope parameters γk which ensure the stability constraints
are not violated as λ increases from m to L. Once we have obtained the values of γ, we use
equation (5.9c) that directly determines the values of β, given that ak(m) are designed according
to (5.42).
As we demonstrated in Section 5.2, the eigenvalues µk and νk of the characteristic polynomial Fλ
(s) in (5.9d) and the shifted counterpart F
ρ
λ
(s) in (5.14), respectively, satisfy µk = νk −ρ.
Thus, the optimal placement of eigenvalues in (5.41) for λ = m yields νk = 0, which ensures the
coefficients of F
ρ
λ
(s) satisfy ˜a
ρ
k
(m) = 0 for all k ≤ d −1 and ˜a
ρ
d
(m) = 1.
Thus at λ = m, coefficients ˜a
ρ
k
(λ) of the characteristic equation F
ρ
λ
(s) are all zero, and the RH
criterion is met at λ = m.
We first establish how the coefficients ˜a
ρ
k
(λ) defined in (5.15) vary with λ = m + ε, where
ε ∈ [0,L−m]
a˜
ρ
k
(m+ε) =
d−k
∑
i=0
(βd−i +γd−iα m)
d −i
k
(−ρ)
(d−k−i) + α ε
d−k
∑
i=0
(γd−i)
d −i
k
(−ρ)
(d−k−i)
= a˜
ρ
k
(m) +α ε b˜
ρ
k
(γ) = α ε b˜
ρ
k
(γ)
(5.43)
where γ := (γ0,..., γd−1) and we let
b˜
ρ
k
(γ) :=
d−k
∑
i=1
γd−i
d −i
k
(−ρ)
(d−k−i)
. (5.44)
108
Therefore, the necessary Routh-Hurwitz stability conditions [82] require b˜
ρ
k
(γ) ≥ 0, k = 0,...,d−
1. The sufficient conditions for ρ-exponential stability obtained by the RH criterion require the
first column of the RH table associated with the coefficients of the shifted characterstic polynomial
in (5.15) to be non-negative. To achieve this, motivated by our observations for the special cases
d = 2−5, we present the following proposition.
Proposition 8. The characteristic equation Fρ
λ
(s) defined in (5.15) with ak(λ) = βk + γkαλ and
ρ = κ
−1/d has strictly negative roots if and only if b˜
ρ
k
(γ) = 0 for k = 0,...,d −3.
Proof. For a general characteristic polynomial of order n gn
(s) = ∑
n
k=0
cks
k
, let fi, j denote the
entries of the RH table, where i indicates the row and j indicates the column. The RH table
consists of n rows, each of which has ⌈
n+2−i
2
⌉ columns. The entries are given by
f1, j
:= cn−2(j−1)
f2, j
:= cn+1−2 j
fi, j
:= fi−2, j+1 − fi−2,1 fi−1, j+1 / fi−1,1. (5.45)
Necessary conditions for stability require all ck ≥ 0 and sufficient conditions for stability require
fi,1 ≥ 0 for all rows i. The proof of Proposition 8 then rests on the following claims.
Claim 1. For any k = 0,...,d −2, the constraint a˜
ρ
k+1
(λ)−a˜
ρ
k
(λ)/a˜
ρ
d−1
(λ) ≥ 0 for all λ ∈ [m, L],
given a˜
ρ
k
(λ) defined in (5.15) and a˜
ρ
j
(m) = 0 for all j = 0,...,d −1, requires b˜
ρ
k
(γ) = 0 with b˜
ρ
k
(γ)
defined in (5.44).
Proof. Given that all ˜a
ρ
k
(λ) = a˜
ρ
k
(m) +α ε b˜
ρ
k
(γ) must be positive for all λ according to the necessary RH conditions, and ˜a
ρ
k
(m) = 0 for all k, the constraint can be written
α
2
ε
2
b˜
ρ
k+1
(γ)b˜
ρ
d−1
(γ) − α ε b˜
ρ
k
(γ) ≥ 0 ⇒ α ε b˜
ρ
k+1
(γ)b˜
ρ
d−1
(γ) ≥ b˜
ρ
k
(γ).
Given that the terms b˜
ρ
j
(γ) must be positive for any j, b˜
ρ
k
(γ) must be zero in order for the inequality
to hold as ε → 0.
109
Claim 2. For the RH coefficients as defined in (5.45) if fi,2 ≤ 0 then fi+1,2 must also satisfy fi+1,2 ≤
0.
Proof. First suppose fi,2 < 0. Then the i
th sufficient condition for stability, fi+2,1 = fi,2 −
fi,1 fi+1,2/ fi+1,1 ≥ 0, results in the requirement fi+1,2 < 0, as fi,1 and fi+1,1 must both be nonnegative.
Next suppose fi,2 = 0. Then i
th sufficient condition for stability fi+2,1 = fi,2 −
fi,1 fi+1,2/ fi+1,1 ≥ 0 can be seemingly be resolved by setting either fi+1,2 ≤ 0 or fi,1 = 0. However,
when the first element of a row of the RH table is zero, we replace the element with 0 < ε ≪ 1
in order to construct the remainder of the table, and consider the sign of the elements in the first
column in the limit ε → 0. Thus setting fi,1 = 0 results in fi+2,1 = −ε fi+1,2/ fi+1,1 which still
requires fi+1,2 ≤ 0.
Thus, fi,2 ≤ 0 implies fi+1,2 ≤ 0.
Claim 3. For characteristic polynomial gn
(s) = ∑
n
k=0
cks
k with cn = 1, fn+1−2k,k+1 = c0 for every
k = 1,..., ⌊n/2⌋.
Proof. The claim is shown using backwards induction on k. Suppose the assertion is true for given
k. Then at k −1 we see
fn+3−2k,k = fn+1−2k,k+1 − fn+1−2k,1 fn+2−2k,k+1 / fn+2−2k,1 = fn+1−2k,k+1 = c0,
meaning the assertion is true for k−1. Here we have used the fact that each row i has only ⌈
n+2−i
2
⌉
non-zero elements, and so fn+2−2k,k+1 must be zero.
Based on the initializations of fi, j for i = 1,2, given in (5.45), c0 is the last element of the first
row when n is even, and the second row when n is odd, ensuring the assertion is true for k = ⌈n/2⌉.
By induction the assertion is true for all integers k = 1,..., ⌊n/2⌋.
110
Claim 4. For characteristic polynomial gn
(s) = ∑
n
k=0
cks
k with cn = 1 and cn−3 = 0, the RH necessary and sufficient requirements for stability require c0 = 0.
Proof. First note that cn−3 is the second entry of the second row of the RH table, so f2,2 = cn−3 = 0.
Given that fi,2 = 0 for i = 2, using induction together with Claim 2 shows that fi,2 = 0 for all
i = 2,...,n.
However, based on Claim 3 with k = 1, we know fn−1,2 = c0 which must be non-negative
according to the necessary RH conditions. Thus in order to satisfy both constraints we must have
c0 = 0.
Lemma 13. For characteristic polynomial hd
(s) = ∑
d
k=0
bks
k with bd = 1 and bd−3 = 0, the RH
necessary and sufficient requirements for stability require bk = 0 for k = 0,...,d −4.
Proof. The result follows immediately from Claim 4. Obviously h
d
(s) satisfies the premise of
Claim 4 with n = d and ck = bk
, in which case for g
n
(s) = h
d
(s) to be stable it must have c0 = b0 =
0. Then the polynomial h
d
(s) can be written
h
d
(s) =
d
∑
k=1
bks
k = s
d−1
∑
k=0
bk+1s
k = shˆd−1
(s), hˆd−1
(s) :=
d−1
∑
k=0
bk+1s
k
in which case stabilitiy is determined by the roots of the d −1 degree polynomial hˆd−1
(s). Given
that bd = 1 and bd−3 = 0 according to our premise, it is clear that hˆd−1
(s) satisfies the premise of
Claim 4 with n = d −1 and ck = bk+1, in which case for g
n
(s) = hˆd−1
(s) to be stable we must have
c0 = b1 = 0.
The process is repeated until we have set bk = 0 for k = 0,...,d −4, at which point the polynomial has become
h
d
(s) = s
d−2
(s
2 +bd−1s+bd−2).
At this point Claim 4 can no longer be applied, and stability of h
d
(s) depends on the values of bd−1
and bd−2, which need not be zero.
111
Now we apply Claim 1 and Lemma 13 to the characteristic equation F
ρ
λ
(s) defined in (5.15).
The first sufficient condition, which gives the positivity constraint f3,1 ≥ 0, is equivalent
to ˜a
ρ
d−2
(λ) − a˜
ρ
d−3
(λ)/a˜
ρ
d−1
(λ) ≥ 0. According to Claim 1, satisfying this constraint requires
b˜
ρ
d−3
(γ) = 0 and ˜a
ρ
d−3
(λ) = 0 for all λ.
According to the definition of the coefficients ˜a
ρ
k
(λ) in (5.15), we have ˜a
ρ
d
(λ) = 1. Given
that we have just established the requirement ˜a
ρ
d−3
(λ) = 0, the polynomial F
ρ
λ
(s) now satisfies
the conditons of Lemma 13, and thus stability requires ˜a
ρ
k
(λ) = 0 for k = 0,...,d − 3 for all λ ∈
[m, L]. As previously established in equation (5.43), given that ˜a
ρ
k
(m) = 0 for all k, the requirement
a˜
ρ
d−3
(λ) = 0 is equivalent to b˜
ρ
k
(γ) = 0 for k = 0,...d −3.
This reduces the shifted characteristic polynomial to F
ρ
λ
(s) = s
d−2
s
2 +sa˜
ρ
d−1
(λ) +a˜
ρ
d−2
(λ)
which has d − 2 roots at zero (placing d − 2 roots of Fλ
(s) at −ρ), and two roots determined by
a˜
ρ
d−1
(λ) and ˜a
ρ
d−2
(λ) .
We will now determine parameters γ which satisfy b˜
ρ
k
(γ) = 0 for k = 0,...d −3 and b˜
ρ
k
(γ) ≥ 0
for k = d −2,d −1. As stated in Assumption 1, γ0 = 1 is fixed, which leaves d −1 free variables
subject to d −2 linear equality constraints, resulting a solution with one degree of freedom, which
will be constrained by the inequality constraints b˜
ρ
k
(γ) ≥ 0 for k = d −2,d −1.
In general b˜
ρ
k
(γ) is a function of γk
,..., γd−1. We leverage this structure by solving b˜
ρ
k
(γ) = 0
sequentially, starting at k = 0, to express γk + 1 in terms of γk+2,..., γd−1 by using the previous
expression for γk
. At k = d −3, this yields γd−2 = ρ
−(d−2)
(1+ (d −2)ρ
d−1
γd−1), and we can then
use back substitution to express γk solely in terms of γd−1,
γk = ρ
−k
"
1 − (−1)
k
d−1
∑
i=k+1
i−1
k −1
γi(−ρ)
i
#
=
d −2
k
ρ
−k +
d −2
k −1
ρ
d−k−1
γd−1. (5.46)
11
It is straightforward to verify the solution set above satisfies b˜
ρ
k
(γ) = 0 for k = 0,...,d −3. using
binomial theorem, which states (x+y)
n = ∑
n
i=0
n
i
x
n−i
y
i which implies ∑
n
i=0
n
i
(−1)
i = ((1) +
(−1))n = 0.
Finally, the inequalities b˜
ρ
d−2
(γ) ≥ 0 and b˜
ρ
d−1
(γ) ≥ 0 provide bounds on the free variable γd−1,
b˜
ρ
d−1
(γ) = γd−1 ≥ 0
b˜
ρ
d−2
(γ) = γd−2 − (d −1)ρ γd−1 = ρ
−(d−2) − ρ γd−1 ≥ 0
which together simplify to 0 ≤ γd−1 ≤ ρ
−(d−1)
.
It now remains to determine parameters βk
. We the definition ak(λ) = βk + γkαλ at λ = m
where ak(m) is given in (5.42), which results in
βk =
d
k
ρ
d−k − γk m/L =
d
k
−
d −2
k
ρ
d−k −
d −2
k −1
ρ
2d−k−1
γd−1,
where we’ve used α m = m/L = κ
−1 = ρ
d
and γk
’s defined in (5.46).
5.6.2 Proof of Theorem 12
For ease of notation we set σ
2 = 1 without loss of generality. We will first prove the relation
Jˆ(λ) =
d
∑
i=1
(−1)
d
2µi
d
∏
k=1
k̸=i
(µ
2
i − µ
2
k
)
(5.47)
where µk are roots of the characteristic polynomial Fλ
(s) given in (5.9d).
Proof. As stated in Section 5.2,
Jˆ(λ) := CˆXˆ(λ)CˆT = Xˆ
11(λ)
1
where the covariance matrix Xˆ(λ)is the solution to the agebraic Lyapunov equation given in (5.19).
It is well known that the analytical solution to the continuous time algebraic Lyapunov equation is
given by
Xˆ(λ) = Z ∞
0
e
A(λ) τ BBT
e
A(λ)
T
τ
dτ. (5.48)
Using the Jordan decomposition Aˆ(λ) = U V U−1
, where we drop the λ for ease of notation, we
can write
Xˆ =
Z ∞
0
U e
V τ U
−1 BBT U
−T
e
V
T
τ U
T
dτ. (5.49)
If the eigenvalues µk of Aˆ(λ) are distinct, then
V = diag(µ1, µ2, ..., µd), Ui, j = µ
−(d−i)
j
(5.50)
in which case the {1,1} element of the covariance matrix can be written
Xˆ
1,1 =
Z ∞
0
d
∑
i=1
e
µi τ
d
∏
k=1
k̸=i
(µi − µk)
2
d τ =
Z ∞
0
d
∑
i=1
d
∑
j=1
e
(µi+µj)τ
d
∏
k=1
k̸=i
(µi − µk)
d
∏
k=1
k̸=j
(µj − µk)
d τ
=
d
∑
i=1
d
∑
j=i
−1
(µi + µj)
d
∏
k=1
k̸=i
(µi − µk)
d
∏
k=1
k̸=j
(µj − µk)
=
d
∑
i=1
(−1)
d
2µi
d
∏
k=1
k̸=i
(µ
2
i − µ
2
k
)
.
(5.51)
114
Finally, we consider the possibility of repeat eigenvalues. Suppose eigenvalues µ1 and µ2 are
repeat eigenvalues. In this case,
V =
µ1 1 0 ... 0
0 µ1 0 ... 0
.
.
.
0 ... 0 µd
U =
µ
−(d−1)
1 −(d −1)µ
−d
1
µ
−(d−1)
3
... µ
−(d−1)
d
µ
−(d−2)
1 −(d −2)µ
−(d−1)
1
µ
−(d−2)
3
... µ
−(d−2)
d
.
.
.
µ
−1
1 −µ
−2
1
µ
−1
3
... µ
−1
d
1 0 1 ... 1
(5.52)
and the first two elements of the final summation in (5.51) become
lim
µm→µn
(−1)
d
2µm
d
∏
k=1
k̸=m
(µ
2
m − µ
2
k
)
+
(−1)
d
2µn
d
∏
k=1
k̸=n
(µ
2
n − µ
2
k
)
=
(−1)
d
d−1
∑
k=1
(−1)
k
(2k −1)µ
2(k−1)
n
(
d−2
d−k−1
)
∑
l=1
(g[d, k;l])2
4µ
3
n
d
∏
k=1
k̸=n
(µ
2
n − µ
2
k
)
2
(5.53)
where g[d, k;l] is the lth element of the list of all possible products consisting of d −k−1 elements
chosen from the set {µ1...µd}/{µ1,µ2}. For example, g[5,2] = {µ3µ4, µ3µ5, µ4µ5, }. Using
equations (5.52) and (5.53), it is straightforward to verify that if µ1 = µ2,
Xˆ
1,1 = lim
µ1→µ2
d
∑
i=1
(−1)
d
2µi
d
∏
k=1
k̸=i
(µ
2
i − µ
2
k
)
. (5.54)
Thus (5.67) holds for any distribution of eigenvalues µk
.
115
We next prove the relation
Jˆ(λ) = σ
2
2a0 (λ)r(λ)
.
Proof. We provide a proof for the case where d is even. The case of odd d can be proven in a
similar way and is omitted for brevity. Recall Xˆ(λ) solves
Aˆ(λ)Xˆ(λ) +Xˆ(λ)Aˆ(λ)
T = −σ
2BˆBˆ
T
.
We define the symmetric d ×d matrix
Zˆ(λ) := Aˆ(λ)Xˆ(λ) +Xˆ(λ)Aˆ(λ)
T
whose entries are given by
zi, j =
xi+1, j +x j+1,i
, i ̸= d ̸= j
xi+1,d − ∑
d
k=1
(ak−1xi,k), i ̸= d = j
−2∑
d
k=1
ak−1xk,d, i = d = j.
Here, we have dropped the (λ) indicators for ease of reading, and without loss of generality, set
σ = 1.
Given the structure of BˆBˆT
, whose entries are all zero except the d-d entry, we first note all
diagonal terms of Zˆ except the d
th must be zero, yielding
xi,i+1 = 0, i = 1,...,n−1. (5.55a)
Due to our definitions of zi, j for the 1−1 block of Zˆ, (5.55a) forces additional zero constraints. In
particular, for j = i+2,
0 = zi, j = xi+1, j +x j+1,i = xi+1,i+2 +xi+3,i
.
116
Together with (5.55a), this yields xi,i+3 = 0, for i = 1,...,d − 1. Repeating this procedure for
j = i+2m yields
xi,i+(2m−1) = 0, for m,i ≥ 1. (5.55b)
The zi,d can now be written as linear functions of xi, j
zi,d =
xi+1,d − ∑
d/2−1
k=0
a2k xi,2k+1 i odd
−∑
d/2−1
k=0
a2k+1 xi,2(k+1)
i even
−2∑
d/2−1
k=0
a2k+1 x2(k+1),d
i = d.
We observe that, for i < d, the coefficients of the zi,d entries replicate the first and second rows
ae
:= [a0,a2,··· ,ad], ao := [a1,a3,··· ,ad−1] (5.56)
of the RH table associated with the characteristic polynomial Fλ
(s) in (5.9d), containing even and
odd coefficients ak
.
After the considerations in (5.55), the remaining terms in the 1−1 block of Zˆ must also equal
zero, requiring
xi,1+2m = −xi+1,1+2m−1, for m,i ≥ 1. (5.57)
This allows us to reduce the number of unknown xi, j
to d. To see this, note that as Xˆ is symmetric, we begin with d(d+1)
2
variables xi, j
. The eliminations in (5.55b) set d
2
4
variables to zero
and the eliminations in (5.57) fix d(d−2)
4
variables in terms of others, leaving d free variables.
117
By iterating across the matrix in a column-wise order, we can denote the unknown variables by
x¯ = [xˆ1 xˆ2 ... xˆd]
T
such that ˆx1 := x1,1, ˆxd
:= xd,d, and
zi,d =
∑
d/2
k=0
a2k xˆ(i+1)/2+k
i odd
∑
d/2−1
k=0
a2k+1 xˆ(i/2)+1+k
i even.
Combining the above expressions with 2zd,d = −1 and z1,d,...,zd−1,d = 0 brings us to
a¯
T
i
x¯ = yi (5.58a)
where we let y1 := −1/2, yi
:= 0 for i = 1,...,d, and
a¯
T
i =
0(d−i+1)/2 ao 0(i−1)/2
i odd
0(d−i)/2 ae 0(i−2)/2
i even.
(5.58b)
Here, the zero vector 0k ∈ R
k
, and ao and ae are given by (5.56). Solving the algebraic Lyapunov
equation for the entries of Xˆ is thus reduced to a linear system of equations, with d equations ¯a
T
i
x¯ =
yi and d unknowns ˆxi
. We now use the same technique of polynomial quotients and remainders that
is used to derive the RH coefficients to recursively generate linear equations of the form (5.58a)
that have fewer non-zero coefficients and ultimately obtain the value of
Jˆ(λ) = CˆXˆ(λ)CˆT = xˆ1. (5.59)
Based on the coefficients given in the third row of the RH table as seen in [82], we are motivated
to define
b¯
i
:= a¯2i −
ad
ad−1
a¯2i−1 i = 1,...,d/2.
118
Note that the vectors b¯
i are of the same structure as those in (5.58b) except they have d/2 non-zero
entries that constitute the 3rd row of the RH table. In addition, combining (5.58b) and the definition
of b¯
i yields
b¯T
i
x¯ =
0 i = 1,...,d/2−1
(−1)
da0/(2a1) i = d/2
(5.60)
We continue this procedure to recover all rows of the RH array. To generalize to any system of size
d, let
q
k
:=
¯f
k−2
1
(d −k +4)
¯f
k−1
1
(d −k +3)
,
¯f
k
i
:= ¯f
k−2
i+1 −q
k ¯f
k−1
i
(5.61a)
for k = 4,...,d +1 initialized with
¯f
1
i = a¯2i
,
¯f
2
i = a¯2i−1,
¯f
3
i = b¯
i (5.61b)
where in ¯f
k
i
(·), the superscript k denotes the recursion index, the subscript i ranges from 1 to
⌈(d −k+2)/2⌉ and the argument denotes the entry number or column number. It is easy to verify
that the vectors ¯f
k
i
are of the same structure as those in (5.58b) except they have ⌈(d −k +2)/2⌉
non-zero entries that constitute the kth row of the RH table. The subscript i determines the position
of the non-zero entries in the vector. Note that we perform one additional iteration compared the the
rows of the RH table, so vectors ¯f
k
i
only directly correspond to rows of the RH table for k = 1,...,d.
Using the notation for the entries of the RH table introduced in the proof of Theorem 11, let fi, j
denote the jth entry of the ith row of the table, with
f1, j
:= ad−2(j−1)
f2, j
:= ad+1−2 j
fi, j
:= fi−2, j+1 − fi−2,1 fi−1, j+1 / fi−1,1. (5.62)
119
Then at iteration k the vector ¯f
k
i
and q
k
satisfy
¯f
k
i =
0d/2−(k+1)/2−i
fk, ⌈(d−k+2)/2⌉
... fk,1 0k−i−1
k odd
0d/2−k/2−i
fk, ⌈(d−k+2)/2⌉
... fk,1 0k−i−1
k even
q
k = fk−2,1/ fk−1,1
(5.63)
for k = 3,...d, and i = 1,..., ⌈(d − k + 2)/2⌉. From this relationship it is clear that q
k must be
positive, and according to Claim 3, the first non-zero entry of ¯f
k
i
is equal to a0 for odd k.
It is now straightforward to show that
(
¯f
k
i
)
T
x¯ =
0 i ̸= 1
(−1)
k ad
2 ¯f
k−1
1
(d−k+3)
i = 1
(5.64)
using induction. Suppose the assumption is true for k = 4,..,
ˆk −1. Then using the relation
(
¯f
ˆk
i
)
T
x¯ = ( ¯f
ˆk−2
i+1
)
T
x¯ −
¯f
ˆk−2
1
(d − ˆk +4)
¯f
ˆk−1
1
(d − ˆk +3)
(
¯f
ˆk−1
i
)
T
x¯
it is immediate that the assertion must hold for k = ˆk at i ̸= 1, as both (
¯f
ˆk−1
i
)
T
x¯ and (
¯f
ˆk−2
i+1
)
T
x¯ must
equal zero. At i = 1,
(
¯f
ˆk
i
)
T
x¯ = −
¯f
ˆk−2
1
(d − ˆk +4)
¯f
ˆk−1
1
(d − ˆk +3)
(−1)
ˆk−1
ad
2 ¯f
ˆk−2
1
(d − ˆk +4)
= (−1)
ˆk
ad
2 ¯f
ˆk−1
1
(d − ˆk +3)
And thus the assertion holds for ˆk. Given (5.60), it is immediate that the assertion holds at k = 4,
and thus by induction the assertion holds for all k = 4,...,d +1.
At iteration k = d + 1, we are left with a single vector ¯f
d+1
1 with a single non-zero element
¯f
d+1
1
(1) in the first position. This allows us to solve for ˆx1 and obtain
(
¯f
d+1
1
)
T
x¯ = ¯f
d+1
1
(1) xˆ1 =
ad
2 ¯f
d
1
(2)
(5.65)
120
which yields ˆx1 = ad/(2 ¯f
d
1
(2)
¯f
d+1
1
(1)). By construction, ¯f
d
1
(2) and ¯f
d+1
1
(1) are the last two terms
in the first column, in addition to ¯f
d
1
(2) =: r being the first (and only) entry in the dth row of the
RH table. It is now easy to verify that the last term ¯f
d+1
1
(1) is given by a0. Combining this with
ad = 1 and (5.59) completes the proof.
5.6.3 Proof of Theorem 13
Proof. By Theorem 12, Jˆ(λ) = σ
2/(2r(λ)a0(λ)). According to the definition given in (5.61a),
we can write
r(λ) := ¯f
d
1
(2) = ¯f
d−2
2
(2) − q
d ¯f
d−1
1
(2) < ¯f
d−2
2
(2)
leading to the general chain of inequalities
r(λ) := ¯f
d
1
(2) ≤ ¯f
d−2
2
(2) ≤ ... ≤ ¯f
2
d/2
(2) = a1(λ)
which follows from positivity of q
k
and positivity of the term f
k
i
(2) = a0(λ) at i = ⌊(d −k+2)/2⌋.
Thus, we obtain that
Jˆ(λ) ≥ σ
2
/(2a1(λ)a0(λ)). (5.66a)
According to (5.9d), it is easy to verify that
a0(λ) =
d
∏
k=1
(−µk), a1(λ) = a0(λ)
d
∑
k=1
−1/µk
. (5.66b)
In addition, from ℜ(−µk) ≥ ρ, it follows that
d
∑
k=1
−1/µk ≤ d/ρ. (5.66c)
121
Combining αλ = a0(λ) with (5.66a), (5.66b), and (5.66c) completes the proof.
5.6.4 Proof of Theorem 14
Proof. Based on Theorem 12, for a given λ ∈ [m, L] the modal contribution to variance is given
by
Jˆ(λ) =
d
∑
i=1
(−1)
d
2µi
d
∏
k=1
k̸=i
(µ
2
i − µ
2
k
)
, (5.67)
where µ1,...,µd are the roots of the characteristic equation Fλ
(s) = ∑
d
k=0
ak(λ). We will now show
that for any given λ and convergence rate ρ, the variance is minimized by placing d−1 eigenvalues
at µk = −ρ, and one eigenvalue at µd = a0 (−ρ)
d−1
.
Based on equations (5.9d) and (5.9c), for a given λ, the coefficient
a0(λ) = α λ =
d
∏
k=1
(−µk)
is fixed. Given that the function Jˆ(λ) is symmetric in the µk
’s, we can assume without loss of
generality that
−ρ ≥ µ1 ≥ ... ≥ µd−1 ≥ µd =
−a0
d−1
∏
i=1
−µi
. (5.68)
122
Considering µd is a function of the other eigenvalues, the derivative of Jˆ(λ) with respect to µj
is
given by
d
d µj
Jˆ(λ) = (−1)
d
−1
2µ
2
j
d
∏
k=1
k̸=j
(µ
2
j − µ
2
k
)
+
1
2µj µd
d
∏
k=1
k̸=d
(µ
2
d − µ
2
k
)
+
−1
d
∏
k=1
k̸=j
(µ
2
j − µ
2
k
)
d
∑
k=1
k̸=j
1
(µ
2
j − µ
2
k
)
+
d
∑
k=1
k̸=j
µj
µk (µ
2
k − µ
2
j
)
d
∏
m=1
m̸=k
(µ
2
k − µ
2
m)
+
d
∑
k=1
k̸=d
−µ
2
d
µj µk (µ
2
k − µ
2
d
)
d
∏
m=1
m̸=k
(µ
2
k − µ
2
m)
+
µd
d
∏
k=1
k̸=d
µj(µ
2
d − µ
2
k
)
d
∑
k=1
k̸=d
1
(µ
2
d − µ
2
k
)
.
(5.69)
Given the assumption in (5.68), it is straightforward if tedious to verify that the derivative in (5.69)
is always negative as long as the order of eigenvalues remains unchanged, meaning the variance
is minimized by making eigenvalues µ1 through µd−1 as close to the imaginary axis as possible.
Given the convergence rate requires Re(µk) ≤ −ρ, the variance is minimized at
µk = −ρ, k = 1,...,d −1, µd = −a0 ρ
−(d−1)
. (5.70)
Next we show that the above configuration of eigenvalues results in the coefficients presented
in Theorem 14. If the roots of the characteristic equation Fλ
(s) defined in (5.9d) are given by µk
,
k = 1,...,d, then the roots of the shifted characteristic equation F
ρ
λ
(s) defined in (5.14) are given
by νk = µk +ρ. Thus for the eigenvalue configuration described in (5.70), F
ρ
λ
(s) has eigenvalues at
123
νk = 0 for k = 1,...,d −1, and νd = ρ −a0 ρ
−(d−1)
, meaning the shifted characteristic polynomial
associated with a give λ is given by
F
ρ
λ
(s) = s
d−1
s − ρ +αλ ρ−(d−1)
. (5.71)
Based on the definition of F
ρ
λ
(s) given in (5.15) in terms of coefficients ˜a
ρ
k
(λ), we can conclude
that for a given λ, variance is minimized when
a˜
ρ
k
(λ) = 0, k = 0,...,d −2, a˜
ρ
d−1
(λ) = −ρ +αλ ρ−(d−1)
. (5.72)
Based on the definition of ˜a
ρ
k
(λ) given in (5.15), where ˜a
ρ
k
(λ)’s are linear functions of the coefficients ak(λ) of the original characteristic equation Fλ
(s), (5.72) presents a linear system with d
equality constraints and d unknowns ak(λ). Solving (5.72) for coefficients ak(λ) yields
ak(λ) =
d −1
k
a0(λ)ρ
−k +
d −1
d −k
ρ
d−k
. (5.73)
Finally, it remains to determine parameters βk
, γk which achieve the coefficients above. The slope
parameters γk determine how ak(λ) varies with λ, and so do not depend on the intercept ak(m).
Using the definition ak(λ) = βk +γk αλ along with allowing λ = m+ε, we can write
ak(λ) = ak(m) + α ε γk = ak(m) +
d −1
k
ρ
−k
(5.74)
which immediately results in γk =
d−1
k
ρ
−d
. The intercept parameters βk determine the value
of ak(λ) at λ = m. Using the relation a0(m) = α m = m/L = κ
−1
and rearranging (5.9c) to use
βk = ak(m)−γk κ
−1 we can determine βk = ρ
d−k
d−1
k−1
, which completes the proof.
1
5.6.5 Proof of Theorem 15
Proof. We begin by establishing that the modal contribution to steady-state variance Jˆ(λ) =
σ
2/(2a0(λ)r(λ)) given in Theorem 12, at parameters which minimize variance given in 14, is
decreasing in λ.
The claim follows directly from the proof of 14, where we established that for a given λ, Jˆ(λ)
is minimized when the foots of the characteristic equation Fλ
(s) defined in (5.9d) are given by
µk = −ρ, k = 1,...,d −1, µd = −a0 ρ
−(d−1)
. It is clear that as λ increases, µd increases in
magnitude, moving further away from the other eigenvalues at µk = −ρ. Based on the expression
for Jˆ(λ) given in (5.67), it is clear that Jˆ(λ) is decreasing at µd increases in magnitude andmoves
away from all µk
for k ̸= d, and thus for the specified parameters Jˆ(λ) is decreasing in λ.
Together with equation (5.18) we have
Jˆ(m)/ρ + (n−1)Jˆ(L)/ρ ≤ J/ρ ≤ (n−1)Jˆ(m)/ρ + Jˆ(L)/ρ (5.75)
Given that a0(λ) = αλ, bounding Jˆ(λ) is equivalent to bounding r(λ), the only entry of the
last row of the RH table associated with the Fλ
(s) given in (5.9d). Using the notation established (5.61a) in the proof of Theormem 13, r(λ) can be written
r(λ) := ¯f
d
1
(2) = a1(λ) − a0(λ)
⌈(d−4)/2⌉
∑
i=0
q
d−2i
. (5.76)
125
Using the expression ak(m) =
d
k
ρ
d−k
to initialize the first two rows of the RH table, we can
determine analytical expressions for the remaining entries of the RH table fk,i as well as q
k
for
k = 3,...,d. When k is even we have
fk,i =
d
2i−3+k
ρ
2i−3+k
2
k−2
2
k−4
2
∏
j=0
(i+ j)(d +2 j +2)
(2i+k −1+2 j)(d −2 j −1)
, q
k =
2k −5
d ρ
(k−4)/2
∏
j=0
d
2 −(2 j)
2
d
2 −(2 j +1)
2
(5.77)
and when k is odd we have
fk,i =
d
2i−3+k
ρ
2i−3+k
2
k−1
2
k−3
2
∏
j=0
(i+ j)(d +2 j +1)
(2i+k −2+2 j)(d −2 j)
, q
k =
2k −5
d ρ
k−5
2
∏
j=0
d
2 −(2 j +1)
2
d
2 −(2 j +2)
2
.
(5.78)
Note that for J < 0, ∏
J
j=0
:= 1. Using induction it is straightforward (though incredibly tedious)
to verify these expressions hold for general k. We now derive a closed-form expression for
qˆ :=
⌈(d−4)/2⌉
∑
i=0
q
d−2i
(5.79)
When d is even, we have
qˆ =
d/2
∑
i=2
q
2i =
d/2
∑
i=2
(4i−5)
d ρ
i−2
∏
j=0
d
2 −(2 j)
2
d
2 −(2 j +1)
2
=
d
ρ
−
2d −1
ρ
d−1
∏
i=1
i
i+1/2
=
d
ρ
−
√
π Γ(d)
ρ Γ(d −1/2)
(5.80)
12
and when d is odd, we have
qˆ =
(d−1)/2
∑
i=1
q
2i+1 =
(d−1)/2
∑
i=1
4i−3
d ρ
i−2
∏
j=0
d
2 −(2 j +1)
2
d
2 −(2 j +2)
2
=
d
ρ
−
2d −2
ρ
d−2
∏
i=1
i
i+1/2
=
d
ρ
−
√
π Γ(d)
ρ Γ(d −1/2)
.
(5.81)
Motivated by the approximation ∏
d−1
i=1
i
i+1/2 ≈
√
π/(2
√
d ), we introduce the bounds
(2d −1)
√
π
2
√
d
≤ d − ρ qˆ ≤ 2d
√
π
2
p
d +1/2
. (5.82)
Given that a0(m) = ρ
d
and a1(m) = d ρ
d−1
, we have
2ρ a0(m)r(m) = 2ρ a0(m) (a1(m)−a0(m)qˆ) = 2ρ
2d
(d −qˆρ) (5.83)
Together equations (5.82) and (5.83) result in
√
d
ρ
2d 2d
√
π
≤ Jˆ(m)/ρ ≤
√
d
ρ
2d(2d −1)
√
π
.
The bounds presented above hold for parameters which minimize variance for any given convergence rate. At the optimal convergence ρ = κ
−1/d
, we have
κ
2
√
d
2d
√
π
≤ Jˆ(m)/ρ ≤
κ
2
√
d
(2d − 1)
√
π
. (5.84)
We can also use the almost exact approximation Γ(d)/Γ(d − 1/2) = (2d − 1)
p
1/(4d −1) to
write
Jˆ(m)/ρ =
κ
2
√
4d −1
2(2d −1)
√
π
.
127
At λ = L, the coefficients ak(L) do not simplify as nicely, and close bounds on r(L) are difficult
obtain. Instead we use
r(L) ≤ a1(L) = (d −1)ρ
−1 +ρ
d−1
which yields
n−1
2(d −1) +2ρ
d
+
κ
2
√
d
2d
√
π
≤ J/ρ ≤
nκ
2
√
d
(2d − 1)
√
π
.
128
Chapter 6
Performance of noisy three-step accelerated first-order
optimization algorithms for strongly convex quadratic problems
In this chapter we study the class of first-order algorithms in which the optimization variable is
updated using information from three previous iterations. While two-step momentum algorithms
such as heavy-ball and Nesterov’s accelerated methods may achieve the optimal convergence rate
with the appropriate choice of parameter, it is an open question if the three-step momentum method
can offer advantages for problems in which exact gradients are not available. For strongly convex
quadratic problems, we identify algorithmic parameters which achieve the optimal convergence
rate and examine how additional momentum terms affects the trade-offs between acceleration and
noise amplification. We show that for any stabilizing parameters, the product of variance and
settling time is lower bounded by a constant factor times the square of the condition number,
as is true for the standard two-step momentum algorithm. Our results suggest that introducing
additional momentum terms does not provide improvement in variance amplification relative to
standard accelerated algorithms.
6.1 Introduction
Previous work [21–28] has established that accelerated methods are more sensitive to noise than
gradient descent, while providing superior rates of convergence. In this chapter we explore how
129
adding additional history terms affects the trade-off between noise amplification and convergence
rate. In practice, accelerated algorithms are implemented in discrete time, according to the update
equation
x
t+d =
d−1
∑
k=0
βk x
t+k − α ∇ f
d−1
∑
k=0
γk x
t+k
!
+ w
t
, (6.1)
d is the number of previous iterations being utilized, t is the iteration index, α is the stepsize, βk
and γk are algorithmic parameters, and w
t
is a white noise with
E[w
t
] = 0, E[w
t
(w
t
)
T
] = σ
2
Iδ(t −τ). (6.2)
First-order optimality conditions impose the following constraints on parameters βk and γk
d−1
∑
k=0
βk = 1,
d−1
∑
k=0
γk = 1. (6.3)
For strongly convex quadratic problems, we analyze the effect of the additional momentum
term on sensitivity to noise by providing upper and lower bounds on variance amplification in
terms of the convergence rate. We first characterize the family of parameters which achieve optimal
convergence rate, which is a generalization of the heavy-ball algorithm. Next we determine upper
and lower bounds on the modal contribution to variance associated with eigenvalues λ of the
Hessian Q for three-step algorithms.
Our results show that moving from d = 2 to d = 3 stretches the distance between upper and
lower bounds on noise amplification. For optimal parameters, the maximum modal contribution to
variance increases as the third order parameters increase. In addition, the product of variance and
settling time is lower bounded by a factor of κ
2
, as is the case in the two-step algorithm. Ultimately
we conclude that offers no advantage in convergence rate or steady state variance.
130
6.2 Three-step accelerated algorithms
In this section, we examine the behavior of the three step discrete time accelerated algorithm,
where the update equation of the estimate x
t
is given by (6.1) with d = 3,
x
t+3 = β2x
t+2 + β1x
t+1 + β0x
t − α∇ f
γ2x
t+2 +γ1x
t+1 +γ0x
t
+ w
t
. (6.4)
For strongly convex quadratic objective function given in (2.7a), the matrix Aˆ(λ) in (2.10) takes
the form
Aˆ(λi) =
0 1 0
0 0 1
−a0(λi) −a1(λi) −a2(λi)
(6.5a)
ak(λi) = αγkλi − βk
, k = {0,1,2} (6.5b)
with characteristic polynomial
Fλ
(z) = z
3 + a2(λ)z
2 + a1(λ)z + a0(λ). (6.6)
Since the two-step momentum algorithm is obtained by setting a0(λ) = 0 for all λ, we will focus
examination on the influence of a0(λ) on convergence rate and noise amplification.
In this section we present our main results concerning the effect of an additional moment
term on the trade-offs between convergence rate, steady-state variance amplification, and condition
number for the class of three-step first-order accelerated algorithms described in (2.2). We first
present general bounds on the minimum and maximum modal contributions to noise amplification
Jˆmin := min
λ
Jˆ(λ) Jˆmax := max
λ
Jˆ(λ)
13
for any set of stabilizing parameters
θ := {α,β0,β1,β2, γ0, γ1, γ2}. (6.7)
Next we characterize the set of parameters which achieve the optimal rate of convergence in
Theorem 11 and provide bounds for Jˆmin and Jˆmax for these parameters in 10.
We also consider a class of parameters θˆ, which includes Polyak’s heavy-ball method and
gradient descent, which for a fixed convergence rate allows trade-off between noise amplification
and condition number. We show that the minimum of Jˆmax over parameters of this class increases
as |a0| increases. We also present a specific example of these parameters for which allowing a0 > 0
for fixed ρ improves the trade-off between Jˆmax to condition number.
6.2.1 Defining the ρ-convergence region
We first describe the stability region as defined by parameters ak defined in (6.5b). System (6.5)
is stable when the roots of the characteristic equation (6.6), equivalently the eigenvalues of Aˆ(λ),
have magnitude less than one. The Jury stability criterion [83] applied to the characteristic polynomial (6.6) imposes the positivity constraints
l := 1−a0 + a1 − a2 ≥ 0
w := 1 + a0 + a1 + a2 ≥ 0
h := 1 − a
2
0 − a1 + a0a2 ≥ 0
g := 1 − a
2
0 ≥ 0
(6.8)
to guarantee stability. The resulting three dimensional stability region is shown in Figure 6.1a. TO
determine constraints which ensure convergence with rate ρ, we repeate the process for the scaled
characteristic equation
Fλ
(ρz) = ρ
3
z
3 + a2(λ)ρ
2
z
2 + a1(λ)ρz + a0(λ), (6.9)
which yields the positivity constrains
132
(a) (b)
Figure 6.1: Stability and ρ-convergence regions for the three-step momentum algorithm. Figure
(a) shows the 3-D stability region in a0, a1, and a2. Different shades of blue correspond to the level
sets of a0. Figure (b) shows a level set of the stability region and the ρ-convergence region for
ρ = 0.7, at a0 = 0.1, as defined by the positivity constraints of (6.8) and (6.10). At a0 = ±ρ
3
, the
convergence region collapses to a single line, and at a0 = 0, ∆ρ (0) recovers the 2-D case.
ρ
3 − a2ρ
2 + a1ρ − a0 ≥ 0
ρ
3 + a2ρ
2 + a1ρ + a0 ≥ 0
ρ
6 − a1ρ
4 + a0a2ρ
2 − a
2
0 ≥ 0
ρ
6 − a
2
0 ≥ 0
(6.10)
for convergence with rate ρ.
While the ρ-convergence region is non-convex, the non-convexity appears exclusively in a0,
as seen in Figure 6.1a. If we examine the stability region for a fixed a0, the region is a convex
triangle defined by first three constrains given in (6.10), denoted ∆ρ (a0), as seen in Figure 6.1b.
As expected, setting a0 = 0 recovers the 2-D stability region described in [35].
We have labeled the vertices of the ρ-convergence region Xρ , Yρ , and Zρ as shown in Figure 6.1b. For a fixed a0, the a1,a2 coordinates of these vertices are given by
Zρ : a2 = −a0ρ
−2
, a1 = −ρ
2
Yρ : a2 = 2ρ + a0ρ
−2
, a1 = ρ
2 + 2a0ρ
−1
Xρ : a2 = −2ρ + a0ρ
−2
, a1 = ρ
2 − 2a0ρ
−1
.
(6.11)
133
Similarly, the edges of the stability and ρ-convergence regions occur along the lines where the
constraints in (6.8) and (6.10) respectively are exactly zero.
We observe that the values w, h, and l as defined in (6.8) have intuitive geometric interpretations
as seen in Figure 6.1b. For a fixed point p = (a0,a1,a2) within the ρ-convergence region, l gives
the distance from p to the ZY edge of ∆1(a0), w gives the distance to the XZ edge of ∆1(a0),
and h gives the distance to the XY edge of ∆1(a0). Furthermore, the value g can be split into the
product of (1−a0) and (1+a0), each of which give the distance along the a0 axis to the edge of
the stability region, as seen in Figure 6.1a.
We will now examine how the addition of the third term a0 effects the behavior of the algorithm.
6.2.2 Deriving parameters which optimize convergence rate
First, we characterize the set of parameters which achieve the optimal rate of convergence.
Theorem 16. For strongly convex quadratic objective function f ∈ QL
m with condition number κ,
the optimal rate of convergence of system (6.5b) of order d = 3, given by ρ = 1 − √
2
κ+1
, is only
achieved by the set of parameters θρ
⋆ (a0) consisting of
β0 = −a0,
γ0 = 0
β1 =
−ρ
4 +a0ρ
2 +a0
ρ
2
γ1 =
a0
ρ
2 +a0
β2 =
ρ
4 +ρ
2 −a0
ρ
2
γ2 =
ρ
2
ρ
2 +a0
α =
4ρ +4a0ρ
−1
L−m
a0 ∈ [−ρ
3
, ρ
3
]
. (6.12)
The parameters are unique for a given a0, where the feasible range a0 ∈ [−ρ
3
, ρ
3
] is a stability
constraint imposed by the last positivity constraint of (6.10). For a0 = 0, these parameters reduce
to Polyak’s heavy-ball method. For a0 ̸= 0, the parameters generalize Polyak’s heavy-ball method,
with the coefficients a1(λ) and a2(λ) lying along the XρYρ edge of the ρ-stability region ∆ρ (a0)
seen in Figure 6.1b. We conclude that any set of parameters θ with γ0 ̸= 0 which allows a0(λ) to
vary with λ must achieve a rate of convergence strictly inferior to the optimal. The derivation of
the optimal parameters is given in Section 6.5.2
We now consider the effect of additional momentum on the variance.
134
6.2.3 Bounding the variance
For d = 3, it is possible to directly solve the algebraic Lyapunov equation given in (2.22) for the
covariance matrix Pˆ(λ), and extract the {1,1} term to obtain
Jˆ(λ) = (1+a0(λ))l(λ) + (1−a0(λ))w(λ)
2w(λ)l(λ)h(λ)
(6.13)
where w, h, and l defined in (6.8) give the distances to the edges of the ρ-convergence region as
seen in Figure 6.1b.
We now present bounds on the modal contribution to noise amplification for any set of stabilizing parameters θ.
Proposition 9. For strongly convex quadratic objective function f ∈ QL
m, with parameters θ which
achieve rate of linear convergence ρ, the modal contribution Jˆ(λ) to the steady-state variance
amplification of system (2.10b) is bounded by
16ρ
2
(1+ρ)
5
≤ Jˆ(λ) ≤
1+4ρ
2 +ρ
4
(1−ρ
2)
5
. (6.14)
which in the 2-D case where a0 = 0 can be improved to
1 ≤ Jˆ(λ) ≤
1+ρ
2
(1−ρ
2)
3
. (6.15)
Introducing the third momentum term a0 decreases the lower bound while increasing the upper
bound. Essentially, additional momentum widens the range of best-case and worst-case noise amplification, as we would expect to result from the introduction of an additional degree of freedom.
We now examine the effect of additional momentum on noise amplification specifically for
parameters designed to optimize convergence rate.
135
Proposition 10. For strongly convex quadratic objective function f ∈ QL
m, with parameters θρ
⋆ (a0)
give in Theorem 16 which achieve optimal rate of convergence for any a0 ∈ [−ρ
3
,ρ
3
], the modal
contribution Jˆ(λ) to the steady-state variance amplification of system (2.10b) is bounded by
1+ρ +ρ
2
2(1+ρ)
5
≤ Jˆ(λ) ≤
1+4ρ
2 +ρ
4
(1−ρ
2)
5
. (6.16)
which 2-D case where a0 = 0 can be tightened to
1
1−ρ
4
≤ Jˆ(λ) ≤
1+ρ
2
(1−ρ
2)
3
. (6.17)
As in Proposition 9, we can see that introducing an additional momentum term allows more
variability in noise amplification. Even for optimal parameters, the modal contribution to variance
Jˆ(λ) shows more variance across eigenvalues λ for d = 3 as opposed to d = 2.
Since the optimal convergence rate ρ in Theorem 5 depends on the condition number κ, the
above bounds can be expressed in terms of κ. The following corollary extends the bounds on the
product between modal contributions to the variance amplification and the settling time for the
two-step momentum method [34, 35] to the three-step momentum method with parameters that
optimize the convergence rate.
Corollary 2. For strongly convex quadratic objective function f ∈ QL
m, with parameters θρ
⋆ (a0),
the product of noise amplification J and settling time T ˆ
s = 1/(1−ρ) is bounded by
O(
√
κ) ≤ Jˆ(λ)×Ts ≤ O(κ
3
) (6.18)
for any a0 ∈ [−ρ
3
, ρ
3
], and
O(κ) ≤ Jˆ(λ) ×Ts ≤ O(κ
2
). (6.19)
for the 2-d case given by a0 = 0.
136
Here we can clearly see how moving a0 away from zero dilates bounds on the product of noise
amplification and settling time. With the additional momentum term a0, the upper bound increases
by a factor of κ. The proof of Corollary 2 is given in Section 6.5.4. Next, specifically for the case
d = 3, we present a lower bound on the product of the total system variance J and the settling time
Ts = 1/(1−ρ) in terms of κ.
Theorem 17. Let the three-step momentum algorithm given by (6.1) with d = 3 with strongly
convex quadratic objective function f ∈ QL
m with condition number κ convergence linearly with
rate ρ. Then the product of steady-state variance and settling time is lower bounded by
J ×Ts ≥
κ
2
96
.
Theorem 17 shows that moving from two history terms to three does not improve the tradeoff between variance and settling time, as their product is still lower bounded proportionally to
κ
2
. While the lower bound on variance and settling time in terms of κ
2 has only been proven for
d = 2 and d = 3, we speculate that it is likely to hold for larger values of d as well. Given that
convergence rate cannot be improved, and increasing d from 2 to 3 failed to improve the product
of variance and settling time, we conclude that introducing additional momentum to the two-step
algorithm does not provide any benefit.
Ultimately we conclude that if we wish to prioritize convergence rate, introducing an additional
history term to the two-step accelerated algorithm is strictly detrimental, as we cannot improve
convergence rate, and the additional momentum increases steady-state variance.
6.2.4 Results for specific parameters
Both Propositions 9 and 10 establish that as |a0| increases, the upper bound on Jˆmax increases.
However, if we wish to reduce noise amplification across λ ∈ [mL], it is useful to know not only
how large Jˆmax can be, but how small it can be made. With this motivation, we consider the lower
137
bound on the maximum modal contribution to noise amplification minθ maxλ Jˆ(λ) and examine
how it is affected by moving a0 away from zero.
We restrict analysis the class of parameters θˆ, parameterized by the variables cρ and a0, consisting of
β0 = −a0, β1 =
a
2
0+a0ρ
2+a
2
0
ρ
2+cρ ρ
4−ρ
6
ρ
2(a0+ρ
2)
β2 =
−a
2
0+ρ
4−cρ ρ
4+a0ρ
4+ρ
6
ρ
2(a0+ρ
2)
γ0 = 0 γ1 =
a0
a0+ρ
2 γ2 =
ρ
2
a0+ρ
2
cρ ∈ [0, 2ρ
2 −2a
2
0
ρ
−4
] a0 ∈ [−ρ
3
, ρ
3
] α =
2(a0+a0ρ+ρ
3
)(2a
2
0+cρ ρ
4−2ρ
6
)
(L−m)ρ
2 (a
2
0−ρ
6)
.
(6.20)
This class of parameters includes the heavy-ball method at a0 = 0, c = 0 and gradient descent
at a0 = 0, cρ = ρ
2
, but does not include Nesterov’s method. As we increase cρ and move between
heavy-ball and gradient descent, we can adjust the trade-off between convergence rate and noise
amplification. We now present the effect of allowing a0 ̸= 0 on minθ maxλ Jˆ(λ) for parameters of
this class.
Proposition 11. For a strongly convex quadratic objective function f ∈ QL
m, with parameters in θˆ
which achieve rate of convergence ρ, the minimum of Jˆmax := maxλ Jˆ(λ) over parameter set θˆ is
increasing in |a0|.
Essentially, for any parameters in θˆ, increasing |a0| away from a0 = 0 must increase not only
the upper bound on Jˆmax, but the lower bound on Jˆmax as well. This means that for any parameters
in the superset θˆ, including the extensions of heavy-ball and gradient descent, increasing |a0|
strictly increases the maximum modal contribution to noise amplification.
6.3 Results for general discrete time algorithms of order d
In this section we present our main results concerning the effect of additional history terms on the
trade-offs between convergence rate, steady-state variance amplification, and condition number for
the class of d-step accelerated algorithms described in (6.1).
138
As previously mentioned, unlike the continuous time case, for discrete time first order algorithms it has already been proven that the optimal convergence rate is given by ρ = 1−2/(
√
κ −1)‘
[13], and cannot be improved upon. For quadratic objective functions, this rate is achieved by the
Polyak’s heavy-ball algorithm. A general class of parameters which achieve this rate for d = 3 is
given in Section 6.2. Given that we cannot improve the convergence rate by increasing the number
of previous iterations used to update the current iteration, we instead consider how allowing d > 2
effects the variance, and the product of variance with settling time.
First, for first order accelerated algorithms with d history terms, we provide two equivalent
analytical expressions for the modal contribution to variance Jˆ(λ) associated with eigenvalue λ of
Q, defined in (2.22). Similarly to Theorem 12, the variance Jˆ(λ) is determined by the characteristic
equation Fλ
(z) given in (6.21). We give two equivalent expressions for Jˆ(λ), first in terms of the
Jury stability table associated with polynomial (6.21), and second in terms of the roots of (6.21).
Theorem 18. Let the d-step momentum algorithm (6.1) with strongly convex quadratic objective
function f ∈ QL
m convergence linearly with rate ρ. Then the modal contribution to steady-state
variance Jˆ(λ) is given by
Jˆ(λ) =
−
d−1
∏
k=3
ck,0(λ)
cd,0(λ)
=
d
∑
i=1
µ
d−1
i
(1− µ
2
i
)
d
∏
j=1
j̸=i
(µi − µj)(1− µiµj)
where ck,0(λ) are the first entries of row k of the Jury stability table associated with the characteristic polynomial
Fλ
(z) = z
d +
d−1
∑
k=0
ak(λ)z
k
, ak(λ) = α γk λ − βk
. (6.21)
and µ1,...µd are the roots of said polynomial.
Both expressions provide a key connection between stability and variance properties for accelerated algorithms of any order. The entries of the Jury stability table indicate how far parameters
139
lie from the edge of the stability region, while the roots of the characteristic polynomial, which
are also the eigenvalues of the matrix Aˆ(λ) in (6.5), directly influence the convergence rate. We
believe that these closed form definitions of the variance will directly allow us to generalize the
bounds on the product of variance and settling time established for d = 2 to accelerated algorithms
of any order d.
Conjecture 1. Let the d-step momentum algorithm (6.1) with strongly convex quadratic objective
function f ∈ QL
m convergence linearly with rate ρ. Then the product of steady-state variance and
setting time is lower bounded by
J ×TS ≥
2d−5
d−3
κ
2
4
d
(6.22)
For all positive integers d.
The conjecture is motivated by observations for for small values of d, and can be verified
numerically for d = 2,...,5, but remains unproven for algorithms of general dimension d. An
overview of the proof structure is provided in Section 6.5.7, although a key proposition remains
unproven. However, numerical observations for small values of d support our conjecture that
the fundamental limitation on the trade-off between convergence rate and steady-state variance in
terms of condition number persists for any number d of history terms used in the update equation.
6.4 Conclusion
We consider the class of three-step momentum algorithms in discrete time for strongly convex
quadratic problems. We present the family of parameters which achieves the optimal rate of convergence, which generalizes the heavy-ball algorithm and specifically requires γ0 = 0. We establish
a lower bound o the product of variance and settling time which scales with the square of the condition number, indicating that the fundamental trade-off between the two metrics cannot be mitigated
by additional momentum. Furthermore, examining the variance for the class of optimal parameters
14
indicates that as the third momentum coefficient a0 moves away from zero, steady-state variance
increases. Finally, we present two general analytical expressions for the variance of d-step accelerated algorithms for any d, which can be used to extend the lower bound on variance and settling
time to higher order algorithms.
6.5 Proofs
6.5.1 Proof of Theorem 17
Proof. Equation 6.13 gives the the modal contribution to variance associated with the i
th eigenvalue
of the Hessian Q. Given that a0 ≤ 1 according to constraint (6.10), we have
Jˆ(λi) ≥
1+a0
2wh
. (6.23)
We will next determine a lower bound for Jˆ(λi) for a given a0. For a given point (a0,a1) ∈ ∆ρ (a0),
h can be maximized by decreasing a1 until the point lies along the Xρ Zρ edge of the ρ-convergence
region. The Xρ Zρ edge is defined by the second constraint of (6.10) being exactly zero, and thus
satisfies a1 = −ρ
2 −a2ρ −a0ρ
−1
, which means that along this edge, the values d and h satisfy
hXρZρ =
a0 +ρ
ρ (1−ρ)
1+ρ
2 +a0(1−ρ)
2 + (wXρZρ −2)ρ
.
Thus for any given distance w from the XZ edge, h is bounded by
h ≤
a0 +ρ
ρ (1−ρ)
1+ρ
2 +a0(1−ρ)
2 + (w−2)ρ
and thus the modal contribution to variance Jˆ(λ) is bounded by
Jˆ(λi)
1−ρ
≥
1+a0
2wh(1−ρ)
≥
(1+a0)ρ
2w(a0 +ρ) (1+a0(1−ρ)
2 + (w−2)ρ +ρ
2)
.
1
Based on first-order optimality conditions, ∑
d−1
k=0
ak(λ) = −1+αλ, and thus
κ =
αL
αm
=
1+a0(L) +a1(L) +a2(L)
1+a0(m) +a1(m) +a2(m)
=
w(L)
w(m)
. (6.24)
Together with the fact that κ ≤ (1+ρ)
2/(1−ρ)
2
, we recover the bounds
w(m) ≤
(1+ρ)
2
κ
, (1−ρ)
2 ≤
(1+ρ)
2
κ
(6.25)
which allow us to write
Jˆ(m)
1−ρ
≥
κ
2
(1+a0)ρ
2(a0 +ρ)(1+a0 +ρ)(1+ρ)
4
(6.26)
It is straightforward to verity that (1 + a0)/((a0 + ρ)(1 + a0 + ρ)) is strictly decreasing in a0 on
the domain a0 ∈ [−ρ
3
, ρ
3
] for any ρ ∈ [0,1]. Thus
Jˆ(m)
1−ρ
≥
κ
2
(1+ρ
3
)
2(1+ρ)
4(1+ρ
2)(1+ρ +ρ
3)
(6.27)
which is strictly decreasing in ρ with ρ ≤ 1, which leads to
Jˆ(m)
1−ρ
≥
κ
2
96
.
Given that J/(1−ρ) ≥ Jˆ(m)/(1−ρ) the proof is complete.
6.5.2 Proof of Theorem 16
Proof. Choosing parameters βk and γk amounts to placing a line in 3-D space of a0, a1, a2, parameterized by λ, which must remain within the ρ-convergence region defined in (6.10), as seen
142
in Figure 6.1a. Based on the endpoints of this line at eigenvalues m and L, the largest permissible
condition number κ is determined by
κ =
αL
αm
=
1+a0(L) +a1(L) +a2(L)
1+a0(m) +a1(m) +a2(m)
, (6.28)
based on the equalities in (6.3). In order to maximize κ for a given ρ, we wish to choose endpoints
which maximize this ratio. We will determine parameters which achieve optimal rate of convergence by considering endpoints (a0,a1,a2)(m) and (a0,a1,a2)(L) which achieve κ =
(1+ρ)
2
(1−ρ)
2
.
We will first consider the simple case where a0(λ) is fixed, with γ0 = 0. In the case of fixed a0,
given that we wish (a1(L) +a2(L)) to be greater than (a1(m) +a2(m)), we can restrict our search
to lines with the m endpoint on the XρZρ edge and the L endpoint along the ZρYρ edge.
Since the longest line in a triangle always lies along an edge, we can further restrict our search
to the XρYρ edge and the ZρYρ edge. In this case it is straightforward to verify that optimal line
placement lies along the XρYρ edge of the ρ-convergence region, defined in (6.10).
Using the (a1,a2) coordinates of endpoints Xρ and Yρ give in (6.11) in conjunction with (6.28)
yields κ =
(1+ρ)
2
(1−ρ)
2 which is independent of a0 and matches the optimal. By solving the equations
ak(λ) = −βk +αλ γk at m and L we produce the parameters given in (6.12).
We will now examine the case where γ0 is strictly non-zero. In this case, due to non-convexity
in a0, it is insufficient to simply choose endpoints within the ρ-convergence region.
Considering the definition of κ given in (6.28), it is evident that increasing a0(L) should increase κ; however as a0(L) increases the convergence region ∆ρ (L) shifts, resulting in a decrease
in a1(L) and a2(L). In order to quantify this trade-off, we consider the following question: As
a0(L) is increased to a0(m) +∆a0
, how are (a1(L),a2(L)) affected?
As in the previous case, given that we wish (a0,a1,a2)(L) to be greater than (a0,a1,a2)(m), it
suffices to consider lines with the m endpoint on the XρZρ edge and the L endpoint along the ZρYρ
edge and a0(L) > a0(m).
143
Figure 6.2: Overlaid a0 level sets of ∆ρ (a0) at ρ = 0.9, for a0(m) = −ρ
3/3 and a0(L) = ρ
3/3
in red and blue respectively. In black and gray we see two examples of a parameterized line
(a2(λ),a1(λ)) which runs from the Xρ (m)Zρ (m) edge to the Zρ (L)Yρ (L) edge. We can see that
in order to satisfy the constraint corresponding to the Xρ (λ)Yρ (λ) edge as a0 changes, the a1/a2
slope must be more negative than one might expect, and (a1,a2)(L) are both smaller than they
would be at the Yρ (m) vertex.
As we increase αλ incrementally by ελ
, a1 must continue to satisfy the constraint ρ
6 − a1ρ
4 +
a0a2ρ
2 − a
2
0 ≥ 0, which requires
a1(m+ελ) = a1(m) +γ1ελ ≤
ρ
2 + (a0(m) +γ0ελ
)(a2(m) +γ2ελ
)ρ
−2 −(a0(m) +γ0ελ
)
2
ρ
−4
(6.29)
and results in following the constraint on γ1 in terms of γ0
γ1 ≤ [a
max
1
(m)−a1(m)] +a0(m)γ2ρ
−2 +a2(m)γ0ρ
−2 −2a0γ0ρ
−4
. (6.30)
Using the relations
γ2 = 1−γ1 −γ0, a
max
1 = ρ
2 −2a0(m)ρ
−1
,
a2(m) = −ρ −a1(m)ρ
−1 −a0(m)ρ
−2
(6.31)
we can express (6.30) as a functions solely of a0(m), a1(m), and ρ.
144
Setting (a1,a2)(m) along XρZρ and (a1,a2)(L) along ZρYρ , together with definitions of γk
,
requires
2ρ
3 +2a1(m)ρ −∆a2
ρ
2 +∆a1
ρ −∆a0 = 0
γ0∆a1 = γ1∆a0
, γ1∆a2 = (1−γ1 −γ0)∆a1
.
(6.32)
We can now solve the system of equations given in (6.30) and (6.32), where we have chosen
to set γ1 equal to its upper bound, as in order to maximize a1(L) an a2(L) we wish to make
γ1/(1−γ1 −γ0) large, which is achieved by making γ1 as large as allowed.
Thus we obtain γ0, γ1, ∆a1
and ∆a2
as functions of a0(m), a1(m) and ∆a0
. The definitions are
not included due to complexity. Along the XρYρ edge, with ∆a0 = 0, we have ∆a1 = 4a0ρ
−1
and
∆a2 = 4ρ. Using the symbolic computation engine Mathematica we can verify that, for a1(m) ∈
[−ρ
2
,ρ
2 −a0(m)ρ
−1
],
(∆a0 +∆a1 +∆a2
) ≤ 4ρ +4a0(m)ρ
−1
(6.33)
with equality only in the case ∆a0 = 0 and a1(m) = ρ
2 −a0(m)ρ
−1
, the a1 coordinate of Xρ .
Repetition of this process for ∆a0 < 0 and for endpoints not on the XρZρ ZρYρ edges is straightforward and not included for the sake of brevity.
6.5.3 Proof of Proposition 9
We first establish the upper bound on Jˆ(λ) given in both Propositions 9 and 10.
Proof. For a fixed a0, Jˆ(λ) is convex in w, l, h on the positive orthant, which is required for stability. Given that w, l, h are affine functions of a1 and a2, as seen in (6.8), Jˆ(λ) is convex in a1,a2
and must achieve it’s maximum at one of the vertices Xρ , Zρ , Yρ .
145
The (a1,a2) coordinates of the vertices are defined in (6.11) which allows us to determine
Jˆ(λ) at each vertex. Exact function forms are omitted due to length. We thus establish the bound
Jˆ(λ) ≤ Jˆmax(a0), where
Jˆmax(a0) = ρ
4
2|a0|ρ (1−ρ
2
) + (ρ
2 −a
2
0
) (1+ρ
2
)
(ρ
4 −d
2
0
) (ρ −|a0|)
2 (1−ρ
2)
3
(6.34)
is achieved by Jˆ at Yρ for a0 ≥ 0 and at Xρ for a0 ≤ 0. It is straightforward to verify that
∂
∂ |a0|
Jˆmax(a0) > 0, and at a0 = ±ρ
3
the bound achieves its maximum value
max
a0
max
λ
Jˆ(λ) = 1+4ρ
2 +ρ
4
(1−ρ
2)
5
. (6.35)
We will now prove the lower bound given in Proposition 9. Both here and in the proof of lower
bounds give in Proposition 10 we will make use of the decomposition
Jˆ(λ) = F1(λ) + F2(λ), F1(λ) :=
1+a0(λ)
2w(λ)h(λ)
, F2(λ) :=
1−a0(λ)
2l(λ)h(λ)
, (6.36)
and take advantage of the inequality
min
λ
[F1(λ) +F2(λ)] ≥ min
λ
[F1(λ)] +min
λ
[F2(λ)]. (6.37)
by bounding F1(λ) and F2(λ) independently.
Proof. For any (a1,a2) interior to ∆ρ (a0) it is possible to move along the line of constant h and
increase either w or l until the either the XρZρ or the ZρYρ edge of ∆ρ (a0) is reached. Thus F1
and F2 must each be minimized along the XρZρ and ZρYρ edges of ∆ρ (a0) respectively. Using the
equality constraints which define these edges, we can express the values w and l as linear functions
of h and rewrite F1 and F2 as function of h.
14
We can now present the following propositions giving the minimum of F1 and F2 along these
edges.
Proposition 12. For fixed a0 ∈ [−ρ
3
, ρ
3
], along the XρZρ edge of ∆ρ (a0), F2 is lower bounded by
F2(h;a0) ≥
(2ρ
2
)
(1−a0)(1+ρ)
3(a0 +ρ)
(6.38)
and for any a0 ∈ [−ρ
3
, ρ
3
] is lower bounded by
F2(h) ≥
8ρ
2
(1+ρ)
5
. (6.39)
Proof. Along the XρZρ edge we use the relation
ρ
3 +a2ρ
2 +a1ρ +a0 = 0
to determine l as a linear function of h,
l(h) =
1+ρ
ρ (a0 +ρ)
((1+ρ)(1−a0)(ρ +a0) − hρ). (6.40)
F2(h;a0) is convex in both h and a0, with the minimum over both given in (6.39). The minimum
is only achieved within the domains of both h and a0 in certain circumstances, while the bound is
valid regardless.
Proposition 13. For fixed a0 ∈ [−ρ
3
, ρ
3
], along the ZρYρ edge of the ρ-convergence region, F1 is
lower bounded by
F1(h;a0) ≥
(2ρ
2
)
(1+a0)(1+ρ)
3(ρ −a0)
(6.41)
147
and for any a0 ∈ [−ρ
3
, ρ
3
] is lower bounded by
F1(h) ≥
8ρ
2
(1+ρ)
5
. (6.42)
Proof. Along the ZρYρ edge we use the relation
ρ
3 −a2ρ
2 +a1ρ −a0 = 0 (6.43)
to determine w as a function of h
w(h) =
1+ρ
(ρ −a0)ρ
((1+ρ)(1+a0)(ρ −a0) − hρ). (6.44)
and observe that F1(h;a0) is convex in both h and a0. The minimum over both values is given
in (6.42).
The propositions above combined with (6.37) complete the proof. By evaluating the minimums
of F2 along XρZρ and F1 along ZρYρ given in Propositions 12 and 13 at a0 = 0, and evaluating the
upper bound Jˆmax(a0) given in (6.34) at a0 = 0 provides upper and lower bounds for the two step
case.
6.5.4 Proof of Proposition 10
Proof. As stated in Section 6.5.2, the only parameters which achieve optimal rate of convergence
place (a1,a2) along the XρYρ edge of the ρ-convergence region, for any fixed a0 ∈ [−ρ
3
,ρ
3
].
We introduce the notation Jˆ
XY (λ;a0) to refer to the modal noise amplification for these specific
parameters.
148
The upper bound of Jˆ(λ) for any stabilizing parameters is already given in equation (6.34) in
the proof of Proposition 9.
While it is possible to solve explicitly for minw Jˆ
XY (w;a0) in terms of a0 and ρ, the result is
prohibitively complex. Instead, we will lower bound F1(λ) and F2(λ) independently along the
XρYρ edge of the ρ-convergence region, first for fixed a0, and then for any a0. The XρYρ edge of
∆ρ (a0) is defined by
ρ
6 − a1ρ
4 + a0a2ρ
2 − a
2
0 = 0
which allows us to determine l and h as linear functions of w, which is in turn linear in λ.
h(w) = (1−ρ
2
)(a
3
0 +ρ
4 +a0ρ
2
(2−w+ρ
2
) +a
2
0
(1+2ρ
2
))
ρ
2(a0 +ρ
2)
l(w) = −2a
2
0 +w(a0ρ
2 −ρ
4
)−2a
2
0
ρ
2 +2ρ
4 +2ρ
6
ρ
2(a0 −3ρ
2)
(6.45)
For the remainder of the proof we consider F1 and F2 solely as functions of w and a0 along XρYρ
edge.
Since the minimum minw Jˆ
XY (w;a0) is symmetric about a0 = 0, we assume without loss of
generality a0 ≥ 0. F2(w;a0) is strictly increasing in w and is minimized at the Xρ end of the XρYρ
edge, while F1(2;a0) is convex in w, although the minimizer is not always interior to XρYρ , leading
to the lower bounds
F2(w;a0) ≥
(1−a0)ρ
4
2(1−ρ
2)(1+ρ)
2(a0 +ρ)
2(ρ
2 −a0)
F1(w;a0) ≥
(2a0ρ
4
)
(1+a0)(1−ρ
2)(a0 +ρ
2)
3
.
(6.46)
149
While minimizing the sum of both lower bounds over a0 is quite difficult, we can determine
a loose lower bound by replacing a0 with the maximum a0 = ρ
3
and the minimum a0 = 0 where
appropriate, resulting in
Jˆ
XY (λ) ≥
1+ρ +ρ
2
2(1+ρ)
5
. (6.47)
When a0 is fixed at a0 = 0, it is straightforward to verify that Jˆ
XY (λ) ≥
1
1−ρ
4
. We established in
Theorem 11 that parameters which place a1(λ), a2(λ) along the XρYρ edge achieve the optimal
convergence rate ρ = 1− √
2
κ+1
. Using this expression for ρ in terms of κ together with the upper
and lower bounds on Jˆ(λ) given in (6.35) and (6.47) results in the bounds given in Corollary 2.
6.5.5 Proof of Theorem 11
In this section we consider the parameter set θˆ given in (6.20), which is designed to place
(a1,a2)(m) and (a1,a2)(L) along the XρZρ edge and the ZρYρ edges of ∆ρ (a0) respectively. For
any fixed a0, we can find parameters θ ∈ θˆ which minimize Jˆmax := maxλ Jˆ(λ). We now prove
that as |a0| increases, so does the minimum value of Jˆmax.
Proof. As Jˆis convex in a1 and a2, the maximum must occur at either λ = m or λ = L. We show
now that
∂
∂ |a0|
max
min
h
Jˆ
XZ(h;a0), min
h
Jˆ
ZY (h;a0)
> 0
where we use the definitions
JXZ(h; da0) = 1+a0
2wXZ(h)h
+
1−a0
2lXZ(h)h
JZY (h; a0) = 1+a0
2wZY (h)h
+
1−a0
2lZY (h)h
(6.48)
150
where w, l as functions of h along the XρZρ and ZρYρ edges are defined in (6.40) and (6.44)
respectively.
By combining
Jˆ
XZ(h; a0) > Jˆ
ZY (h; a0),
∂
∂ h
Jˆ
ZY (h; a0) < 0 for a0 ≥ 0 (6.49)
we see that max
minh Jˆ
XZ(h;a0), minh Jˆ
ZY (h;a0)
occurs along the XρZρ edge for a0 ≥ 0. From
the inequalities
∂
∂ a0
Jˆ
XZ(h;a0) > 0,
∂
∂ a0
h(Xρ ) ≤ 0,
∂
∂ a0
h(Zρ ) ≥ 0, for a0 ≥ 0, (6.50)
we can see that minh Jˆ
XZ(h;a0) is increasing in a0 > 0. With the symmetry provided by
JXZ(h;a0) = JZY (h;−a0), it is straightforward to determine that minθˆ max{Jˆ(m), Jˆ(L)} is increasing as a0 moves away from zero.
6.5.6 Proof of Theorem 18
Without loss of generality, we set σ
2 = 1 for ease of notation. We first derive the expression for
Jˆ(λ) in terms of the entries of the Jury stability table of the associated characteristic equation
Fλ
(z).
Proof. We begin by establishing some notation. For a general characteristic equation of degree k,
F
k
(z) =
d
∑
j=0
ajz
j
, (6.51)
we consider both the associated Jury stability table [84] and an associated matrix Mk
, whose motivation will given next.
151
We will let ci, j denote the entries of the Jury stability table associated with (6.51), with i indicating the row and j indicating the column, starting with j = 0 to match common notation. The
entries are defined recursively by
ci+1, j = ci,0 ci, j − ci,d+1−i ci,d+1−i−j
(6.52)
Note that for a given row i, the table will have d +2−i non zero entries. For the first row, c1, j = aj
for j = 0,..., k.
We define the matrix Mk
associated with (6.51) by
Mk =
ak ak−1 ak−2 ... a2 a1 a0
ak−1 ak−2 ak−3 ... a1 a0 0
ak−2 ak−3 ak−4 ... a0 0 0
.
.
.
a2 a1 a0 ... 0 0 0
a1 a0 0 ... 0 0 0
a0 0 0 ... 0 0 0
+
0 0 0 ... 0 0 0
0 ak 0 ... 0 0 0
0 ak−1 ak
... 0 0 0
.
.
.
0 a3 a4 ... ak 0 0
0 a2 a3 ... ak−1 ak 0
0 a1 a2 ... ak−2 ak−1 ak
(6.53)
We label the rows of the matrix Mk
as r
k
j
, where j = 1...k + 1 indicates the row number, and
the superscript denotes the rows come from Mk
.
We now proceed to prove the result. For the d-step momentum algorithm given in (6.1), the
modal contribution to variance Jˆ(λ) as defined in (2.22) is determined by the covariance matrix
Pˆ(λ), which solves the algebraic Lyapunov equation
Pˆ(λ) = Aˆ(λ)Pˆ(λ)AˆT
(λ) + BˆBˆ
T
(6.54)
152
where
Aˆ(λ) =
0 I
−a0(λ) [−a1(λ) ··· −ad−1(λ)]
, Bˆ =
0 ··· 0 1T
. (6.55)
Based on the structure of matrices Aˆ(λ) and Bˆ and the symmetry of Pˆ(λ), it is straightforward to
confirm that Pˆ(λ) is a symmetric Toeplitz matrix with d unknown elements pi
, with p1 on the main
diagonal, p2 on the first off diagonals, p3 on both second off diagonals, and so on until pd which
appears only in the upper right and lower left corners. Thus we can rewrite (6.64) as a sysmtem of
linear equations
Md
p¯ = 1 (6.56)
where Md
is the matrix defined in (6.53) associated with the characteristic equation of Aˆ(λ) given
in (6.21), and
p¯ :=
p1 p2 p3 ... pd−1 pd p0
T
, 1 :=
1 0 ... 0
T
(6.57)
Here p1,..., pd denote the d unknown elements of the covariance matrix Pˆ(λ), with Pˆ(λ)1,1 =
p1 and p0 is introduced as an additional slack variable which solves the linear relation p0 =
−∑
d
k=1
ak−1 pk
.
To solve this linear system we will employ a series of vector operations on the rows of Md
to
reduce the system to a single equation. Our approach mirrors the technique for creating the Jurty
stability table, as described in [85]. As described in the proof of the Jury stability criteria provided
in [85], we can use polynomial division to reduce a system of order d to an equivalent system of
order d −1, and thus recursively continue to reduce the system. At each iteration, we will present
an equivalent, reduced order linear system of equations Mk p¯
k = b
k
, until we reduce the system to
a single linear equation.
153
The first order reduction step is performed by constructing a new matrix with the rows
r
d−1
j = a0 r
d
j+1 − ad r
d
d+1−j
. (6.58)
By construction, the rows satisfy the linear equations
(r
d−1
d
)
T
p¯ = −(r
d
1
)
T
p¯ = −1, (r
d−1
j
)
T
p¯ = 0, j = 1,...,d −1.
Additionally, each row has a zero in the last column. Most importantly, the rows are constructed
such that when the zero in the last column is dropped, the rows r
d−1
j
are identically the rows r
d−1
j
of the matrix Md−1 defined in (6.53) associated with the characteristic equation ∑
d−1
j=0
c2, jz
j
, where
c2, j are the entries of the second row of the Jury stability table associated with (6.21). Thus we
have constructed the linearly system
Md−1
p¯
d−1 = b
d−1
,
p¯
d−1
:=
p1 p2 p3 ... pd−1 pd
T
, b :=
0 0 ... −1
T
(6.59)
where ¯p
d−1 ∈ R
d−1
and b
d−1 ∈ R
d−1
.
At the next iteration, we define rows
r
d−2
j = c2,0 r
d−1
j+1 − c2,d−1 r
d−1
d+1−j
(6.60)
and after dropping the last column of zeros, obtain the linear system
Md−2
p¯
d−2 = b
d−2
,
p¯
d−2
:=
p1 p2 p3 ... pd−1
T
, b
d−2
:=
0 0 ... −c2,0
T
(6.61)
154
where the vector b
d−2
is determined by noting
(r
d−2
d−1
)
T
p¯
d−2 = c2,0 (r
d−1
d
)
T
p¯
d−1 = −c2,0.
In general, we define the rows at iteration d −k by
r
d−k
j = ck+1,0 r
d−k+1
j+1 − ck+1,d−k+1 r
d−k+1
d−k+1−j
(6.62)
minus the last column of zeros, resulting in the linear system of equations
Md−k
p¯
d−k = b
d−k
,
p¯
d−k
:=
p1 p2 ... pd−k+1
T
, b
d−k
:=
0 0 ... −c2,0 c3,0 ... ck,0
T
(6.63)
of size d −k +1. At iteration d, we are left with a single scalar equation
cd+1,0 p1 = −c2,0 c3,0...cd,0
which implies the variance Jˆ(λ) is given by
Jˆ(λ) = p1 =
−c2,0 c3,0 ... cd,0
cd+1,0
where ck,0 are the entries in the first column of the Jury stability table associated with (6.21).
We next derive the expression for Jˆ(λ) in terms of the roots of the associated characteristic
equation Fλ
(z).
Proof. As stated in the previous proof, Jˆ(λ) is determined by the covariance matrix Pˆ(λ), which
solves the algebraic Lyapunov equation
Pˆ(λ) = Aˆ(λ)Pˆ(λ)AˆT
(λ) + BˆBˆ
T
. (6.64)
155
The closed form solution to the discrete time algebraic Lyapunov equation is given by
Pˆ(λ) =
∞
∑
k=0
Aˆ(λ)
kBˆBˆ
T
(Aˆ(λ)
T
)
k
. (6.65)
We use the Jordan decomposition Aˆ(λ) = U V U−1
, where
V = diag(µ1, µ2, ..., µd), Ui, j = µ
−(d−i)
j
(6.66)
to determine the matrix powers Aˆ(λ) = U VkU
−1
, results in the jth element of the vector Aˆ(λ)
k B
is given by
[Aˆ(λ)
k B] j =
d
∑
i=l
µ
k+j
i
d
∏
m=1
m̸=i
(µi − µm)
(6.67)
Thus the variance Jˆ(λ), given by the {1,1} term of the matrix Pˆ(λ), has the structure
Jˆ(λ) = Pˆ
1,1(λ) =
∞
∑
k=0
d
∑
i=1
µ
k
i
d
∏
j=1
j̸=i
(µi − µj)
2
=
d
∑
i=1
µ
d−1
i
(1− µ
2
i
)
d
∏
j=1
j̸=i
(µi − µj)(1− µiµj)
. (6.68)
6.5.7 Outline of Proof of Conjecture 1
The outline of the proof of Theorem 18 is provided below. While the proof is incomplete, the
outline below provides the motivation and reasoning behind the conjecture. The derivation is
inspired by the approach used in the proof of Theorem 17, which provides the lower bound on the
156
product of variance and settling time for d = 3. We first determine a lower bound on Jˆ(λ) for a
fixed
w :=
d
∏
i=1
(1− µi)
and then use the bound on w at λ = m
w(m) ≤
(1+ρ)
2
κ
to lower bound the result. The conjecture depends on two propositions which provide lower bounds
on Jˆ(λ) . First we fix
µd = 1 −
w
d−1
∏
i=1
(1− µ −i)
in order to ensure a fixed value of w, and provide a configuration of the remaining d−1 eigenvalues
which results in a lower bound for Jˆ(λ) for given w.
Proposition 14. Let the d-step momentum algorithm (6.1) with strongly convex quadratic objective
function f ∈ QL
m convergence linearly with rate ρ. Then, for any fixed value of w = ∏
d
i=1
(1− µi),
the modal contribution to variance Jˆ(λ) is lower bounded by
Jˆ(λ) ≥
∞
∑
k=0
lim
(µ1...µd−1)→ρ
d
∑
i=1
µ
k
i
(1+ µi)
2
d
∏
j=1
j̸=i
(µi − µj)
d
∑
i=1
µ
k
i
d
∏
j=1
j̸=i
(µi − µj)
=: Jˆ
⋆
(w) (6.69)
where µ1,...,µd are the eigenvalues of the characteristic equation Fλ
(z) given in (6.21).
The first step of the proposition,
Jˆ(λ) ≥
∞
∑
k=0
d
∑
i=1
(1+ µi)
2
µ
k
i
d
∏
j=1
j̸=i
(µi − µj)
d
∑
i=1
µ
k
i
d
∏
j=1
j̸=i
(µi − µj)
(6.70)
157
follows immediately from Theorem 18. The fractions (1+µi)/2 ≤ 1 have been added preemptively
to produce an expression which is strictly decreasing in w. In the d = 3 case, this step is equivalent
to introducing the lower bound in (6.23). Next we propose that, given the constraints µi ∈ [−ρ, ρ]
for i = 1,...,d − 1, the above expression is minimized by placing all d − 1 free eigenvalues at
+ρ. The proposition can be verified numerically for d = 2,...,5, but remains unproven for general
integers d. Next we derive a lower bound for the value of the variance at this eigenvalue placement.
Proposition 15. Let the d-step momentum algorithm (6.1) with strongly convex quadratic objective
function f ∈ QL
m with condition number κ convergence linearly with rate ρ. Then, for any of
w = ∏
d
i=1
(1− µi), the function Jˆ⋆
(w) defined in (6.69) satisfies
Jˆ⋆
(w)
1−ρ
≥
(2d −3)κ
2
2
2(d+1)
. (6.71)
The proof is provided below.
Proof. We first compute the inner limit in (6.69), placing d −1 eigenvalues at ρ, and then compute
the infinite summation over k. While in the proof of Theorem 18 we determine the infinite summation of k first, computing the limit µ1,...,µd−1 → ρ of the expression in (6.68) is prohibitively
difficult. We first establish
lim
(µ1,...,µd−1)→ρ
d
∑
i=1
µ
k
i
(1+ µi)
2
d
∏
j=1
j̸=i
(µi − µj)
d
∑
i=1
µ
k
i
d
∏
j=1
j̸=i
(µi − µj)
=
µ
2k
d
(1+ µd)
2(µd −ρ)
2(d−1)
+
µ
k
d
ρ
k
(2+ µd)
2(µd −ρ)
2(d−1)
d−2
∑
m=0
k
m
(µd −ρ)
m
ρ
−m
+
µ
k
d
ρ
k+1
2(µd −ρ)
2(d−1)
d−2
∑
m=0
k +1
m
(µd −ρ)
m
ρ
−m
+
d−2
∑
m=0
d−2
∑
n=0
ρ
2k−m−n
2(µd −ρ)
2d−2−m−n
k
n
k
m
+ ρ
k +1
m
(6.72)
158
and introduce the identities
∞
∑
k=0
µ
k
d ρ
k
k
n
=
µ
n
d
ρ
n
(1− µd ρ)
n+1
(6.73a)
∞
∑
k=0
µ
k
d ρ
k
k +1
m
=
µ
m−1
d
ρ
m−1
(1− µdρ)
m+1
=
µ
m−1
d
ρ
m−1
(1− µdρ)
m
+
µ
m
d
ρ
m
(1− µdρ)
m+1
for m ≥ 1
=
µ
m
d
ρ
m
(1− µdρ)
m+1
for m = 0
(6.73b)
∞
∑
k=0
ρ
2k
k
m
k
n
=
ρ
2max[m,n]
(1−ρ
2)
m+n+1
fm,n(ρ
2
) (6.73c)
∞
∑
k=0
ρ
2k
k
m
k +1
n
=
ρ
2max[m,n−1]
(1−ρ
2)
m+n+1
gm,n(ρ
2
) (6.73d)
where fm,n(x) is a polynomial of degree min[m, n] with positive integer coefficients and gm,n(x) is
a polynomial of degree min[m, n−1] with positive integer coefficients. The last two identities are
obtained noting
k
m
k
n
=
1
m!n!
m−1
∏
i=0
(k −i)
n−1
∏
j=0
(k − j)
which results in a polynomial of degree n+m in k with integer coefficients, and
∞
∑
k=0
ρ
2k
k
n =
ρ
n−1
∑
j=0
n
j
ρ
2 j
(1−ρ
2)
n+1
(6.74)
where
n
j
are the Eulerian numbers [86] . We can combine the expressions in (6.73c) and (6.73d)
to introduce the notation
Gm,n(ρ) := (1−ρ
2
)
m+n+1
∞
∑
k=0
ρ
2k
k
n
k
m
+ ρ
k +1
m
=
h
ρ
2max[m,n]
fm,n(ρ
2
) + ρ
2max[m,n−1]+1
gm,n(ρ
2
)
i
(6.75)
159
Using the above identities together with the substitution µd = 1−w/(1−ρ)
d−1 we can now express
the infinite summation over k of (6.72) as
Jˆ
⋆
(w) =
(1−ρ)
d−1
2d−1
2
(1−ρ)
d −w
2d−2
2(1−ρ)
d−1 −w
+
d−2
∑
m=0
(1−ρ)
d−1
2d−2−m
(2+ρ)(1−ρ)
d−1 −w
(1−ρ)
d−1 −w
m
2
(1−ρ)
d −w
2d−2−m
(1−ρ)
d +wρ
m+1
+
d−2
∑
m=1
(1−ρ)
d−1
2d−1−m
(1−ρ)
d−1 −w
m−1
2
(1−ρ)
d −w
2d−2−m
(1−ρ)
d +wρ
m
+
d−2
∑
m=0
d−2
∑
n=0
(1−ρ)
d−1
2d−2−m−n
2
(1−ρ)
d −w
2d−2−m−n
(1−ρ
2)
m+n+1
Gm,n(ρ).
(6.76)
Derivative analysis shows that Jˆ⋆
(w) is decreasing in w. Using the bound w ≤ (1 + ρ)
2/κ at
λ = m introduced in (6.25), and introducing the produt with settling time Ts = 1/(1−ρ), we can
then write
Jˆ
⋆
(w)×
1
1−ρ
≥ Jˆ
⋆
(1+ρ)
2
κ
×
1
1−ρ
=
κ
2d−1
(1−ρ)
2d
2−3d
2
κ (1−ρ)
d − (1+ρ)
2
2d−2
2κ (1−ρ)
d−1 − (1+ρ)
2
+
d−2
∑
m=0
κ
2d−2−m
(1−ρ)
2d
2−4d−dm+m+1
κ (2+ρ)(1−ρ)
d−1 − (1+ρ)
2
κ (1−ρ)
d−1 − (1+ρ)
2
m
2
κ (1−ρ)
d − (1+ρ)
2
2d−2−m
κ (1−ρ)
d + (1+ρ)
2 ρ
m+1
!
+
d−2
∑
m=1
κ
2d−m−1
(1−ρ)
2d
2−3d−dm+m
κ (1−ρ)
d−1 − (1+ρ)
2
m−1
2
κ (1−ρ)
d − (1+ρ)
2
2d−2−m
κ (1−ρ)
d + (1+ρ)
2 ρ
m
+
d−2
∑
m=0
d−2
∑
n=0
κ
2d−2−m−n
(1−ρ)
d(2d−4−m−n)
2
κ (1−ρ)
d − (1+ρ)
2
2d−2−m−n
(1+ρ)
m+n+1
Gm,n(ρ).
(6.77)
It now remains to show that the expression above
Lemma 2. The function
1
1−ρ
Jˆ⋆
(1+ρ)
2
κ
given in (6.77) is decreasing in ρ for ρ ∈ [0, 1].
We now determine the limit of
1
1−ρ
Jˆ⋆
(1+ρ)
2
κ
as ρ approaches one. Due to the presence
of (1 − ρ) terms in the numerator, the first three terms all tend to zero as ρ approaches one. For
the final double summation term, as long as m < d −2 and n < d −2, the term 2d −4−m−n is
strictly greater than zero, and the term tends to zero as ρ tends to one. For the last term in the
summations, at m = n = d −2, we have d(2d −4−m−n) = 0 and m+n+1 = 2d −3 and thus
lim
ρ→1
κ
2d−2−m−n
(1−ρ)
d(2d−4−m−n)
2
κ (1−ρ)
d − (1+ρ)
2
2d−2−m−n
(1+ρ)
m+n+1
=
κ
2
2
2d+2
. (6.78)
It thus remains to establish the limit of the polynomial Gm,n(ρ) at m = n = d −2 as ρ approaches
one. Based on the definition given in (6.75) and the intermediate summation given in (6.74), we
can use the computational engine Mathematica to verify
Gd−2,d−2(1) = 4
2d −5
d −3
for d ≥ 3. (6.79)
16
Chapter 7
Conclusion
We have examined key performance metrics of accelerated first-order optimization algorithms.
For smooth strongly convex objective functions, we examined the impact of acceleration on transient growth in optimization error, and determine upper and lower bounds on worst-case transient
response which scale with the square root of the condition number. We further propose two variations of the standard two-step momentum based algorithms and quantify the effects of algorithmic
performance. We consider averaging the iterates x
t over time, and adding additional history terms
x
t−d
to the update equation.
We show that averaging algorithmic iterates over a moving window of fixed length d reduces
steady-state variance by a factor of roughly 1/d compared to the steady-state variance of the nonaveraged output while maintaining linear convergence rate. However, the product of variance and
settling time maintains dependence on the square of the condition number, indicating that variance
reduction due to averaging cannot overcome this fundamental limitation.
Our second algorithmic variation achieves similar results. In continuous time, our examination
of gradient flow dynamics of order d shows that while increasing the order d improves convergence
rate and improves the upper bound on the best possible variance for a given convergence rate, the
product between variance amplification and settling time scales maintains scaling with κ
2
. In discrete time for strongly convex quadratic problem, it is an established result that the convergence
rate achieved by the heavy-ball algorithm cannot be improved upon. Our results show that for any
stabilizing parameters, the product of variance and settling time is lower bounded in terms of κ
2
,
162
and for parameters which achieve the optimal convergence rate, the largest modal contribution to
variance is increasing as the algorithmic dependence on the third history term increases. Together,
we conclude adding a third momentum term does not offer improvement on the algorithmic performance. We further conjecture that the results extend to accelerated algorithms with any number
of history terms d, which remains to be proven. Future work includes fully quantifying the tradeoff between convergence rate and steady-state variance for accelerated based algorithms with d
momentum terms. Furthermore, as well as consideration of the connection between continuous
and discrete time behavior imposed by discretization. Additionally, investigation into discretization schemes connecting gradient flow dynamics and discrete time accelerated algorithms is still
ongoing. Further investigation into the connection between accelerated algorithms in discrete and
continuous time may offer insights into how to capture the advantages of higher order gradient
flows in practical implementation.
163
References
[1] L. Bottou and Y. Le Cun. On-line learning for very large data sets. Appl. Stoch. Models Bus.
Ind., 21(2):137–151, 2005.
[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
[3] Yu Nesterov. Gradient methods for minimizing composite objective functions. Math. Program., 140(1):125–161, 2013.
[4] Mingyi Hong, Meisam Razaviyayn, Zhi-Quan Luo, and Jong-Shi Pang. A unified algorithmic framework for block-structured optimization involving big data: With applications in
machine learning and signal processing. IEEE Signal Process. Mag., 33(1):57–77, 2016.
[5] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.
SIAM Rev., 60(2):223–311, 2018.
[6] Fu Lin, Makan Fardad, and Mihailo R Jovanovic. Design of optimal sparse feedback gains ´
via the alternating direction method of multipliers. IEEE Transactions on Automatic Control,
58(9):2426–2431, 2013.
[7] Sepideh Hassan-Moghaddam and Mihailo R Jovanovic. Topology design for stochastically ´
forced consensus networks. IEEE Transactions on Control of Network Systems, 5(3):1075–
1086, 2017.
[8] Armin Zare, Hesameddin Mohammadi, Neil K Dhingra, Tryphon T Georgiou, and Mihailo R
Jovanovic. Proximal algorithms for large-scale statistical modeling and sensor/actuator se- ´
lection. IEEE Transactions on Automatic Control, 65(8):3441–3456, 2019.
[9] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization
algorithms via integral quadratic constraints. SIAM J. Optim., 26(1):57–95, 2016.
[10] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent firstorder method for minimizing strongly convex functions. IEEE Control Syst. Lett., 2(1):49–
54, 2018.
[11] A. Badithela and P. Seiler. Analysis of the heavy-ball algorithm using integral quadratic
constraints. In Proceedings of the 2019 American Control Conference, pages 4081–4085.
IEEE, 2019.
164
[12] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and
momentum in deep learning. In Proc. ICML, pages 1139–1147, 2013.
[13] Y. Nesterov. Lectures on convex optimization, volume 137. Springer Optimization and Its
Applications, 2018.
[14] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning,
pages 1139–1147. PMLR, 2013.
[15] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Comput. Math. & Math. Phys., 4(5):1–17, 1964.
[16] Y. Nesterov. A method for solving the convex programming problem with convergence rate
O(1/k
2
). In Dokl. Akad. Nauk SSSR, volume 27, pages 543–547, 1983.
[17] Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging.
SIAM Journal on Control and Optimization, 30:838–855, 07 1992.
[18] Sebastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. ´ Electronic
Journal of Statistics, 12(1):461 – 529, 2018.
[19] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory, pages 545–604. PMLR, 2018.
[20] Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noisecorrupted gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1019–1028. PMLR, 10–15 Jul 2018.
[21] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural Computation,
12:1889–1900, 2000.
[22] Ahmad Beirami, Meisam Razaviyayn, Shahin Shahrampour, and Vahid Tarokh. On optimal
generalizability in parametric learning. In NIPS, 2017.
[23] Zhi-Quan Luo and P. Tseng. Error bounds and convergence analysis of feasible descent
methods: a general approach. Ann. Oper. Res., 46(1):157–178, 1993.
[24] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., pages
400–407, 1951.
[25] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2009.
[26] O. Devolder. Exactness, Inexactness and Stochasticity in First-Order Methods for LargeScale Convex Optimization. PhD thesis, Louvain-la-Neuve, 2013.
165
[27] P. Dvurechensky and A. Gasnikov. Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. J. Optimiz. Theory App., 171(1):121–145, 2016.
[28] D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter optimization
through reversible learning. In Proc. ICML, pages 2113–2122, 2015.
[29] Hesameddin Mohammadi, Armin Zare, Mahdi Soltanolkotabi, and Mihailo R. Jovanovic.´
Convergence and sample complexity of gradient methods for the model-free linear–quadratic
regulator problem. IEEE Transactions on Automatic Control, 67(5):2435–2450, 2022.
[30] Maryam Fazel, Rong Ge, Sham M. Kakade, and Mehran Mesbahi. Global convergence of
policy gradient methods for the linear quadratic regulator. In International Conference on
Machine Learning, 2018.
[31] Hesameddin Mohammadi, Mahdi Soltanolkotabi, and Mihailo R Jovanovic. Random search
for learning the linear quadratic regulator. In 2020 American Control Conference (ACC),
pages 4798–4803. IEEE, 2020.
[32] B. O’Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Found.
Comput. Math., 15:715–732, 2015.
[33] B. T. Polyak and G. V. Smirnov. Transient response in matrix discrete-time linear systems.
Autom. Remote Control, 80(9):1645–1652, 2019.
[34] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Robustness of accelerated first- ´
order algorithms for strongly convex optimization problems. IEEE Trans. Automat. Control,
66(6):2480–2495, June 2021.
[35] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Tradeoffs between convergence rate ´
and noise amplification for momentum-based accelerated optimization algorithms. IEEE
Trans. Automat. Control, 2022. submitted; also arXiv:2209.11920.
[36] B. Van Scoy and L. Lessard. The speed-robustness trade-off for first-order methods with
additive gradient noise. 2021. arXiv:2109.05059.
[37] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization
with inexact oracle. Math. Program., 146(1-2):37–75, 2014.
[38] B. Hu and L. Lessard. Dissipativity theory for Nesterov’s accelerated method. In Proc.
ICML, volume 70, pages 1549–1557, 2017.
[39] Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, and Asuman Ozdaglar. Robust
accelerated gradient methods for smooth strongly convex functions. SIAM Journal on Optimization, 30(1):717–751, 2020.
[40] Igor Gitman, Hunter Lang, Pengchuan Zhang, and Lin Xiao. Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems,
32, 2019.
166
[41] Wei Tao, Sheng Long, Gaowei Wu, and Qing Tao. The role of momentum parameters in the optimal convergence of adaptive polyak’s heavy-ball methods. arXiv preprint
arXiv:2102.07314, 2021.
[42] Boris T Polyak. New stochastic approximation type procedures. Automat. i Telemekh, 7(98-
107):2, 1990.
[43] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
[44] Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, and Michael I Jordan. Averaging
stochastic gradient descent on riemannian manifolds. In Conference On Learning Theory,
pages 650–687. PMLR, 2018.
[45] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger
convergence rates for least-squares regression. Journal of Machine Learning Research,
18(101):1–51, 2017.
[46] Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity
for logistic regression. The Journal of Machine Learning Research, 15(1):595–627, 2014.
[47] Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, and Asuman Ozdaglar. A
universally optimal multistage accelerated stochastic gradient method. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, ´ Advances
in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[48] K. J. Arrow, L. Hurwicz, and H. Uzawa. Studies in linear and non-linear programming. 1958.
[49] Andrew A Brown and Michael C Bartholomew-Biggs. Some effective methods for unconstrained optimization based on the solution of systems of ordinary differential equations. J.
Optim. Theory Appl., 62(2):211–224, 1989.
[50] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. Proc. Neural Information Processing (NIPS), 27, 2014.
[51] Diego Feijer and Fernando Paganini. Stability of primal–dual gradient dynamics and applications to network optimization. Automatica, 46(12):1974–1981, 2010.
[52] Jing Wang and Nicola Elia. A control perspective for centralized and distributed convex
optimization. In 2011 50th IEEE conference on decision and control and European control
conference, pages 3800–3805. IEEE, 2011.
[53] Ashish Kumar Cherukuri, Enrique Mallada, and Jorge Cortes. Asymptotic convergence of ´
constrained primal-dual dynamics. Syst. Control. Lett., 87:10–15, 2015.
[54] N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. The proximal augmented Lagrangian ´
method for nonsmooth composite optimization. IEEE Trans. Automat. Control, 64(7):2861–
2868, July 2019.
167
[55] S. Hassan-Moghaddam and M. R. Jovanovic. Proximal gradient flow and Douglas-Rachford ´
splitting dynamics: global exponential stability via integral quadratic constraints. Automatica, 123:109311 (7 pages), January 2021.
[56] N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. A second order primal-dual method ´
for nonsmooth convex composite optimization. IEEE Trans. Automat. Control, 67(8):4061–
4076, August 2022.
[57] Michael Muehlebach and Michael I. Jordan. Continuous-time lower bounds for gradientbased algorithms. In Proceedings of the 37th International Conference on Machine Learning,
ICML’20. JMLR.org, 2020.
[58] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Variance amplification of accelerated ´
first-order algorithms for strongly convex quadratic optimization problems. In Proceedings
of the 57th IEEE Conference on Decision and Control, pages 5753–5758, Miami, FL, 2018.
[59] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Performance of noisy Nesterov’s ´
accelerated method for strongly convex optimization problems. In Proceedings of the 2019
American Control Conference, pages 3426–3431, Philadelphia, PA, 2019.
[60] S. Michalowsky, C. Scherer, and C. Ebenbauer. Robust and structure exploiting optimisation
algorithms: an integral quadratic constraint approach. Int. J. Control, pages 1–24, 2020.
[61] J. I. Poveda and Na Li. Robust hybrid zero-order optimization algorithms with acceleration
via averaging in time. Automatica, page 109361, 2021.
[62] Dimitri P Bertsekas. Convex optimization algorithms. Athena Scientific, 2015.
[63] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alternating direction
method of multipliers. SIAM J. Imaging Sci., 8(1):644–681, 2015.
[64] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the admm in
decentralized consensus optimization. IEEE Trans. Signal Process., 62(7):1750–1761, 2014.
[65] L.N. Trefethen and M. Embree. ˜ Spectra and pseudospectra: the behavior of nonnormal
matrices and operators. Princeton University Press, Princeton, 2005.
[66] M. R. Jovanovic and B. Bamieh. Componentwise energy amplification in channel flows. ´ J.
Fluid Mech., 534:145–183, July 2005.
[67] Mihailo R Jovanovic. From bypass transition to flow control and data-driven turbulence mod- ´
eling: an input–output viewpoint. Annual Review of Fluid Mechanics, 53:311–345, 2021.
[68] B Can, M. Gurbuzbalaban, and L. Zhu. Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In International Conference on Machine Learning,
pages 891–901. PMLR, 2019.
[69] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.
Kluwer Academic Publishers, 2004.
168
[70] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM J. Optim.,
28(3):2654–2689, 2018.
[71] Marina Danilova and Grigory Malinovskiy. Averaged heavy-ball method. Computer Research and Modeling, 14:277–308, 04 2022.
[72] A. Megretski and A. Rantzer. System analysis via integral quadratic constraints. IEEE Trans.
Autom. Control, 42(6):819–830, 1997.
[73] S. Cyrus, B. Hu, B. Van Scoy, and L. Lessard. A robust accelerated optimization algorithm
for strongly convex functions. In Proceedings of the 2018 American Control Conference,
pages 1376–1381, 2018.
[74] B. T. Polyak and P. Shcherbakov. Lyapunov functions: An optimization theory perspective.
IFAC-PapersOnLine, 50(1):7456–7461, 2017.
[75] F. Topsok. Some bounds for the logarithmic function. Inequal. Theory Appl., 4:137, 2006.
[76] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a stepsize. In Conference on learning theory, pages 658–695. PMLR, 2015.
[77] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation
with convergence rate o (1/n). Advances in neural information processing systems, 26, 2013.
[78] Boris Polyak. Comparison of convergence rate of one-step and multistep optimization algorithms in the presence of noise. Izv. Akad. Nauk SSSR, Tekh. Kibern., 1:9–12, 01 1977.
[79] H. Mohammadi, S. Samuelson, and M. R. Jovanovic. Transient growth of accelerated opti- ´
mization algorithms. IEEE Trans. Automat. Control, 68(3):1823–1830, 2023.
[80] M. Muehlebach and M. Jordan. A dynamical systems perspective on Nesterov acceleration.
In International Conference on Machine Learning, pages 4656–4662. PMLR, 2019.
[81] Michael Muehlebach and Michael I. Jordan. Continuous-time lower bounds for gradientbased algorithms. In Proceedings of the 37th International Conference on Machine Learning,
ICML’20. JMLR.org, 2020.
[82] Marc Bodson. Explaining the routh–hurwitz criterion: A tutorial presentation [focus on
education]. IEEE Control Systems Magazine, 40(1):45–51, 2020.
[83] E. I. Jury. A simplified stability criterion for linear discrete systems. Proceedings of the IRE,
50(6):1493–1500, 1962.
[84] K. Ogata. Discrete-time control systems. Prentice-Hall, New Jersey, 1994.
[85] Lee H Keel and Shankar P Bhattacharyya. A new proof of the jury test. Automatica,
35(2):251–258, 1999.
[86] Kyle Petersen. Eulerian Numbers. 10 2015.
169
Abstract (if available)
Abstract
First order algorithms are widely used in a variety of control and learning applications. We employ control theoretic tools to examine the performance of momentum-based accelerated first-order optimization algorithms used to solve mostly strongly convex quadratic problems in the presence of additive white stochastic disturbances. We consider three key performance metrics: convergence rate of the optimization error in expectation, worst-case transient growth of normalized optimization error in non-asymptotic time frames, and steady-state variance. We first examine the transient behavior of accelerated first-order optimization algorithms in the absence of noise. We show that both the maximum normalized optimization error over iteration number and possible initial conditions and the rise time to the worst-case transient peak are proportional to the square root of the condition number of the problem. We next propose two variations to the class of standard two-step momentum based algorithms, and investigate their effects on the performance metrics above. Our goal is to reduce worst-case transient growth and steady state variance without compromising convergence rate. We consider post-algorithmic averaging over the iterations of the optimization algorithms, and introducing additional momentum terms which reach further into algorithmic history in order to update each iteration. Averaging over fixed number of iterates reduces variance somewhat while maintaining convergence rate, while introducing additional momentum in discrete time preserves convergence rate while increasing variance for optimal parameters. Higher order gradient flow dynamics show improvement in convergence rate and variance. For both algorithmic variants, the lower bound on the product of steady-state variance and settling time in terms of the square of the condition number is preserved.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robustness of gradient methods for data-driven decision making
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Discrete optimization for supply demand matching in smart grids
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Integration of truck scheduling and routing with parking availability
PDF
Speeding up multi-objective search algorithms
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Material and process development and optimization for efficient manufacturing of polymer composites
Asset Metadata
Creator
Samuelson, Samantha
(author)
Core Title
Performance trade-offs of accelerated first-order optimization algorithms
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-12
Publication Date
09/26/2024
Defense Date
09/26/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
accelerated algorithms,convex optimization,gradient flow dynamicsc,heavy-ball method,integral quadratic constraints,linear system dynamics,Nesterov's accelerated method,noisy gradients,OAI-PMH Harvest,performance trade-offs,stocahastic optimization,transient growth
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jovanovic, Mihailo (
committee chair
), Nayyar, Ashutosh (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
sam.anne.sam@gmail.com,sasamuel@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399B9ZC
Unique identifier
UC11399B9ZC
Identifier
etd-SamuelsonS-13553.pdf (filename)
Legacy Identifier
etd-SamuelsonS-13553
Document Type
Dissertation
Format
theses (aat)
Rights
Samuelson, Samantha
Internet Media Type
application/pdf
Type
texts
Source
20240926-usctheses-batch-1214
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
accelerated algorithms
convex optimization
gradient flow dynamicsc
heavy-ball method
integral quadratic constraints
linear system dynamics
Nesterov's accelerated method
noisy gradients
performance trade-offs
stocahastic optimization
transient growth