Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
(USC Thesis Other)
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Stochastic Conjugate Subgradient Framework for Large-scale
Stochastic Optimization Problems
by
Di Zhang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(INDUSTRIAL & SYSTEMS ENGINEERING)
December 2024
Copyright 2024 Di Zhang
I dedicate this thesis to my family and friends,
for their inspiration and support.
ii
Acknowledgements
This dissertation is funded by the following grants: AFOSR 20-1-0006 (Chapter 3), NSF FA9550-
434 (Chapter 4), and DOE Grant DE-SC002361 (Chapter 5).
I would like to thank my advisor Dr. Suvrajeet Sen and all the members in USC3DLAB.
I would like to thank my committee members.
Finally, I would like to thank all of my friends and my family.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Stochastic Optimization vs. Robust Optimization . . . . . . . . . . . . . . 2
1.1.2 Stochastic Optimization vs. Distributionally Robust Optimization . . . . . 2
1.1.3 Stochastic Optimization vs. Optimization with Chance Constraints . . . . . 3
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Major Approaches for SO . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Common Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Adaptive Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Stochastic Conjugate Subgradient Direction . . . . . . . . . . . . . . . . . 11
1.3.3 Line-search Algorithm to Determine the Step Size . . . . . . . . . . . . . 12
1.4 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Models, Algorithms and Convergence Properties . . . . . . . . . . . . . . . . . 14
2.1 Non-smooth Conjugate Subgradient Algorithm . . . . . . . . . . . . . . . . . . . 14
2.1.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Non-smooth Conjugate Subgradient Algorithm . . . . . . . . . . . . . . . 16
2.1.3 Properties of Non-smooth Conjugate Subgradient Algorithm . . . . . . . . 18
2.2 Constrained Non-smooth Conjugate Subgradient Algorithm . . . . . . . . . . . . . 23
2.2.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Conjugate Subgradient Algorithm for Equality Constrained Problems . . . 24
2.2.3 Properties of Active Set Non-smooth Conjugate Subgradient Algorithm . . 25
2.3 Stochastic Conjugate Subgradient Algorithm . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Basic SCS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Comparison with stochastic approximation . . . . . . . . . . . . . . . . . 28
Chapter 3: Stochastic Conjugate Subgradient Algorithm for Kernel Support Vector Machines 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Stochastic first-order methods . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Stochastic methods for general non-linear optimization . . . . . . . . . . . 34
3.1.3 Stochastic conjugate gradient method . . . . . . . . . . . . . . . . . . . . 35
iv
3.1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Stochastic Conjugate Subgradient (SCS) Algorithm . . . . . . . . . . . . . . . . . 38
3.3 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Stochastic Local Approximations . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Sufficient Decreases and Supermartingale Property . . . . . . . . . . . . . 46
3.3.3 Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Optimality Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Preliminary Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: Stochastic Conjugate Subgradient Algorithm for Two-stage Stochastic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Connections with the literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Stochastic Conjugate Subgradient Algorithm . . . . . . . . . . . . . . . . . . . . 66
4.4 SCS for Two-stage Stochastic Quadratic Programming . . . . . . . . . . . . . . . 68
4.5 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.1 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.2 Feasible direction and sufficient decrease . . . . . . . . . . . . . . . . . . 73
4.5.3 Convergence Rate and Optimality Condition . . . . . . . . . . . . . . . . 74
4.6 Preliminary Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Chapter 5: A Sampling-based Progressive Hedging Algorithm for Stochastic Programming 80
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 Classic progressive hedging algorithm . . . . . . . . . . . . . . . . . . . . 81
5.1.3 Adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1.4 Stochastic conjugate gradient method . . . . . . . . . . . . . . . . . . . . 83
5.1.5 Penalty parameter and line-search algorithm . . . . . . . . . . . . . . . . . 85
5.2 Sampling-based Progressive Hedging Algorithm . . . . . . . . . . . . . . . . . . . 85
5.3 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Scenario Dual Objective Functions . . . . . . . . . . . . . . . . . . . . . . 88
5.3.2 Stochastic Local Approximation . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3 Sufficient Increase and Submartingale Property . . . . . . . . . . . . . . . 96
5.3.4 Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.5 Optimality condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Preliminary Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
v
List of Tables
1.1 Comparison of Stochastic Optimization (SO) with other stochastic frameworks . . 4
3.1 Classification performance of algorithms for different data sets . . . . . . . . . . . 57
3.2 Definitions of Symbols in Section 1 . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Definitions of Symbols in Section 2 . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Definitions of Symbols in Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . 62
vi
List of Figures
3.1 { f(αˆ k)}
k=50
k=1
for different combinations (data,algorithm). . . . . . . . . . . . . . . 56
3.2 { f(αˆ k)}
k=−1
k=−50 for different combinations (data,algorithm). . . . . . . . . . . . . . 57
4.1 { f(xˆk)}
k=50
k=1
for different combinations (data,algorithm). . . . . . . . . . . . . . . 78
4.2 { f(xˆk)}
k=−1
k=−50 for different combinations (data,algorithm). . . . . . . . . . . . . . . 79
5.1 Dual objective values for different combinations (data,algorithm). . . . . . . . . . 108
vii
Preamble
Stochastic Optimization is a cornerstone of operations research, providing a framework to solve
optimization problems under uncertainty. Despite the development of numerous algorithms to
tackle these problems, several persistent challenges remain, including: (1) selecting an appropriate
sample size, (2) determining an effective search direction, and (3) choosing a proper step size. This
dissertation introduces a comprehensive framework, the Stochastic Conjugate Subgradient (SCS)
framework, designed to systematically address these challenges. Specifically, The SCS framework
offers structured approaches to determining the sample size, the search direction, and the step
size. By integrating various stochastic algorithms within the SCS framework, we have developed
novel stochastic algorithms. The convergence and convergence rates of these algorithms have been
rigorously established, with preliminary computational results support the theoretical analysis.
viii
Chapter 1
Introduction and Motivation
1.1 Introduction
Stochastic Optimization (SO) is a fundamental framework in operations research and decisionmaking under uncertainty [7, 66]. It involves optimizing an objective function that depends on
random variables, with the goal of making decisions that perform well on average or with high
probability. The typical objective in SO is to minimize (or maximize) the expected value of an
objective function:
min
x∈X
Eξ
[ f(x,ξ )], (1.1)
where x is the decision variable, ξ represents the random vector, and f(x,ξ ) is the objective function dependent on x and ξ .
While SO is widely used in various applications such as finance [61], machine learning [63], inventory control [40], power system [25], etc., it is often entangled with other stochastic frameworks
like Robust Optimization (RO), Distributionally Robust Optimization (DRO), and Optimization
with Chance Constraints (OCC). Each of these frameworks offers a different approach to handling
uncertainty, and understanding their relationships with SO is critical for selecting the appropriate
methodology for specific problems. Thus, before we introduce our methodology to solve SO, we
1
will first explore these relationships, highlighting the similarities, differences, and formulation of
each approach in relation to SO.
1.1.1 Stochastic Optimization vs. Robust Optimization
In contrast to SO, RO does not rely on probabilistic information but instead assumes that uncertain
parameters lie within a predefined uncertainty set U [5, 6]. The objective in RO is to find a solution
that performs well under the worst-case scenario within this set:
min
x∈X
max
ξ∈U
f(x,ξ )
Constraints in RO are often similarly robustified to ensure feasibility under all realizations within
the uncertainty set. A specific comparison of SO and RO is summarized as follows:
• Uncertainty Handling: SO models uncertainty using probability distributions, allowing for
an expected-value approach, while RO focuses on worst-case scenarios within a deterministic uncertainty set.
• Flexibility vs. Conservatism: SO is more flexible as it leverages probabilistic information,
but it can be sensitive to inaccuracies in the assumed distributions. RO, on the other hand,
is more conservative, providing solutions that are less dependent on specific probabilistic
assumptions but potentially over-conservative due to its focus on worst-case scenarios.
• Computational Complexity: SO often involves solving large-scale scenario-based models,
which can be computationally intensive. RO tends to be more tractable, especially when the
uncertainty set is convex, but it may result in overly conservative solutions.
1.1.2 Stochastic Optimization vs. Distributionally Robust Optimization
DRO [20, 53] extends the concept of RO by incorporating ambiguity in the probability distribution
itself. Instead of assuming a fixed probability distribution, DRO optimizes the objective function
2
against the worst-case distribution within a set of plausible distributions, known as the ambiguity
set P:
min
x∈X
sup
P∈P
EP [ f(x,ξ )]
This formulation generalizes both SO and RO, offering a balance between flexibility and robustness. Here’s a comparison between SO and DRO,
• Ambiguity vs. Fixed Distribution: While SO assumes a known and fixed distribution,
DRO considers the possibility of distributional uncertainty and protects against the worstcase distribution within an ambiguity set. This makes DRO more robust than SO in situations
where the true distribution is uncertain or only partially known.
• Balance Between SO and RO: DRO can be seen as a middle ground between SO and
RO. It retains some of the flexibility of SO by using probabilistic models but incorporates
robustness by considering a range of possible distributions, thus mitigating the risk of model
misspecification.
• Robustness and Performance Trade-off: DRO provides a more balanced approach by ensuring robustness to distributional errors without the extreme conservatism of RO. However,
introducing an ambiguity set requires balancing between the representation of model uncertainty, model fidelity, and their interplay with model complexity.
1.1.3 Stochastic Optimization vs. Optimization with Chance Constraints
OCC [7, 15] is a specific approach within SO that focuses on satisfying constraints with a high
probability. Instead of optimizing the expected value of an objective function, OCC ensures that
constraints are met with a probability greater than a specified threshold 1−ε:
min
x∈X
f(x)
3
subject to:
P(g(x,ξ ) ≤ 0) ≥ 1−ε,
where ε is a small tolerance level, and g(x,ξ ) represents the constraint function. The comparison
of SO and OCC:
• Objective vs. Constraints: SO generally focuses on optimizing the expected value of the
objective function, whereas OCC focuses on ensuring that constraints are satisfied with a
high probability. This makes OCC particularly relevant in risk-sensitive applications where
constraint violations are costly or unacceptable.
• Probability vs. Expectation: While SO uses expectations to handle uncertainty, OCC uses
probability levels to ensure that the solution meets certain reliability standards. This probabilistic constraint approach can be seen as a complementary method to the expectation-based
approach of SO.
• Applicability: OCC is often used in fields like finance, energy, and supply chain management, where the risk of constraint violations must be constrained in a certain level. It is
especially useful when decision-makers need to guarantee performance.
Method Formulation Uncertainty Handling Key Focus Conservatism
SO minx∈X Eξ
[ f(x,ξ )] Probabilistic Expected value optimization
Moderate
RO minx∈X maxξ∈U f(x,ξ ) Deterministic uncertainty set
Worst-case optimization
High
DRO minx∈X supP∈P EP[ f(x,ξ )] Distributional ambiguity set
Worst-case distribution optimization
High to Moderate
OCC minx∈X f(x) subject to
P(g(x,ξ ) ≤ 0) ≥ 1−ε
Probabilistic Constraint satisfaction
Moderate
Table 1.1: Comparison of Stochastic Optimization (SO) with other stochastic frameworks
Table 1.1 gives a comprehensive comparison of SO with other stochastic frameworks. For the rest
of the dissertation, we will focus on solving SO problems.
4
1.2 Motivation
Over the years, several methodologies have been developed to address SO problems, each with its
own strengths and applicability. We will list a few major approaches and some common challenges
among these approaches.
1.2.1 Major Approaches for SO
• Stochastic Gradient Descent (SGD) or Stochastic Approximation is a well-known optimization technique, particularly in machine learning. It iteratively updates the solution by
taking steps proportional to the negative gradient of the objective function, which is calculated using randomly selected data points or scenarios. Originating from the work of Robbins
and Monro [54], SGD has been adapted for stochastic optimization problems where the objective function is an expectation. Its efficiency in handling large-scale problems has made
it a standard tool in fields like deep learning and online optimization.
• Deterministic Algorithm through Sample Average Approximation (SAA). Originated from
the paper ”Linear programming under uncertainty” by Dantzig [18], SAA method quickly
becomes a powerful and straightforward approach for stochastic programming, particularly
useful when the objective function involves expectations. SAA approximates the expected
value by averaging over a finite number of sampled scenarios. This transforms the stochastic
problem into a deterministic one that can be solved using traditional optimization techniques.
While SAA is computationally intensive, requiring a large number of samples for accuracy,
it remains a popular choice in practice due to its simplicity and effectiveness in various
domains such as finance and logistics.
• Stochastic Decomposition (SD) is a widely used technique, particularly for solving largescale two-stage stochastic linear programs. Initially developed by Higle and Sen [27], SD
extends the principles of Benders Decomposition by incorporating stochastic elements. It
iteratively refines the solution by solving a master problem and generating cuts based on
5
scenario-specific subproblems. Although the number of cuts in the approximation could
grow without bound at first, this handicap was overcome in the 1994 [28] where the objective
function is approximated by a finite number (n1 +3) of inexact subgradients, thus curtailing
the size of the approximations to a fixed number, regardless of the number of outcomes in
the sample space. The method is especially useful in problems where the scenario space is
large, allowing for efficient handling of uncertainty.
• Progressive Hedging (PH) is an algorithm designed for multi-stage stochastic programs.
Introduced by Rockafellar and Wets [55], PH decomposes the original problem by scenarios, solving each independently and then coordinating the solutions through a penalty
mechanism. This approach is particularly effective for non-convex problems and has seen
widespread application in areas like energy management and supply chain optimization due
to its ability to handle complex, scenario-dependent decisions.
• Stochastic Trust Region method. For general non-linear stochastic optimization, there are
two broad classes of algorithms, most of which are stochastic generalizations of smooth deterministic optimization algorithms. For gradient based direction finding, stochastic analogs
were studied by [47], whereas stochastic analogs of trust region methods (STORM) appear
in a series of papers [2, 14, 16]. From a computational standpoint, direction-finding algorithms may have faster iterations, whereas the trust region methods may provide stronger
convergence properties [9]. However, the aforementioned stochastic methods rely on the existence of gradients which may not be appropriate for non-smooth stochastic problems (5.1).
Readers interested in the non-smooth case can refer to [36].
• Dynamic Programming (DP) is a classical approach for solving multi-stage stochastic
problems, where decisions are made sequentially over time. Based on Bellman’s Principle
of Optimality [4], DP breaks down the problem into simpler subproblems, each representing
a stage in the decision process. However, DP is often limited by the ”curse of dimensionality,” which occurs when the state and action spaces become prohibitively large. To mitigate
6
these challenges, Reinforcement Learning (RL) has emerged as a powerful tool for sequential decision-making under uncertainty. RL, particularly in its modern form [70, 68], approximates the optimal policy or value function through interactions with the environment,
making it well-suited for problems where the system dynamics are complex or unknown.
RL has gained prominence in fields such as robotics, finance, and operations management,
where it offers a flexible and scalable alternative to traditional DP methods.
• Methods for multi-stage SO: Among the prominent methods developed for MSP, the Stochastic Dual Dynamic Programming (SDDP) approach introduced by Periera and Pinto [49]
stands out for its effectiveness in solving large-scale problems. However, as the complexity of multistage problems increases, particularly with constraints that change over time,
newer methods have been developed to enhance computational efficiency and solution accuracy.Recent advancements include the Multistage Stochastic Decomposition (MSD) approach by Sen and Zhou [62], which builds upon traditional stochastic decomposition methods to better handle the complexities inherent in multistage problems. Additionally, the
Stochastic Dual Linear Programming (SDLP) method proposed by Gangammavar and Sen [24]
extends the Regularized Stochastic Decomposition (SD) approach from 1994. The SDLP
method is particularly notable for its ability to represent all approximations of the objective
function with finite (n1 +3) piece-wise linear representations, providing a more refined approach compared to SDDP. These methods highlight the ongoing evolution of techniques in
multi-stage SO, addressing the challenges posed by dynamic constraints and the need for
accurate, computationally feasible solutions.
1.2.2 Common Challenges
Several common challenges arise across different methods. These include determining the appropriate sample size, selecting the correct search direction, and setting the optimal step size. We
provide several specific examples from the above approaches.
7
• The selection of an appropriate sample size is crucial in methods like SAA, SGD and SD.
In SAA, the sample size directly impacts the accuracy of the approximation of the expected
objective function. A larger sample size generally leads to a more accurate approximation,
reducing the variance of the solution. However, increasing the sample size also escalates the
computational burden. In practice, a common approach is to conduct a sensitivity analysis
by solving the SAA problem for different sample sizes and examining the stability of the
solutions. If the solutions converge as the sample size increases, it indicates that the chosen
sample size is sufficient. In SGD, the sample size typically refers to the number of data points
used to compute the gradient at each iteration (e.g., in mini-batch SGD). Smaller sample
sizes (mini-batches) allow for faster iterations but introduce higher variance in the gradient
estimates. On the other hand, larger sample sizes reduce variance but increase computation
per iteration. The choice often depends on the trade-off between computational efficiency
and the need for stable convergence. Practical guidelines suggest using mini-batches with
sizes that balance these factors, such as 32 or 64 samples per iteration, commonly seen in machine learning applications. In the context of the regularized SD method, Higle and Sen [28]
demonstrated that for a problem with n decision variables, the SD algorithm only needs to
retain n + 3 cuts. This finding indicates that the SD algorithm can iteratively construct an
increasingly accurate approximation of the recourse function, eventually converging to the
optimal solution with a sufficient number of cuts relative to the problem’s dimensionality.
Since each cut corresponds to a sample, this result provides valuable guidance on selecting
an appropriate sample size.
• The search direction is a critical aspect in optimization algorithms, particularly in methods like SGD and DP. The search direction in SGD is determined by the gradient of the
objective function with respect to the decision variables. However, since the gradient is
estimated based on a sample, the direction can be noisy. To address this, techniques like
momentum are used, where the search direction is a weighted average of the previous direction and the current gradient. This helps smooth out fluctuations caused by noisy gradient
8
estimates and accelerates convergence, particularly in non-convex optimization problems.
In DP, the search direction is implicitly defined by the Bellman equation, which recursively
breaks down the problem into subproblems. However, in high-dimensional spaces, directly
computing the search direction is infeasible. Reinforcement Learning (RL) methods, such
as Q-learning or policy gradient methods, approximate the optimal policy or value function to guide the search direction. In policy gradient methods, for example, the direction
is determined by the gradient of the expected reward with respect to the policy parameters.
Techniques like actor-critic methods further refine this direction by combining value function estimates (critic) with policy updates (actor).
• The step size in optimization algorithms controls the magnitude of the update in each iteration and is crucial in ensuring convergence and efficiency. In SGD, the step size, also
known as the learning rate, is one of the most important hyperparameters. A step size that
is too large can cause the algorithm to overshoot the optimal solution, leading to divergence.
Conversely, a step size that is too small can slow down convergence. Adaptive methods
like AdaGrad, RMSprop, and Adam dynamically adjust the step size during the optimization process based on past gradient information, improving convergence speed and stability.
These methods are particularly effective in deep learning, where the landscape of the objective function can be highly complex. In the Progressive Hedging Algorithm, the step size in
dual update corresponds to the penalty parameter used to enforce consistency across scenario
subproblems. The choice of this parameter affects the convergence rate of the algorithm. A
larger penalty can lead to faster convergence but may cause instability if too large. Typically, the penalty parameter is adjusted adaptively based on the progress of the algorithm,
balancing convergence speed and solution stability.
9
1.3 Contributions
In order to give a comprehensive solution to the challenges in the previous section, we will introduce a stochastic conjugate subgradient (SCS) method which has the following benefits: (1) It
incorporates adaptive sequential sampling and sample complexity analysis to determine the sample
sizes; (2) The method employs a specialized stochastic direction-finding approach to handle both
nonlinearity and non-smooth aspects of the SO problem; (3) It also applies a line-search technique
to guarantee the sufficient decrease of the objective function value as well as the sufficient increase
of the directional derivative over the search direction.
1.3.1 Adaptive Sampling
Most of the aforementioned algorithms employ SAA which has a fixed sample size to estimate
these uncertainties. However, it is difficult to determine the sample size [59, 21] when applying such method to large-scale SO algorithms. On the other hand, while adaptive sampling [41]
has emerged as a powerful approach for instances with a large number of outcomes (especially
for multi-dimensional uncertainty), such adaptive sampling has remained elusive in the context
of large-scale SO. Some effort has been made to combine adaptive sampling with some SO algorithms. For instance, [3] provides randomized PH methods which is able to produce new iterates as
soon as a single scenario subproblem is solved. [1] proposes a SBPHA approach which is a combination of replication and PH algorithm. However, these methods, although provide a ”randomized”
algorithm, still sample a subset of scenarios from a fixed finite collection in each iteration, without dynamically determining the sample size based on the optimization procedure. On the other
hand, our proposed algorithm will strategically sample a subset of scenario subproblems in each
iteration, facilitating dynamic scenario updates to enhance computational efficiency. We will also
leverage the sample complexity analysis [69] to support the choice of the sample size, ensuring a
theoretically grounded approach.
10
1.3.2 Stochastic Conjugate Subgradient Direction
Stochastic First-order (SFO) methods such as stochastic apprximation (SA) [54, 11, 44] is very
popular for solving SP problems. Simple first-order gradient-based methods dominate the field for
convincing reasons: low computational cost, simplicity of implementation, and strong empirical
results. However, while these methods have low computational requirements in any one iteration,
there are some disadvantages: a) the convergence rate of such algorithms depend heavily on the
dimension of the decision variables [37], b) they are not known for their numerical accuracy especially when the functions are non-smooth and, c) it does not have sufficient decrease in each
iteration.
In contrast to first-order methods, algorithms based on Conjugate Gradients (CG) provide finite termination when the objective function is a linear-quadratic deterministic function [46]. Furthermore, CG algorithms are also used for large-scale nonsmooth convex optimization problems
[73, 77]. However, despite its many strengths, (e.g., fast per-iteration convergence, frequent explicit regularization on step-size, and better parallelization than first-order methods), the use of
conjugate gradient methods for Stochastic Optimization is uncommon. Although [31, 75] proposed some new stochastic conjugate gradient (SCG) algorithms, their assumptions do not support
non-smooth objective functions, and hence not directly applicable to general SO problems (which
are pententially non-smooth).
In this dissertation, we adopt a direction-finding approach in which the direction is obtained
via a specialized problem of finding a vector with the smallest norm over a line segment in R
n
.
This approach was originally designed for deterministic non-smooth convex optimization problems [72, 73]. In order to solve stochastic optimization problems when functions are non-smooth,
it is straightforward to imagine that a stochastic subgradient method could be used. Thus, we
propose a new SCS algorithm which accommodates both the curvature of a quadratic function,
as well as the decomposition of subgradients by data points, as is common in SP [27, 28]. This
combination enhances the power of stochastic first-order approaches without the additional burden
of second-order approximations. In other words, our method shares the spirit of Hessian Free (HF)
11
optimization, which has attracted some attention in the machine learning community [43, 17]. As
our theoretical analysis and the computational results reveal, the new method provides a “sweetspot” at the intersection of speed, accuracy and scalability of algorithms.
1.3.3 Line-search Algorithm to Determine the Step Size
The convergence of many SO algorithms is highly sensetive to the choice of the step size. For instance, in the original Progressive Hedging (PH) algorithm, Rockafellar and Wets briefly discussed
the effect of the penalty parameter on the performance of the PH algorithm and the quality of the
solution, but they did not provide any guidance on how to choose it. Various choices have since
been proposed, often differing with the application under consideration. Readers interested in the
selection of the penalty parameter can refer to [78], which offers a comprehensive literature review
of related methods. Anothor example is the original stochastic approximation algorithm. The step
size in this algorithm decreases over time, specifically following the rule tk = c/k, where k is the
iteration number and c is a constant. The diminishing step size ensures that the updates become
smaller as the algorithm progresses, allowing it to converge to the true root.
Our approach differs from these methods by employing a stochastic direction-finding and linesearch technique [47, 2] derived from Wolfe’s line-search method [72, 73]. It guarantees the sufficient decrease of the objective function value as well as the sufficient increase of the directional
derivative over the search direction. Compared with existing approaches, our method is more
mathematically motivated and adheres to the principles of optimization [13].
1.4 Structure of the Dissertation
The structure of the dissertation will be summarized as follows. In chapter 2, a new algorithm
called conjugate subgraident algorithm will be introduced to solve non-smooth convex SAA version of SP problems with or without constraints. Then, it will be generalized to stochastic conjugate
12
subgradient (SCS) algorithm which is a combination of sequentially SAA with non-smooth conjugate subgradient direction finding. In chapter 3, we will apply SCS algorithm to kernel support
vector machine (SVM) problems (a well-known machine learning model which can be treated as
a two-stage SP without first stage constraints). Convergence and convergence rate are estiblished
and our experiments reveal that we can maintain (or even surpass) the scalability of the existing
first-order method while improving both the speed and accuracy of optimization. In chapter 4,
SCS algorithm is applied to general two-stage convex SP problems with constraints. In chapter
5, SCS is used together with progressive hedging (PH) algorithm to solve large-scale stochastic
programming problems in parallel.
13
Chapter 2
Models, Algorithms and Convergence Properties
One of the most popular methods for solving smooth optimization problems is the steepest descent algorithm [39], which combines the gradient descent method with a line search. However,
when applied to quadratic problems, its performance is highly sensitive to the condition number.
In contrast, methods like conjugate gradient descent are proven to achieve finite convergence [46].
Consequently, for non-smooth quadratic programming problems, such as kernel SVMs (involving the sum of a quadratic and a non-smooth function, which we will discuss in Chapter 3), it is
natural to consider a non-smooth conjugate subgradient method. This method should serve as an
appropriate analog to the conjugate gradient descent algorithm and inherits similar convergence
properties. Moreover, since many nonlinear programming problems can be well-approximated locally by quadratic functions, such a method can be highly effective for general nonlinear problems.
Thus, in this chapter, we will first introduce a conjugate subgradient algorithm, which originated from Wolfe [73] and forms the foundation for the remainder of the chapter. In Section 2.2,
we will generalize this algorithm to handle problems with equality constraints. Finally, in Section
2.3, we will extend the algorithm to its stochastic form to address SP problems.
2.1 Non-smooth Conjugate Subgradient Algorithm
In this section, we will introduce a conjugate subgradient algorithm originated from Wolfe’s Algorithm [72, 73] to solve general non-smooth convex function. There are many good properties of
14
this algorithm which serve as a good foundation of the remaining chapter. Specifically, Theorem
2.2 shows that searching along the conjugate subgradient direction dk will always guarantee a sufficient decrease of f while moving along an arbitrary subgradient direction may not ensure such
property. Theorem 2.3 shows that when applying the algorithm to solve a smooth quadratic problem, the algorithm behaves almost the same as the smooth conjugate gradient algorithm. Theorem
2.4 shows that the algorithm will have finite convergence even when the function is non-smooth,
which generalizes the smooth conjugate gradient algorithm.
2.1.1 Problem setup
Many problems of mathematical programming pose the task of minimizing a convex, but not necessarily differentiable, function f of several variables,
min
x∈Rn
f(x). (2.1)
For example, the mathematical formulation of SAA version of two-stage SQLP without first-stage
constraints is,
min
x∈R
n1
f(x) ≜
1
2
x
TQx+c
T
x+
1
n
n
∑
i=1
hi(x,ωi)],
where hi(x,ωi) is defined as
hi(x,ωi) = hQL(x,ωi) ≜ min
y≥0,y∈Rn2
d
T
y
s.t. Dy = ξ (ωi)−C(ωi)x.
(2.2)
Here, D ∈ R
m2×n2
is a deterministic matrix, ξ (ωi) denotes a random vector, and C(ωi) denotes a
random matrix. We also assume that Q to be a positive definite matrix and the second-stage cost
vector d is fixed.
15
In general, solving such problems efficiently is very difficult. Although first order methods can
be used, they may converge very slowly due to the fact that an arbitrary subgradient direction can
not guarantee a sufficient decrease. Meanwhile, higher order optimization algorithm is not applicable because the function is non-differentiable. Thus, we will introduce a non-smooth conjugate
subgradient algorithm which have finite convergence to solve the problem.
2.1.2 Non-smooth Conjugate Subgradient Algorithm
In this part, we focus on solving (2.1) using Wolfe’s non-smooth conjugate subgradient algorithm
(Algorithm 1).
Algorithm 1: Wolfe’s Non-smooth Conjugate Subgradient Algorithm
2 Set starting point x0, subgradient g0 ∈ ∂ f(x0), G0 = {g0}, ε > 0, δ and a0 > δ.
3 for k = 0,1,2,3... do
5 Calculate dk = −Nr(Gk).
6 if ||dk
|| < ε then
7 if |ak
| < δ then
9 The algorithm stops.
10 else
12 xk+1 = xk
, gk+1 = gk
, Gk+1 = {gk+1} and ak+1 = 0.
13 end
14 else
16 Use Algorithm 2 to obtain a step size tk and set xk+1 = xk +tkdk
.
18 Calculate gk+1 ∈ ∂ f(xk+1), set Gk+1 = {−dk
,gk+1}, ak+1 = ak +tk
||dk
||.
19 end
20 end
The algorithm consists of three important components:
• Sequential subgradient calculation. Once the algorithm reaches a new point xk
, a new subgradient gk ∈ ∂ f(xk) will be calculated.
• Direction finding. Wolfe’s non-smooth conjugate subgradient method uses the smallest norm
of the convex combination of the previous search direction (dk−1) and the current subgradient
(gk
). More specifically, we first solve the following.
λk =
1
2
argmin
λ∈[0,1]
||λdk−1 + (1−λ)gk
||2
. (2.3)
16
The solution of (2.3) is,
λk = Π[0,1]
− ⟨dk−1,gk⟩+||gk
||2
||dk−1||2 −2⟨dk−1,gk⟩+||gk
||2
,
where Π[0,1] denotes the projection operator on [0,1]. Then we can set the new direction dk
dk = λkdk−1 + (1−λk)gk
.
The above procedure can be summarized by an operator Nr(G) which denotes the point
with the smallest Euclidean norm in the convex hull of G. We will let the new direction
dk ≜ −Nr(Gk), where Gk = {gk
,−dk−1}.
• Choice of step size. This is achieved by Wolfe’s line-search methods [73]. Further details
are provided in algorithm 2.
Most of the logic of Algorithm 2 is motivated by Wolfe’s Line Search Algorithm [73]. Let
g(t) ∈ ∂ f(xk +t · dk) and define the intervals L and R.
L = {t > 0 | f(xk +t · dk)− f(xk) ≤ −m2||dk
||2
t},
R = {t > 0 | ⟨g(t),dk⟩ ≥ −m1||dk
||2
},
(2.4)
where m2 < m1 < 0.5. The output of the step size will satisfy two metrics: (i) Identify a set L
which includes points that sufficiently reduce the objective function value; (ii) Identify set R for
which the directional derivative estimate is improved. The algorithm seeks points which belong to
L∩R. With regard to (i), we note that replacing dk with an arbitrary subgradient gk may not ensure
sufficient descent unless the function f is smooth.
1
Algorithm 2: Basic Line Search Algorithm
2 Set m2 < m1 < 0.5, 1 < n ∈ Z
+ and b =
δk
n
.;
4 Choose t||dk
|| ∈ [b,δk
].;
5 if t ∈ L\R then
7 t = 2t until t ∈ R or t||dk
|| > δk
. Set I = [t/2,t].;
8 if t||dk
|| > δk then
10 Return t/2.
11 else
12 if t ∈ L∩R then
14 Return t
15 else
16 while t ∈/ L∩R do
18 Set t be the middle point of I;
19 if t ∈ R\L then
21 Replace I by its left half
22 else
24 Replace I by its right half
25 end
26 end
27 end
28 end
29 end
30 if t ∈ R\L then
32 t = t/2 until t ∈ L or t||dk
|| < b. Set I be the interval of [t,2t].;
33 if t ∈ L∩R then
35 Return t
36 else
37 if t||dk
|| < b then
39 Return t = 0
40 else
42 Repeat line (18)-(25) while t ∈/ L∩R.
43 end
44 end
45 end
47 Return t
2.1.3 Properties of Non-smooth Conjugate Subgradient Algorithm
Theorem 2.1. If d1 = −g1, and dk = −argmin||conv{dk−1,gk}|| for all k > 1, then
1. −dk ∈ conv{g1,...,gk}.
2. If also ||gk
|| ≤ C, and ⟨dk
,gk⟩ ≤ −m||dk
||2
, then
18
||dk
|| ≤ C
(1−m)
√
k +1
→ 0 as k → ∞.
Proof. See Konnov 1998 [34].
Theorem 2.2. In each step of algorithm 1, there exists tk which satisfies both L and R conditions.
Proof. First, for t > 0, define
P(t) = f(x+t · d)− f(x)
t||d||2
.
Note that P(t) is continuous and monotone non-decreasing because d is the steepest decent direction at x (See [73] for details). Thus, we have L = {0 < t < t1}. On the other hand, we claim that R
will be satisfied for t large enough. Suppose it is not true, i.e., ⟨g(t),dk⟩ < −m1||dk
||2
, for arbitrary
t > 0. Then by convexity,
f(xk +t · dk)− f(xk) ≤ ⟨g(t),t · dk⟩ < −m1||dk
||2
t → −∞,
which contradicts to the assumption the f is finite. Thus, R = {t > t2}. If t ∈/ R, again by convexity
and m2 < m1, we have
f(xk +t · dk)− f(xk) ≤ ⟨g(t),t · dk⟩ < −m1||dk
||2
t < −m2||dk
||2
t,
which means t ∈ L. Finally, since m2 < m1 and P(t) is continuous and non-decreasing, we must
have t1 > t2, i.e., L∩R ̸= /0.
The next theorem shows that if f is smooth quadratic, with m1 = m2 = ε = 0, δ = ∞, and an
accurate minimizing along direction dk
,
min
t
f(xk +tdk),
19
algorithm 1 will perform very similarly with classic conjugate gradient algorithm. We will first
summarize Wolfe’s smooth conjugate gradient algorithm as follows.
Algorithm 3: Wolfe’s Conjugate Gradient Algorithm
2 Set starting point x0, gradient g0 = −∇ f(x0), G0 = {g0}.
3 while ||gk
|| ̸= 0 do
5 Calculate dk = −Nr(Gk).
7 Calculate tk =
−⟨gk
,dk⟩
⟨dk
,Qdk⟩
.
9 Update xk+1 = xk +tkdk
.
11 Set gk+1 = ∇ f(xk+1) = gk +tkQdk
.
13 Set Gk = {−dk
,gk+1}.
15 Set k = k +1.
16 end
Theorem 2.3. If the function f(x) = ⟨x,Qx⟩+⟨p, x⟩ is smooth quadratic, then {dj}
k
j=1 will be a
conjugate system with respect to Q and algorithm 3 will converge in n+1 iterations.
In order to proof Theorem 2.3, we will first claim two lemmas. The first lemma shows that dk
is a convex combination of all the previous subgradient and the second lemma indicates that the
subgradient obtained in each iteration of algorithm 3 is mutually independent.
Lemma 2.1. If d0 = −g0, and d
k = −Nr({Gk}), then −d
k ∈ conv{g1,...,gk}.
Proof. See [34].
Lemma 2.2. At iteration k +1 of algorithm 3, we have ⟨gk+1,gj⟩ = 0, for all j ≤ k.
Proof. The proof is based on mathematical induction. Clearly, when j = 1, we have
g1 = g0 +t0 ·Qd0 = g0 +
⟨g0,go⟩
⟨g0,Qg0⟩
Q(−g0).
⟨g1,g0⟩ = ⟨g0,g0⟩ − ⟨g0,g0⟩· ⟨g0,Qgo⟩
⟨g0,Qg0⟩
= 0.
Suppose it is true for j = 1,2,..., k −1, i.e.,
⟨gk+1,gj⟩ = 0,∀ j < k. (2.5)
20
We now show that ⟨gk+1,gj⟩ = 0, for all j ≤ k. Since from Lemma 2.1, dk = −∑j≤k λj
· gj
, with
∑j≤k λj = 1 and λj ≥ 0, we have
⟨dk
,gi⟩ = − ∑
j≤k
λj
· ⟨gj
,gi⟩ = −λi
||gi
||2
, ∀i ≤ k, (2.6)
where the last equality is because of (2.5). Since there exists λi > 0, we have dk ̸=
−→0 . On the other
hand, since −dk
is the smallest norm in the convex hull of {gi}
k
i=1
, we have ⟨dk
,gi⟩ ≤ −||dk
||2
.
Thus, together with (2.6), we have
λi
||gi
||2 ≥ ||dk
||2 > 0, ∀i ≤ k,
which implies that λi > 0, for all i ≤ k. With this result, to show ⟨gk+1,gj⟩ = 0, for all j ≤ k, it is
suffice to show ⟨gk+1,dk⟩ = 0. Since
gk+1 = gk +tk
·Qdk
.
⟨gk+1,dk⟩ = ⟨gk
,dk⟩ − ⟨gk
,dk⟩· ⟨dk
,Qdk⟩
⟨dk
,Qdk⟩
= 0,
which shows that the induction hypothesis also holds for all i = 1,2,..., k +1.
Proof of Theorem 2.3:
Proof. According to Lemma 2.2, we have dk = −∑j≤k λj
· gj
, where λj > 0, for all j ≤ k. Thus,
− ∑
j≤k
λj(⟨gj
,dk⟩+||dk
||2
) = ||dk
||2 −||dk
||2 = 0.
On the other hand, since −dk
is the smallest norm in the convex hull of {gi}
k
i=1
, we have ⟨dk
,gi⟩ ≤
−||dk
||2
. We can conclude that ⟨dk
,gi⟩ = −||dk
||2
. Thus, ⟨dk
,gi⟩ − ⟨dk
,gj⟩ = 0, for all i ̸= j ≤ k.
Finally, let i = j +1, we have
⟨dk
,gj+1⟩ − ⟨dk
,gj⟩ = tj⟨dk
,Qdj⟩ = 0,
21
which implies that {dj}
k
j=1 will be a conjugate system with respect to Q.
Also, from Lemma 2.2, we have gi’s are independent with each other. Thus, after at most
n+1 iterations, we must have gn+1 =
−→0 , which is the optimization condition for smooth convex
functions.
The next theorem shows that algorithm 1 will have a reset in a finite number of iterations. The
theorem needs an assumption which bounds the norm of the subgradient.
Assumption 1: For any x, g ∈ ∂ f(x), we have ||g|| ≤ C.
Theorem 2.4. With assumption 1, algorithm 1 will have ||dk
|| ≤ ε after at most every 4C
2
ε
2
iterations.
Proof. First, according to the definition of dk+1, we have
dk+1 = Nr({−dk
,gk+1}) = −(1−θ)· dk +θgk+1, 0 ≤ θ ≤ 1.
Then, the square norm of dk+1 will be
||dk+1||2 = (1−θ)
2
||dk
||2 −2(1−θ)θ · ⟨dk
,gk+1⟩+θ
2
||gk+1||2
.
Together with the R condition −⟨dk
,gk+1⟩ ≤ m1||dk
||2 ≤ 1/2||dk
||2
and assumption 1, we have
||dk+1||2 ≤ (1−θ)||dk
||2 +θ
2C
2
.
Let f(θ) = (1−θ)||dk
||2 +θ
2C
2
, we have
||dk+1||2 ≤ f(θ
∗
) = ||dk
||2 −
||dk
||4
4C2
= ||dk
||2
(1−
||dk
||2
4C2
),
which implies that
||dk
||2
||dk+1||2
≥
1
1−
||dk
||2
4C2
≥ 1+
||dk
||2
4C2
.
22
Thus,
1
||dk+1||2
≥
1
||dk
||2
+
1
4C2
.
Summing up from 1 to L, we have the largest L = such that ||dL|| > ε satisfies
1
ε
2
>
1
||dL||2
≥
L
4C2
+
1
||d0||2
,
which implies L ≤
4C
2
ε
2
.
2.2 Constrained Non-smooth Conjugate Subgradient Algorithm
In this section, we will introduce some variants of conjugate subgradient algorithm to solve general
non-smooth convex function with linear equality constraints.
2.2.1 Problem setup
Many problems of mathematical programming pose the task of minimizing a convex, but not necessarily differentiable, function f with several linear equality constraints,
min
x
f(x)
s.t. a
T
i
x = bi
, i ∈ E ,
(2.7)
where E is finite sets of indice, ai’s and x are vectors in R
n
and bi’s are scalars. For example, the
mathematical formulation of SAA version of two-stage SQLP is given below,
min f(x) ≜
1
2
x
TQx+c
T
x+
1
n
n
∑
i=1
hi(x,ωi)]
s.t. x ∈ X = {x : Ax = b} ∈ R
n1
,
where h(x,ω) is defined as
23
h(x,ωi) = hQL(x,ωi) ≜min d
T
y
s.t. Dy = ξ (ωi)−C(ωi)x,
y ≥ 0, y ∈ R
n2
.
(2.8)
Here A ∈ R
m1×n1
is a deterministic matrix with row vectors denoted by {ai}
m1
i=1
, and D ∈ R
m2×n2
is a deterministic matrix. In addition, ξ (ωi) denotes a random vector, C(ωi) denotes a random
matrix. We also assume that the second-stage cost vector d is fixed and Q to be a positive definite
matrix.
2.2.2 Conjugate Subgradient Algorithm for Equality Constrained Problems
In this part, we focus on solving (2.7) using Wolfe’s non-smooth conjugate subgradient algorithm
together with active-set method to deal with constraints. The algorithm consists of four important
components:
• Sequential subgradient calculation. Similar as in 2.1.2.
• Direction finding. Similar as in 2.1.2.
• Direction Modification. Since there are linear constraints in (2.7), dk may not be a feasible
searching direction. Thus, we will modify the searching direction dk
to a feasible searching
direction d˜
k
. Specifically, let the null-space Z of A be such that ⟨A,Z⟩ = 0. Then the modified
direction
d˜
k = ZZT
dk
. (2.9)
24
• This is achieved by Wolfe’s line search methods [73]. Let g(t) ∈ ∂ f(xk +t · d˜
k) and ˜g(t) =
ZZT g(t). Define the intervals L and R
L = {t > 0 | f(xk +t · d˜
k)− f(xk) ≤ −m2||d˜
k
||2
t},
R = {t > 0 | 0 > ⟨g˜(t),d˜
k⟩ ≥ −m1||d˜
k
||2
},
(2.10)
where m2 < m1 < 1. The output of the step size will satisfy two metrics: (i) Identify a set
L which includes points that sufficiently reduce the objective function value; (ii) Identify
set R for which the directional derivative estimate is improved. The algorithm seeks points
which belong to L ∩ R. Note that replacing d˜
k with an arbitrary subgradient modification
g˜k = ZZT gk may not ensure sufficient descent unless the function fk
is smooth.
Algorithm 4: Conjugate Subgradient Algorithm for Equality Constrained Problems
2 Set starting point x0, subgradient g0 ∈ ∂ f(x0), G0 = {g0}, ε > 0, δ and a0 > δ.
4 Obtain the null-space Z.
5 for k = 0,1,2,3... do
7 Calculate dk = −Nr(Gk).
9 Calculate the modified direction d˜
k by (2.9).
10 if ||d˜
k
|| < ε then
11 if |ak
| < δ then
13 The algorithm stops.
14 else
16 xk+1 = xk
, ˜gk+1 = g˜k
, Gk+1 = {gk+1} and ak+1 = 0.
17 end
18 else
20 Use Algorithm 2 to obtain a step size tk and set xk+1 = xk +tkd˜
k
.
22 Calculate gk+1 ∈ ∂ f(xk+1), and ˜gk+1 = ZZT gk+1.
24 Set Gk+1 = {−d˜
k
,g˜k+1}, ak+1 = ak +tk
||d˜
k
||.
25 end
26 end
2.2.3 Properties of Active Set Non-smooth Conjugate Subgradient Algorithm
Theorem 2.5. In each step of algorithm 4, d˜
k
is a feasible direction.
Proof. The proof is well-documented in the literature on linearly constrained optimization (e.g.,
Nocedal and Wright [46]).
25
Theorem 2.6. In each step of algorithm 4, if f has a Lipschitz constant 0 < L < ∞, then there
exists tk > 0 such that
f(xk +tk
· d˜
k)− f(xk) ≤ −m2tk
||d˜
k
||2
. (2.11)
Proof. First, since −d˜
k
is the smallest norm in the convex hull of {g˜i}
k
i=1
, we have ⟨d˜
k
,g˜i⟩ ≤
−||d˜
k
||2
, we have
||d˜
k
||2 ≤ −⟨d˜
k
,g˜k⟩ (2.12)
We will then divide the proof into two cases.
Case 1: if there exists tk such that R conditions is satisfied, i.e.,
⟨g˜(tk),d˜
k⟩ ≥ −m1||d˜
k
||2
.
According to (2.12), we can equivalently write
⟨g˜(tk),d˜
k⟩ − ⟨d˜
k
,g˜k⟩ ≥ ⟨g˜(tk),d˜
k⟩+||d˜
k
||2 ≥ (1−m1)||d˜
k
||2
.
Then, we have
(1−m1)||d˜
k
||2 ≤ ⟨g˜(tk),d˜
k⟩ − ⟨d˜
k
,g˜k⟩ ≤ tkL||d˜
k
||2
.
Thus, we have tk ≥ (1−m1)/L.
Case 2: if tk
in case 1 makes some of the constraints in l ∈/ Jk violated. Then, as tk gets larger from
0, we can find lk such that
a
T
lk
(xk +tkd˜
k) = blk
.
Then,
tk
||alk
||·||d˜
k
|| ≥ tk
|a
T
lk
dk
| = |a
T
lk
xk −blk
| ≥ τ.
26
Thus,
tk ≥
τ
||alk
||·||d˜
k
||
.
2.3 Stochastic Conjugate Subgradient Algorithm
2.3.1 Problem setup
In this section, we focus on a task of solving a two-stage Stochastic Programming (SP)
min
x∈Rn1
f(x) = c(x) +Eω[h(x,ω)], (2.13)
c(·) is differentiable convex deterministic function, ω ∈ R
m is a continuous random vector and
h(·,ω) is nonsmooth and convex random function. Here, h(x,ω) is called the recourse function
and its value is usually obtained by solving another constrained optimization problem:
h(x,ω) ≜ min
y
g(y)
s.t. Dy = ξ (ω)−C(ω)x,
y ≥ 0, y ∈ R
n2
,
where g(y) can be either a linear function d
T
y or a convex quadratic function 1
2
y
TPy + d
T
y. For
this reason, we will call x and y the first-stage and second-stage decision variable, respectively. For
the convenience of notation, we will let Y(x,ω) ≜ {0 ≤ y ∈ R
n2
|Dy = ξ (ω)−C(ω)x} through out
this chapter.
2.3.2 Basic SCS algorithm
The main ingredients of the algorithm 5 include three important components:
27
• Sequential function approximation. In many cases we have a fixed number of data points.
However, our focus is on situations where the data can be queried from an oracle sequentially.
As a result, we will not fix the value of m in (4.1). Instead, we use an online version of sample
average approximation (OSAA) [27, 28] to approximate function f using fk
, where
fk(x) = c(x) + 1
|Sk
|
|Sk
|
∑
i=1
h(x,ωi), (2.14)
• Direction finding. Similar as in 2.1.2.
• Choice of step size. Similar as in 2.1.2.
Algorithm 5: Basic SCS Algorithm
2 ε > 0, k ← 0.
4 Randomly generate S0 samples from the data set.
6 Set a feasible solution x0 ∈ R
n1 and an initial direction d0 ∈ R
n1
.
8 f0(x) = c(x) + 1
|S0| ∑
|S0|
i=1
h(x,ωi).
9 while ||dk
|| > ε do
11 k = k +1;
13 Obtain gk ∈ ∂ fk−1(xk).;
15 Let Gk = {gk
,−dk−1} and calculate dk = −Nr(Gk).;
17 Apply Algorithm 2 to find step size tk and set xk+1 = xk +tkdk
. ;
19 Randomly generate a set of new samples Sk
.;
21 Construct fk(x) = c(x) + 1
|Sk
| ∑
|Sk
|
i=1
h(x,ωi).;
22 end
2.3.3 Comparison with stochastic approximation
Consider a standard stochastic programming problem
min
x
Eω[ f(x,ω)]
The main steps for stochastic approximation (SA) at iteration k are
• (1) Generate m i.i.d. realization ωi
,i = 1,2,...,m.
28
• (2) Calculate the subgradients gk,i ∈ ∂ f(xk
,ωi), for all i = 1,2,...,m.
• (3) Average the gradients calculated in step (2) and obtain the stochastic gradient
gk =
1
m
m
∑
i=1
gk,i
.
• (4) Update the decision variable
xk+1 = xk −tk
· gk
,
where tk
is the step size.
SCS algorithm maintains the first and second steps in SA. However, in the third step, instead
of taking an average to obtain the stochastic subgradient, SCS calculates a direction dk which is
a smallest norm convex combination of all the subgradients. The following lemma indicates that,
although both −dk and gk
remain unbiased estimators of the true subgradient, −dk
is more likely
to be a descent direction as m increases. This highlights the advantage of using dk over gk
.
Lemma 2.3. Let f : R
n → R be a convex function. Then for every x ∈ R, we have
• The subdifferential ∂ f(x) is a non-empty, convex, and compact set. Moreover, for all y ∈ R
n
,
there holds
inf
α>0
f(x+αy)− f(x)
α
= max
g∈∂ f(x)
⟨y,g⟩.
• If 0 ∈/ ∂ f(x), then the direction
d = −argmin
g∈∂ f(x)
∥g∥
satisfies ⟨d,g⟩ < 0 for all g ∈ ∂ f(x).
• The direction d will be a descent direction, i.e., there exist α > 0, s.t.
f(x+αd)− f(x) < 0.
29
Note that from Theorem 2.1, the direction −dk
is a convex combination of all the subgradients
gk,i ∈ ∂ f(xk). Thus, combined with Lemma 2.3 (a), we conclude that −dk
is still a subgradient
at xk
. Moreover, since dk has the smallest norm (among a convex combination of gk,i
), it will
converge to the direction d in Lemma 2.3 (b) as m → ∞, while gk does not have this property.
30
Chapter 3
Stochastic Conjugate Subgradient Algorithm for Kernel
Support Vector Machines
Stochastic First-Order (SFO) methods have been a cornerstone in addressing a broad spectrum of
modern machine learning (ML) challenges. However, their efficacy is increasingly questioned,
especially in large-scale applications where empirical evidence indicates potential performance
limitations. In response, this paper proposes an innovative method specifically designed for kernel
support vector machines (SVMs). This method not only achieves faster convergence per iteration
but also exhibits enhanced scalability when compared to conventional SFO techniques. Diverging from traditional sample average approximation strategies that typically frame kernel SVM as
an “all-in-one” Quadratic Program (QP), our approach adopts adaptive sampling. This strategy
incrementally refines approximation accuracy on an “as-needed” basis. Crucially, this approach
also inspires a decomposition-based algorithm, effectively decomposing parameter selection from
error estimation, with the latter being independently determined for each data point. To exploit
the quadratic nature of the kernel matrix, we introduce a stochastic conjugate subgradient method.
This method preserves many benefits of first-order approaches while adeptly handling both nonlinearity and non-smooth aspects of the SVM problem. Thus, it extends beyond the capabilities
of standard SFO algorithms for non-smooth convex optimization. The convergence rate of this
novel method is thoroughly analyzed within this paper. Our experimental results demonstrate that
31
the proposed algorithm not only maintains but potentially exceeds the scalability of SFO methods.
Moreover, it significantly enhances both speed and accuracy of the optimization process.
3.1 Introduction
Kernel support vector machines (SVM), a subset of supervised learning models, are instrumental in data-driven classification tasks [58]. Consider a dataset of covariates represented as S =
{(zi
,wi)}
m
i=1
, where the pairs correspond to features and their associated labels respectively. The
primary objective is to predict the label W given a random feature Z utilizing the classifier β. In
scenarios where the feature vector Z is exceedingly large or the data deviates from linear separability, it becomes necessary to apply a non-linear transformation to the feature vector Z, commonly
referred to as kernel mapping φ. Consequently, the kernel SVM problem is formulated as:
min
β
g(β) :=
1
2
||β||2 +E[max{0,1−W⟨β,φ(Z)⟩}], (3.1)
where the expectation E is taken with respect to the joint distribution of (Z,W). Directly addressing the problem as defined in Equation (3.1) is infeasible due to two significant challenges: (a) the
joint distribution of the pair (Z,W) is typically unknown, and (b) the kernel mapping φ is not directly accessible. To overcome the first challenge (a), the focus generally shifts towards resolving
an empirical risk minimization (ERM) problem, specifically through a sample average approximation (SAA) approach. With regard to the second challenge (b), while direct access to φ may not
be available, the inner product within the kernel mapping, ⟨φ(zi),φ(zj)⟩, may be available. This
provides a valuable workaround [58]. Essentially, one uses the Representer Theorem[33] to express β in the form of β = ∑
m
i=1αiφ(zi). This transformation enables a finite sample kernel SVM
approximation as follows:
min
α
fS(α) :=
1
2
⟨α,Qα⟩+
1
m
m
∑
i=1
max{0,1−wi⟨α,Qi⟩}, (3.2)
32
where α = [α1,α2,...,αm] is the decision varaible, Q ∈ R
m×m is the kernel matrix with Qi j =
⟨φ(zi),φ(zj)⟩ and Qi refers to the i-th row of Q. It is significant to observe that so long as we have
access to Qi j = ⟨φ(zi),φ(zj)⟩, we can solve (3.2) without having the direct knowledge of the kenel
mapping φ. Notably, a variety of kernel mappings, including but not limited to polynomial, Gaussian, and Laplacian kernels, often result in a positive definite matrix Q [58]. This characteristic
implies that equation (3.2) constitutes a convex optimization problem.
Equation (3.2) can be interpreted as an “all-in-one” Quadratic Program (aioQP). However, the
computational feasibility of solving the aioQP becomes challenging with extremely large datasets,
such as those containing millions of data points. To address this problem, our method introduces
an adaptive sampling method aimed at improving the quality of the approximation locally, while
enhancing the accuracy of the global approximation on an “as-needed” basis. The approximation
at iteration k, utilizing a significantly smaller sample size |Sk
| << m, is expressed as:
min
α
fk(α) :=
1
2
⟨α,Q
kα⟩+
1
|Sk
|
|Sk
|
∑
i=1
max{0,1−wi⟨α,Q
k
i
⟩}, (3.3)
where α = [α1,α2,. . . ,α|Sk
|
] is the decision variable, Q
k ∈ R
m×m is the kernel matrix with Q
k
i j =
⟨φ(zi),φ(zj)⟩ and Q
k
i
refers to the i-th row of Q
k
. We will demonstrate that our proposed approach
not only offers enhanced accuracy but also has greater scalability compared to Stochastic First
Order (SFO) methods in tackling large-scale kernel SVM problems. To lay the groundwork, we
will commence by reviewing a range of classic stochastic methods, many of which are applicable
to kernel SVM.
3.1.1 Stochastic first-order methods
SFO methods such as “stochastic gradient descent” (SGD) [54, 11, 44] or dual coordinate ascent
[50, 32] are very popular for solving machine learning problems. For black-box non-smooth optimization, SGD-type algorithms using secant approximations have been proposed by [76]. It is
33
also worth noting that a recent survey [35] summarizes this type of optimization method comprehensively. Simple first-order gradient-based methods dominate the field for several convincing
reasons: low computational cost, simplicity of implementation, and strong empirical results. For
example, in order to solve (3.2), it is common to adopt a specialized version of SGD such as Pegasos [63] where the subgradient calculations can be carried out very rapidly and in parallel (for a
given α ∈ R
n
). [64] shows that the Pegasos algorithm obtains an εp-sub-optimal solution in time
O(1/(λ εp)), where λ is the condition number of Q. This run-time does not depend on the number of variables n, and therefore is favorable when n is very large. Finally, asymptotitic analysis
via the work of Polyak and Juditsky [51] has shown that SFO exhibits asymptotic normality for
smooth stochastic optimization and a similar result has been recently obtained in [19] for the case
of non-smooth stochastic convex optimization problems.
While the aforementioned methods have low computational requirements in any one iteration,
there are some disadvantages: a) the convergence rate of such algorithms depend heavily on the
condition number of the Hessian of the quadratic function, b) SFO methods are not known for
their numerical accuracy especially when the functions are non-smooth and, c) the convergence
rate mentioned in the previous paragraph only applies under the assumption that the component
functions are smooth. Since (3.2) has non-smooth component functions, such a convergence rate
is not guaranteed.
3.1.2 Stochastic methods for general non-linear optimization
For general non-linear stochastic optimization, there are two broad classes of algorithms, most of
which are stochastic generalizations of smooth deterministic optimization algorithms. For gradient based direction finding, stochastic analogs were studied by [47], whereas stochastic analogs
of trust region methods (STORM) appear in a series of papers [2, 14, 16]. From a computational
standpoint, direction-finding algorithms may have faster iterations, whereas trust region methods
may provide stronger convergence properties [9]. However, the aforementioned stochastic methods rely on the existence of gradients which may not be appropriate for non-smooth stochastic
34
problems such as kernel SVM in (3.2). Readers interested in the non-smooth, non-convex case
may refer to [36].
3.1.3 Stochastic conjugate gradient method
In contrast to first-order methods, algorithms based on Conjugate Gradients (CG) provide finite
termination when the objective function is a linear-quadratic deterministic function [46]. Furthermore, CG algorithms are also used for large-scale nonsmooth convex optimization problems
[73, 77]. However, despite its many strengths, (e.g., fast per-iteration convergence, frequent explicit regularization on step-size, and better parallelization than first-order methods), the use of
conjugate gradient methods for Stochastic Optimization is uncommon. Although [31, 75] proposed some new stochastic conjugate gradient (SCG) algorithms, their assumptions do not support
non-smooth objective functions, and hence not directly applicable to kernel SVM problems. The
following assumptions, which are common for kernel SVM models, will be imposed throughout
this chapter.
• The optimization problems (3.1)-(3.3) are convex and finite valued. Thus, extended realvalued functions are not permissible in our framework. As a consequence, we assume the
function f and fk possess the Lipschitz constant 0 < Lf < ∞ and their sub-differentials are
non-empty for every α. Note that when the sample size is large enough, Lf will represent an
upper bound of the Lipschitz constant on both f and fk
. Efficient and accurate estimation of
the Lipschitz constant holds significant practical importance. Readers interested in this topic
may find valuable insights in the work of [22, 74].
• The inner product of the kernel function and the classifier is bounded, i.e., for any kernel
mapping φ(z) and classifier β, |⟨φ(z),β⟩| < M. Intuitively, this assumption indicates that
the classification error can not be unbounded, which is reasonable for kernel SVM.
35
In this chapter, we adopt a direction-finding approach in which the direction is obtained via a
specialized problem of finding a vector with the smallest norm over a line segment. This approach
was originally designed for deterministic non-smooth convex optimization problems [72, 73].
In order to solve stochastic optimization problems when functions are non-smooth (as in (3.2)),
it is straightforward to imagine that a stochastic subgradient method could be used. In this chapter,
we treat (3.2) as a stochastic piecewise linear quadratic optimization problem and propose a new
stochastic conjugate subgradient (SCS) algorithm which accommodates both the curvature of a
quadratic function, as well as the decomposition of subgradients by data points, as is common in
stochastic programming [27, 28]. This combination enhances the power of stochastic first-order
approaches without the additional burden of second-order approximations. In other words, our
method shares the spirit of Hessian Free (HF) optimization, which has attracted some attention
in the machine learning community [43, 17]. As our theoretical analysis and the computational
results reveal, the new method provides a “sweet-spot” at the intersection of speed, accuracy, and
scalability of algorithms (see Theorems 3.1 and 3.3).
3.1.4 Contributions
(1) Our methodology extends Wolfe’s deterministic conjugate subgradient method [73] into the
stochastic domain. While a straightforward application of [73] to stochastic programming (SP)
problems using Sample Average Approximation (SAA) [65] might be considered, our focus is on
solving Kernel SVM problems with extremely large datasets such as Hepmass with 3.5 million
samples. Consequently, we employ an adaptive sampling strategy over a deterministic finite sum
approximation. This approach bears some resemblance to Stochastic Decomposition (SD) [27, 28],
although SD employs a ”trust-region” strategy for iterative progress. More notably though, deterministic algorithms for finite-sum SAA can impose significant computational and memory demands, especially with large datasets. This observation is corroborated by our experimental results. To address the challenges arising from non-smoothness as well as extremely large datasets,
36
our novel algorithm integrates several disparate notions such as adaptive sampling, decomposition, conjugate subgradients, and line search techniques. We demonstrate that this amalgamation
effectively approximates solutions for kernel SVM problems with extremely large datasets.
(2) The ensuing analysis of our algorithm establishes a convergence rate of O(1/ε
2
) for kernel
SVM, where ε denotes the desired accuracy. This contrasts with the convergence rate of O(1/
√5
k)
reported in [76], applicable to smooth, albeit far more demanding black-box and non-strongly
convex problems.
(3) In contrast to trust-region methods, which necessitate solving a constrained convex optimization problem in each iteration, our algorithm derives a descent direction by solving a onedimensional convex optimization problem.
(4) Our computational experiments demonstrate that, from an optimization standpoint, the
solutions produced by SCS yield lower objective function values compared with SFO methods,
maintaining competitive computational times for small to medium-sized datasets. More crucially,
as the dataset size increases, the efficiency and accuracy of our Stochastic Conjugate Subgradient
(SCS) algorithm surpasses those of algorithms such as Pegasos, a specialized first-order methods
for kernel SVM.
The structure of this chapter is as follows. In §3.2, the discussion evolves from mathematical
concepts to computational algorithms. We will discuss the SCS algorithm for solving large scale
kernel SVM via a combination of Wolfe’s non-smooth conjugate subgradient method [73] and
adaptive sampling which helps leverage a successive sampling methodology as in Stochastic Decomposition [27, 28]. This combination not only introduces some elements of recursive estimation
into kernel SVM, but also bypasses the need for constrained QP algorithms, or even the need for
solving a large linear system (of equations) rapidly. In §3.3, we will present the convergence analysis of the SCS algorithm by showing that the algorithm generates a super-martingale sequence
which converges in expectation. In §3.4, we will provide some technical details about its implementation and compare the computational results with its two progenitors: the Pegasos algorithm
(a SFO method) and Wolfe’s algorithm [73], which is a deterministic non-smooth optimization
37
method. In the context of this comparison, we will discuss some advantages and disadvantages
of adopting the SCS algorithm for solving the kernel SVM problem. Finally, our conclusions are
provided in §3.5. The notations used in this chapter are summarized in Appendix 3.5.
3.2 Stochastic Conjugate Subgradient (SCS) Algorithm
In this section, we focus on solving (3.3) using the SCS algorithm in which we combine the nonsmooth CG method [73] with adaptive sampling and decomposition as in stochastic programming
[8]. In this setting, decomposition refers to a treatment of the piecewise linear (max) function and
the quadratic part separately in (3.3) to find the direction of change for the decision variables. We
will first state the SCS algorithm (Algorithm 6) and then explain the key concepts underlying this
approach. In the following, we let Nr(G) denote the projection of “0” onto any finite collection
of vectors G. During the course of the algorithm, we will adaptively enlarge the data set which is
represented within the SVM problem. Before the start of the k
th iteration, the set of data available
to the algorithm will be denoted Sk−1. In iteration k we will include additional data points denoted
by the set sk
, so that |Sk
| = |Sk−1|+|sk
|. The SCS algorithm consists of four important components:
• Sequential function approximation. In many cases we have a fixed number of data points.
However, our focus is on situations where the data can be queried from an oracle sequentially.
Thus, we use adaptive sampling [27, 28, 48],
min
α
fk(α) = 1
2
⟨α,Q
kα⟩+
1
|Sk
|
|Sk
|
∑
i=1
max{0,1−wi⟨α,Q
k
i
⟩}. (3.4)
It is important to note that wi and Q
k
i
are based on realization of random variables (Z,W).
Consequently, fk
is a realization of the expectation functional in (3.1). In the remainder
of this paper, we will let the kernel matrix Q
k
agree with the size of data set (|Sk
|) in any
iteration.
38
• Direction finding. The idea here is inspired by Wolfe’s non-smooth conjugate subgradient
method which uses the smallest norm of the convex combination of the previous search
direction (−dk−1) and the current subgradient (gk
). More specifically, we first solve the
following one-dimensional QP
λ
∗
k = argmin
λ∈[0,1]
1
2
||λ ·(−dk−1) + (1−λ)· gk
||2
. (3.5)
Then we can set the new search direction as
dk = −
λ
∗
k
·(−dk−1) + (1−λ
∗
k
)· gk
:= −Nr(Gk),
where Gk = {−dk−1,gk}, and λ
∗
k
denotes the optimal weight for the convex combination.
Clearly if one fixes λ = 0, then the search direction reduces to that of the subgradient method.
• Choice of step size. This is achieved by combining concepts of deterministic Wolfe’s algorithm, stochastic trust region and line search methods [46, 73]. The step actually applies
Wolfe’s line search criteria to the function which is a sampled realization of the expectation
functional. Let g(t) ∈ ∂ fk−1(αˆ k +t · dk) and define the intervals L and R.
L = {t > 0 | fk−1(αˆ k +t · dk)− fk−1(αˆ k) ≤ −m1||dk
||2
t},
R = {t > 0 | 0 > ⟨g(t),dk⟩ ≥ −m2||dk
||2
}.
(3.6)
The above conditions are consistent with the line-search rules in Wolfe’s deterministic algorithm [72]. The output of the step size will satisfy two metrics: (i) Identify a set L which
includes points which reduce the objective function approximation fk−1, and (ii) Identify a
set R for which the directional derivative estimate for fk−1 is improved. The algorithm seeks
points which belong to L∩R. The combination of L and R is called strong Wolfe condition
and it has been shown in [73] Lemma 1 that L∩R is not empty.
39
• Termination criteria. The algorithm concludes its process when ||dk
|| < ε and δk < δmin.
As we will show in Theorem 3.4, a dimunitive value of ||dk
|| indicates a small norm of
the subgradient g ∈ f(αˆ k), fulfilling the optimality condition for an unconstrained convex
optimization problem. Additionally, a threshold δmin is established to prevent premature termination in the initial iterations. Without this threshold, the algorithm may halt prematurely
if the initial samples are correctly classified by the initial classifier α0, resulting in ||dk
|| < ε.
However, if δmin is introduced, based on Theorem 3.1, |Sk
| should be chosen such that
|Sk
| ≥ −8log(ε/2)·
(M +1)
2
κ
2δ
4
min
,
where κ is the κ-approximation defined in Definition 3.1. This effectively mitigates any
early stopping concern with high probability.
Algorithm 6: Stochastic Conjugate Subgradient (SCS) Algorithm
2 Set ε > 0, δmin < δ0 < δmax, η1 ∈ (0,1), η2 > 0, γ > 1, and set k ← 0.
4 Randomly generate S0 from the data set S and build an initial Radial Basis Function (RBF) kernel Q
0 based
on S0.
6 Define f0(α) = 1
2
⟨α,Q
0α⟩+
1
|S0| ∑
|S0|
i=1 max{0,1−wi⟨α,Q
0
i
⟩}.
8 Set a feasible solution αˆ 0 =
−→0 ∈ R
s0 and an initial direction d0 ∈ ∂ f0(αˆ 0).
9 while ||dk
|| > ε or δk > δmin do
11 k ← k +1;
13 Obtain gk ∈ ∂ fk−1(αˆ k−1), Gk = {−dk−1,gk} and dk = −Nr(Gk);
15 Apply Algorithm 7 to find step size tk
;
17 Set αk = αˆ k−1 +tkdk
;
19 Next randomly generate a set of new samples sk of cardinality |sk
|;
21 Define Sk
∆= Sk−1 ∪sk
;
23 Construct fk(α) = 1
2
α
TQ
kα +
1
|Sk
| ∑
|Sk
|
i=1 max{0,1−wi⟨α,Q
k
i
⟩};
25 Randomly generate a set of new samples Tk of cardinality |Tk
| = |Sk
| independent of Sk
;
27 Construct ˆfk(α) = 1
2
α
⊤Qˆ kα +
1
|Tk
| ∑
|Tk
|
i=1 max{0,1−wˆi⟨α,Qˆ k
i
⟩} based on Tk
;
29 Let βk = (αk
,
−→0 ) ∈ R
|Sk
|
,
ˆβk−1 = (αˆ k−1,
−→0 ) ∈ R
|Sk
|
and dk = (dk
,
−→0 ) ∈ R
|Sk
|
;
30 if fk(βk)− fk(
ˆβk−1) ≤ η1(
ˆfk(βk)− ˆfk(
ˆβk−1)) and ||dk
|| > η2δk−1 then
32 αˆ k ← βk
, δk ← min{γδk−1,δmax};
33 else
35 αˆ k ← ˆβk−1, δk ← max{
δk−1
γ
,δmin};
36 end
37 end
40
Remark 1: (i) The initial choice of step size δ0 may require some tuning depending on the data
set but it does not affect the convergence property of this algorithm.
(ii) According to Theorem 3.2, δmax :=
ε
ζ
and
ζ = maxn
4nκ,η2,
4κ
C1(1−η1)
o
,
where C1 =
m1
2n
, 1 < n ∈ Z
+.
(iii) Note that replacing dk with an arbitrary subgradient may not ensure descent unless the function fk−1 is smooth. Thus, simply substituting dk with gk and using SGD-based method may not
guarantee the convergence rate shown in Theorem 3.3.
(iv) In general, optimization problems are stated with the decision variables in a fixed dimension.
However, since we have designed an adaptive sampling approach, the dimension of our decision
variables grows iteratively according to the increase in sample size.
Remark 2: Algorithm 7 will have two possible outcomes. One is it returns a suitable step
size t such that t ∈ L ∩ R and t||dk
|| ≥ δk
n
. The other is it keeps shrinking the step size until
t||dk
|| <
δk
n
. This case is captured in step 11 of Algorithm 2 and the algorithm will return t = 0.
Since αk = αˆ k−1 +tkdk
, t = 0 means the decision variable will not be updated in this iteration. In
the next iteration, Algorithm 6 will generate more samples and these new samples may improve
the search direction. This whole procedure creates a subroutine to first terminate the line-search
algorithm and then improve the search direction by further sampling. This phenomenon is similar
to retaining the previous incumbent solution in the stochastic decomposition algorithm [28].
3.3 Convergence Properties
While the SCS algorithm and the stochastic trust region method share some similarities in their
proofs of convergence, the specifics of these arguments are different because the former is based
on direction-finding and the latter is based on trust regions. For example, the fully-linear property
41
Algorithm 7: Line Search Algorithm
2 Set 1
4 ≤ m1 < 0.5, 1
4 ≤ m2 < m1, 1 < n ∈ Z
+ and b =
δk
n
.;
4 Choose t||dk
|| ∈ [b,δk
].;
5 if t ∈ L\R then
7 t = 2t until t ∈ R or t||dk
|| > δk
. Set I = [t/2,t].;
8 if t||dk
|| > δk
then
10 Return t/2.
11 else
12 if t ∈ L∩R then
14 Return t
15 else
16 while t ∈/ L∩R do
18 Set t be the middle point of I;
19 if t ∈ R\L then
21 Replace I by its left half
22 else
24 Replace I by its right half
25 end
26 end
27 end
28 end
29 end
30 if t ∈ R\L then
32 t = t/2 until t ∈ L or t||dk
|| < b. Set I be the interval of [t,2t].;
33 if t ∈ L∩R then
35 Return t
36 else
37 if t||dk
|| < b then
39 Return t = 0
40 else
42 Repeat line (18)-(25) while t ∈/ L∩R.
43 end
44 end
45 end
47 Return t
42
in stochastic trust region method plays no role in the convergence proof of this paper. Similarly, the
sufficient decrease condition (Lemma 3.2) in our algorithm relies on Wolfe’s line search algorithm
while the condition used for a stochastic trust region method requires the Cauchy decrease property.
In the following subsections, Theorem 3.1 shows the sample complexity of the algorithm, Theorem
3.2 and the subsequent corollary show that sufficient descent can be obtained and as a result, the
stochastic sequence is a supermartingale. In §3.3.3, we apply the renewal reward theorem and
optional stopping time theorem [26] to establish the convergence rate of the SCS algorithm. The
specific results are provided in Theorem 3.3.
Notational Guide. We will adopt the convention that random variables will be denoted by capital letter (e.g., B) while a realization will be denoted by the corresponding lower case letter (e.g.,
b). Also, to avoid any notational confusion, please note that B will denote a ball and Ci’s are
ordinary constants.
3.3.1 Stochastic Local Approximations
In this subsection, we aim to demonstrate that with an adequate number of samples, the function
fk will serve as an “effective” stochastic local approximation of f . The criteria for what constitute “effective” in this context is provided in Definition 3.1, which provides a precise metric.
Additionally, Definition 3.2 extends this concept to a stochastic framework.
Definition 3.1. A function fk
is a local κ-approximation of f on B(αˆ k
,δk), if for all α ∈ B(αˆ k
,δk),
| f(α)− fk(α)| ≤ κδ 2
k
.
Definition 3.2. Let ∆k
, Aˆ
k denote the random counterpart of δk
, αˆ k
, respectively, and Fk denote the
σ-algebra generated by {Fi
,Fˆ
i
,Fˆ
i+1}
k−1
i=0
. Notice that this definition allows us to include the value
of the so-called incumbent sequences (of Stochastic Decomposition [28]) as part of the σ-algebra.
43
A sequence of random models {Fk} is said to be µ1-probabilistically κ-approximation of f with
respect to {B(Aˆ
k
,∆k)} if the event
Ik
∆= {Fk
is κ −approximation of f on B(Aˆ
k
,∆k)}
satisfies the condition
P(Ik
|Fk−1) ≥ µ1 > 0.
Remark 2: (i) Unlike [47] which imposes a generic set of guidelines for stochastic line-search
methods including requirements on the gradient approximation, our setup accommodates nonsmoothness of the kernel SVM problems. For this reason, our requirements simply focus on sufficiently accurate function estimation without invoking gradient approximations. (ii) In the interest
of clarity, we emphasize that the incumbent decision αˆ k has its random counterpart denoted as Aˆ
k
.
Given the relation β = ∑
|Sk
|
i=i αiφ(zi), we can define the function f(α) as a transformation of g(β)
g(β) = f(α) :=
1
2
⟨α,Q
kα⟩+E[max{0,1−W⟨α,Q
k
i
(Z)⟩}], (3.7)
where Q
k
i
(Z) = [⟨φ(z1),φ(Z)⟩,⟨φ(z2),φ(Z)⟩,...,⟨φ(z|Sk
|
),φ(Z)⟩]. Similarly, we can also express
fk(α) in terms of gk(β) as follows.
fk(α) = gk(β) :=
1
2
||β||2 +
1
|Sk
|
|Sk
|
∑
i=1
max{0,1−wi⟨β,φ(zi)⟩}. (3.8)
Theorem 3.1. Assume that f and fk are Lipschitz continuous with constant Lf
, for any 0 < ε < 1,
κ >
4Lf
δmin
. If for all i, we have |⟨β,φ(zi)⟩| ≤ M and
|Sk
| ≥ −8log(ε/2)·
(M +1)
2
κ
2δ
4
k
, (3.9)
44
then
P(| fk(α)− f(α)| ≤ κδ 2
k
, ∀α ∈ B(αˆ k
,δk)) ≥ 1−ε.
Proof. First, for the point αˆ k
, given that ˆβk = ∑
|Sk
|
i=i αˆ k,iφ(zi), we have
f(αˆ k) = g(
ˆβk) = 1
2
|| ˆβk
||2 +E[max{0,1−W⟨
ˆβk
,φ(Z)⟩}],
fk(αˆ k) = gk(
ˆβk) = 1
2
|| ˆβk
||2 +
1
|Sk
|
|Sk
|
∑
i=1
max{0,1−wi⟨
ˆβk
,φ(zi)⟩}.
Thus,
fk(αˆ k)− f(αˆ k) = gk(
ˆβk)−g(
ˆβk)
=
1
|Sk
|
|Sk
|
∑
i=1
max{0,1−wi⟨
ˆβk
,φ(zi)⟩} −E[max{0,1−W⟨
ˆβk
,φ(Z)⟩}].
Using the assumption that |⟨β,φ(zi)⟩| ≤ M for all i, Hoeffding’s inequality implies that
P(| fk(αˆ k)− f(αˆ k)| ≤ 1
2
κδ 2
k
) ≥ 1−2exp
−
κ
2δ
4
k
|Sk
|
2
8 ·|Sk
|·(M +1)
2
≥ 1−ε,
which indicates that when
|Sk
| ≥ −8log(ε/2)·
(M +1)
2
κ
2δ
4
k
,
we have
P(| fk(αˆ k)− f(αˆ k)| ≤ 1
2
κδ 2
k
) ≥ 1−ε.
For any other α ∈ B(αˆ k
,δk), if | fk(αˆ k)− f(αˆ k)| ≤ 1
2
κδ 2
k
, then
| fk(α)− f(α)| ≤ | fk(α)− fk(αk)|+| fk(αk)− f(αk)|+| f(αk)− f(α)|
≤ 2Lf
· δk +
1
2
κδ 2
k
≤ κδ 2
k
,
45
where the second last inequality is due to the assumption that fk and f are Lipschitz continuous
and the last inequality is because κ >
4Lf
δmin
>
4Lf
δk
. Thus, we conclude that if |Sk
| satisfies Equation
(3.9), then
P(| fk(α)− f(α)| ≤ κδ 2
k
, ∀α ∈ B(αˆ k
,δk)) ≥ 1−ε.
3.3.2 Sufficient Decreases and Supermartingale Property
This section establishes the stochastic variant of the sufficient decrease condition, known as the
supermartingale property, in Theorem 3.2. The proof is segmented into four distinct cases, each
based on whether the conditions of κ-approximation (as defined in Definition 3.2) and εF-accuracy
(as outlined in Definition 3.4) are met or not. Lemmas 3.1 - 3.3 detail the outcomes for these
various situations. Following these lemmas, the supermartingale property and its comprehensive
proof are presented.
Definition 3.3. The value estimates ˆfk
∆= ˆfk(αˆ k) and ˆfk+1
∆= ˆfk(αˆ k +tdk) are εF-accurate estimates
of f(αˆ k) and f(αˆ k +tdk), respectively, for a given δk
if and only if
|
ˆfk − f(αˆ k)| ≤ εFδ
2
k
, |
ˆfk+1 − f(αˆ k +tdk)| ≤ εFδ
2
k
.
Definition 3.4. A sequence of model estimates {Fˆ
k
,Fˆ
k+1} is said to be µ2-probabilistically εFaccurate with respect to {Aˆ
k
,∆k
,Sk} if the event
Jk
∆= {Fˆ
k
,Fˆ
k+1 are εF-accurate estimates for ∆k} (3.10)
satisfies the condition
P(Jk
|Fk−1) ≥ µ2 > 0. (3.11)
46
Lemma 3.1. Suppose that fk
is a κ-approximation of f on B(αˆ k
,δk) and C1 =
1
8n
, where n is
specified in Algorithm 7. If
δk ≤
1
16nκ
||dk
|| (3.12)
and Algorithm 7 returns a suitable step size t ∈ L and t||dk
|| >
δk
n
, then we have
f(αˆ k +tdk)− f(αˆ k) ≤ −C1||dk
||δk
. (3.13)
Proof. Since t ∈ L and t||dk
|| >
δk
n
, we have
fk(αˆ k +tdk)− fk(αˆ k) ≤ −m1||dk
||2
t ≤ −
m1
n
||dk
||δk
.
Also, since ˆfk
is κ −approximation of f on B(αˆ k
,δk), we have
f(αˆ k +tdk)− f(αˆ k) = f(αˆ k +tdk)− fk(αˆ k +tdk) + fk(αˆ k +tdk)
− fk(αˆ k) + fk(αˆ k)− f(αˆ k)
≤ 2κδ 2
k −
m1
n
||dk
||δk
≤ −C1||dk
||δk
,
(3.14)
which concludes the proof.
Lemma 3.2. Suppose that fk
is κ-approximation of f on the ball B(αˆ k
,δk) and the estimates
(
ˆfk
,
ˆfk+1) are εF-accurate with εF ≤ κ. Moreover, if
δk ≤ min{
(1−η1)C1||dk
||
2κ
,
||dk
||
η2
}, (3.15)
and Algorithm 7 returns a suitable step size t ∈ L and t||dk
|| >
δk
n
, then the k-th iteration will be
successful (Update αˆ k ← βk
, and δk ← min{γδk−1,δmax}).
47
Proof. Since t ∈ L and t||dk
|| >
δk
n
, we have
fk(αˆ k +tdk)− fk(αˆ k) ≤ −
m1
n
||dk
||δk ≤ −2C1||dk
||δk
,
where the last inequality holds because m1 ≥
1
4
. Also, fk
is κ-approximation of f on B(αˆ k
,δk)
and the estimates (
ˆfk
,
ˆfk+1) are εF-accurate with εF ≤ κ implies that
ρk =
ˆfk+1 − ˆfk
fk(αˆ k −tdk)− fk(αˆ k)
=
ˆfk+1 − f(αˆ k +tdk)
fk(αˆ k +tdk)− fk(αˆ k)
+
f(αˆ k +tdk)− fk(αˆ k +tdk)
fk(αˆ k +tdk)− fk(αˆ k)
+
fk(αˆ k +tdk)− fk(αˆ k)
fk(αˆ k +tdk)− fk(αˆ k)
+
fk(αˆ k)− f(αˆ k)
fk(αˆ k +tdk)− fk(αˆ k)
+
f(αˆ k)− ˆfk
fk(αˆ k +tdk)− fk(αˆ k)
,
which indicates that
1−ρk ≤
2κδk
C1||dk
|| ≤ 1−η1.
Thus, we have ρk ≥ η1, ||dk
|| > η2δk and the iteration is successful.
Lemma 3.3. Suppose the function value estimates {(
ˆfk
,
ˆfk+1)} are εF-accurate and
εF < η1η2C1.
If the kth iteration is successful, then the improvement in f is bounded such that
f(αˆ
k+1
)− f(αˆ k) ≤ −C2δ
2
k
,
where C2
∆= 2η1η2C1 −2εF.
Proof. Since t ∈ L and t||dk
|| >
δk
n
, we have
fk(αˆ k +tdk)− fk(αˆ k) ≤ −2C1||dk
||δk
,
48
Also, since the iteration is successful, we have ||dk
|| > η2δk
. Thus, we have
ˆfk − ˆfk+1 ≥ η1(f(αˆ k)− f(αˆ k +tdk)) ≥ 2η1C1||dk
||δk ≥ 2η1η2C1δ
2
k
.
Then, since the estimates are εF -accurate, we have that the improvement in f can be bounded as
f(αˆ k +tdk)− f(αˆ k) ≤ f(αˆ k +tdk)− ˆfk+1 + ˆfk+1 − ˆfk + ˆfk − f(αˆ k) ≤ −C2δ
2
k
.
Theorem 3.2. Let the random function Vk
∆= Fk+1 −Fk
, the corresponding realization be vk
,
ζ = max{4nκ,η2,
4κ
C1(1−η1)
} and µ1µ2 >
1
2
. (3.16)
Then for any ε > 0 such that ||dk
|| ≥ ε and ζ∆k ≤ ε, we have
E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε] ≤ −
1
2
C1||dk
||∆k ≤ −θ ε∆k
,
where θ =
1
2
C1.
Proof. First of all, if kth iteration is successful, i.e. αˆ k+1 = αˆ k −tdk we have
vk = f(αˆ k +tdk)− f(αˆ k). (3.17)
If kth iteration is unsuccessful, i.e. αˆ k+1 = αˆ k we have
vk = f(αˆ k)− f(αˆ k) = 0. (3.18)
Then we will divide the analysis into 4 cases according to the states (true/false) observed for the
pair (Ik
, Jk).
(a) Ik and Jk are both true. Since the fk
is a κ-approximation of f on B(αˆ k
,δk) and condition
(3.12) is satisfied, we have Lemma 3.1 holds. Also, since the estimates (
ˆfk
,
ˆfk+1) are εF-accurate
49
and condition (3.15) is satisfied, we have Lemma 3.2 holds. Combining (3.13) with (3.17), we
have
vk ≤ −C1||dk
||δk
∆= b1. (3.19)
(b) Ik
is true but Jk
is false. Since fk
is a κ-approximation of f on B(αˆ k
,δk)∩X and condition
(3.12) is satisfied, it follows that Lemma 3.1 still holds. If the iteration is successful, we have
(3.19), otherwise we have (3.18). Thus, we have vk ≤ 0.
(c) Ik
is false but Jk
is true. If the iteration is successful, since the estimates (
ˆfk
,
ˆfk+1) are εFaccurate and condition (3.3) is satisfied, Lemma 3.3 holds. Hence,
vk ≤ −C2δ
2
k
.
If the iteration is unsuccessful, we have (3.18). Thus, we have vk ≤ 0 whether the iteration is
successful or not.
(d) Ik and Jk are both false. Since f is convex and t||dk
|| < δk
, for any g(t) ∈ ∂ f(αˆ k +tdk), with
the assumption in §3.1.3 which implies that ||g(t)|| is bounded, we have
vk = f(αˆ k +tdk)− f(αˆ k) ≤ g(t)
T
tdk ≤ C3δk
.
If the iteration is successful, then
vk ≤ C3δk
∆= b2.
If the iteration is unsuccessful, we have (3.18). Thus, we have vk ≤ b2 whether the iteration is
successful or not.
50
With the above four cases, we can bound E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε] based on different outcomes
of Ik and Jk
. Let B1 and B2 be the random counterparts of b1 and b2. Then we have
E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε]
≤ µ1µ2B1 + (µ1(1− µ2) + µ2(1− µ1))· 0+ (1− µ1)(1− µ2)B2
= µ1µ2(−C1||dk
||∆k) + (1− µ1)(1− µ2)C3∆k
.
Choose µ1 ∈ (1/2,1) and µ2 ∈ (1/2,1) large enough such that
µ1µ2 ≥ 1/2 and µ1µ2
(1− µ1)(1− µ2)
≥
2C3
c1||dk
||,
we have
E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε] ≤ −
1
2
C1||dk
||∆k
.
A quick observation of the concluding condition reveals that the requirement ∆k ≤
ε
ζ
implies that
the supermartingale property holds. This prompts us to impose the condition that restricts δmax =
ε
ζ
.
This property is formalized in the following corollary.
Corollary 3.1. Let
Tε = inf{k ≥ 0 : ||dk
|| < ε}.
Then Tε is a stopping time for the stochastic process Aˆk
. Moreover, conditioned on Tε ≥ k, Fk
is a
supermartingale.
Proof. From Theorem 3.2, we have
E[Fk
|Fk−1,Tε > k] ≤ Fk−1 −Θε∆k
. (3.20)
Hence, Fk
is a supermartingale.
51
3.3.3 Convergence Rate
Building upon the results established in Theorem 3.2, where fk
is demonstrated to be a supermartingale and Tε is identified as a stopping time, we proceed to construct a renewal reward process for
analyzing the bound on the expected value of Tε . As highlighted in the abstract, Theorem 3.3
confirms the rate of convergence to be O(1/ε
2
). To begin with, let us define the renewal process
{Al} as follows: set A0 = 0, and for each l > 0, define Al = inf{m > Al−1 : ζ∆m ≥ ε}, with ζ
being specified in (3.16). Additionally, we define the inter-arrival times τl = Al − Al−1. Lastly,
we introduce the counting process N(k) = max{n : Al ≤ k}, representing the number of renewals
occurring up to the k
th iteration.
Lemma 3.4. Let 1
2 < p = µ1µ2 ≤ P(Ik ∩Jk). Then for all l ≥ 1,
E[τl
] ≤
p
2p−1
.
Proof. First,
E[τl
] = E[τl
|ζ∆Al−1 > ε] ·P(ζ∆Al−1 > ε) +E[τl
|ζ∆Al−1 = ε] ·P(ζ∆Al−1 = ε)
≤ max{E[τl
|ζ∆Al−1 > ε],E[τn|ζ∆Al−1 = ε]}.
If ζ∆Al−1 > ε, according to algorithm 6,
E[τl
|ζ∆Al−1 > ε] = 1. (3.21)
If ζ∆Al−1 = ε, by Theorem 3.2, we can treat {∆Al−1
,...,∆Al
} as a random walk, and we have
E[τl
|ζ∆Al−1 = ε] ≤
p
2p−1
. (3.22)
Combining (3.21) and (3.22) completes the proof.
52
Lemma 3.5. Let ζ and θ be the same as in Theorem 3.2 and ∆max =
ε
ζ
, then
E[N(Tε )] ≤
2ζFmax
θ ε2
+
ζ∆max
ε
=
2ζFmax
θ ε2
+1,
where Fmax = maxi≤(k∧Tε )
|Fi
|.
Proof. We will first show that
Rk∧Tε = Fk∧Tε +Θε
k∧Tε
∑
j=0
∆j (3.23)
is a supermartingale. Using (3.20),
E[Rk∧Tε
|Fk−1] = E[Fk∧Tε
|Fk−1] +E[Θε
k∧Tε
∑
j=0
∆j
|Fk−1]
≤ Fk−1 −Θε∆k +E[Θε
k∧Tε
∑
j=0
∆j
|Fk−1]
= Fk−1 +E[Θε
(k−1)∧Tε
∑
j=0
∆j
|Fk−1]
= Fk−1 +Θε
(k−1)∧Tε
∑
j=0
∆j
= R(k−1)∧Tε
,
(3.24)
where the summation in the last expectation in (3.24) is true by moving Θε∆k
inside the summation
so that it has one less term if k < Tε .
If k < Tε , then
|Rk∧Tε
| = |Rk
| ≤ Fmax +Θεk∆max.
If k ≥ Tε , then
|Rk∧Tε
| = |Rε | ≤ Fmax +ΘεTε∆max.
53
This is also bounded almost surely since Tε is bounded almost surely. Hence, according to (3.23)
and the optional stopping theorem [26], we have
E[Θε
Tε
∑
j=0
∆j
] ≤ E[RTε
] +Fmax ≤ E[R0] +Fmax ≤ 2Fmax +Θε∆max. (3.25)
Furthermore, since the renewal An happens when ζ∆j ≥ ε and N(Tε ) is a subset of {0,1,2,...,Tε},
we have
Θε
Tε
∑
j=0
ζ∆j
≥ Θε
N(Tε )ε
. (3.26)
Combining (3.25) and (3.26), we have
E[N(Tε )] ≤
2ζFmax +ζΘε∆max
Θε
2
≤
2ζFmax
Θε
2
+
ζ∆max
ε
.
Theorem 3.3. Under conditions enunciated in §3.1.3, we have
E[Tε ] ≤
p
2p−1
2ζFmax
Θε
2
+2
. (3.27)
Proof. First, note that N(Tε ) + 1 is a stopping time for the renewal process {An : n ≥ 0}. Thus,
using Wald’s equation (inequality form) [26], we have
E[AN(Tε )+1
] ≤ E[τ1]E[N(Tε ) +1].
Moreover, since AN(Tε )+1 ≥ Tε , we have
E[Tε ] ≤ E[τ1]E[N(Tε ) +1].
54
Hence, by Lemma 3.4 and Lemma 3.5
E[Tε ] ≤
p
2p−1
2ζFmax
Θε
2
+2
.
3.3.4 Optimality Condition
Theorem 3.4 demonstrates that if ||dk
|| < ε, the ε-optimality condition for f(x) will be satisfied.
The error bound on this optimality is linked to both the sample size and the subgradient of fk(x).
Readers seeking the in-depth proof of Theorem 3.4 can refer to Appendix 3.5.
Theorem 3.4. Let d∗ = −argming∈∂ f(αˆ k)
||g||. If k is the smallest index for which ||dk
|| < ε and
|Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then
P(||d
∗
|| < 4ε +ε
′
) ≥ 1−δ.
3.4 Preliminary Computational Results
Our computational experiments are based on data sets available at the UCI Machine Learning
Repository [23]. Our study considers three different methods: Wolfe’s algorithm [73], kernel
Pegasos algorithm1
and the SCS algorithm. Wolfe’s algorithm uses a deterministic objective
function (i.e., the SAA) and is treated as a benchmark, while Pegasos and SCS use (3.2) as the
objective function. However, in order to compare the progress of the objective function values
for different algorithms, we will track the values of (3.2) for all three algorithms. The values of
{ f(αˆ k)}
k=50
k=1
and { f(αˆ k)}
k=−1
k=−50 are shown in Figures 3.1 and 3.2, where the notation k = 1,...,50,
and k = −50,...,−1 represent the solution value in the first fifty, and last fifty (respectively) iterations of each algorithms. The algorithms were implemented on a MacBook Pro 2019 with a
2.6GHz 6-core Intel Core i7 processor and a 16GB of 2400MHz DDR4 onboard memory.
1The computations below are based on our implementation of the Pegasos Algorithm.
55
Figure 3.1: { f(αˆ k)}
k=50
k=1
for different combinations (data,algorithm).
(a) Breast Cancer (b) Heart Failure (c) Wine Quality
(d) Avila Bible (e) Magic Telescope (f) Room Occupancy
Remark 3: (i) The objective function values validated for the SCS algorithm are lower than those
for the Pegasos algorithm. (ii) The SCS algorithm also converges faster than the Pegasos Algorithm. (iii) The beauty of the Pegasos algorithm is that it provides a decision rule without providing
a solution to (3.2). This decision rule provides a classifier without any reference to optimality. On
the other hand, the SCS algorithm is truly an optimization algorithm with a stopping rule in which
the steps (and stopping) are guided by dk
(||dk
|| < ε).
The performance of all three algorithms for different data sets are given in Table 3.1. To make
the experiment more persuasive, we divide the data into training and testing parts. We obtain
a classifier using the training data and apply such classifiers in the testing data set to estimate
the accuracy. Here, accuracy reflects the fraction of total number of data points that have been
correctly classified. Since Wolfe’s algorithm requires the entire data set, instances with very large
data sets are beyond its reach and are reported by N/A in Table 3.1. The accuracy and training
time reported in the Table 3.1 are based on the average of 20 independent runs for each pair (data,
algorithm) using different random seeds.
56
Figure 3.2: { f(αˆ k)}
k=−1
k=−50 for different combinations (data,algorithm).
(a) Breast Cancer (b) Heart Failure (c) Wine Quality
(d) Avila Bible (e) Magic Telescope (f) Room Occupancy
Table 3.1: Classification performance of algorithms for different data sets
Algorithm Wolfe Pegasos SCS
data sets samples features accu. time(s) accu. time(s) accu. time(s)
heart failure 273 13 0.833 1.48 0.833 0.75 0.833 6.12
breast cancer 500 30 0.97 2.28 0.95 1.14 0.97 4.92
wine quality 680 11 0.87 4.19 0.855 3.93 0.86 12.49
avila Bible 2,000 10 0.726 32.97 0.714 5.60 0.718 38.26
magic telescope 5,000 10 N/A N/A 0.735 25.99 0.74 33.25
room occupancy 7,500 18 N/A N/A 0.974 56.54 0.976 39.88
swarm behavious 20,000 2400 N/A N/A 0.805 67.61 0.9 13.38
mini-BooNE 120,000 50 N/A N/A 0.79 145.86 0.81 39.57
skin-nonskin 200,000 3 N/A N/A 0.93 176.45 0.97 19.68
Hepmass 3,500,000 28 N/A N/A 0.83 3762.89 0.87 42.69
Remark 4: Conventional wisdom in ML suggests that SFO methods provide the best balance
between scalability and accuracy of stochastic optimization problems. However from Table 3.1,
we can see that this general impression may not only be questionable but also encourage an overemphasis on SFO methods. In fact, the SCS algorithm performs better than the Pegasos algorithm
both in accuracy and in computational speed, especially for the larger data sets in our experiment.
Specifically, for the four largest data sets in our experiment, the computational times and accuracy reported for the SCS algorithm are superior to those obtained using Pegasos. Thus, the SCS
algorithm gives an alternative approach which is quite competitive for Kernel SVM problems.
57
3.5 Conclusion
Our approach leads to a class of online algorithms that go beyond first-order approximations and
incorporates stochastic analysis of stopping times in a direction-finding method. Specifically, it
includes several features of Wolfe’s method (e.g., stability and good convergence rate) as well as
online aspects of Pegasos. This combination promotes improvements on both deterministic and
stochastic methods that have been previously used for SVM. In contrast to the Pegasos algorithm,
we find that the optimization performance of the SCS algorithm is more reliable and provides consistently lower objective values (see Figures 3.1 and 3.2) in out-of-sample validation. It appears
that such reliability may be difficult to achieve using first-order methods without additional finetuning. Although our experimental results are based on some fixed data sets, the SCS approach can
be effective for classification of streaming data as well. Finally, we should emphasize that although
the SCS algorithm in this paper is specifically designed for kernel SVM, it can be generalized to
other non-smooth stochastic convex optimization problems, as well as constraimed problems using
primal-dual methods similar to [60, 56]. We also conjecture that many of the asymptotic properties
(e.g., asymptotic normality) of SGD methods may also be studied for the SCS algorithm [19].
Appendix
Proof of optimality condition
In this section, we want to prove that ||dk
|| < ε implies that the ε-optimality condition of f(αˆ k), as
stated in Theorem 3.4. The error bound on optimality is related to the sample size as well as the
the subgradient of fk(αˆ k). First of all, we will define a few well-known notations.
• d
∗ = −argming∈∂ f(αˆ k)
||g||: negative of the smallest norm subgradient of f(αˆ k).
• d
∗
k = −argming∈∂ fk(αˆ k)
||g||: negative of the smallest norm subgradient of fk(αˆ k).
• dk = −Nr({gk
,−dk−1}), where gk ∈ ∂ fk(αˆ k).
58
• Subdifferential of f at z
′
: ∂ f(z
′
) = {q : f(z) ≥ f(z
′
) +q
T
(z−z
′
)}.
• ε-subdifferential of f at z
′
: ∂ fε (z
′
) = {q : f(z) ≥ f(z
′
) +q
T
(z−z
′
)−ε}.
• Directional derivative of f at z in direction d: f
′
(z;d) = max{d
T q : q ∈ ∂ f(z)}.
• ε-directional derivative of f at z in direction d: f
′
ε
(z;d) = max{d
T q : q ∈ ∂ε f(z)}.
• X (z) := −min||d||=1
f
′
(z;d) and Xk(z) := −min||d||=1
f
′
k
(z;d).
• Xε (z) := −min||d||=1
f
′
ε
(z;d) and d¯
k(z) = −argmin||d||=1
f
′
ε
(z;d).
Lemma 3.6. Let the tuple (m1,m2) satisfy the requirement in Algorithm 7 ( 1
4 ≤ m1 <
1
2
and 1
4 ≤
m2 < m1). If k is the smallest index for which ||dk
|| < ε, then we have ||d
∗
k
|| < 4ε.
Proof. Suppose the claim is false, then by the definition of d
∗
k
, we have ||gk
|| ≥ ||d
∗
k
|| ≥ 4ε. Thus,
||dk
||2 = ||λ
∗
k
gk −(1−λ
∗
k
)dk−1||2
= (λ
∗
k
)
2
||gk
||2 + (1−λ
∗
k
)
2
||dk−1||2 −2λ
∗
k
(1−λ
∗
k
)⟨gk
,dk−1⟩
≥ (λ
∗
k
)
2
||gk
||2 + (1−λ
∗
k
)
2
||dk−1||2
(3.28)
where the inequality holds because Algorithm 7 ensures the condition R in (3.6).
We will now divide the analysis into 2 cases which are examined below: a) ||gk
|| > ||dk−1|| ≥ ε
and ||gk
|| ≥ 4ε and b) ||dk−1|| ≥ ||gk
|| ≥ 4ε.
(a) In this case from (3.28) we have
||dk
||2 ≥
17(λ
∗
k
)
2 −2λ
∗
k +1
ε
2
(3.29)
Note that
λ
∗
k =
⟨gk
,dk−1⟩+||gk
||2
||dk−1||2 +||gk
||2 +2⟨gk
,dk−1⟩
≥
−m2||dk−1||2 +||gk
||2
||dk−1||2 +||gk
||2
,
where the inequality holds because the R condition in (3.6) ensures 0 > ⟨gk
,dk−1⟩ ≥ −m2||dk−1||2
.
Thus, we claim that λ
∗
k ≥
2
17 because it is suffice to show
5
−17 ·m2||dk−1||2 +17 ·||gk
||2 ≥ 2||dk−1||2 +2||gk
||2
.
This can be verified by observing m2 < m1 <
1
2
and ||gk
|| > ||dk−1||. Thus, based on (3.29) and
λ
∗
k ≥
2
17 , we have ||dk
|| ≥ ε. This contradicts the assumptions of the lemma.
(b) If ||dk−1|| ≥ ||gk
|| ≥ 4ε, then we have
||dk
||2 ≥
32(λ
∗
k
)
2 −32λ
∗
k +16
ε
2 ≥ 8ε
2
.
Thus, ||dk
|| ≥ 2
√
2ε, which also contradicts the assumptions of the lemma.
Lemma 3.7. ||d
∗
|| = X (αˆ k) and ||d
∗
k
|| = Xk(αˆ k).
Proof. According to the definition of X (αˆ k),
X (αˆ k) = − min
||d||=1
f
′
(αˆ k
;d) = − min
||d||=1
max
q∈∂ f(αˆ k)
d
T
q (3.30)
Note that the min-max problem (3.30) has a stationary point
q¯ = argmin
q∈∂ f(αˆ k)
||q||, d¯= −
q¯
||q¯||.
Thus, X (αˆ k) = ||q¯|| = ||d
∗
||. Similarly, we can also prove that Xk(αˆ k) = ||d
∗
k
||.
Theorem 3.5. If |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then P(Xε (αˆ k)−Xk(αˆ k) ≤ ε
′
) ≥ 1−δ.
Proof. By definition, d¯
k(αˆ k) = −argmin||d||=1
f
′
ε
(αˆ k
;d). Hence,
Xε (αˆ k)−Xk(αˆ k) ≤
1
|Sk
|
|Sk
|
∑
i=1
h
′
(αˆ k
;d¯
k(αˆ k),ωi)−E[h
′
ε
(αˆ k
;d¯
k(αˆ k),ω)]
=
1
|Sk
|
|Sk
|
∑
i=1
[h
′
(αˆ k
;d¯
k(αˆ k),ωi)−h
′
ε
(αˆ k
;d¯
k(αˆ k),ωi)]
+
1
|Sk
|
|Sk
|
∑
i=1
h
′
ε
(αˆ k
;d¯
k(αˆ k),ωi)−E[h
′
ε
(αˆ k
;d¯
k(αˆ k),ω)]
6
Note that for each ωi
, we have h
′
(αˆ k
;d¯
k(αˆ k),ωi) ≤ h
′
ε
(αˆ k
;d¯
k(αˆ k),ωi), since ∂h(αˆ k
,ωi) ⊆ ∂hε (αˆ k
,ωi).
Also, by Hoeffding’s inequality, if |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then
P
1
|Sk
|
|Sk
|
∑
i=1
h
′
ε
(αˆ k
;d¯
k(αˆ k),ωi)−E[h
′
ε
(αˆ k
;d¯
k(αˆ k),ω)] ≤ ε
′
≥ 1−δ,
which concludes the proof.
Proof. (Theorem 3.4) First, by Theorem 3.5 and Lemma 3.7, if |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then with probability
at least 1−δ,
||d
∗
|| −||d
∗
k
|| = lim
ε→0
Xε (αˆ k)−Xk(αˆ k) ≤ ε
′
.
Combining with Lemma 3.6, we have
P
||d
∗
|| < 4ε +ε
′
≥ 1−δ.
Notation Tables
Symbols Definitions
m Number of samples
S = {(zi
,wi)}
m
i=1 The dataset with pairs correspond to features and their associated labels
(Z,W) The random counterpart of (z,w)
φ The kernel mapping which maps the feature z to a higher dimension
β The decision variable (classifier) for the SVM problem
α The decision variable for kernel SVM problem
Q ∈ R
m×m The kernel matrix with Qi j = ⟨φ(zi),φ(zj)⟩
Qi The i
th row of Q
Q
k The kernel matrix generated at k
th iteration
Q
k
i The i
th row of Q
k
Sk The dataset used at k
th iteration
|Sk
| The number of data points in Sk
0 < L < ∞ The Lipschitz constant for function f and fk
M A large constant to bound |⟨φ(z),β⟩| for any φ(z) and β
Table 3.2: Definitions of Symbols in Section 1
Symbols Definitions
dk The search direction at k
th iteration
gk The subgradient of fk at k
th iteration
Gk = {dk−1,gk} A convex set which contains all of the convex combinations of dk−1 and gk
Nr(Gk) A function which returns a vector with the smallest norm in the convex hall of Gk
ε An algorithmic parameter related to the termination criteria (||dk
|| < ε)
δ The region where the line-search is executed
δ0, δmin and δmax The initial, minimum and maximum choice of the region
η1 ∈ (0,1) An algorithmic parameter to identify whether fk and ˆfk have similar decrease
η2 > 0 An algorithmic parameter to identify whether ||dk
|| is large enough
γ > 1 An algorithmic parameter to adjust the line-search region δ
αk The candidate solution of fk−1
αˆ k The optimal(incumbent) solution of fk−1
βk The candidate solution of fk
ˆβk The optimal(incumbent) solution of fk
Tk The dataset used in the k
th iteration independent from Sk
|Tk
| The number of data points in Tk
1
4 < m1 < 0.5 An algorithmic parameter to identify whether fk has sufficient decrease
m2(< m1) An algorithmic parameter to identify whether the directional derivative of fk has sufficient increase
1 < n ∈ Z
+ An algorithmic parameter to avoid unlimited iterations of line-search
ζ An algorithmic parameter to bound the line-search region
Table 3.3: Definitions of Symbols in Section 2
Symbols Definitions
κ The κ-approximation parameter (see Definition 3.1)
Ik An event when fk satisfies κ-approximation property
µ1 The probability of event Ik
εF The εF-accurate parameter (see Definition 3.4)
Jk A event when ˆfk
is εF-accurate
µ2 The probability of event Jk
Tε = inf{k > 0 : ||dk
|| < ε} The first time that ||dk
|| < ε
Al The renewal process which renews once δl >
ε
ζ
τl = Al −Al−1 The inter-arrival time
N(k) = max{l : Al ≤ k} The counting process representing the number of renewals occurred up to k
th iteration
Table 3.4: Definitions of Symbols in Section 3
62
Chapter 4
Stochastic Conjugate Subgradient Algorithm for Two-stage
Stochastic Programming
4.1 Introduction
In this chapter, we focus on the task of solving a two-stage Stochastic Programming (SP) with
equality constraints
min
x∈X⊆Rn1
f(x) = c(x) +Eω[h(x,ω)], (4.1)
where X = {x : Ax = b, x ≥ 0} is a polyhedral set, c(·) is differentiable convex deterministic function, ω ∈ R
m is a continuous random vector and h(·,ω)is non-smooth and convex random function.
Here, h(x,ω) is called the recourse function and its value is usually obtained by solving another
constrained optimization problem:
h(x,ω) ≜ min
y
g(y)
s.t. Dy = ξ (ω)−C(ω)x,
y ≥ 0, y ∈ R
n2
,
where g(y) can be either a linear function d
T
y or a convex quadratic function 1
2
y
TPy + d
T
y. For
this reason, we will call x and y the first-stage and second-stage decision variable, respectively. For
convenience, we will let Y(x,ω) ≜ {0 ≤ y ∈ R
n2
|Dy = ξ (ω)−C(ω)x} through out the chapter.
63
It is well known that one can solve (4.1) by adopting first-order methods such as stochastic
gradient descent (SGD) [54, 11, 63] and stochastic decomposition [27, 28] where the subgradient
calculations can be carried out very rapidly and in parallel (for a given x). However, first-order
methods are provably inaccurate in many cases [39].
We propose a new algorithm that will accommodate both the curvature of functions, as well as
the decomposition of subgradients by data points, as is common in stochastic programming. The
algorithm includes the computational power of non-smooth conjugate subgradient method [73],
sequential sampling, decomposition [27, 28] and active set method [46] to provide both computational reliability as well as a well-defined convergence rate. The convergence and convergence
rate of the new method are established, with some experiments showing performance of the new
algorithm. However, our convergence results are restricted to the linear equality constraints (i.e.,
Ax = b). A more general model with inequality (i.e., x ≥ 0) will be included in a revised version.
The structure of this chapter is as follows. We begin by setting our work within the context of
the literature in §4.2. In §4.3, the discussion evolves from mathematical concepts to computational
algorithms. We will discuss the stochastic conjugate subgradient (SCS) algorithm for solving large
scale SP of the form (4.1) via a combination of conjugate gradient descent (CGD, [73, 52]) and
sample average approximation (SAA), which helps leverage a successive sampling methodology.
This combination not only introduces some elements of recursive computations but also bypasses
the need for constrained SP algorithms or even the need for solving a large linear system (of
equations) rapidly. In §4.4, we will present the convergence analysis of SCS by showing that the
algorithm guarantees the objective function values to form a sequence of supermartingales and it
converges in expectation. In §4.5, we will provide some technical details about its implementation
and present computational results by comparing the SGD algorithm, the stochastic mirror descent
(SMD) [45], and the SCS algorithm. In the context of that comparison, we will discuss some
advantages and disadvantages of adopting the SCS algorithm in the context of specific SP.
64
4.2 Connections with the literature
First-order methods such as stochastic gradient descent (Robbins and Monro [54]; Bottou and
Bousquet [11]; Shalev-Shwartz [63]) or dual coordinate ascent (Platt [50]; Joachims [32]) are very
popular for solving SP. Simple first-order gradient-based methods dominate the field for convincing
reasons: low computational cost, simplicity of implementation, and strong empirical results.
For example, SGD finds an ε-suboptimal solution in time O(1/ε). This runtime does not
depend on the dimension of the decision variable and therefore is favorable for large-scale SP.
However, the SGD approach has several limitations. (a) It lacks a clear stopping criterion; (b) It
tends to be overly aggressive at the beginning of the optimization process; (c) While SGD reaches
a moderate accuracy quite fast, its convergence becomes rather slow when we are interested in
more accurate solutions [64].
For dual coordinate ascent, several authors (e.g., Mangasarian and Musicant [42]; Hsieh et
al. [30]) proved a linear convergence rate. However, there are two handicaps with dual coordinate
ascent. First, the linear convergence parameter may be very close to zero and the initial unspecified
number of iterations might be very large. Second, the analysis only deals with the sub-optimality
of the dual objective, while our real goal is to bound the sub-optimality of the primal objective.
Recently, Shalev-Shwartz and Zhang [64] proposed a stochastic version of dual coordinate ascent,
in which at each round they choose one coordinate to optimize uniformly at random. However,
such dual algorithms make it difficult to predict the quality of the primal solution even when the
duality gap is considered to be small.
Yet beyond first-order (BFO) methods are rarely used in SP, despite having several strengths:
faster per-iteration convergence, frequent explicit regularization on step-size, and better parallelization than first-order methods.
In this chapter, we present a decomposition-based conjugate subgradient algorithm that integrates data point decomposition, commonly used in stochastic programming, with Wolfe’s conjugate subgradient method [73] for non-smooth convex SP. Our computational results demonstrate that the proposed method consistently yields solutions with lower objective function values
65
from an optimization standpoint. Our approach combines non-smooth conjugate gradient descent
(CGD) [73, 46], stochastic programming, and the active set method. To facilitate this integration,
we modify Wolfe’s algorithm from a direction-finding method to a stochastic conjugate subgradient (SCS) algorithm, where stochastic subgradients provide piece-wise linear approximations
similar to stochastic decomposition (SD) [28]. Direction-finding and line-search are then applied
to these subgradients, projected onto an active set.
4.3 Stochastic Conjugate Subgradient Algorithm
In this section, decomposition refers to the separate treatment of the piecewise linear (max) function and the quadratic part in (4.1) to find the direction of change for the decision variables. We
will first present the SCS algorithm (Algorithm 8). The main components of the algorithm include
four key elements:
• Sequential function approximation. In many cases we have a fixed number of data points.
However, our focus is on situations where the data can be queried from an oracle sequentially.
As a result, we will not fix the value of m in (4.1). Instead, we use an adaptive sampling
approach [41, 65] to approximate function f using fk
, where
fk(x) = c(x) + 1
|Sk
|
|Sk
|
∑
i=1
h(x,ωi), (4.2)
Using Hoeffding’s Inequality [29], we can approximate (4.1) by (4.2) to an arbitrarily high
level of accuracy as the number of samples increases.
• Direction Modification. Since there are linear constraints in (4.2), gk ∈ ∂ fk(xk) (We will talk
about how to obtain gk
for specific problems in $4.4.) and the previous search direction d˜
k−1
may be infeasible. Thus, we will modify the subgradient gk and the search direction d˜
k−1 to
ensure they are feasible search directions. Specifically, let the null-space Z of A be such that
⟨A,Z⟩ = 0 and the modified directions are give by
66
g˜k = ZZT
gk and d˜
k−1 = ZZT
d˜
k−1. (4.3)
• Conjugate direction finding. The idea is inspired by Wolfe’s non-smooth conjugate subgradient method, which uses the smallest norm of the convex combination of the previous
modified search direction d˜
k−1 and the current modified subgradient ˜gk
. More specifically,
we first solve the following one-dimensional QP problem:
λ
∗
k = argmin
λ∈[0,1]
1
2
||λ ·(−d˜
k−1) + (1−λ)· g˜k
||2
. (4.4)
Then we can set the new search direction as
d˜
k = −
λ
∗
k
·(−d˜
k−1) + (1−λ
∗
k
)· g˜k
:= −Nr(G˜
k), (4.5)
where G˜
k = {−d˜
k−1,g˜k}, and λ
∗
k
denotes the optimal weight for the convex combination.
Clearly if one fixes λ = 0, then the search direction reduces to that of the projected subgradient method.
• Choice of step size. This is achieved by Wolfe’s line search methods [73]. Let g(t) ∈
∂ fk(xk +t · d˜
k) and ˜g(t) = ZZT g(t). Define the intervals L and R
L = {t > 0 | fk(xk +t · d˜
k)− fk(xk) ≤ −m2||d˜
k
||2
t},
R = {t > 0 | 0 > ⟨g˜(t),d˜
k⟩ ≥ −m1||d˜
k
||2
},
(4.6)
where m2 < m1 < 1. The output of the step size will satisfy two metrics: (i) Identifying a set
L which includes points that sufficiently reduce the objective function value; (ii) Identifying
a set R for which the directional derivative estimate is improved. The algorithm seeks points
which belong to L ∩ R. Note that replacing d˜
k with an arbitrary subgradient modification
g˜k = ZZT gk may not ensure sufficient descent unless the function fk
is smooth.
67
Algorithm 8: Stochastic Conjugate Subgradient (SCS) Algorithm
2 ε > 0, τ > 0, δ0, η1 > 1, η2 > 0, γ > 1 and k ← 0.
4 Obtain the null-space Z of A.
6 Randomly generate S0 samples from the data set.
8 Set a feasible solution ˆx0 ∈ R
|S0|
and an initial direction d0 ∈ R
|S0|
.
10 f0(x) = c(x) + 1
|S0| ∑
|S0|
i=1
h(x,ωi).
11 while ||dk
|| > ε do
13 k = k +1;
15 Obtain gk ∈ ∂ fk−1(xˆk). ;
17 Modify the subgradient gk and the direction d˜
k−1 by (4.3).;
19 Calculate the new current search direction d˜
k by (4.4) and (4.5).;
21 Apply Algorithm 7 to find step size tk and set xk+1 = xk +tkd˜
k
. ;
23 Randomly generate a set of new samples sk and let Sk ≜ Sk−1 ∪sk
.;
25 Construct fk(x) = c(x) + 1
|Sk
| ∑
|Sk
|
i=1
h(x,ωi).;
27 Randomly generate a set of new samples Tk of cardinality |Sk
| independent of Sk
.;
29 Construct ˆfk(x) = c(x) + 1
|Tk
| ∑
|Tk
|
i=1
h(x,ωi).;
30 if fk(xk)− fk(xˆk−1) ≤ η1(
ˆfk(xk)− ˆfk(xˆk−1)) and ||dk
|| > η2δk
then
32 xˆk ← xk
, δk ← min{γδk−1,
ε
ξ
};
33 else
35 xˆk ← xˆk−1, δk ←
δk−1
γ
;
36 end
37 end
Remark: (i) The initial choice of the maximal region δ0 may require some tuning depending on
the dataset but it does not affect the convergence property of this algorithm. (ii) Based on Theorem
4.1 in §4.4, Sk should be chosen such that |Sk
| ≥ −8log(ε/2)·
(M−m)
2
κ
2δ
4
k
. (iii) According to Theorem
3.2, ξ = max{4nκ,η2,
4κ
C1(1−η1)
}, where C1 =
m1
2n
.
4.4 SCS for Two-stage Stochastic Quadratic Programming
In this section, we will use the SCS algorithm to solve (4.1) when the first-stage is a constrained
quadratic program and the second-stage is a constrained Linear program (SQLP) and when the
68
first-stage is a constrained quadratic program and the second-stage is a constrained quadratic program (SQQP). In the mathematical formulations of two-stage SQLP problems and two-stage SQQP
problems, we use the subscript QL and QQ to identify SQLP and SQQP problems, respectively.
Let x and y denote the first-stage and the second-stage decision variables, with x belonging to the
set X ⊆ R
n1 and y belonging to a polyhedron in R
n2
. The mathematical formulation of a two-stage
SQLP is given below.
min f(x) ≜
1
2
x
TQx+c
T
x+E[h(x,ω)]
s.t. x ∈ X = {x : Ax = b} ∈ R
n1
,
where h(x,ω) is defined as
h(x,ω) = hQL(x,ω) ≜min d
T
y
s.t. Dy = ξ (ω)−C(ω)x,
y ≥ 0, y ∈ R
n2
.
(4.7)
Here A ∈ R
m1×n1
is a deterministic matrix, and D ∈ R
m2×n2
is a deterministic matrix. In addition,
ξ (ω) denotes a random vector, C(ω) denotes a random matrix, and E[·] denotes the expectation
with respect to the probability measure of ω. Finally, we assume that the second-stage cost vector
d is fixed and Q to be a positive definite matrix in SQLP/SQQP problems. A specific example of
SQLP problem can be the kernel SVM problem where hQL(x,ω) ≜ max{0,1 − ω⟨x,Qi⟩} is the
hinge-loss function.
To get a subgradient of (4.7), we derive its dual for a specific ¯x [38] as follows:
hQL(x¯,ω) =max π
T
(ξ (ω)−C(ω)x¯)
s.t. π ∈ Π = {π : D
T
π ≤ d}.
Let
69
π
∗ = argmax
π∈Π
π
T
(ξ (ω)−C(ω)x¯)
Then a subgradient of υh ∈ ∂xh(x,ω) is
υh = −(π
∗
)
TC(ω).
As for a two-stage SQQP, the mathematical formulation is:
min f(x) ≜
1
2
x
TQx+c
T
x+E[h(x,ω)]
s.t. x ∈ X = {x : Ax = b} ∈ R
n1
,
where h(x,ω) is defined as
h(x,ω) = hQQ(x,ω) ≜ min
0≤y∈Rn2
1
2
y
TPy+d
T
y
s.t. Dy = ξ (ω)−C(ω)x
(4.8)
To get a subgradient of (4.8), we derive its dual for a specific ¯x [38] as follows:
hQQ(x,ω) = max
0≤s∈R
n2
−
1
2
s
THs+e(x,ω)
T
s, (4.9)
where
M = DP−1/2
, H = P
−1/2
(I − MT
(MMT
)
−1M)P
−1/2
,
e(x,ω) = Hd −P
−1/2
(MMT
)
−1
(ξ (ω)−c(ω)x¯).
Let
s
∗ = argmax
0≤s∈R
n2
−
1
2
s
THs+e(x¯,ω)
T
s = Pro js≥0{s = H
−1
e(x¯,ω)}
Then a subgradient of gh ∈ ∂xh(x,ω) is
70
gh = P
−1/2
(MMT
)
−1
c(ω)s
∗
.
Given this subgradient of h(x,ω), it follows that a subgradient of f(x) is
g = Qx+c+gh.
4.5 Convergence Properties
The following assumptions, which are common for two-stage SLP models, will be imposed throughout the convergence rate analysis.
• A1: The optimization problem (4.1) is convex and finite-valued. Thus, extended real-valued
functions are not permissible in our setting. As a consequence, the function f has a Lipschitz
constant 0 < L < ∞ and the subdifferential is non-empty at every x.
• A2: The relatively complete recourse assumption holds, i.e., for every ωi and x ∈ X , we
have the second stage problem is bounded,
m ≤ h(x,ωi) ≤ M.
4.5.1 Sample Complexity Analysis
Theorem 4.1. Given any 0 < ε < 1, there exists a constant κ > 0 such that when
|Sk
| ≥ −8log(ε/2)·
(M −m)
2
κ
2δ
4
k
, (4.10)
we have
Pr
|| fk(x)− f(x)|| ≤ κδ 2
k
, ∀x ∈ B(xˆk
,δk)
≥ 1−ε.
7
Proof. First, for the fixed point ˆxk
, consider the following identities
f(xˆk) = c(xˆk) +Eω[h(xˆk
,ω)].
fk(xˆk) = c(xˆk) + 1
|Sk
|
|Sk
|
∑
i=1
h(xˆk
,ωi).
From A2 we have m ≤ h(x,ωi) ≤ M. Applying Hoeffding’s inequality,
P(| fk(xˆk)− f(xˆk)| ≤ 1
2
κδ 2
k
) ≥ 1−2exp
−
κ
2δ
4
k
|Sk
|
2
8 ·|Sk
|·(M −m)
2
≥ 1−ε,
which indicates that when
|Sk
| ≥ −8log(ε/2)·
(M −m)
2
κ
2δ
4
k
,
we have
P(| fk(xˆk)− f(xˆk)| ≤ 1
2
κδ 2
k
) ≥ 1−ε.
For any other x ∈ B(xˆk
,δk), if | fk(xˆk)− f(xˆk)| ≤ 1
2
κδ 2
k
, then
| fk(x)− f(x)| ≤ | fk(x)− fk(xk)|+| fk(xk)− f(xk)|+| f(xk)− f(x)|
≤ 2L · δk +
1
2
κδ 2
k
≤ κδ 2
k
,
where the second last inequality is due to the assumption that fk and f are Lipschitz continuous
and the last inequality is because κ >
4L
δmin
>
4L
δk
. Thus, we conclude that if |Sk
| satisfies Equation
(4.10), then
P(| fk(x)− f(x)| ≤ κδ 2
k
, ∀x ∈ B(xˆk
,δk)) ≥ 1−ε.
72
4.5.2 Feasible direction and sufficient decrease
Theorem 4.2. In each step of algorithm 8, d˜
k
is a feasible direction.
Proof. The proof is well-documented in the literature on linearly constrained optimization (e.g.,
Nocedal and Wright [46]).
Theorem 4.3. In each step of algorithm 4.2, with assumption A2, there exists tk > 0 such that
fk(xk +tk
· d˜
k)− fk(xk) ≤ −m2tk
||d˜
k
||2
. (4.11)
Proof. First, since −d˜
k
is the smallest norm in the convex hull of {g˜i}
k
i=1
, we have ⟨d˜
k
,g˜i⟩ ≤
−||d˜
k
||2
, we have
||d˜
k
||2 ≤ −⟨d˜
k
,g˜k⟩ (4.12)
We will then divide the proof into two cases.
Case 1: if there exists tk such that R conditions is satisfied, i.e.,
⟨g˜(tk),d˜
k⟩ ≥ −m1||d˜
k
||2
.
According to (4.12), we can equivalently write
⟨g˜(tk),d˜
k⟩ − ⟨d˜
k
,g˜k⟩ ≥ ⟨g˜(tk),d˜
k⟩+||d˜
k
||2 ≥ (1−m1)||d˜
k
||2
.
Then, based on assumption A2, we have
(1−m1)||d˜
k
||2 ≤ ⟨g˜(tk),d˜
k⟩ − ⟨d˜
k
,g˜k⟩ ≤ tkL||d˜
k
||2
.
Thus, we have tk ≥ (1−m1)/L.
Case 2: if tk
in case 1 makes some of the constraints in l ∈/ Jk violated. Then, as tk gets larger from
0, we can find lk such that
a
T
lk
(xk +tkd˜
k) = blk
.
73
Then,
tk
||alk
||·||d˜
k
|| ≥ tk
|a
T
lk
dk
| = |a
T
lk
xk −blk
| ≥ τ.
Thus,
tk ≥
τ
||alk
||·||d˜
k
||
.
4.5.3 Convergence Rate and Optimality Condition
The convergence rate analysis is similar to that in Chapter 3. For the sake of completeness, we will
list the key theorem here. Interested readers can refer to Chapter 3 for the detailed proofs.
Theorem 4.4. Let Tε = inf{k ≥ 0 : ||d˜
k
|| < ε}. Then Tε is a stopping time for the stochastic process
Xˆ k and
E[Tε ] ≤
p
2p−1
2ζFmax
Θε
2
+2
. (4.13)
Theorem 4.5 demonstrates that if ||d˜
k
|| < ε, the ε-optimality condition for f(x) will be satisfied.
The error bound on this optimality is linked to both the sample size and the subgradient of fk(x).
Theorem 4.5. Let d˜∗ = −argming˜=ZZT g,g∈∂ f(xˆk)
||g˜||. If k is the smallest index for which ||d˜
k
|| < ε
and |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then
P(||d˜∗
|| < 4ε +ε
′
) ≥ 1−δ.
To prove the theorem, we first define a few notations.
• d˜∗ = −argming˜=ZZT g,g∈∂ f(xˆk)
||g˜||: negative of the smallest norm subgradient of f(xˆk) projected onto the null-space Z.
• d˜∗
k = −argming˜k=ZZT g,g∈∂ fk(xˆk)
||g˜k
||: negative of the smallest norm subgradient of fk(xˆk)
projected onto the null-space Z.
74
• dk = −Nr({gk
,−dk−1}), where gk ∈ ∂ fk(xˆk).
• Subdifferential of f at z
′
: ∂ f(z
′
) = {q : f(z) ≥ f(z
′
) +q
T
(z−z
′
)}.
• ε-subdifferential of f at z
′
: ∂ fε (z
′
) = {q : f(z) ≥ f(z
′
) +q
T
(z−z
′
)−ε}.
• Directional derivative of f at z in direction d: f
′
(z;d) = max{d
T q : q ∈ ∂ f(z)}.
• ε-directional derivative of f at z in direction d: f
′
ε
(z;d) = max{d
T q : q ∈ ∂ε f(z)}.
• X˜(z) := −mind˜=ZZT d,||d˜||=1
f
′
(z;d˜) and X˜
k(z) := −mind˜=ZZT d,||d˜||=1
f
′
k
(z;d˜).
• X˜
ε (z) := −mind˜=ZZT d,||d˜||=1
f
′
ε
(z;d˜) and d¯
k(z) = −argmind˜=ZZT d,||d˜||=1
f
′
ε
(z;d˜).
Lemma 4.1. Let the tuple (m1,m2) satisfy the requirement in Algorithm 7 ( 1
4 ≤ m1 <
1
2
and 1
4 ≤
m2 < m1). If k is the smallest index for which ||d˜
k
|| < ε, then we have ||d˜∗
k
|| < 4ε.
Proof. Suppose the claim is false, then by the definition of d˜∗
k
, we have ||g˜k
|| ≥ ||d˜∗
k
|| ≥ 4ε. Thus,
||d˜
k
||2 = ||λ
∗
k ZZT
gk −(1−λ
∗
k
)ZZT
dk−1||2
= (λ
∗
k
)
2
||g˜k
||2 + (1−λ
∗
k
)
2
||d˜
k−1||2 −2λ
∗
k
(1−λ
∗
k
)⟨g˜k
,d˜
k−1⟩
≥ (λ
∗
k
)
2
||g˜k
||2 + (1−λ
∗
k
)
2
||d˜
k−1||2
(4.14)
where the inequality holds because Algorithm 7 ensures the condition R in (4.6).
We will now divide the analysis into 2 cases which are examined below: a) ||g˜k
|| > ||d˜
k−1|| ≥ ε
and ||g˜k
|| ≥ 4ε and b) ||d˜
k−1|| ≥ ||g˜k
|| ≥ 4ε.
(a) In this case from (4.14) we have
||d˜
k
||2 ≥
17(λ
∗
k
)
2 −2λ
∗
k +1
ε
2
(4.15)
Note that
λ
∗
k =
⟨g˜k
,d˜
k−1⟩+||g˜k
||2
||d˜
k−1||2 +||g˜k
||2 +2⟨g˜k
,d˜
k−1⟩
≥
−m2||d˜
k−1||2 +||g˜k
||2
||d˜
k−1||2 +||g˜k
||2
,
7
where the inequality holds because the R condition in (4.6) ensures 0 > ⟨g˜k
,d˜
k−1⟩ ≥ −m2||d˜
k−1||2
.
Thus, we claim that λ
∗
k ≥
2
17 because it is suffice to show
−17 ·m2||d˜
k−1||2 +17 ·||g˜k
||2 ≥ 2||d˜
k−1||2 +2||g˜k
||2
.
This can be verified by observing m2 < m1 <
1
2
and ||g˜k
|| > ||d˜
k−1||. Thus, based on (4.15) and
λ
∗
k ≥
2
17 , we have ||d˜
k
|| ≥ ε. This contradicts the assumptions of the lemma.
(b) If ||d˜
k−1|| ≥ ||g˜k
|| ≥ 4ε, then we have
||d˜
k
||2 ≥
32(λ
∗
k
)
2 −32λ
∗
k +16
ε
2 ≥ 8ε
2
.
Thus, ||d˜
k
|| ≥ 2
√
2ε, which also contradicts the assumptions of the lemma.
Lemma 4.2. ||d˜∗
|| = X˜(xˆk) and ||d˜∗
k
|| = X˜
k(xˆk).
Proof. According to the definition of X˜(xˆk),
X (xˆk) = − min
d˜=ZZT d,||d˜||=1
f
′
(xˆk
;d˜) = − min
d˜=ZZT d,||d˜||=1
max
q∈∂ f(xˆk)
d
T
q (4.16)
Note that the min-max problem (4.16) has a stationary point
q¯ = argmin
q˜=ZZT q,q∈∂ f(xˆk)
||q˜||, d¯= −
q¯
||q¯||.
Thus, X˜(xˆk) = ||q¯|| = ||d˜∗
||. Similarly, we can also prove that X˜
k(xˆk) = ||d˜∗
k
||.
Theorem 4.6. If |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then P(X˜
ε (xˆk)−X˜
k(xˆk) ≤ ε
′
) ≥ 1−δ.
7
Proof. By definition, d¯
k(xˆk) = −argmin||d˜=ZZT d,d˜||=1
f
′
ε
(xˆk
;d˜). Hence,
X˜
ε (xˆk)−X˜
k(xˆk) ≤
1
|Sk
|
|Sk
|
∑
i=1
h
′
(xˆk
;d¯
k(xˆk),ωi)−E[h
′
ε
(xˆk
;d¯
k(xˆk),ω)]
=
1
|Sk
|
|Sk
|
∑
i=1
[h
′
(xˆk
;d¯
k(xˆk),ωi)−h
′
ε
(xˆk
;d¯
k(xˆk),ωi)]
+
1
|Sk
|
|Sk
|
∑
i=1
h
′
ε
(xˆk
;d¯
k(xˆk),ωi)−E[h
′
ε
(xˆk
;d¯
k(xˆk),ω)]
Note that for each ωi
, we have h
′
(xˆk
;d¯
k(xˆk),ωi) ≤ h
′
ε
(xˆk
;d¯
k(xˆk),ωi), since ∂h(xˆk
,ωi) ⊆ ∂hε (xˆk
,ωi).
Also, by Hoeffding’s inequality, if |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then
P
1
|Sk
|
|Sk
|
∑
i=1
h
′
ε
(xˆk
;d¯
k(xˆk),ωi)−E[h
′
ε
(xˆk
;d¯
k(xˆk),ω)] ≤ ε
′
≥ 1−δ,
which concludes the proof.
Proof. (Theorem 4.5) First, by Theorem 4.6 and Lemma 4.2, if |Sk
| ≥ −2(ε
′
)
2
L
2 logδ
, then with probability
at least 1−δ,
||d˜∗
|| −||d˜∗
k
|| = lim
ε→0
X˜
ε (xˆk)−X˜
k(xˆk) ≤ ε
′
.
Combining with Lemma 4.1, we have
P
||d˜∗
|| < 4ε +ε
′
≥ 1−δ.
4.6 Preliminary Computational Results
Our computational experiments are based on data sets available at the USC 3D Lab 1
. Specifically,
these problems are two-stage stochastic linear programming problems, and we have tested the
1https://github.com/USC3DLAB/SD
LandS3, lgsc, pgp2, and ssn datasets. The study considers three different methods: the SGD [54]
algorithm with projection, stochastic mirror descent [45], and the SCS algorithm. We track the
decrease in objective values with respect to the number of iterations for all algorithms. The figures
also include the 95% upper and lower bounds of the objective values obtained from SD [27, 28] as
a benchmark. The algorithms were implemented on a MacBook Pro 2023 with a 2.6GHz 12-core
Apple M3 Pro processor and 32GB of 2400MHz DDR4 onboard memory. The code used in this
research is available at the following GitHub repositories: SCS for SP (accessed on August 24,
2024), and the computation results are shown in Figure 4.1 and Figure 4.2. .
Figure 4.1: { f(xˆk)}
k=50
k=1
for different combinations (data,algorithm).
(a) LandS3 (b) lgsc
(c) pgp2 (d) ssn
Remarks: (i) The upper and lower bounds are derived from SD, which requires significantly
more time to compute compared to the other algorithms. However, because these bounds are
obtained through multiple replications, they provide a highly reliable estimate of the objective
value, making them an excellent benchmark; (ii) Regarding the objective function value, SCS
consistently yields lower values than the SGD and SMD algorithms, and it also converges faster;
(iii) The SCS algorithm follows the principle of optimization, featuring a line-search procedure that
78
Figure 4.2: { f(xˆk)}
k=−1
k=−50 for different combinations (data,algorithm).
(a) LandS3 (b) lgsc
(c) pgp2 (d) ssn
automatically determines the step size. In contrast, other algorithms, such as SMD, require prior
knowledge of the bound of the subgradient norm to set the step size, which is often impractical in
many real-world applications.
4.7 Conclusion
Our approach leads to a class of online algorithms that go beyond first-order approximations. It
incorporates several features of Wolfe’s method (e.g., stability and a good convergence rate) as
well as the online aspects of SGD, which promote good computational times. In contrast to the
SGD algorithm, we find that the optimization performance of the SCS algorithm is more reliable
and consistently provides lower objective values (see Figures 4.2). It appears that such reliability
may be difficult to achieve using first-order methods without additional fine-tuning efforts.
79
Chapter 5
A Sampling-based Progressive Hedging Algorithm for
Stochastic Programming
The progressive hedging (PH) algorithm is a cornerstone among algorithms for large-scale stochastic programming problems. However, its traditional implementation is hindered by several limitations, including the requirement to solve all scenario subproblems in each iteration, reliance on an
explicit probability distribution, and a convergence process that is highly sensitive to the choice of
penalty parameters. This chapter introduces a sampling-based PH algorithm, which aims to overcome these limitations. Our approach employs a dynamic selection process for the number of scenario subproblems solved per iteration. It incorporates adaptive sequential sampling to determine
sample sizes, a stochastic conjugate subgradient method for direction finding, and a line-search
technique to update the dual variables. Experimental results demonstrate that this novel algorithm
not only addresses the bottlenecks of the conventional PH algorithm but also potentially surpasses
its scalability, representing a substantial improvement in the field of stochastic programming.
80
5.1 Introduction
5.1.1 Problem setup
In this Chapter, we focus on solving stochastic programming (SP) problems stated as follows.
min
x∈X⊆Rn
f(x) = Eω˜ [h(x,ω˜)], (5.1)
where ω˜ ∈ R
m denotes a random vector, and h(·,ω˜) is a finite-valued convex function in x for
any particular realization ω˜ = ω. Except in very simplistic situations (e.g., when ω˜ has a small
number of outcomes), solving the optimization problem (5.1) using deterministic algorithms is
infeasible. To overcome challenges arising from efforts to solve realistic instances, the focus often
shifts towards solving a sample average approximation (SAA) problem. In such an approach, we
create an approximation by using |S| i.i.d. realizations of ω˜ called scenarios, and denote them by
ω
1
,...,ω
|S|
. In this setting, (5.1) can be reformulated as
min
x∈X
f(x) = 1
|S|
|S|
∑
s=1
h(x,ω
s
). (5.2)
In order to establish an optimal solution of (5.2), many algorithms have been proposed and one of
the more popular algorithms among them is the progressive hedging (PH) algorithm [55]. To lay
the groundwork, we will commence by reviewing the classic PH algorithm.
5.1.2 Classic progressive hedging algorithm
The PH Algorithm is a version of alternating direction method of multipliers (ADMM) [12] which
has attracted significant attention in stochastic programming community, extensively utilized for
addressing scenario-based uncertainty in large-scale optimization problems. Originally conceptualized by Rockafellar and Wets in 1991 [55], the algorithm has evolved significantly, finding
applications in diverse fields such as finance, energy, and logistics [71]. The convergence properties of the PH algorithm have been rigorously analyzed since its inception. Rockafellar and
81
Wets provided the initial proof of convergence for convex problems. Further research, such as
that by Ruszczynski and Shapiro [57], explored the conditions under which the PH algorithm con- ´
verges, including scenarios involving non-convexities and other complexities. However, despite
its widespread popularity, the classic PH algorithm remains a deterministic algorithm which immediately hinders its applicability for broader stochastic optimization problems. For instance, in
applications where the uncertainty can only be sampled via simulation or the sample space is so
vast that it is unrealistic to expect a knowledge of all possible outcomes associated with the data.
More specifically, the limitations include
• First, in contrast to several SP algorithms, notably stochastic gradient descent (SGD), classic PH algorithm necessitates the resolution of all subproblems during each iteration. This
requirement constitutes a significant computational bottleneck, substantially impeding its
scalability and performance.
• Second, the convergence criteria of the PH algorithm are predicated upon the accessibility
and knowledge of the probability distribution of the random variables involved. This requirement severely restricts the algorithm’s utility in situations where such a distribution remains
unknown or is difficult to ascertain, thereby restricting its range of applications.
• Lastly, the algorithm employs a dual update rule which is fundamentally gradient-based
and accurate solely at the current dual point. Also, this approach renders the convergence
process highly dependent on the appropriate selection of the penalty parameter which acts
as a step size in the dual update. The performance of the algorithm, therefore, hinges on the
specification of the subgradient and the penalty parameter, further complicating its practical
performance.
For instances which present such challenges, an adaptive sampling algorithm may be more suitable
because such methods allow a possibility of creating approximations which are able to sequentially
adapt to the data which is revealed during the course of the algorithm. In this chapter, we will introduce an innovative sampling-based PH algorithm which adaptively samples a subset of scenario
82
subproblems in each iteration, facilitating dynamic scenario updates to enhance computational
efficiency.
5.1.3 Adaptive sampling
Classic PH algorithm relies on the knowledge of the probability distribution of the random variables. When the probability distribution is not available, the PH algorithm typically employs sample average approximation (SAA) [67] which has a fixed sample size to estimate these uncertainties. However, it is difficult to determine the sample size [59, 21] when applying such method
to the PH algorithm and convergence properties have only been established for cases with only a
finite number of outcomes. While adaptive sampling [41] has emerged as a powerful approach for
instances with a large number of outcomes (especially for multi-dimensional uncertainty), such
adaptive sampling has remained elusive in the context of PH. Adaptive sampling dynamically adjusts the sample size based on ongoing optimization progress, providing a more resource-efficient
and precise solution. Some effort has been made to combine adaptive sampling with the PH algorithm. For instance, [3] provides randomized PH methods which is able to produce new iterates as
soon as a single scenario subproblem is solved. [1] proposes a SBPHA approach which is a combination of replication and PH algorithm. However, these methods, while provide a “randomized”
PH algorithm, still sample a subset of scenarios from a fixed finite collection in each iteration, without dynamically determining the sample size based on the optimization procedure. On the other
hand, our sampling-based PH algorithm will strategically sample a subset of scenario subproblems
in each iteration, facilitating dynamic scenario updates to enhance computational efficiency. We
will also leverage the sample complexity analysis [69] to support the choice of the sample size,
ensuring a theoretically grounded approach.
5.1.4 Stochastic conjugate gradient method
The dual update of the classic PH algorithm can be considered a stochastic first-order (SFO)
method [54, 10]. First-order gradient-based methods dominate the field for several convincing
83
reasons: low computational cost, simplicity of implementation, and strong empirical results. However, SFO methods are not known for their numerical accuracy, especially when the functions
are non-smooth, and the classic convergence rate [44] only applies under the assumption that the
component functions are smooth.
In contrast to first-order methods, algorithms based on conjugate gradients (CG) provide finite termination when the objective function is a linear-quadratic deterministic function [46].
Furthermore, CG algorithms are also used for large-scale nonsmooth convex optimization problems [73, 77]. However, despite their many strengths (e.g., fast per-iteration convergence, frequent
explicit regularization of step size, and better parallelization than first-order methods), the use of
conjugate gradient methods for SP is uncommon. Although [31, 75] proposed some new stochastic conjugate gradient (SCG) algorithms, their assumptions do not support non-smooth objective
functions, and thus are not directly applicable to our problems.
To solve stochastic optimization problems where functions are non-smooth (as in (5.2)), one
might consider using a stochastic subgradient method. In this paper, we treat the dual optimization problem in PH as a stochastic linear-quadratic optimization problem and propose leveraging a
stochastic conjugate subgradient (SCS) direction, originating from [72, 73, 80, 79], which accommodates both the curvature of a quadratic function and the decomposition of subgradients by data
points, as is common in stochastic programming [27, 28]. This combination enhances the power
of stochastic first-order approaches without the additional burden of second-order approximations.
In other words, our method shares the spirit of Hessian-Free (HF) optimization, which has gained
attention in the machine learning community [43, 17]. As our theoretical analysis and computational results reveal, the new method provides a ”sweet spot” at the intersection of speed, accuracy,
and scalability of algorithms (see Theorems 5.4 and 5.5).
84
5.1.5 Penalty parameter and line-search algorithm
In the original Progressive Hedging (PH) algorithm, Rockafellar and Wets briefly discussed the
effect of the penalty parameter on the performance of the PH algorithm and the quality of the solution, but they did not provide any guidance on how to choose it. Since then, various choices have
been proposed, often differing depending on the application. Readers interested in the selection of
the penalty parameter can refer to [78], which offers a comprehensive literature review of related
methods.
Our approach differs from these methods by employing a stochastic direction-finding technique [47, 2] derived from Wolfe’s line-search method [72, 73]. Compared to existing approaches,
our method is more mathematically motivated and adheres closely to optimization principles [13].
5.2 Sampling-based Progressive Hedging Algorithm
In this section, we focus on solving (5.1) using a sampling-based PHA in which we combine
classic PHA [55] with adaptive sampling in stochastic programming [8]. The main ingredients of
the algorithm include six important components:
• Sequential scenarios sampling. At iteration k, different from classic PH Algorithm which
solves all scenario subproblems, we will sample |Sk
| scenarios and only solve the subproblems corresponding to these scenarios. This is similar to using sample average approximation (SAA) to approximate function f using fk
,
fk(x) = 1
|Sk
|
|Sk
|
∑
s=1
h(x,ω
s
), (5.3)
where |Sk
| will be determined based on concentration inequalities in Theorem 5.2.
• Update the primal variables. For each scenario s ∈ Sk
, similar to the classic PHA, we first
solve the corresponding augmented Lagrangian problem
85
x
s
k = argmin
x
s∈X
h(x
s
,ω
s
) +⟨λ
s
k
, x
s −x¯k−1⟩+
ρ
2
||x
s −xk−1||2
. (5.4)
Then a feasible solution is
x¯k =
1
|Sk
|
|Sk
|
∑
s=1
x
s
k
. (5.5)
• Conjugate subgradient direction finding. The idea here is inspired by Wolfe’s non-smooth
conjugate subgradient method, which uses the smallest norm of the convex combination of
the previous search direction d
s
k−1
and the current subgradient g
s
k = x
s
k − x¯k
. More specifically, we first solve the following one-dimensional QP
γ
s
k = argmin
γ
s∈[0,1]
1
2
||γ
s
d
s
k−1 + (1−γ
s
)g
s
k
||2
. (5.6)
Then we can set the new direction
d
s
k = γ
s
k
d
s
k−1 + (1−γ
s
k
)g
s
k
:= Nr(G
s
k
),
where G
s
k = {g
s
k
,d
s
k−1
}.
• Update the dual variables using line-search. We update ˆλ
s
k+1 = λ
s
k +θ
s
k
d
s
k
. Define
L
s
k
(λ
s
k +θ
s
d
s
k
) := h(xˆ
s
k
,ω
s
) +⟨λ
s
k +θ
s
d
s
k
, xˆ
s
k −xˆk⟩+
ρ
2
||xˆ
s
k −xˆk
||2
, (5.7)
where δ
s
k
is the stochastic searching region, ˆx
s
k
and ˆxk are obtained by replacing λ
s
k
in (5.4)
with λ
s
k + θd
s
k
in each scenario and resolving (5.4) and (5.5), respectively. The step size
86
θ
s
k will be obtained by performing an inexact line-search which makes it satisfy the strongWolfe condition [72, 73]. Let g
s
(θ
s
) ∈ ∂λ
sL
s
k
(λ
s
k + θ
s
· d
s
k
) and define the intervals L and
R.
L = {θ
s > 0 | L
s
k
(λk +θ
s
· dk)−L
s
k
(λk) ≥ m1||d
s
k
||2
θ
s
},
R = {θ
s > 0 |
Sk
∑
s=1
⟨g
s
(θ
s
),d
s
k
⟩ ≤ m2||d
s
k
||2
},
(5.8)
where m2 < m1 < 1/2. The output of the step size will satisfy two metrics: (i) Identifying a
set L which includes points which sufficiently increase the dual objective function approximation L
s
k
, and (ii) Identifying a set R for which the directional derivative estimate for L
s
k
is
sufficiently decreased. The algorithm seeks points that belong to L∩R. The combination of
L and R is called strong-Wolfe condition and it has been shown in [73] (Lemma 1) that L∩R
is not empty. Intuitively, the above procedure is a line-search along conjugate direction d
s
k
and the goal is to (approximately) maximize the dual objective function L
s
k
.
• Stochastic searching region update. Similar to the standard stochastic trust region method [2],
let γ > 1, η ∈ (0,1), λk
:= (λ
1
k
,...,λ
|Sk
|
k
) and Lk(λk) := ∑
|Sk
|
s=1
L
s
k
(λ
s
k
), if
Lk(
ˆλk)−Lk(λk−1) > η[Lk−1(
ˆλk)−Lk−1(λk−1)], (5.9)
set δk = γδk−1 and λk = ˆλk
. Otherwise, set δk = δk−1/γ and λk = λk−1.
• Termination criteria. Define dk
:= (d
1
k
,...,d
|Sk
|
k
) and ||dk
|| :=
1
|Sk
| ∑
|Sk
|
s=1
||d
s
k
||. The algorithm
concludes its process when ||dk
|| < ε and δk < δmin. As we will show in Theorem 5.5, a
diminutive value of ||dk
|| indicates a small norm of gk
, fulfilling the optimality condition for
an unconstrained convex optimization problem. Additionally, a threshold δmin is established
to prevent premature termination in the early iterations. Without this threshold, the algorithm
87
may halt prematurely even if ||dk
|| < ε. However, if δmin is introduced, based on Theorem
5.2, |Sk
| should be chosen such that
|Sk
| ≥ −8log(ε/2)·
M2
1
κ
2δ
4
k
,
where κ is the κ-approximation defined in Definition 5.2. This effectively mitigates any
early stopping concerns with high probability.
Algorithm 9: Sampling-based Progressive Hedging Algorithm
1 Initialize ¯x0 ∈ W, λ0 = (λ
1
0
,...,λ
|S0|
0
) =⃗0, d0 = (d
1
0
,d
2
0
,...,d
|S0|
0
), ρ > 0, γ > 1, η ∈ (0,1), δ0 ∈ [δmin,δmax]
and ε > 0.
2 while ||dk
|| > ε do
3 Randomly draw a scenarios set with Sk scenarios.
4 x
s
k = argminx
s∈X
h(x
s
,ω
s
) +λ
s
k
(x
s −x¯k−1) + ρ
2
||x
s −x¯k−1||2
, ∀s.
5 x¯k =
1
Sk
∑
S
s=1
x
s
k
.
6 Calculate g
s
k = x
s
k −x¯k
, G
s
k = {d
s
k−1
,g
s
k
} and d
s
k = Nr(Gk), ∀s.
7 Apply line-search algorithm to find a step size θ
s
k which belongs to L∩R.
8 Update ˆλ
s
k = λ
s
k−1 +θ
s
k
d
s
k
, ∀s.
9 if (5.9) is satisfied then
10 λk ← ˆλk
, δk ← min{γδk−1,δmax};
11 else
12 λk ← λk−1, δk ← max{
δk−1
γ
,δmin};
13 end
14 end
5.3 Convergence Properties
5.3.1 Scenario Dual Objective Functions
Definition 5.1. For any ω
s
, ¯x and λ
s
, define the scenario optimal solution
x
s
(λ
s
,ω
s
) = argmin
x∈X
h(x,ω
s
) +⟨λ
s
, x−x¯⟩,
and the corresponding scenario dual objective function
88
L
s
(λ
s
;w
s
, x¯) = h(x
s
,ω
s
) +⟨λ
s
, x
s −x¯⟩.
Lemma 5.1. In any iteration of Algorithm 9, we have ∑
|Sk
|
s=1
λ
s
k = 0.
Proof. We will show the lemma by induction. When k = 0, λ
s
0 = 0, for all s ∈ S0. Thus, the
statement is correct. Next, when k = 1, we have
λ
s
1 = ρ(x
s
1 −x¯k) = ρ(x
s
1 −
1
|S1|
|S1|
∑
s=1
x
s
1
),
then
|S1|
∑
s=1
λ
s
1 = ρ
|S1|
∑
s=1
(x
s
1 −
1
|S1|
|S1|
∑
s=1
x
s
1
) = 0.
Suppose it is true for k = n, i.e., ∑
|Sn|
s=1
λ
s
n = 0, then for k = n+1,
|Sn+1|
∑
s=1
λ
s
n+1 =
|Sn|
∑
s=1
λ
s
n +ρ
|Sn+1|
∑
s=1
(x
s
n+1 −x¯n+1) = 0,
where the first part is 0 by the induction hypothesis and the second part is 0 by the construction of
Algorithm 9. We thus conclude the proof.
Theorem 5.1. In every iteration of Algorithm 9, we have
1
|Sk
|
|Sk
|
∑
s=1
L
s
k
(λ
s
k
;ω
s
, x¯k) = 1
|Sk
|
|Sk
|
∑
s=1
L
s
k
(λ
s
k
;ω
s
) :=
1
|Sk
|
|Sk
|
∑
s=1
h(x
s
k
,ω
s
) +⟨λ
s
k
, x
s
k
⟩.
Proof.
1
|Sk
|
|Sk
|
∑
s=1
L
s
(λ
s
;ω
s
, x¯k) = 1
|Sk
|
|Sk
|
∑
s=1
[h(x
s
k
,ω
s
) +⟨λ
s
k
, x
s
k
⟩] + 1
|Sk
|
|Sk
|
∑
s=1
⟨λ
s
k
, x¯k⟩,
where the last part is 0 by Lemma 5.1.
89
5.3.2 Stochastic Local Approximation
Assumption 5.1. The primal feasible region X is bounded, i.e., there exists a constant C such that
for any x ∈ X, we have ||x|| ≤ C.
Assumption 5.2. For any ω
s
and λ
s
, L
s
(λ
s
;ω
s
) is bounded, i.e., there exists a constant M, such
that |L
s
(λ
s
;ω
s
)| ≤ M.
Assumption 5.3. For any ω
s
, L
s
(λ
s
;ω
s
) possesses a Lipschitz constant 0 < L < ∞ and the subdifferentials are non-empty for every λ
s
.
Assumption 5.4. For any ω
s
and λ
s
, L
s
(λ
s
;ω
s
) has bounded sub-differential set, i.e., for any
g
s ∈ ∂λ
sL
s
(λ
s
;ω
s
), we have ||g
s
|| ≤ G.
Definition 5.2. A multi-variable function Lk(λ
1
,...,λ
|S|
)is a local κ-approximation of L(λ
1
,...,λ
|S|
)
on B(λ
s
k
,δk), if for all λ
s ∈ B(λ
s
k
,δk),
|L(λ
1
,...,λ
|S|
)−Lk(λ
1
,...,λ
|S|
)| ≤ κδ 2
k
.
Definition 5.3. Let ∆k
, Λ
s
k
denote the random counterpart of δk
, λ
s
k
, respectively, and Lk denote the σ-algebra generated by {Li}
k−1
i=0
. A sequence of random model {Lk} is said to be µ1-
probabilistically κ-approximation of L with respect to {B(Λ
s
k
,∆k)} if the event
Ik
∆= {Lk
is κ −approximation of L on B(Λ
s
k
,∆k)}
satisfies the condition
P(Ik
|Lk−1) ≥ µ1 > 0.
Lemma 5.2. If one added scenario si ∈ Sk\Sk−1 changes from si
to s
′
i
, then for s ∈ Sk\si
, we have
||g
s
k −(g
s
k
)
′
|| = O(
1
|Sk
|
), |γ
s
k −(γ
s
k
)
′
| = O(
1
|Sk
|
), and ||d
s
k −(d
s
k
)
′
|| = O(
1
|Sk
|
).
90
Proof. First, note that for s ∈ Sk\si
,
g
s
k = x
s
k −x¯k = x
s
k −
x
1
k +...+x
s
k +...+x
|Sk
|
k
|Sk
|
,
(g
s
k
)
′ = x
s
k −x¯
′
k = x
s
k −
x
1
k +...+x
s
′
k +...+x
|Sk
|
k
|Sk
|
.
Thus, by Assumption 5.1,
||g
s
k −(g
s
k
)
′
|| =
||x
s
k −x
s
′
k
||
|Sk
|
≤
2C
|Sk
|
. (5.10)
Also, note that
γ
s
k = argmin
γ
s∈[0,1]
||γ
s
d
s
k−1 + (1−γ
s
)g
s
k
||2 =
⟨d
s
k−1
,d
s
k−1 −g
s
k
⟩
||d
s
k−1 −g
s
k
||2
,
(γ
s
k
)
′ = argmin
γ
s∈[0,1]
||γ
s
d
s
k−1 + (1−γ
s
)(g
s
k
)
′
||2 =
⟨d
s
k−1
,d
s
k−1 −(g
s
k
)
′
⟩
||d
s
k−1 −(g
s
k
)
′
||2
.
Let ∆g
s
k = (g
s
k
)
′ −g
s
k
, we have
γ
s
k −(γ
s
k
)
′ =
⟨d
s
k−1
,d
s
k−1 −g
s
k
⟩·||d
s
k−1 −g
s
k −∆g
s
k
||2
||d
s
k−1 −g
s
k
||2
·||d
s
k−1 −g
s
k −∆g
s
k
||2
−
⟨d
s
k−1
,d
s
k−1 −g
s
k −∆g
s
k
⟩·||d
s
k−1 −g
s
k
||2
||d
s
k−1 −g
s
k
||2
·||d
s
k−1 −g
s
k −∆g
s
k
||2
=
⟨∆g
s
k
,d
s
k−1 −g
s
k
⟩· ⟨d
s
k−1
,d
s
k−1 −g
s
k
⟩
||d
s
k−1 −g
s
k
||2
·||d
s
k−1 −g
s
k −∆g
s
k
||2
+
||∆g
s
k
||2
· ⟨d
s
k−1
,d
s
k−1 −g
s
k
⟩ − ⟨∆g
s
k
,d
s
k−1
⟩·||d
s
k−1 −g
s
k
||2
||d
s
k−1 −g
s
k
||2
·||d
s
k−1 −g
s
k −∆g
s
k
||2
.
With Assumption 5.4, Eq. (5.10) and triangular inequality, we have
|γ
s
k −(γ
s
k
)
′
| = O(
1
|Sk
|
) (5.11)
Finally, to bound ||d
s
k −(d
s
k
)
′
||, note that
d
s
k = γ
s
k
g
s
k + (1−γ
s
k
)d
s
k−1
,
(d
s
k
)
′ = (γ
s
k
)
′
(g
s
k
)
′ + (1−(γ
s
k
)
′
)d
s
k−1
.
(5.12)
91
Let ∆γ
s
k = (γ
s
k
)
′ −γ
s
k
, then
d
s
k −(d
s
k
)
′ = γ
s
k
g
s
k + (1−γ
s
k
)d
s
k−1
−(γ
s
k +∆γ
s
k
)(g
s
k +∆g
s
k
)−(1−γ
s
k −∆γ
s
k
)d
s
k−1
= −γ
s
k∆g
s
k −∆γ
s
k
g
s
k −∆γ
s
k∆g
s
k +∆γ
s
k
d
s
k−1
.
(5.13)
With Assumption 5.4, Eq. (5.10) and Eq. (5.11), we have
||d
s
k −(d
s
k
)
′
|| = O(
1
|Sk
|
). (5.14)
Lemma 5.3. If one added scenario si ∈ Sk\Sk−1 changes from si
to s
′
i
, then for s ∈ Sk\si
, we have
||x
s
k
(θ
s
)−(x
s
k
)
′
(θ
s
)|| = O(
1
|Sk
|
), and
||L
s
k
(λ
s
k +θ
s
k
d
s
k
;ω
s
)−L
s
k
(λ
s
k + (θ
s
k
)
′
(d
s
k
)
′
;ω
s
)|| = O(
1
|Sk
|
).
Proof. For any θ
s
, we have
x
s
k
(θ
s
) = argmin
x
s∈X
h(x
s
,ω
s
) +⟨λ
s
k +θ
s
d
s
k
, x
s −x¯k⟩.
(x
s
k
)
′
(θ
s
) = argmin
x
s∈X
h(x
s
,ω
s
) +⟨λ
s
k +θ
s
(d
s
k +∆d
s
k
), x
s −x¯
′
k
⟩.
Note that we have only changed the objective coefficient of x
s
and the difference is θ
s
· ∆d
s
k
. According to the sensitivity analysis in linear programming, ||∆d
s
k
|| = O(
1
|Sk
|
) and θ
s ∈ [δmin,δmax],
we have
||x
s
k
(θ
s
)−(x
s
k
)
′
(θ
s
)|| = O(θ
s
·||∆d
s
k
||) = O(
1
|Sk
|
). (5.15)
Also, for any θ
s
, consider
92
g(θ
s
;ω
s
) := h(x
s
k
(θ
s
),ω
s
) +⟨λ
s
k +θ
s
d
s
k
, x
s
k
(θ
s
)−x¯k⟩.
g
′
(θ
s
;ω
s
) := h((x
s
k
)
′
(θ
s
),ω
s
) +⟨λ
s
k +θ
s
(d
s
k
)
′
,(x
s
k
)
′
(θ
s
)−x¯
′
k
⟩.
Let ∆x
s
k
(θ
s
) = x
s
k
(θ
s
)−(x
s
k
)
′
(θ
s
) and ∆x¯k = x¯
′
k −x¯k
, then we have
g(θ
s
;ω
s
)−g
′
(θ
s
;ω
s
) = h(x
s
k
(θ
s
),ω
s
) +⟨λ
s
k +θ
s
d
s
k
, x
s
k
(θ
s
)−x¯k⟩
−(h((x
s
k
)
′
(θ
s
),ω
s
) +⟨λ
s
k +θ
s
(d
s
k
)
′
,(x
s
k
)
′
(θ
s
)−x¯
′
k
⟩)
= h(x
s
k
(θ
s
),ω
s
)−h(x
s
k
(θ
s
) +∆x
s
k
(θ
s
);ω
s
)
+⟨λ
s
k +θ
s
d
s
k
, x
s
k
(θ
s
)−x¯k⟩
− ⟨λ
s
k +θ
s
(d
s
k +∆d
s
k
), x
s
k
(θ
s
) +∆x
s
k
(θ
s
)−x¯k +∆x¯k⟩)
= h(x
s
k
(θ
s
),ω
s
)−h(x
s
k
(θ
s
) +∆x
s
k
(θ
s
);ω
s
)
− ⟨λ
s
k
,∆x
s
k
(θ
s
)⟩+⟨λ
s
k
,∆x¯k⟩ − ⟨θ
s
d
s
k
,∆x
s
k
(θ
s
)⟩+⟨θ
s
d
s
k
,∆x¯k⟩
− ⟨θ
s∆d
s
k
, x
s
k
(θ
s
) +∆x
s
k
(θ
s
)−x¯k +∆x¯k⟩
With Assumption 5.3, Lemma 5.2 and Eq. (5.15), we have
||g(θ
s
;ω
s
)−g
′
(θ
s
;ω
s
)|| = O(
1
|Sk
|
). (5.16)
Since Eq. (5.16) is true for any θ
s
, we have
||L
s
k
(λ
s
k +θ
s
k
d
s
k
;ω
s
)−L
s
k
(λ
s
k + (θ
s
k
)
′
(d
s
k
)
′
;ω
s
)||
= ||max
θ
s
g(θ
s
;ω
s
)−max
θ
s
g
′
(θ
s
;ω
s
)||
≤ max
θ
s
||g(θ
s
;ω
s
)−g
′
(θ
s
;ω
s
)||
= O(
1
|Sk
|
).
93
Theorem 5.2. Let L(λ
1
,...,λ
|Sk
|
) := E{ω1,...,ωs}
[
1
|Sk
| ∑
|Sk
|
s=1
L
s
k
(λ
s
;ω
s
)] and
Lk(λ
1
,...,λ
|Sk
|
) :=
1
|Sk
| ∑
|Sk
|
s=1
L
s
k
(λ
s
;ω
s
). Let M1 be a large constant, for any 0 < ε < 1, κ >
4L
δmin
,
and
|Sk
| ≥ −8log(ε/2)·
M2
1
κ
2δ
4
k
, (5.17)
we have
P[|L(λ
1
,...,λ
|Sk
|
)−Lk(λ
1
,...,λ
|Sk
|
)| ≤ κδ 2
k
, ∀λ
s ∈ B(λ
s
k
,δk)|Lk−1] ≥ 1−ε.
Proof. For the points λ
s
k
, we will first show that given Lk−1, Lk has the bounded difference property and then use McDiarmid’s inequality to prove the theorem. Since λ
s
k
depends on ω
1
,ω
2
,...,ω
|Sk
|
,
the McDiarmid’s inequality cannot be applied to Lk(λ
1
k
,...,λ
|Sk
|
k
) directly. However, we can express
Lk as a function of ω
1
,ω
2
,...,ω
|Sk
|
and show that Lk(ω
1
,ω
2
,...,ω
|Sk
|
) has the bounded difference
property. Specifically, the bounded difference property with respect to Lk(ω
1
,ω
2
,...,ω
|Sk
|
) is written as
|Lk(ω
1
,ω
2
,...,ω
s
,...,ω
|Sk
|
)−Lk(ω
1
,ω
2
,...,ω
s
′
,...,ω
|Sk
|
)| = O(
1
|Sk
|
).
We will divide the analysis into 2 cases to bound the changes of L
s
k
(λ
s
k
;ω
s
).
• Case 1: For s ∈ Sk/si
. Given Lk−1, based on Lemma 5.3, we have
|L
s
k
(λ
s
k
;ω
s
)−L
s
k
((λ
s
k
)
′
;ω
s
)| = O(
1
|Sk
|
).
• Case 2: For s = si
. Given Lk−1, based on Assumption 5.2, we have
|L
s
k
(λ
s
k
;ω
s
)−L
s
k
((λ
s
k
)
′
;ω
s
′
)| ≤ 2M.
94
Thus,
|Lk(ω
1
,ω
2
,...,ω
s
,...,ω
|Sk
|
)−Lk(ω
1
,ω
2
,...,ω
s
′
,...,ω
|Sk
|
)|
≤
∑s̸=si
|L
s
k
(λ
s
k
;ω
s
)−L
s
k
((λ
s
k
)
′
;ω
s
)|
|Sk
|
+
|L
s
k
(λ
s
k
;ω
s
)−L
s
k
((λ
s
k
)
′
;ω
s
′
)|
|Sk
|
= O(
1
|Sk
|
).
Let the bounded difference be specificed by M1
|Sk
|
. Applying McDiarmid’s inequality,
P(|L(λ
1
k
,...,λ
|Sk
|
k
)−Lk(λ
1
k
,...,λ
|Sk
|
k
)| ≤ 1
2
κδ 2
k
) ≥ 1−2exp(−
κ
2δ
4
k
|Sk
|
2
8 ·|Sk
|· M2
1
) ≥ 1−ε,
which indicates that when
|Sk
| ≥ −8log(ε/2)·
M2
1
κ
2δ
4
k
,
we have
P(|L(λ
1
k
,...,λ
|Sk
|
k
)−Lk(λ
1
k
,...,λ
|Sk
|
k
)| ≤ 1
2
κδ 2
k
) ≥ 1−ε.
For any other λ
s ∈ B(λ
s
k
,δk), if |L(λ
1
k
,...,λ
|Sk
|
k
)−Lk(λ
1
k
,...,λ
|Sk
|
k
)| ≤ 1
2
κδ 2
k
, then
|L(λ
1
,...,λ
|Sk
|
)−Lk(λ
1
,...,λ
|Sk
|
)|
≤ |L(λ
1
,...,λ
|Sk
|
)−L(λ
1
k
,...,λ
|Sk
|
k
)|
+|L(λ
1
k
,...,λ
|Sk
|
k
)−Lk(λ
1
k
,...,λ
|Sk
|
k
)|+|Lk(λ
1
k
,...,λ
|Sk
|
k
)−Lk(λ
1
,...,λ
|Sk
|
)|
≤ 2L · δk +
1
2
κδ 2
k
≤ κδ 2
k
,
where the second last inequality is due to the Assumption 5.3 that Lk and L
s
k
are Lipschitz continuous and the last inequality is because κ >
4L
δmin
>
4L
δk
. Thus, we conclude that if |Sk
| satisfies
Equation (5.17), then
P[|L(λ
1
,...,λ
|Sk
|
)−Lk(λ
1
,...,λ
|Sk
|
)| ≤ κδ 2
k
, ∀λ
s ∈ B(λ
s
k
,δk)] ≥ 1−ε.
95
5.3.3 Sufficient Increase and Submartingale Property
Definition 5.4. Let θk ⊙dk
:= (θ
1
k
d
1
k
,...,θ
|Sk
|
k
d
|Sk
|
k
). The value estimates Lk
:= Lk(λk) and Lk+1 =
Lk(λk +θk ⊙dk) are εF-accurate estimates of L(λk) and L(λk +θk ⊙dk), respectively, for a given
δk
if and only if
|Lk −L(λk)| ≤ εFδ
2
k
, |Lk+1 −L(λk +θk ⊙dk)| ≤ εFδ
2
k
.
Definition 5.5. A sequence of model estimates {Lk
,Lk+1} is said to be µ2-probabilistically εFaccurate with respect to {Λk
,∆k
,Sk} if the event
Jk
∆= {Fˆ
k
,Fˆ
k+1 are εF-accurate estimates for ∆k} (5.18)
satisfies the condition
P(Jk
|Lk−1) ≥ µ2 > 0. (5.19)
Lemma 5.4. Suppose that Lk
is a κ-approximation of L on B(λ
s
k
,δk). If
δk ≤
m1
4nκ
||dk
||1 and C1 =
m1
2n
, (5.20)
then there exist suitable step sizes θ
s
k ∈ L and θ
s
k
||d
s
k
|| >
δk
n
such that
L(λk +θk ⊙dk)−L(λk) ≥ C1||dk
||1δk
. (5.21)
Proof. Since θ
s
k ∈ L and θ
s
k
||d
s
k
|| >
δk
n
, for each L
s
k
,
L
s
k
(λ
s
k +θ
s
k
d
s
k
)−L
s
k
(λ
s
k
) ≥ m1||d
s
k
||2
θ
s
k ≥
m1
n
||d
s
k
||δk
.
96
Thus,
Lk(λk +θk ⊙dk)−Lk(λk) =
|Sk
|
∑
s=1
[L
s
k
(λ
s
k +θ
s
k
d
s
k
)−L
s
k
(λ
s
k
)]
≥
m1
n
||dk
||1δk
Also, since Lk
is κ −approximation of L on B(λ
s
k
,δk), we have
L(λk +θk ⊙dk)−L(λk) = L(λk +θk ⊙dk)−Lk(λk +θk ⊙dk)
+Lk(λk +θk ⊙dk)−Lk(λk) +Lk(λk)−L(λk)
≥ −2κδ 2
k +
m1
n
||dk
||1δk
≥ C1||dk
||δk
,
(5.22)
which concludes the proof.
Lemma 5.5. Suppose that Lk
is κ-approximation of L on the ball B(λ
s
k
,δk)the estimates(Lk
,Lk+1)
are εF-accurate with εF ≤ κ. If δk ≤ 1 and
δk ≤ min{
(1−η1)C1||dk
||
2κ
,
||dk
||
η2
}, (5.23)
then a suitable step sizes θ
s
k ∈ L and θ
s
k
||d
s
k
|| >
δk
n makes the k-th iteration successful.
Proof. Since θ
s
k ∈ L and θ
s
k
||d
s
k
|| >
δk
n
, from Lemma 5.4,
Lk(λk +θk ⊙dk)−Lk(λk) ≥ 2C1||dk
||1δk
Also, Lk
is κ-approximation of L on B(λ
s
k
,δk) and the estimates (Lk
,Lk+1) are εF-accurate with
εF ≤ κ implies that
ρk =
Lk −Lk−1
Lk+1 −Lk
=
Lk −L(λk +θk ⊙dk)
Lk+1 −Lk
+
L(λk +θk ⊙dk)−Lk+1
Lk+1 −Lk
+
Lk+1 −Lk
Lk+1 −Lk
+
Lk −L(λk)
Lk+1 −Lk
+
L(λk)−Lk−1
Lk+1 −Lk
,
97
which indicates that
1−ρk ≤
2κδk
C1||dk
||1
≤ 1−η1.
Thus, we have ρk ≥ η1, ||dk
||1 > η2δk and the iteration is successful.
Lemma 5.6. Suppose the function value estimates {(Lk
,Lk+1)} are εF-accurate and
εF < η1η2C1.
If the kth iteration is successful, then the improvement in L is bounded such that
L(λk +θk ⊙dk)−L(λk) ≥ C2δ
2
k
,
where C2
∆= 2η1η2C1 −2εF.
Proof. Since θ
s
k ∈ L and θ
s
k
||d
s
k
|| >
δk
n
, from Lemma 5.4,
Lk(λk +θk ⊙dk)−Lk(λk) ≥ 2C1||dk
||1δk
Also, since the iteration is successful, we have ||dk
||1 > η2δk
. Thus, we have
Lk −Lk+1 ≥ η1(Lk −Lk−1) ≥ 2η1C1||dk
||δk ≥ 2η1η2C1δ
2
k
.
Then, since the estimates are εF -accurate, we have that the improvement in f can be bounded as
L(λk +θk ⊙dk)−L(λk +θk ⊙dk) = L(λk +θk ⊙dk)−Lk+1 +Lk+1 −Lk +Lk −L(λk) ≥ C2δ
2
k
.
Theorem 5.3. Let the random function Vk
∆= Lk+1 −Lk
, the corresponding realization be vk
,
ζ = max{4nκ,η2,
4κ
C1(1−η1)
} and µ1µ2 >
1
2
. (5.24)
98
Then for any ε > 0 such that ||dk
||1 ≥ ε and ζ∆k ≤ ε, we have
E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε] ≥
1
2
C1||dk
||1∆k ≥ θ ε∆k
,
where θ =
1
2
C1.
Proof. First of all, if kth iteration is successful, i.e. λk+1 = λk +θk ⊙dk
, we have
vk = L(λk +θk ⊙dk)−L(λk). (5.25)
If kth iteration is unsuccessful, i.e. λk+1 = λk we have
vk = L(λk)−L(λk) = 0. (5.26)
Then we will divide the analysis into 4 cases according to the states (true/false) observed for the
pair (Ik
, Jk).
(a) Ik and Jk are both true. Since the fk
is a κ-approximation of f on B(λ
s
k
,δk) and condition
(5.20) is satisfied, we have Lemma 5.4 holds. Also, since the estimates (
ˆfk
,
ˆfk+1) are εF-accurate
and condition (5.23) is satisfied, we have Lemma 5.5 holds. Combining (5.21) with (5.25), we
have
vk ≥ C1||dk
||δk
∆= b1. (5.27)
(b) Ik
is true but Jk
is false. Since fk
is a κ-approximation of f on B(λ
s
k
,δk) and condition (5.20)
is satisfied, it follows that Lemma 5.4 still holds. If the iteration is successful, we have (5.27),
otherwise we have (5.26). Thus, we have vk ≥ 0.
(c) Ik
is false but Jk
is true. If the iteration is successful, since the estimates (Lk
,Lk+1) are εFaccurate and condition (5.6) is satisfied, Lemma 5.6 holds. Hence,
vk ≥ C2δ
2
k
.
99
If the iteration is unsuccessful, we have (5.26). Thus, we have vk ≥ 0 whether the iteration is
successful or not.
(d) Ik and Jk are both false. Since L is convex and θ
s
k
||d
s
k
|| < δk
, for any g(θk) ∈ ∂L(λk +θk ⊙dk),
with Assumption 5.4, ||g(θk)|| ≤ G , we have
vk = f(αˆ k +tdk)− f(αˆ k) ≥ −g(t)
T
tdk ≥ −Gδk
.
If the iteration is successful, then
vk ≥ −Gδk
∆= b2.
If the iteration is unsuccessful, we have (5.26). Thus, we have vk ≥ b2 whether the iteration is
successful or not.
With the above four cases, we can bound E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε] based on different outcomes
of Ik and Jk
. Let B1 and B2 be the random counterparts of b1 and b2. Then we have
E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε]
≥ µ1µ2B1 + (µ1(1− µ2) + µ2(1− µ1))· 0+ (1− µ1)(1− µ2)B2
= µ1µ2(C1||dk
||∆k)−(1− µ1)(1− µ2)G∆k
.
Choose µ1 ∈ (1/2,1) and µ2 ∈ (1/2,1) large enough such that
µ1µ2 ≥ 1/2 and µ1µ2
(1− µ1)(1− µ2)
≥
2G
C1||dk
||,
we have
E[Vk
||dk
|| ≥ ε,ζ∆k ≤ ε] ≥
1
2
C1||dk
||∆k
.
A quick observation of the concluding condition reveals that the requirement ∆k ≤
ε
ζ
implies that
the supermartingale property holds. This prompts us to impose the condition that restricts δmax =
ε
ζ
.
This property is formalized in the following corollary.
100
Corollary 5.1. Let
Tε = inf{k ≥ 0 : ||dk
||1 < ε}.
Then Tε is a stopping time for the stochastic process Λ
k
. Moreover, conditioned on Tε ≥ k, Lk
is a
submartingale.
Proof. From Theorem 5.3, we have
E[Lk
|Lk−1,Tε > k] ≥ Lk−1 +Θε∆k
. (5.28)
Hence, Lk
is a submartingale.
5.3.4 Convergence Rate
Building upon the results established in Theorem 5.3, where Lk
is demonstrated to be a submartingale and Tε is identified as a stopping time, we proceed to construct a renewal reward process for
analyzing the bound on the expected value of Tε . As highlighted in the abstract, Theorem 5.4
confirms the rate of convergence to be O(1/ε
2
). To begin with, let us define the renewal process
{Al} as follows: set A0 = 0, and for each l > 0, define Al = inf{m > Al−1 : ζ∆m ≥ ε}, with ζ
being specified in (5.24). Additionally, we define the inter-arrival times τl = Al − Al−1. Lastly,
we introduce the counting process N(k) = max{n : Al ≤ k}, representing the number of renewals
occurring up to the k
th iteration.
Lemma 5.7. Let 1
2 < p = µ1µ2 ≤ P(Ik ∩Jk). Then for all l ≥ 1,
E[τl
] ≤
p
2p−1
.
Proof. First,
E[τl
] = E[τl
|ζ∆Al−1 > ε] ·P(ζ∆Al−1 > ε) +E[τl
|ζ∆Al−1 = ε] ·P(ζ∆Al−1 = ε)
≤ max{E[τl
|ζ∆Al−1 > ε],E[τn|ζ∆Al−1 = ε]}.
101
If ζ∆Al−1 > ε, according to Algorithm 9,
E[τl
|ζ∆Al−1 > ε] = 1. (5.29)
If ζ∆Al−1 = ε, by Theorem 5.3, we can treat {∆Al−1
,...,∆Al
} as a random walk, and we have
E[τl
|ζ∆Al−1 = ε] ≤
p
2p−1
. (5.30)
Combining (5.29) and (5.30) completes the proof.
Lemma 5.8. Let ζ and θ be the same as in Theorem 5.3 and ∆max =
ε
ζ
, then
E[N(Tε )] ≤
2ζFmax
θ ε2
+
ζ∆max
ε
=
2ζFmax
θ ε2
+1,
where Fmax = maxi≤(k∧Tε )
|Fi
|.
Proof. We will first show that
Rk∧Tε = Lk∧Tε +Θε
k∧Tε
∑
j=0
∆j (5.31)
is a submartingale. Using (5.28),
E[Rk∧Tε
|Fk−1] = E[Fk∧Tε
|Fk−1] +E[Θε
k∧Tε
∑
j=0
∆j
|Fk−1]
≤ Fk−1 −Θε∆k +E[Θε
k∧Tε
∑
j=0
∆j
|Fk−1]
= Fk−1 +E[Θε
(k−1)∧Tε
∑
j=0
∆j
|Fk−1]
= Fk−1 +Θε
(k−1)∧Tε
∑
j=0
∆j
= R(k−1)∧Tε
,
(5.32)
where the summation in the last expectation in (5.32) is true by moving Θε∆k
inside the summation
so that it has one less term if k < Tε .
102
If k < Tε , then
|Rk∧Tε
| = |Rk
| ≤ Fmax +Θεk∆max.
If k ≥ Tε , then
|Rk∧Tε
| = |Rε | ≤ Fmax +ΘεTε∆max.
This is also bounded almost surely since Tε is bounded almost surely. Hence, according to (5.31)
and the optional stopping theorem [26], we have
E[Θε
Tε
∑
j=0
∆j
] ≤ E[RTε
] +Fmax ≤ E[R0] +Fmax ≤ 2Fmax +Θε∆max. (5.33)
Furthermore, since the renewal An happens when ζ∆j ≥ ε and N(Tε ) is a subset of {0,1,2,...,Tε},
we have
Θε
Tε
∑
j=0
ζ∆j
≥ Θε
N(Tε )ε
. (5.34)
Combining (5.33) and (5.34), we have
E[N(Tε )] ≤
2ζFmax +ζΘε∆max
Θε
2
≤
2ζFmax
Θε
2
+
ζ∆max
ε
.
Theorem 5.4. Under conditions enunciated in §4.5, we have
E[Tε ] ≤
p
2p−1
2ζFmax
Θε
2
+2
. (5.35)
Proof. First, note that N(Tε ) + 1 is a stopping time for the renewal process {An : n ≥ 0}. Thus,
using Wald’s equation (inequality form) [26], we have
103
E[AN(Tε )+1
] ≤ E[τ1]E[N(Tε ) +1].
Moreover, since AN(Tε )+1 ≥ Tε , we have
E[Tε ] ≤ E[τ1]E[N(Tε ) +1].
Hence, by Lemma 5.7 and Lemma 5.8
E[Tε ] ≤
p
2p−1
2ζFmax
Θε
2
+2
.
5.3.5 Optimality condition
Lemma 5.9. If k is the smallest index for which ||dk
|| < ε, then we have ||gk
|| < 4ε.
Proof. Suppose the claim is false, we have ||gk
|| ≥ 4ε. Thus,
||dk
||2 = ||λ
∗
k
gk −(1−λ
∗
)dk−1||2
= (λ
∗
)
2
||gk
||2 + (1−λ
∗
)
2
||dk−1||2 −2λ
∗
(1−λ
∗
)⟨gk
,dk−1⟩
≥ (λ
∗
)
2
||gk
||2 + (1−λ
∗
)
2
||dk−1||2 +2m2λ
∗
(1−λ
∗
)||dk−1||2
≥ (λ
∗
)
2
||gk
||2 + (1−λ
∗
)(1−
λ
∗
2
)||dk−1||2
≥ [1−
3
2
λ
∗ +
33
2
(λ
∗
)
2
]ε
2
Note that λ
∗ ≥
3
33 =
1
11 because it suffice to show that
λ
∗ =
⟨gk
,−dk−1⟩+||gk
||2
||dk−1||2 +||gk
||2 +2⟨gk
,−dk−1⟩
≥
m2||dk−1||2 +||gk
||2
(1+2m1)||dk−1||2 +||gk
||2
≥
1
11
,
which is equivalent to
104
11 ·m2||dk−1||2 +11 ·||gk
||2 ≥ (1+2m1)||dk−1||2 +||gk
||2
.
This can be verified by setting m1 =
1
2
and m2 =
1
4
. Thus, based on (5.3.5) and λ
∗ ≥
1
11 , we have
||dk
|| ≥ ε. This contradicts the assumptions of the lemma.
Theorem 5.5. If k is the smallest index for which ||dk
|| < ε, then we will have
| f(x
∗
)− f(x¯k)| = O(ε).
Proof. Consider the difference between the primal objective value
fk(xk) =
|Sk
|
∑
s=1
h(x
s
k
,ω
s
)
and the dual objective value
Lk(xk
,λk) =
|Sk
|
∑
s=1
[h(x
s
k
,ω
s
) +⟨λ
s
k
, x
s
k −x¯k⟩+
ρ
2
||x
s
k −x¯k
||2
].
Since ||dk
|| < ε, from Lemma 5.9, we have ||gk
|| = ||xk −x¯k
|| ≤ 4ε. Thus, we have
Lk(xk
,λk)− fk(xk) =
|Sk
|
∑
s=1
[⟨λ
s
k
, x
s
k −x¯k⟩+
ρ
2
||x
s
k −x¯k
||2
] = O(ε). (5.36)
Let x
∗
k = argmin fk(xk). From the duality theorem, we have
Lk(xk
,λk) ≤ fk(x
∗
k
) ≤ fk(xk). (5.37)
(5.36) and (5.37) implies that
| fk(x
∗
k
)− fk(xk)| = O(ε). (5.38)
On the other hand, based on Assumption 5.4, we have for every scenario s,
105
|h(x
s
k
,ω
s
)−h(x¯k
,ω
s
)| ≤ G·|x
s
k −x¯k
|.
Thus, we have
| fk(xk)− fk(x¯k)| = O(ε). (5.39)
Eq. (5.38) and Eq. (5.39) implies that
| fk(x
∗
k
)− fk(x¯k)| = O(ε). (5.40)
Also, let x
∗ = argmin f(x), we will show that
| f(x
∗
)− f(x
∗
k
)| = O(ε). (5.41)
First, note that we can use the same concentration inequality in Theorem 5.2 to show that with
probability at least (1−α)
2
,
| fk(x
∗
)− f(x
∗
)| ≤ ε and | fk(x
∗
k
)− f(x
∗
k
)| ≤ ε. (5.42)
Thus,
fk(x
∗
) ≤ f(x
∗
) +ε and f(x
∗
k
) ≤ fk(x
∗
k
) +ε.
Based on the definition of x
∗
and x
∗
k
, we have
f(x
∗
k
) ≤ fk(x
∗
k
) +ε ≤ fk(x
∗
) +ε ≤ f(x
∗
) +2ε,
which proves Eq. (5.41).
Finally, we can conclude the proof by combining Eq. (5.40), Eq. (5.41) and Eq. (5.41).
106
5.4 Preliminary Computational Results
Our computational experiments are based on data sets available at the USC 3D Lab 1
. Specifically,
these problems are two-stage stochastic linear programming problems, and we have tested LandS3,
pgp2, 20-term, and baa99-20. The study considers three different methods: Classic PHA [55],
randomized PHA [3], and the sampling-based PHA. We track the increase in dual objective values
with respect to the number of QPs solved for all algorithms. The figures also include the 95% upper
and lower bounds of the objective value estimates obtained from stochastic decomposition [27, 28]
as a benchmark. The algorithms were implemented on a MacBook Pro 2023 with a 2.6GHz 12-
core Apple M3 Pro processor and 32GB of 2400MHz DDR4 onboard memory. The code used in
this research is written in Julia and is available at the following GitHub repository: Progressive
Hedging SCS (accessed on August 24, 2024), and the computational results are shown in Figure
5.1.
5.5 Conclusion
We propose a novel sampling-based PH algorithm for large-scale stochastic optimization problems.
Compared with the classic PH algorithm, the proposed approach (a) does not need to know the
explicit form of the distribution of the random variables; (b) allows dynamic updates of the sample
size; (c) uses the conjugate subgradient direction and line-search to update the dual variables; and
(d) implicitly updates the penalty parameter based on optimization principles. The illustrative
experimental results show that it has good practical performance.
1https://github.com/USC3DLAB/SD
107
Figure 5.1: Dual objective values for different combinations (data,algorithm).
(a) LandS3 (b) pgp2
(c) 20-term (d) baa99-20
108
Bibliography
[1] Nezir Aydin. Sampling based progressive hedging algorithms for stochastic programming
problems. 2012.
[2] Afonso S Bandeira, Katya Scheinberg, and Luis Nunes Vicente. Convergence of trust-region
methods based on probabilistic models. SIAM Journal on Optimization, 24(3):1238–1264,
2014.
[3] Gilles Bareilles, Yassine Laguel, Dmitry Grishchenko, Franck Iutzeler, and Jer´ ome Malick. ˆ
Randomized progressive hedging methods for multi-stage stochastic programming. Annals
of Operations Research, 295(2):535–560, 2020.
[4] Richard E Bellman and Stuart E Dreyfus. Applied dynamic programming, volume 2050.
Princeton university press, 2015.
[5] Aharon Ben-Tal and Arkadi Nemirovski. Robust convex optimization. Mathematics of operations research, 23(4):769–805, 1998.
[6] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and applications of
robust optimization. SIAM review, 53(3):464–501, 2011.
[7] John R Birge and Francois Louveaux. Introduction to Stochastic Programming. Springer
Science & Business Media, 2011.
[8] John R Birge and Francois Louveaux. Introduction to stochastic programming. Springer
Science & Business Media, 2011.
[9] Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate
analysis of a stochastic trust-region method via supermartingales. INFORMS journal on
optimization, 1(2):92–119, 2019.
[10] Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. ´ Advances in neural
information processing systems, 20, 2007.
[11] Leon Bottou and Olivier Bousquet. 13 the tradeoffs of large-scale learning. ´ Optimization for
Machine Learning, page 351, 2011.
[12] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed
optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
109
[13] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004.
[14] Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained
optimization methods based on probabilistic models. Mathematical Programming, 169:337–
375, 2018.
[15] Abraham Charnes and William W Cooper. Chance-constrained programming. Management
science, 6(1):73–79, 1959.
[16] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a
trust-region method and random models. Mathematical Programming, 169:447–487, 2018.
[17] Frank E Curtis and Katya Scheinberg. Optimization methods for supervised machine learning: From linear models to deep learning. In Leading Developments from INFORMS Communities, pages 89–114. INFORMS, 2017.
[18] George B Dantzig. Linear programming under uncertainty. Management science, 1(3-
4):197–206, 1955.
[19] Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Asymptotic normality and optimality
in nonsmooth stochastic approximation. arXiv preprint arXiv:2301.06632, 2023.
[20] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty
with application to data-driven problems. Operations research, 58(3):595–612, 2010.
[21] Shuotao Diao and Suvrajeet Sen. A reliability theory of compromise decisions for large-scale
stochastic programs. arXiv preprint arXiv:2405.10414, 2024.
[22] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas.
Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances
in Neural Information Processing Systems, 32, 2019.
[23] Andrew Frank. Uci machine learning repository. http://archive. ics. uci. edu/ml, 2010.
[24] Harsha Gangammanavar and Suvrajeet Sen. Stochastic dynamic linear programming: A
sequential sampling algorithm for multistage stochastic linear programming. SIAM Journal
on Optimization, 31(3):2111–2140, 2021.
[25] Harsha Gangammanavar, Suvrajeet Sen, and Victor M Zavala. Stochastic optimization of
sub-hourly economic dispatch with wind energy. IEEE Transactions on Power Systems,
31(2):949–959, 2015.
[26] Geoffrey Grimmett and David Stirzaker. Probability and random processes. Oxford university press, 2020.
[27] Julia L Higle and Suvrajeet Sen. Stochastic decomposition: An algorithm for two-stage linear
programs with recourse. Mathematics of operations research, 16(3):650–669, 1991.
110
[28] Julia L Higle and Suvrajeet Sen. Finite master programs in regularized stochastic decomposition. Mathematical Programming, 67(1):143–168, 1994.
[29] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The
collected works of Wassily Hoeffding, pages 409–426. Springer, 1994.
[30] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of
the 25th international conference on Machine learning, pages 408–415, 2008.
[31] Xiao-Bo Jin, Xu-Yao Zhang, Kaizhu Huang, and Guang-Gang Geng. Stochastic conjugate
gradient algorithm with variance reduction. IEEE Transactions on Neural Networks and
Learning Systems, 30(5):1360–1369, 2018.
[32] Thorsten Joachims. Making large-scale support vector machine learning practical, advances
in kernel methods. Support vector learning, 1999.
[33] George Kimeldorf and Grace Wahba. Some results on tchebycheffian spline functions. Journal of mathematical analysis and applications, 33(1):82–95, 1971.
[34] Igor V Konnov. A combined relaxation method for variational inequalities with nonlinear
constraints. Mathematical Programming, 80(2):239–252, 1998.
[35] Guanghui Lan. Stochastic gradient descent. In Encyclopedia of Optimization, pages 1–3.
Springer, 2023.
[36] Junyi Liu, Guangyu Li, and Suvrajeet Sen. Coupled learning enabled stochastic programming
with endogenous uncertainty. Mathematics of Operations Research, 47(2):1681–1705, 2022.
[37] Junyi Liu and Suvrajeet Sen. Asymptotic results of stochastic decomposition for two-stage
stochastic quadratic programming. SIAM Journal on Optimization, 30(1):823–852, 2020.
[38] Junyi Liu and Suvrajeet Sen. Asymptotic results of stochastic decomposition for two-stage
stochastic quadratic programming. SIAM Journal on Optimization, 30(1):823–852, 2020.
[39] David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2.
Springer, 1984.
[40] Xiyuan Ma, Roberto Rossi, and Thomas Archibald. Stochastic inventory control: A literature
review. IFAC-PapersOnLine, 52(13):1490–1495, 2019.
[41] Wai-Kei Mak, David P Morton, and R Kevin Wood. Monte carlo bounding techniques for
determining solution quality in stochastic programs. Operations research letters, 24(1-2):47–
56, 1999.
[42] Olvi L Mangasarian and David R Musicant. Successive overrelaxation for support vector
machines. IEEE Transactions on Neural Networks, 10(5):1032–1037, 1999.
[43] James Martens et al. Deep learning via hessian-free optimization. In ICML, volume 27, pages
735–742, 2010.
111
[44] Angelia Nedic and Dimitri Bertsekas. Convergence rate of incremental subgradient algo- ´
rithms. In Stochastic optimization: algorithms and applications, pages 223–264. Springer,
2001.
[45] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust
stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
[46] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Science & Business
Media, 2006.
[47] Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected
complexity analysis. SIAM Journal on Optimization, 30(1):349–376, 2020.
[48] Raghu Pasupathy. On choosing parameters in retrospective-approximation algorithms for
stochastic root finding and simulation optimization. Operations Research, 58(4-part-1):889–
901, 2010.
[49] Mario VF Pereira and Leontina MVG Pinto. Multi-stage stochastic optimization applied to
energy planning. Mathematical programming, 52:359–375, 1991.
[50] John Platt. Sequential minimal optimization: A fast algorithm for training support vector
machines. 1998.
[51] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
[52] Michael JD Powell. Nonconvex minimization calculations and the conjugate gradient
method. In Numerical Analysis: Proceedings of the 10th Biennial Conference held at Dundee,
Scotland, June 28–July 1, 1983, pages 122–141. Springer, 1984.
[53] Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arXiv
preprint arXiv:1908.05659, 2019.
[54] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of
mathematical statistics, pages 400–407, 1951.
[55] R Tyrrell Rockafellar and Roger J-B Wets. Scenarios and policy aggregation in optimization
under uncertainty. Mathematics of operations research, 16(1):119–147, 1991.
[56] R Tyrrell Rockafellar and Roger J-B Wets. Scenarios and policy aggregation in optimization
under uncertainty. Mathematics of operations research, 16(1):119–147, 1991.
[57] Andrzej Ruszczynski and Alexander Shapiro. Stochastic programming models. ´ Handbooks
in operations research and management science, 10:1–64, 2003.
[58] Bernhard Scholkopf and Alexander J Smola. ¨ Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
112
[59] Suvrajeet Sen and Yifan Liu. Mitigating uncertainty via compromise decisions in two-stage
stochastic linear programming: Variance reduction. Operations Research, 64(6):1422–1437,
2016.
[60] Suvrajeet Sen and Hanif D. Sherali. A class of convergent primal-dual subgradient algorithms
for decomposable convex programs. Mathematical Programming, 35:279–297, 1986.
[61] Suvrajeet Sen, Lihua Yu, and Talat Genc. A stochastic programming approach to power
portfolio optimization. Operations Research, 54(1):55–72, 2006.
[62] Suvrajeet Sen and Zhihong Zhou. Multistage stochastic decomposition: a bridge between
stochastic programming and approximate dynamic programming. SIAM Journal on Optimization, 24(1):127–153, 2014.
[63] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal
estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
[64] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(2), 2013.
[65] Alexander Shapiro. Monte carlo sampling methods. Handbooks in operations research and
management science, 10:353–425, 2003.
[66] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic
programming: modeling and theory. SIAM, 2021.
[67] Alexander Shapiro and Tito Homem-de Mello. A simulation-based approach to two-stage
stochastic programming with recourse. Mathematical Programming, 81(3):301–325, 1998.
[68] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,
2018.
[69] Vladimir Naumovich Vapnik, Vlamimir Vapnik, et al. Statistical learning theory. 1998.
[70] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
[71] Jean-Paul Watson and David L Woodruff. Progressive hedging innovations for a class of
stochastic mixed-integer resource allocation problems. Computational Management Science,
8:355–370, 2011.
[72] Philip Wolfe. Note on a method of conjugate subgradients for minimizing nondifferentiable
functions. Mathematical Programming, 7:380–383, 1974.
[73] Philip Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pages 145–173. Springer, 1975.
[74] GR Wood and BP Zhang. Estimation of the lipschitz constant of a function. Journal of Global
Optimization, 8:91–103, 1996.
113
[75] Zhuang Yang. Adaptive stochastic conjugate gradient for machine learning. Expert Systems
with Applications, page 117719, 2022.
[76] Farzad Yousefian, Angelia Nedic, and Uday V Shanbhag. On stochastic and deterministic
quasi-newton methods for nonstrongly convex optimization: Asymptotic convergence and
rate analysis. SIAM Journal on Optimization, 30(2):1144–1172, 2020.
[77] Gonglin Yuan, Zhou Sheng, and Wenjie Liu. The modified hz conjugate gradient algorithm
for large-scale nonsmooth optimization. PLoS One, 11(10):e0164289, 2016.
[78] Shohre Zehtabian and Fabian Bastin. Penalty parameter update strategies in progressive
hedging algorithm. Cirrelt Montreal, QC, Canada, 2016.
[79] Di Zhang and Suvrajeet Sen. A stochastic conjugate subgradient algorithm for kernelized
support vector machines: The evidence.
[80] Di Zhang and Suvrajeet Sen. The stochastic conjugate subgradient algorithm for kernel support vector machines. arXiv preprint arXiv:2407.21091, 2024.
114
Abstract (if available)
Abstract
Stochastic Optimization is a cornerstone of operations research, providing a framework to solve optimization problems under uncertainty. Despite the development of numerous algorithms to tackle these problems, several persistent challenges remain, including: (1) selecting an appropriate sample size, (2) determining an effective search direction, and (3) choosing a proper step size. This dissertation introduces a comprehensive framework, the Stochastic Conjugate Subgradient (SCS) framework, designed to systematically address these challenges. Specifically, The SCS framework offers structured approaches to determining the sample size, the search direction, and the step size. By integrating various stochastic algorithms within the SCS framework, we have developed novel stochastic algorithms. The convergence and convergence rates of these algorithms have been rigorously established, with preliminary computational results support the theoretical analysis.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Computational stochastic programming with stochastic decomposition
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Stochastic games with expected-value constraints
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Computational validation of stochastic programming models and applications
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Variants of stochastic knapsack problems
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
Robustness of gradient methods for data-driven decision making
PDF
Learning enabled optimization for data and decision sciences
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
A stochastic employment problem
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Topics in algorithms for new classes of non-cooperative games
Asset Metadata
Creator
Zhang, Di
(author)
Core Title
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2024-12
Publication Date
11/13/2024
Defense Date
09/10/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive sampling,conjugate gradient,OAI-PMH Harvest,stochastic optimization
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sen, Suvrajeet (
committee chair
), Gupta, Vishal (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
dz2365@columbia.edu,dzhang22@usc.edu
Unique identifier
UC11399DIAD
Identifier
etd-ZhangDi-13631.pdf (filename)
Legacy Identifier
etd-ZhangDi-13631
Document Type
Dissertation
Format
theses (aat)
Rights
Zhang, Di
Internet Media Type
application/pdf
Type
texts
Source
20241113-usctheses-batch-1222
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adaptive sampling
conjugate gradient
stochastic optimization