Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
(USC Thesis Other)
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ON THE INTERPLAY BETWEEN STOCHASTIC PROGRAMMING, NON-PARAMETRIC STATISTICS, AND NONCONVEX OPTIMIZATION by Shuotao Diao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (INDUSTRIAL AND SYSTEMS ENGINEERING) August 2022 Copyright 2022 Shuotao Diao Acknowledgements First, I would like to express my deepest gratitude to my Ph.D. advisor, Dr. Suvrajeet Sen. During my 6-year Ph.D. study, Dr. Sen taught me how to become a good researcher and an uprightperson. Withoutmyadvisor’sunlimitedpatienceandsupport, Iwouldnotachievea thorough understanding of both theory and computation in stochastic programming. With- out my advisor’s fatherly encouragement both inside and outside the research work, I would not rebuild my confidence. Dr. Sen took incredible amounts of effort in guiding me to go on the right path. Dr. Sen also did much more work than an advisor should do for me. He is always the first person with whom I would like to share my happiness and worry during my 6-year Ph.D. study. I will forever remember the wonderful time of having brunch and dinner with my advisor while discussing research in Santa Monica. Second,Iwouldliketothanktheprofessorswhosupportedmeimmensely. Ihavelearned to keep a peaceful mind in doing research from Dr. Bennett Eisenberg at Lehigh University. Dr. Huai-Dong Cao at Leigh University taught me how to see the beauty of mathematics. I am also very grateful for the support of my dissertation committee, Dr. John Carlsson, Dr. Phebe Vayanos, and Dr. Rahul Jain. Third, I would like to thank AFOSR and ONR for their financial support. I also want to thank Shelly Lewis and Grace Owh for their heartwarming support of my Ph.D. education. Finally, I want to thank my grandfather, Guanchao Wu, and my parents, Lin Diao and Bin Wu, for their unconditional love. I spent most of my childhood with my grandfather. My grandfather inspired me to have curiosity about anything unknown. My father is my best buddy who always encourages me to try some sports and who is always happy to play ii gameswithme. Mymotherwasmyfirstteacherwhotaughtmetotraveltolearnmorefrom nature. Without their tireless support, I would not have had a chance to pursue my dream of getting a Ph.D. degree in my beloved area. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures viii Abstract ix Chapter 1: Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Algorithms for Stochastic Programming. . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Stochastic Approximation and its Extension . . . . . . . . . . . . . . 7 1.2.2 Stochastic Quasi-Gradient Method . . . . . . . . . . . . . . . . . . . 10 1.2.3 Stochastic Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Types of Convergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter2: AnUnifyingViewontheSampleComplexityandDecomposition- based Methods of Multistage Stochastic Linear Program 19 2.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Sample Complexity of MSLP . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Sample Average Approximation Model of MSLP . . . . . . . . . . . . 22 2.2.3 Background: Rademacher Average . . . . . . . . . . . . . . . . . . . 24 2.2.4 Complexity Analysis of SAA in MSLP . . . . . . . . . . . . . . . . . 25 2.3 Decomposition-Based Methods for MSLP . . . . . . . . . . . . . . . . . . . . 35 2.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.3 Stochastic Dual Dynamic Programming . . . . . . . . . . . . . . . . . 41 2.3.4 Literature Review: Regularized SDDP . . . . . . . . . . . . . . . . . 46 2.3.5 SDDP with Proximal Mapping . . . . . . . . . . . . . . . . . . . . . 49 2.3.6 SD-SDDP: A Bridge between SD and SDDP . . . . . . . . . . . . . . 53 2.3.7 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3.8 Stochastic Dynamic Linear Programming . . . . . . . . . . . . . . . . 70 2.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 iv Chapter 3: Distribution-free Algorithms for Learning Enabled Optimization with Non-parametric Approximation 78 3.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2.1 Non-parametric Statistical Estimation . . . . . . . . . . . . . . . . . 86 3.3 LEON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.3.1 Mathematical Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.3.2 Algorithm Design and Convergence Analysis . . . . . . . . . . . . . . 93 3.3.3 LEON with kNN Estimation . . . . . . . . . . . . . . . . . . . . . . . 102 3.3.4 LEON with Kernel Estimation . . . . . . . . . . . . . . . . . . . . . . 110 3.4 Convergence Rate of LEON . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.5 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.5.1 Predictive Newsvendor Problem . . . . . . . . . . . . . . . . . . . . . 119 3.5.2 Computational Results with Two-Stage Predictive SLP . . . . . . . . 121 3.5.2.1 Predictive Multi-product Inventory Planning. . . . . . . . . 121 3.5.2.2 Two-Stage Predictive Shipment Planning . . . . . . . . . . 122 3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Chapter 4: Non-parametric Stochastic Decomposition for Two-Stage Predic- tive Stochastic Programming 126 4.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.3 Two-Stage Predictive SLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.3.3 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.3.4 Non-parametric SD with Batch Sampling . . . . . . . . . . . . . . . . 148 4.4 Two-Stage Predictive SQQP . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.5.1 Two-Stage Predictive SLP . . . . . . . . . . . . . . . . . . . . . . . . 154 4.5.2 Two-Stage Predictive SQQP . . . . . . . . . . . . . . . . . . . . . . . 157 Chapter5: OnlineNon-parametricEstimationforNonconvexStochasticPro- gramming 161 5.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.2 Scope and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.2.1 Technical Preliminaries and Notations . . . . . . . . . . . . . . . . . 168 5.3 Nonconvex Predictive Stochastic Programming . . . . . . . . . . . . . . . . . 170 5.3.1 N-LEON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.3.1.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . 176 5.3.2 NSD-MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.4 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.4.1 Two-Stage PSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 v Chapter 6: On the Combination of Stochastic Decomposition and Majoriza- tion Minimization for Nonconvex Stochastic Programming 191 6.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.4 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.4.1 Two-Stage SLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.4.2 Two-Stage SP with Concave First-Stage Cost . . . . . . . . . . . . . 206 Chapter 7: Conclusions 208 References 210 vi List of Tables 1.1 Decision Variables in the Single-Period Multi-Product Inventory Model . . . 4 1.2 Model Parameters in the Single-Period Multi-Product Inventory Model . . . 4 1.3 Decision Variables in the Multistage Hydroelectric Model . . . . . . . . . . . 5 1.4 Model Parameters in the Multistage Hydroelectric Model . . . . . . . . . . . 6 1.5 Convergence Rate of Algorithms for Stochastic Programming . . . . . . . . . 15 2.1 Comparisons among Decomposition-Based Algorithms. . . . . . . . . . . . . 37 3.1 Computational Results of the LEON Algorithm and (Robust) SA in the Pre- dictive Newsvendor Problem. (x ∗ =36.1975) . . . . . . . . . . . . . . . . . . 120 3.2 Computational Results of LEON Algorithm in BAA99-(25,50) (˜ γ q = ˇ C √ mq ). . 122 3.3 Computational Results for BK19 . . . . . . . . . . . . . . . . . . . . . . . . 123 3.4 Computational Results with Larger Sample Size for BK19 . . . . . . . . . . 123 4.1 Problem Complexity of the Two-Stage Predictive SLPs . . . . . . . . . . . . 154 4.2 BK19, sample size is 8,645 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.3 BAA99, Sample Size 32,445 . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.4 Summary of Computation Experiments of the SD-kNN-Batch . . . . . . . . 157 4.5 Problem Size of SQQP Problem . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.6 Summary of the Numerical Experiments of SD-kNN-QQ . . . . . . . . . . . 160 5.1 Frameworks for PSP Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.2 Notations in the NSD-MM Algorithm . . . . . . . . . . . . . . . . . . . . . . 179 6.1 Notations in the SD-MM Algorithm . . . . . . . . . . . . . . . . . . . . . . . 195 6.2 Computational Results of the SD-MM in Two-Stage SLPs . . . . . . . . . . 205 vii List of Figures 2.1 Scenario tree of SAA method for a 4-Stage MSLP with N [2:4] =(2,4,3) . . . 23 2.2 Building Blocks of the Regularized SD and its Extension . . . . . . . . . . . 38 2.3 Building Blocks of the Regularized SDDP . . . . . . . . . . . . . . . . . . . 39 3.1 Predictive Newsvendor Problem: SA and Robust SA . . . . . . . . . . . . . 84 3.2 Illustration of the kNN method where (ω,ξ ) follows bivariate normal distri- bution (sample sizes = 100, 1000, 10000). . . . . . . . . . . . . . . . . . . . . 88 3.3 Predictive Newsvendor Problem: LEON-kNN (k =⌊N β ⌋) . . . . . . . . . . . 120 4.1 BK19, SD-kNN versus LEON-kNN . . . . . . . . . . . . . . . . . . . . . . . 155 4.2 BAA99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.1 Computational Results of the NSD-MM . . . . . . . . . . . . . . . . . . . . 189 6.1 Illustration of the SD-MM methodology. The first figure (starting from the left hand side) illustrates that SD-MM approximates the concave function f(x)fromabove. ThesecondfigureillustratesthatSD-MMapproximatesthe convex function h(x) from below. The last figure shows that the sum of the two function approximations becomes a local approximation of f(x)+h(x). 196 6.2 Computational Results of the SDMM . . . . . . . . . . . . . . . . . . . . . . 206 viii Abstract Inthiscollectionofresearch,wefocusonthestudyoffirst-orderanddecomposition-basedal- gorithms for solving predictive stochastic programming problems, multistage stochastic pro- gramming problems, and nonconvex stochastic programming problems. Chapter 1 presents the generic formulation of stochastic programming problems, real-life examples of stochastic programming problems, types of convergence often used in the analysis of stochastic pro- gramming algorithms, and a brief introduction to several popular stochastic programming algorithms. Chapter 2 studies sample complexity and decomposition-based algorithms for multistage stochastic linear programming problems. Firstly, we provide a different view of the sample complexity of multistage stochastic linear program based on the concepts in compound sam- ple averaging approximation. Secondly, we present a unifying framework to study various decomposition-based methods for solving multistage stochastic linear programming prob- lems. Chapter 3 demonstrates a fusion of concepts from stochastic optimization and non- parametricstatisticallearninginwhichdataisavailableintheformofcovariatesinterpreted as predictors and responses. In this new class of stochastic programming problems with covariates (which we refer to as predictive stochastic programming problems), the objec- tive function is a conditional expectation of the random cost function with respect to the observation of the predictor. The objective function can predict the outcome given the ob- served characteristic, such as the hedonic pricing model for housing price prediction. To solve the predictive stochastic problems, we use non-parametric estimation methods (e.g., ix kNN and kernel) to compute a weighted average of the subgradients of the cost function for each data pair. This new type of subgradient estimation is defined as the non-parametric stochastic quasi-gradient of the “true” objective function. Then we use the notion of the non-parametric stochastic quasi-gradients to design a double-loop first-order algorithm to ensure asymptotic convergence. Chapter 4 follows the problem formulation of the previous chapter and makes a non- parametric extension of the Stochastic Decomposition (SD) algorithm (Non-parametric SD) to solve a class of two-stage predictive stochastic programming problems, which includes two-stage predictive stochastic linear programming problems as well as two-stage predic- tive stochastic quadratic-quadratic programming problems. Compared to the original SD, the minorant is constructed via k nearest neighbors estimation instead of sample averaging. Therefore, the minorant update (i.e., rescaling of the previously generated minorants) in the original SD can not be used, and a new minorant update is designed to overcome this challenge. Chapter5isaboutalgorithmdesignfor(predictive)stochasticprogrammingbeyondcon- vexoptimization. Inparticular,wepresentafusionofstochasticdecomposition,majorization minimization, k nearestneighborsmethodtosolveaclassofnonconvexpredictivestochastic program(PSP),wheretherandomnessispresentedaspredictorandresponse. Theobjective function is a conditional expectation of a smooth concave function and a second-stage linear recourse function given the observation of the predictor. This extension not only allows new stochasticdifference-of-convex(dc)functionsbutallowsnewapplicationsinwhichthese two crucial paradigms (PSP and dc) can be integrated to provide a more powerful setting for modern applications. This combination also provides an opportunity to study convergence resultsinamoregeneralsetting. InChapter6,weshowthat,underminormodifications,the proposed methodology can be transitioned into a decomposition-based algorithm for solving classic nonconvex stochastic programming problems. Finally, the computational results of various instances prove the efficiency of the methodology. x Chapter3isbasedonthejointworkingpaper([27])withDr. SuvrajeetSenandwethank Dr. Phebe Vayanos for the early discussion about the predictive stochastic programming. Chapters 5 and 6 are based on the joint working paper ([28]) with Dr. Suvrajeet Sen. xi Chapter 1 Introduction Stochastic programming (SP) is a common way to model decision-making problems under uncertainties. The uncertainties occur when the parameters can not be observed before the decision is made or the parameter measurement itself is noisy. In this chapter, we formulatethegenericmodeloftheclassicstochasticprogrammingproblem. Wethenprovide three examples in finance, logistics, and power systems to motivate the uses of stochastic programming. Finally, we briefly overview several popular first-order and decomposition- based stochastic programming algorithms. 1.1 Background We let (Ω ,Σ Ω ,P) denote the probability space, and we let ˜ ξ :Ω 7→Ξ ⊆ R m denote a random variable taking values in the measurable space (Ξ ,Σ Ξ ), and let µ ξ denote the distribution function of ˜ ξ . In general, a stochastic programming problem (see Wets [113] and Birge and Qi [15] for more details) can be formulated as follows. min x E ˜ ξ [F 0 (x, ˜ ξ )] s.t.E ˜ ξ [F i (x, ˜ ξ )]≤ 0, i=1,...,s, E ˜ ξ [F i (x, ˜ ξ )]=0, i=s+1,...,m, x∈X ⊆ R n , (1.1) 1 where F 0 :R n × Ξ 7→R∪{+∞}, (1.2a) F i :R n × Ξ 7→R, (1.2b) E ˜ ξ [F i (x, ˜ ξ )]= Z Ξ F i (x,ξ )µ ξ (dξ ) for i=0,1,...,m, (1.2c) X is closed. (1.2d) By Theorem 1.6.9 (Change of Variable Formula) in Durrett [29], the expectations in (1.1) can be written as the following integrals: E ˜ ξ [F i (x, ˜ ξ )]= Z Ω F i (x, ˜ ξ )dP= Z Ξ F i (x,ξ )µ ξ (dξ ). One example is when ˜ ξ is a standard normal random variable, and the distribution of ˜ ξ and the expectation over ˜ ξ can be written as follows: µ ξ (dξ )= 1 √ 2π e − ξ 2 2 dξ, (1.3a) E ˜ ξ [F i (x, ˜ ξ )]= Z ∞ −∞ F i (x,ξ ) 1 √ 2π e − ξ 2 2 dξ. (1.3b) In analyzing algorithms to solve SP problems, we often face a technical problem of the infinite product probability space when scenarios are sampled as the algorithm proceeds. For instance, the Stochastic Approximation algorithm generates a sequence of i.i.d. random variables ˜ ξ 1 , ˜ ξ 2 ,..., which turns out to generate an infinite product space. The following theorem aims to construct a unique probability measure on the infinite product space. Theorem 1.1 (Kolmogorov’s extension theorem, Theorem 2.1.14 in [29]). Suppose we are given probability measures µ n on (R n ,R n ) that are consistent, that is, µ n+1 ((a 1 ,b 1 ]×···× (a n ,b n ]× R)=µ n ((a 1 ,b 1 ]×···× (a n ,b n ]). 2 Then there is a unique probability measure P on (R N ,R N ) (i.e.,N={1,2,...} which stands for natural numbers) with P(ζ :ζ i ∈(a i ,b i ],1≤ i≤ n)=µ n ((a 1 ,b 1 ]×···× (a n ,b n ]). Note that Theorem 1.1 can be extended to the case in which the random variables take values in other measurable spaces (S,S) when (S,S) is nice 1 . Theorem1.2 (Theorem2.1.15in[29]). If S is a Borel subset of a complete separable metric space M, andS is the collection of Borel subsets of S, then (S,S) is nice. Withalittleabuseofnotations,wealsoletPdenotetheprobabilitymeasureon(S N ,S N ). Example 1: Conditional Value at Risk in Portfolio Optimization Conditional Value at Risk (CVaR, see Rockafellar and Uryasev [95] for more details) is the conditional expectation of the loss given that the loss is greater than some amount. In other words, CVaR measures the average loss of the tail/extreme events. Since β -CVaR (denoted as ϕ β (x)) is closely related to β -VaR (denoted as α β (x)), we provide both definitions below. α β (x)=min{a∈ℜ: Z F(x,ξ )≤ α µ ξ (d ξ )≥ β }, ϕ β (x)= 1 1− β Z F(x,ξ )≥ α β (x) F(x,ξ )µ ξ (d ξ ). One application of CVaR is portfolio optimization. Given a portfolio of n financial in- struments, we define the decision variables and model parameters as follows. Let [ t] + = x i : position in the i th instrument α : variable corresponds to β -VaR 1 (S,S) is nice if there is a 1-1 map ϕ from S intoℜ so that ϕ and ϕ − 1 are both measurable. 3 ˜ ξ i : return on the i th instrument β : probability level, β ∈(0,1) max{t,0} for any t ∈ ℜ, x = (x 1 ,x 2 ,...,x n ), and ˜ ξ = ( ˜ ξ 1 ,..., ˜ ξ n ), and the portfolio opti- mization problem with CVaR objective is formulated below. min x,α α + 1 1− β E ˜ ξ h [− ˜ ξ ⊤ x− α ] + i s.t. n X i=1 x i =1, x≥ 0. Example 2: Single-Period Multi-product Inventory Model This example is based on the single-period multi-product inventory model in Bassok et al. [6]. The goal is to maximize the expected profit of the single-period multi-product inventory planning problem. We reformulate this problem as a two-stage stochastic programming problem. In the first stage, the inventory level of each product needs to be determined. In the second stage, the demands of all classes are realized, and each product is allocated to satisfy each demand. Both backorder and excess stock are allowed in this model. The notations of the decision variables and the model parameters are defined below. y i : inventory level (in units) of the i th product w ij : number of the i th product used to satisfy demand class j v i : excess amount of the i th product u i : backorder cost of the i th product Table 1.1: Decision Variables in the Single-Period Multi-Product Inventory Model c i : purchase cost of i th product d i : demand of i th product m y : maximum inventory a ij : net revenue from using i th product to satisfy a unit of demand class j s i : salvage value of excess stock of i th product π i : backorder cost of i th product Table 1.2: Model Parameters in the Single-Period Multi-Product Inventory Model 4 min y N X k=1 c k y k +E ˜ ξ [Q(y, ˜ ξ )] s.t. 0≤ y k ≤ m y , for k =1,...,N where Q(y,ξ ) is the optimal value of the following problem. Q(y,ξ )= min w,v,w − N X i=1 i X j=1 a ji w ji − N X i=1 s i v i + N X i=1 π i u i , s.t. u i + i X j=1 w ji =d i (ξ ), for i=1,...,N, v i + N X j=i w ij =y i , for i=1,...,N, w ij ≥ 0, for i=1,...,N, j =i,...,N, u i ,v i ≥ 0,for i=1,...,N. Example 3: Multistage Hydroelectric Model ThisexampleisasimplificationofthemultistagehydroelectricmodelinRottingandGjelsvik [96]. In this example, the goal is to make a long-term and seasonal plan of power generation policyundertheuncertaintiesoffutureinflow. Weprovidethenotationsofdecisionvariables and model parameters below. Given the initial decision and the initial parameter, x 0 and ξ 1 , we provide a dynamic x t : vector of water content of the reservoirs at the end of the t th stage y t : vector of market transactions and exchange with other subsystems in the t th stage q t : vector of inflow of reservoirs in the t th stage u t : vector of release of reservoirs in the t th stage Table 1.3: Decision Variables in the Multistage Hydroelectric Model programming formulation of the multistage stochastic linear programming model below. 5 p t (ξ t ): vector of power demands in t th stage B: incidence matrix I in the hydro system D: incidence matrix II in the hydro system F: matrix that describes connections and allocations H: energy conservation matrix Table 1.4: Model Parameters in the Multistage Hydroelectric Model min u 1 ,s 1 ,x 1 ,y 1 d ⊤ 1 y 1 +E ˜ ξ 2 [Q 2 (x 1 , ˜ ξ 2 )] s.t. x 1 − x 0 − Bu 1 − Ds 1 − q 1 (ξ 1 )=0, Hu 1 +Fy 1 =p 1 , v 1 ≤ v 1 ≤ v 1 , 0≤ u 1 ≤ u 1 , 0≤ y 1 ≤ y 1 . For t=2,...,T, the random cost-to-go function is written as follows. Q t (x t− 1 ,ξ t )= min ut,st,xt,yt d ⊤ t y t +E ˜ ξ t+1 [Q t+1 (x t , ˜ ξ t+1 )] (1.4a) s.t. x t − x t− 1 − Bu t − Ds t − q t (ξ t )=0, (1.4b) Hu t +Fy t =p t , (1.4c) v t ≤ v t ≤ v t , 0≤ u t ≤ u t , 0≤ y t ≤ y t . (1.4d) We define that Q T+1 (x T ,ξ T+1 ) ≡ 0. Equation (1.4b) corresponds to water balance con- straint. Equation (1.4c) corresponds to energy balance constraint. Equation (1.4d)) reflect the bounds on the control variables. 1.2 Algorithms for Stochastic Programming The crucial motivations for designing algorithms for solving SP problems are two-fold: Firstly, the complexity of the original SP problem drastically scales up as the dimension and size of the scenario set increases. Secondly, an algorithm needs to be designed so that 6 the computation per iteration is not cumbersome and the final output of the algorithm is asymptotically close to the optimal solution of the original SP problems. 1.2.1 Stochastic Approximation and its Extension Stochastic Approximation (SA) algorithm dates back to the work of Robbins and Monro [92]. We recommend Lai [60] to the reader who has an interest in the development of SA. WeusethetheoreticalframeworkfromNemirovskietal. [79]tointroducetheSAalgorithm. Throughoutthis subsection, wefocus on solvingthe followingstochastic programmingprob- lem. min x∈X f(x)≜E ˜ ξ [F(x, ˜ ξ )], (1.5) where X : nonempty compact convex set, F :X× Ξ 7→R. We let x ∗ denote an optimal solution of the problem above and let X ∗ = argmin x∈X f(x). We assume that F(·,ξ ) is convex on X. We further assume that f(·) is continuous on X and finite value in a neighborhood of x ∈ X. The fundamentals of the SA algorithm relies on the following assumptions. A1 ˜ ξ 1 , ˜ ξ 2 ,... are i.i.d. random variables. A2 There exists an oracle such that for a given input point (x,ξ )∈ X× Ξ , a stochastic subgradient, G(x,ξ ), of f(x) (i.e., G(x,ξ ) ∈ ∂F(x,ξ ) and E ˜ ξ [G(x, ˜ ξ )] ∈ ∂f(x)) is returned. Let∥·∥ 2 denote the L 2 norm and and let Π X denote the projection operator (i.e., Π X (x)= min z∈X ∥z− x∥ 2 2 ). The SA algorithm is provided below. By using the diminishing stepsize rule (i.e., P ∞ ℓ=1 γ ℓ = ∞, P ∞ ℓ=1 γ 2 ℓ < ∞), the SA algorithm is shown to converge to an optimal solution with probability one [11, 93], which is based on the following Supermartingale Convergence Theorem. 7 Algorithm 1 Stochastic Approximation Input: x 1 ∈X, L<∞, γ ℓ >0 for ℓ∈N. Output: x L . for ℓ=1,2,...,L do Sample ξ ℓ from the probability distribution µ ξ . Calculate G(x ℓ ,ξ ℓ ) based on assumption A2. Update the solution: x ℓ+1 =Π X (x ℓ − γ ℓ G(x ℓ ,ξ ℓ )). end for Theorem 1.3 (Supermartingale Convergence Theorem (see Proposition 8.2.10 in [11])). Let Y l , Z l , and W l , l = 0,1,2,..., be three sequences of random variables and let F l , l = 0,1,2,..., be sets of random variables such thatF l ⊂F l+1 for all l. Suppose that: (1) The random variables Y l , Z l , and W l are nonnegative, and are functions of the random variables inF l . (2) For each l, we have E[Y l+1 |F l ]≤ Y l − Z l +W l . (3) There holds P ∞ l=0 W l <∞. Then, wehave P ∞ l=0 Z l <∞, andthesequenceY l convergestoanonnegativerandomvariable Y, with probability 1. Under the extra assumption that f(x) is differentiable and strongly convex on X, [79] shows thatE[∥x ℓ − x ∗ ∥ 2 ] is of order O(ℓ − 1 2 ) (i.e., there exists ℓ 0 <∞ and M <∞ such that E[∥x ℓ − x ∗ ∥ 2 ]≤ Mℓ − 1 2 forallℓ≥ ℓ 0 ). Withthefurtherassumptionthatx ∗ isintheinteriorof X and∇f(x) is Lipschitz continuous, Nemirovski et al. [79] also show thatE[f(x ℓ )− f(x ∗ )] is of order O(ℓ − 1 ). One extension of SA is Robust SA [79, 89], which uses the average of a sequence of esti- matedsolutionsfromSAasanoutputsolution(i.e.,x N 1 = 1 N P N ℓ=1 x ℓ ,where{x ℓ } N ℓ=1 isgener- atedbytheSAalgorithm). SupposethatthereexistsM G <∞suchthatE[∥G(x, ˜ ξ )∥ 2 2 ]≤ M 2 G for all x∈X. The Robust SA algorithm with constant stepsize (the varying stepsize version can be seen in [79]) is provided below. Without the assumption on the smoothness of f(x), it is shown by Nemirovski et al. [79] thatE[f(x N 1 )− f(x ∗ )] is of order O(N − 1 2 ). Polyak and Juditsky [89] show that x N 1 converges to x ∗ with probability one when X = ℜ n and some 8 regularity conditions on f(x) and γ ℓ are satisfied. As mentioned by Nemirovski et al. [79], Robust SA can be viewed as Mirror Descent Algorithm 2 Robust Stochastic Approximation Input: x 1 ∈X, N <∞, D X =max x∈X ∥x− x 1 ∥ 2 , γ ℓ = D X M G √ N , for ℓ=1,2,...,L. Output: x N 1 = 1 N P N ℓ=1 x ℓ . for ℓ=1,2,...,N do Sample ξ ℓ from the probability distribution µ ξ . Calculate G(x ℓ ,ξ ℓ ) based on assumption A2. Update the solution: x ℓ+1 =Π X (x ℓ − γ ℓ G(x ℓ ,ξ ℓ )). end for algorithm using Euclidean structure. Here, we focus on introducing one version of Mirror Descent algorithm which utilizes distance generating function and averaging of a history of estimated solutions. For the connection between the original Mirror Descent ([80, 81]) and the version involving distance generating function, we recommend [8]. Let ∥·∥ be a gen- eral norm and∥x∥ ∗ = max ∥z∥≤ 1 z ⊤ x be its dual norm. The distance generating function can be defined as follows. Definition1.4. A function ϱ:X 7→ℜ is a distance generating function with modulus α > 0 with respect to a norm∥·∥ if it satisfies the following properties: a. ϱ(·) is convex and continuous on X. b. The set X o ={x∈X :∂ϱ(x)̸=∅} is nonempty. c. ϱ(·) is continuously differentiable and strongly convex with parameter α with respect to ∥·∥ on X o . LetV(x,z)=ϱ(z)− [ϱ(x)+∇ϱ(x) ⊤ (z− x)]. SupposethatthereexistsM ∗ <∞suchthat E[∥G(x, ˜ ξ )∥ 2 ∗ ] ≤ M 2 ∗ . The Mirror Descent algorithm with constant stepsize can be written as follows. Nemirovski et al. [79] show that the expected optimality gapE[f(x N 1 )− f(x ∗ )] of MirrorDescentalgorithmisorderO(N − 1 2 )whentheoptimizedconstantstepsizeoroptimized diminishing stepsize is used. It is worth noting that Zhou et al. [116] show an almost 9 sure global convergence of Stochastic Mirror Descent algorithm on a class of variationally coherent 2 problems (including convex problems). Algorithm 3 Mirror Descent Input: x 1 ∈ X, N < ∞, D ϱ = p max x∈X ϱ(x)− min z∈X ϱ(z), γ ℓ = √ 2αD ϱ M∗ √ N , for ℓ = 1,2,...,N. Output: x N 1 = P N ℓ=1 γ ℓ x ℓ / P N ℓ=1 γ ℓ . for ℓ=1,2,...,N do Sample ξ ℓ from the probability distribution µ ξ . Calculate G(x ℓ ,ξ ℓ ) based on assumption A2. Update the solution: x ℓ+1 =argmin x∈X γ ℓ G(x ℓ ,ξ ℓ ) ⊤ (x− x ℓ )+V(x ℓ ,x). end for 1.2.2 Stochastic Quasi-Gradient Method Stochastic Quasi-Gradient (SQG) method [32, 35] can be considered as more general version of SA. Again, we focus on solving min x∈X f(x)=E[F(x, ˜ ξ )] which is defined in the previous section. Foragiven(x,ξ )∈X× Ξ, let ˆ F ℓ (x;{ξ ℓ,i } n ℓ i=1 )and ˆ G ℓ (x;{ξ ℓ,i } n ℓ i=1 )denotetheestima- tionofF(x)andg(x)intheℓ th iteration,respectively. LetF ℓ denotetheσ algebragenerated by the history of estimated solutions up to iteration ℓ (i.e., {x 0 ,x 1 ,...,x ℓ }). ˆ F ℓ (x,ξ ) and ˆ G ℓ (x,ξ ) satisfies the following properties. E[ ˆ F ℓ (x ℓ ;{ ˜ ξ ℓ,i } n ℓ i=1 )|F ℓ ]=f(x ℓ )+δ ℓ , E[ ˆ G ℓ (x ℓ ;{ ˜ ξ ℓ,i } n ℓ i=1 )|F ℓ ]=g(x ℓ )+ς ℓ , where g(x ℓ )∈ ∂f(x ℓ ). Usually, δ ℓ and ς ℓ are required to diminish to 0 as ℓ goes to infinity (i.e., δ ℓ →0,∥ς ℓ ∥). An indirect way to measure the accuracy of the ˆ G ℓ is provided below. f(x ∗ )− f(x ℓ )≥ (x ∗ − x ℓ ) ⊤ E[ ˆ G ℓ (x ℓ , ˜ ξ ℓ )|F ℓ ]+ϵ ℓ , (1.6) where x ∗ is the optimal solution to min x∈X f(x) =E[F(x, ˜ ξ )] and ϵ ℓ → 0 as ℓ→∞. Under 2 Problem (1.5) is variationally coherent if ∇f(x) ⊤ (x− x ∗ ) ≥ 0 for all x ∈ X and x ∗ ∈ X ∗ and there exists some ¯x ∗ ∈X ∗ such that∇f(˜ x) ⊤ (˜ x− ¯x ∗ )≥ 0 only if ˜ x∈X ∗ . 10 Algorithm 4 Stochastic Quasi-Gradient Method Input: x 0 ∈X, L<∞, γ ℓ >0 for ℓ∈N, and{n ℓ } is a sequence of positive integers. Output: x L . for ℓ=0,1,2,...,L− 1 do Generate i.i.d dataset,{ξ ℓ,i } n ℓ i=1 , from the probability distribution µ ξ . Calculate ˆ G ℓ (x ℓ ;{ξ ℓ,i } n ℓ i=1 ). Update the solution: x ℓ+1 =Π X (x ℓ − γ ℓ ˆ G ℓ (x ℓ ;{ξ ℓ,i } n ℓ i=1 )). end for thefollowingconditions,Ermoliev[32]showsthat{x ℓ } ℓ∈N in(1.6)generatedbySQGmethod converges to an optimal solution with probability one. f(x) is continuous and convex on X, (1.7a) X is compact and convex, (1.7b) ∃ M <∞, E[ ˆ G ℓ (x ℓ ;{ξ ℓ,i } n ℓ i=1 )|F ℓ ]≤ M, (1.7c) With probability one,{ϵ ℓ } ℓ∈N satisfies that γ ℓ ≥ 0, ∞ X ℓ=0 γ ℓ =∞, X ℓ E[γ ℓ |ϵ ℓ |+γ 2 ℓ ]<∞. (1.7d) Condition (1.7d) implies that the convergence of SQG method relies on the assumption that the bias from the subgradient calculation is known and possible to compute in practice. 1.2.3 Stochastic Decomposition Higle and Sen [46, 47] propose (Regularized) Stochastic Decomposition (SD) algorithm to solve the following two-stage stochastic linear programming problem. min x c ⊤ x+E ˜ ξ [h(x, ˜ ξ )] s.t. Ax≤ b (1.8) 11 where h(x,ξ ) is the minimum value of the second-stage problem for a given (x,ξ ). h(x,ξ )=min{d ⊤ y :Dy =e(ξ )− C(ξ )x, y≥ 0}. The key features of SD algorithm are: (1) simultaneous function-solution updates; (2) prox- imal mapping in the master problem; (3) incumbent selection for producing a sequence of convergentincumbents. Inthek th iteration,SDutilizesaproximalmappingbasedoncurrent approximation function to find the next candidate solution. x k =argmin x {f k− 1 (x)+ σ k 2 ∥x− ˆ x k− 1 ∥ 2 2 :Ax≤ b}. Thenanewscenario,ξ k ,isgeneratedfromtheprobabilitydistributionµ ξ . Thedualextreme point is explored by solving the following second stage problem. π k ∈argmax π {π ⊤ (e(ξ k )− C(ξ k )x):D ⊤ π ≤ d}. Note that the set {π : D ⊤ π ≤ d} is fixed, which makes the process of storing dual ex- treme points meaningful. Let Π k denote the set of explored dual extreme points. Let ˆ H k (x) = P k i=1 h(x,ξ i ) denote sample-based benchmark value function in iteration k. Then the minorant, h k k (x), of it is constructed below. π k,i ∈argmin π {π ⊤ e(ξ i )− C(ξ i )x k :π ∈Π k }, for i=1,2,...,k, h k k (x)= 1 k k X i=1 π ⊤ k,i (e(ξ i )− C(ξ i )x). The updates of previous minorants are based on the assumption that h is bounded below by 0. The SD algorithm based on Liu and Sen [69] is provided in Algorithm 5. HigleandSen[47]showthatwithprobabilityone,thereexistsK⊂ Nsuchthat{x k+1 } k∈K and{ˆ x k } k∈K converge to the same optimal solution to (1.8). Sen and Liu [98] further refines 12 Algorithm 5 Stochastic Decomposition Set Π 0 =∅, ˆ x 0 ∈X :={x∈ℜ n :Ax≤ b}, q∈(0,1), σ k ≥ 1 and β ∈(0,1). Set J 0 ={0}, h 0 0 (x)=0, H 0 (x)=max{h 0 0 (x)}, and f 0 (x)=c ⊤ x+H 0 (x). for l =1,2,3,... do Solve the following proximal mapping problem Candidate Selection x k ∈argmin x∈X f k− 1 (x)+ σ k 2 ∥x− ˆ x k− 1 ∥ 2 . and record the Lagrangian multipliers{θ k− 1 i } i∈J k− 1 of{h k− 1 i (x)} i∈J k− 1 . Generate a new sample ξ k . Compute dual extreme points π k ∈argmax π {π ⊤ (e(ξ k )− C(ξ k )x):D ⊤ π ≤ d}. Update Π k ← Π k− 1 ∪{π k } andJ k =J k− 1 ∪{− k,k}\{i∈J k− 1 :θ k− 1 i =0}. for i=1,2,...,l do Compute π k,i ∈argmin π {π ⊤ e(ξ i )− C(ξ i )x k :π ∈Π k } and ˆ π k,i ∈argmin π {π ⊤ e(ξ i )− C(ξ i )ˆ x k− 1 :π ∈Π k }. end for Construct h k k (x)= 1 k P k i=1 π ⊤ k,i (e(ξ i )− C(ξ i )x). Construct h k − k (x)= 1 k P k i=1 ˆ π ⊤ k,i (e(ξ i )− C(ξ i )x). Update h k i = k− 1 k h k− 1 i , ∀ i∈J k . Construct H k (x)=max{h k i (x):i∈J k } and f k (x)=c ⊤ x+H k (x). if f k (x k )− f l (ˆ x k− 1 )≤ q f k− 1 (x k )− f k− 1 (ˆ x k− 1 ) then Incumbent Selection Set ˆ x k ← x k . else Set ˆ x k ← ˆ x k− 1 . end if end for 13 theanalysisandshowsthatthesequence{ˆ x k } k∈N convergestoanuniqueoptimalsolutionto (1.8)withprobabilityone. LiuandSen[69]extendSDalgorithmtosolvetwo-stagestochastic quadratic-linearprogramming(SQLP)/quadratic-quadraticprogramming(SQQP)problem and shows thatE[∥ˆ x N − x ∗ ∥] is of order O(N − 1 ). 1.3 Types of Convergence Here, we provide various kinds of convergence of algorithms for solving the classic SP prob- lems in (1.1). Let {x ℓ } ℓ denote the sequence generated by an algorithm (e.g., Stochastic Approximation). Again, we let X ∗ ∈ argmin x∈X f(x), where f(x) = E ˜ ξ [F(x, ˜ ξ )]. We assume that X is closed and convex, and F(·, ˜ ξ ) is convex on X for almost every ˜ ξ . We also assume thatX ∗ isnonemptyandmin x∈X f(x)isfinite. Wesummarizetheconvergenceresultsofvarious algorithms below. I. With probability one, x ℓ →x ∗ as ℓ→∞, where x ∗ ∈X ∗ . II. With probability one, f(x ℓ )→f(x ∗ ) as ℓ→∞, where x ∗ ∈X ∗ . III. E[∥x ℓ − x ∗ ∥]→0 as ℓ→∞. IV. E[f(x ℓ )− f(x ∗ )]→0 as ℓ→∞. The following famous convergence theorem provides a key connection between almost sure convergence (i.e., convergence with probability one) and convergence in expectation. Theorem1.5 (DominatedConvergenceTheorem). ([29]) Let{X n } be a sequence of random variables on (Ω ,F,P). If X n →X w.p.1, and|X n |<Y w.p.1 for all n, andE[Y]<∞, then E[X n ]→E[X]. If Y is a constant, then Theorem 1.5 is called bounded convergence theorem. Now, wecanconnectvarioustypesofconvergenceinSPalgorithms. If X ⊂ int(dom(f)), bythecontinuityoff,TypeIconvergenceimpliesTypeIIconvergence. WhenX iscompact, there exists M <∞ such that∥x ℓ − x ∗ ∥≤ M w.p.1. By the bounded convergence theorem, TypeIconvergenceimpliesTypeIIIconvergence. Ifthereexists M f <∞suchthat|f(x)|< 14 M f for all x∈ X, then Type II convergence implies Type IV convergence by the bounded convergence theorem. Here, we summarize the convergences of several popular SP algorithms in the table below. Stochastic Approximation (SA) algorithm, Robust SA, Mirror Descent SA, and Stochastic Quasi-gradient Method (SQG) are first-order algorithms used to solve stochastic convex programming problems. Each iteration samples a stochastic gradient (or calculates a stochastic quasi-gradient) and uses it to update the solutions via projection or proximal mapping. SA [93] and SQG method [32] are shown to have Type I convergence. For the first-order methods which utilize the supermartingale convergence theorem, they usually get Type II convergence first and use it to prove Type I convergence. Stochastic Decomposition algorithm (SD) is a decomposition-based algorithm used to solve two-stage stochastic linear programming problems (two-stage SLP), two-stage stochastic quadratic-linear programming problems(two-stageSQLP)aswellastwo-stagestochasticquadratic-quadraticprogramming problems (two-stage SQQP). It is worth noting that the SD algorithm generates a sequence of candidate solutions and a sequence of incumbent solutions. Sen and Liu [98] show that the sequence of incumbent solutions generated by the SD algorithm has Type I convergence. Stochastic Dual Dynamic Programming (SDDP) is another decomposition-based algorithm usedtosolvemultistagestochasticlinearprogrammingproblems. Itisworthnotingthatthe efficacyoftheSDDPisrestrictedtothecasewhentheprobabilitydistributionisknown, the support of random variables is finite, and the random variables are stage-wise independent. Philpott and Guan [88] show that SDDP has Type I convergence. Algorithm Problem Class Convergence Type Rate SA Smooth and Strongly Convex E[∥x ℓ − x ∗ ∥] O(ℓ − 1 2) [79] SA Smooth, Strongly Convex and Lipschtiz Continuous Gradient E[f(x ℓ )− f(x ∗ )] O(ℓ − 1 ) [79] Robust SA Convex E[f(x ℓ )− f(x ∗ )] O(ℓ − 1 2) [79] Mirror Descent SA Convex E[f(x ℓ )− f(x ∗ )] O(ℓ − 1 2) [79] SD SQLP/SQQP E[∥x ℓ − x ∗ ∥] O(ℓ − 1 ) [69] Table 1.5: Convergence Rate of Algorithms for Stochastic Programming 15 1.4 Contributions of this Dissertation The contributions of the dissertation are three-fold. First, we propose Learning Enabled Optimization with Non-parametric Approximation (LEON) algorithm and Stochastic De- composition with k nearest neighbors estimation (SD-kNN) algorithm to contribute to the mathematical fusion of non-parametric statistical estimation and stochastic programming algorithms. Both algorithms solve a new class of convex stochastic programming problems where the random vectors appear as tuples (i.e., covariates). The objective function is a conditional expectation of the cost/profit given the partial observation of the tuple (e.g., the observation of the predictor). Without knowing the true conditional distribution of tu- ple, the LEON algorithm uses a dataset of historic tuples and a non-parametric estimation method (kNN or kernel) to estimate the subgradient of the “true” objective function, and the decision is updated by performing a standard projection step using the previously es- timated decision and the non-parametric estimation of the subgradient. Such an approach can be handily extended to streaming data applications. Streaming data applications arise inreal-timesystems(e.g., windpowergeneration)inwhichleadingindicators(suchaswind- speed) from an upstream location can be used as a predictor of wind-speed at a downstream location, thus improving the quality of predictions. Moreover, when such predictions are used within online algorithms (e.g. SD), it has the potential to improve real-time operations of rapidly evolving systems such as wind power generation. As opposed to standard distribution-driven (i.e., requires the full knowledge of distri- bution and the distribution must be fixed) stochastic programming algorithms, we design a data-driven first-order stochastic programming algorithm that utilizes non-parametric es- timation to adapt to the evolution of the data process. In the SD-kNN algorithm, we go one step further by reusing the previously generated kNN estimate of the subgradients to construct a piecewise linear function which is a piecewise linear approximation of the ob- jective function. We use the term “minorant” to refer to the piece stored in the piecewise linear approximation of the objective function. Such methodology can be regarded as a 16 non-parametric extension of the SD algorithm for solving two-stage predictive stochastic programming problems. This is the first decomposition-based stochastic programming algo- rithm that uses non-parametric estimation to attain the predictive capability to the best of our knowledge. Secondly,wedesignafusionofkNN,SD,andMM(NSD-MM)tosolvenonconvexpredic- tive stochastic programming problems where the objective function is difference-of-convex. One key feature of the proposed algorithm is that we use a piecewise linear function to ap- proximate the convex component of the kNN estimate of the objective function from below whileconstructingaconvexsurrogatefunctiontoapproximatetheconcavecomponentofthe kNNestimateoftheobjectivefunctionfromabovesimultaneously. Asaresult,thecombined function approximation is neither the lower bound nor upper bound of the kNN estimate of the objective function, which is very uncommon in the difference-of-convex algorithm litera- ture. To ensure descent in terms of the kNN estimate of the objective function, we design a novel inner loop. The numerical results of a shipment planning problem and a freight fleet scheduling problem validate the computational efficiency of the NSD-MM algorithm. We also show that NSD-MM under minor modifications can be converted into a decomposition- based algorithm (SD-MM) for solving two-stage SP where the objective function consists of a smooth concave first-stage objective and a linear second-stage recourse function. We extend the boundary of SD and provide a unifying view on the decomposition-based meth- ods for stochastic programming by identifying an associated supermartingale property of the iterates. Such methodology is proven to be computationally efficient by applying the SD-MM solver to solve eight instances of the classic two-stage stochastic linear program (e.g., 4NODE, SSN, and so on) and two instances of the two-stage stochastic program with concave first-stage cost. Finally, we provide a new view of the sample complexity of multistage stochastic lin- ear program based on the concepts of compound SAA and build a unifying framework to 17 study various decomposition-based algorithms (e.g., SDDP, regularized SDDP and SDLP) for solving multistage stochastic linear programming problems. 18 Chapter 2 An Unifying View on the Sample Complexity and Decomposition-based Methods of Multistage Stochastic Linear Program 2.1 Overview Multistage stochastic linear program (MSLP) arises when decisions need to be made period- ically to adapt to the evolution of exogenous and endogenous random processes. It has been widely used in various areas, such as portfolio optimization [22], hydroelectric scheduling [77], multi-reservoir operation planning [63], electricity energy management[45], and energy resources planning [90]. In this chapter, we start by presenting the sample complexity of MSLP based on the theory of uniform normalized convergence. Then we present a unifying view of the decomposition-based methods for solving MSLP. 2.2 Sample Complexity of MSLP Shapiro [100, 101] applies the large deviation theory to analyze the sample size required for the sample average approximation (SAA) method when applied to Multistage Stochastic Program. Ermoliev and Norkin [34] use the concepts of uniform normalized convergence of 19 random variables and Rademacher averages of functional sets to analyze the sample com- plexity of the SAA method for compound stochastic optimization problems. They claim that the objective function in multistage stochastic programming (MSP) problem can be regardedasacompoundfunctionthatconsistsofbothinternalconditionalexpectations and nested minimization operations. However, they do not provide a concrete guideline to ana- lyze the sample complexity of the SAA method for MSP problems. In this paper, we aim to provide this guideline and utilize the tools of uniform normalized convergence to provide an alternate view on analyzing the sample complexity of the SAA method for the multistage stochastic linearprogram(MSLP).Itisworthnotingthatthissamplecomplexityanalysis is distribution-free. In other words, such complexity results should be regarded as the worst- case complexity. Our contributions are two-fold: (1) By the regularity assumption on the union of the stage-wise effective feasible region, we decouple the solution path and provide the recursive relation of the stagewise SAA estimation error. (2) Starting from the recursive relation of the stagewise SAA estimation error, we use the notion of uniform convergence and its related results to derive the sample complexity of the MSLP problem. 2.2.1 Problem Formulation Let ˜ ξ [T] : Ω → Ξ 2 × Ξ 3 × ...× Ξ T be a full sample path (i.e. ˜ ξ [T] = ( ˜ ξ 1 ,..., ˜ ξ T )) defined on the probability space (Ω ,Σ Ω ,P). In particular, let ξ 1 be a realization of the vector of random components of c 1 ,b 1 ,A 1 . For t≥ 2, let ξ t ∈Ξ t ⊆ R rt be the realization of the vector of random components of c t ,b t ,B t and A t in the t th stage. In the MSLP problem, we let x 1 :Ω →R n 1 and X 1 (ξ 1 ):={x 1 | A 1 x 1 =b 1 , x 1 ≥ 0} denote the first-stage decision variable and its corresponding feasible region. For t ≥ 2, we let x t : Ω → R n 1 denote the decision variable in the t th stage and let X t (x t− 1 ,ξ t ) :={x t | B t x t− 1 +A t x t = b t ,x t ≥ 0} denote its 20 effective feasible region, respectively. Let F t be the σ -algebra generated by ( ˜ ξ 1 ,..., ˜ ξ t ). We assumethatthedecisioninformationstructureintheMSLPproblemisafiltration(i.e.,Fora collection of σ -algebras{F t } T t=1 , it satisfies F t ⊆F t+1 for t=1,..,T 1 ). For more information about the use of filtration in MSLP, we refer the reader to [18, 115]. We consider that the decision x = (x 1 ,...,x T ) made in the MSLP problem is nonanticipative. That is, x t is measurable with respect to the σ -algebraF t . Totheendofthischapter, weconsiderthecasewhen ξ 1 isgiven. WeshallreplaceX 1 (ξ 1 ) by X 1 because ξ 1 is fixed. We let Pr {·} denote the probability of the event in the brackets. Now we formulate MSLP problem below. min x 1 ∈X 1 c ⊤ 1 x 1 +E[ inf x 2 ∈X 2 (x 1 ,ξ 2 ) c ⊤ 2 x 2 +E[ inf x 3 ∈X 3 (x 2 ,ξ 3 ) c ⊤ 3 x 3 +...+E[ inf x T ∈X T (x T− 1 ,ξ T ) c ⊤ T x T |F T− 1 ]...|F 2 ]] (2.1) The i.i.d. assumption and the stagewise independence assumption on a collection of sample paths is made below. (A1) ˜ ξ 1 [T] , ˜ ξ 2 [T] ,... are i.i.d. and stagewise independent random variables. Let E ˜ ξ denote taking expectation over a random variable ˜ ξ , then problem (2.1) can be reformulated as a Dynamic Programming (DP) problem below. min x 1 c ⊤ 1 x 1 +E ˜ ξ 2 [Q 2 (x 1 , ˜ ξ 2 )] s.t. x 1 ∈X 1 :={x 1 | A 1 x 1 =b 1 , x 1 ≥ 0} (2.2) where Q t (x t− 1 ,ξ t ) is defined as follows. Q t (x t− 1 ,ξ t )=min xt c ⊤ t x t +E ˜ ξ t+1 [Q t+1 (x t , ˜ ξ t+1 )] s.t. x t ∈X t (x t− 1 ,ξ t ):={x t | B t x t− 1 +A t x t =b t ,x t ≥ 0} (2.3) 21 We assume that Q T+1 ≡ 0. Note that x t isF t measurable so that E[Q t+1 (x t , ˜ ξ t+1 )|F t ] means that taking the condi- tionalexpectationofQ t+1 (x t , ˜ ξ t+1 )giventhatx t isfixed. Inotherwords, F t canbedescribed as some information which tells the realization of x t . For any z t ∈Ξ t , define X t (z t ):= X 1 , t=1, ∪ x t− 1 ∈ ¯ X t− 1 X t (x t− 1 ,z t ), t≥ 2. and ¯ X t := X 1 , t=1, ∪ zt∈Ξ t X t (z t ), t≥ 2. Inspired by Lan [61], we make the following regularity assumptions on the stagewise feasible regions. (A2) For t=1,...,T, ¯ X t is compact and conv( ¯ X t ) is contained in a cube with edge of length D t . (A3) For t = 1,...,T, |Q t+1 (x t ,z t+1 )| <∞ for all x t ∈ conv( ¯ X t ) and z t+1 ∈ Ξ t+1 . Further- more, there exists M t such that sup xt∈conv( ¯ Xt),z t+1 ∈Ξ t+1 |Q t+1 (x t ,z t+1 )|≤ M t . (A4) For t=2,...,T, there exists L t such that |Q t (x,z t )− Q t (y,z t )|≤ L t ∥x− y∥ 2 for all x, y∈conv( ¯ X t− 1 ) uniformly in z t ∈Ξ t . Throughout this section, we shall let X t denote the convex hull of ¯ X t (i.e. X t ≜conv( ¯ X t )). 2.2.2 Sample Average Approximation Model of MSLP Let N t denote the number of child nodes of a node in stage t− 1. Let N [t:t ′ ] =(N t ,N t+1 ,...,N t ′). 22 Inparticular,N [2:T] tellsthenodaldistributionoftheSAAmethod. Forinstance,thescenario tree of SAA method for a 4-stage MSLP with N [2:4] =(2,4,3) is given in Figure 2.1. Figure 2.1: Scenario tree of SAA method for a 4-Stage MSLP with N [2:4] =(2,4,3) Under the Assumptions (A1) - (A3), SAA model of the MSLP problem is formulated as follows. min x 1 c ⊤ 1 x 1 + 1 N 2 N 2 X i=1 ˆ Q 2,N [3:T] (x 1 ,ξ i 2 ) s.t. x 1 ∈X 1 (2.4) and ˆ Q t,N [t+1:T] (x t− 1 ,ξ i t )=min xt c ⊤ t x t + 1 N t+1 N t+1 X i=1 ˆ Q t+1,N [t+2:T] (x t ,ξ i t+1 ) s.t. x t ∈X t (x t− 1 ,ξ i t ) (2.5) Forthepurposeofanalysis,let ˆ x i t ∈arg min xt∈Xt(x t− 1 ,ξ i t ) {c ⊤ t x t + 1 N t+1 P N t+1 i=1 ˆ Q t+1,N [t+2:T] (x t ,ξ i t+1 )}. By the definition of Q t in (2.3), we have Q t (x t− 1 ,ξ i t )≤ c ⊤ t ˆ x i t +E ˜ ξ t+1 [Q t+1 (ˆ x i t , ˜ ξ t+1 )]. Similarly,let ¯x i t ∈arg min xt∈Xt(x t− 1 ,ξ i t ) {c ⊤ t x t +E ˜ ξ t+1 [Q t+1 (x t , ˜ ξ t+1 )]}.Bythedefinitionof ˆ Q t,N [t+1:T] in (2.5), we have ˆ Q t,N [t+1:T] (x t− 1 ,ξ i t )≤ c ⊤ t ¯x i t + 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (¯x i t ,ξ j t+1 ). 23 2.2.3 Background: Rademacher Average In this subsection, we give the reader an overview of Rademacher Average used in SAA method based on Ermoliev and Norkin [34]. For a set of points (z 1 ,...,z n ) = z n in Ξ and a sequence of functions{f(·,z i ):X →R} n i=1 define Rademacher average R n (f,z n ) as R n (f,z n ):=E σ sup x∈X | 1 n n X i=1 σ i f(x,z i )|, where σ i are i.i.d. random variables such that σ i ∈{± 1 2 } with probabilities 1 2 ; E σ denotes mathematicalexpectationover σ =(σ 1 ,...,σ n ). Rademacheraverageofafamilyoffunctions {f(·,z):X →R} z∈Ξ is defined as R n (f,Ξ):= sup z 1 ∈Ξ ,...,zn∈Ξ R n (f,z n ). (2.6) Lemma2.1. ([34])LetcompactsetX ⊂ R d becontainedinacubewithedgeoflengthD,and functions{f(·,z i ):X →R} be uniformly bounded by constant M(z i ) and H¨older continuous onX with constantγ (z i )∈(0,1], L(z i )>0, i.e.,|f(x,z i )|<M(z i ) and|f(x,z i )− f(y,z i )|≤ L(z i )∥x− y∥ γ (z i ) for any x,y∈X and z i . Then R n (f,z n )≤ L n D γ n d γ n/2 +M n √ dγ − 1/2 n √ lnn √ n and for any λ ∈(0, 1 2 ) R n (f,z n )≤ N f (λ,z n )/n λ , N f (λ,z n )=L n D γ n d γ n 2 + M n √ d p γ n (1− 2λ )e where L n = L n (z n ) = max 1≤ i≤ n L(z i ), M n = M n (z n ) = max 1≤ i≤ n M(z i ), γ n = γ n (z n ) = min 1≤ i≤ n γ (z i ). Proof. See [34]. 24 2.2.4 Complexity Analysis of SAA in MSLP Define SAA estimation error starting of the t th -stage cost-to-go function as follows. δ t ({ξ i t } Nt i=1 ):= sup x∈X t− 1 1 N t Nt X i=1 Q t (x,ξ i t )− E ˜ ξ t [Q t (x, ˜ ξ t )] (2.7) It is worth noting that x is one deterministic feasible solution in X t while x t is a random variable defined in sec 2.2.1. As a result, E[Q t (x, ˜ ξ t )|F t− 1 ] =E ˜ ξ t [Q t (x, ˜ ξ t )] by the stagewise independent assumption in (A1) (i.e., x and ξ t are independent of F t− 1 ). In this section, we focus on building a recursive relation between current accumulated SAA estimation error and the subsequent accumulated SAA estimation error so that we can accumulate the SAA estimation errors throughout all stages. We also define Rademacher average of a family of the t th stage value function as follows. R Nt (Q t ,Ξ t )= sup z 1 ∈Ξ t,...,z N t ∈Ξ t R n (Q t ,z Nt ) where z Nt =(z 1 ,z 2 ,...,z Nt ). The proposition below shows the boundedness of R Nt (Q t ,Ξ t ). Proposition 2.2. Suppose that Assumptions (A1)-(A4) hold. Let e=2.718..., then for any α ∈(0, 1 2 ), the following holds. R Nt (Q t ,Ξ t )≤ √ n t− 1 (L t D t +M t / p (1− 2α )e) N α t for t=2,...,T, where R Nt (Q t ,Ξ t ) is defined in (2.6). Proof. It is a direct result from Lemma 2.1. We let N Qt (α ):= √ n t− 1 (L t D t +M t / p (1− 2α )e) Then we have R Nt (Q t ,Ξ t )≤ N Qt (α ) N α t 25 Proposition 2.3. Suppose that Assumptions (A1)-(A4) hold, then for any α ∈ (0, 1 2 ), we have a. Pr(N α t δ t ({ ˜ ξ i t } Nt i=1 )≥ 2r t N Qt (α )+s)≤ 2exp(− s 2 2M 2 t ) b. lim Nt→∞ N α t δ t ({ ˜ ξ i t } Nt i=1 )=0 a.s. for any power term α ∈[0, 1 2 ) Proof. a. See Corollary 3.2 in [34]. b. See Corollary 3.3 in [34]. HereweshowarecursiverelationwhichhelpsusboundtheaccumulatedSAAestimation error in a recursive way. Lemma 2.4. The following recursive relation holds for t=2,3,...,T − 1. sup x∈X t− 1 1 N t Nt X i=1 ˆ Q t,N [t+1:T] (x,ξ i t )− E ˜ ξ t [Q t (x, ˜ ξ t )] ≤ δ t ({ξ i t } Nt i=1 )+ sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] Proof. Note that sup x∈X t− 1 1 N t Nt X i=1 ˆ Q t,N [t+1:T] (x,ξ i t )− E ˜ ξ t [Q t (x, ˜ ξ t )] ≤ sup x∈X t− 1 1 N t Nt X i=1 Q t (x,ξ i t )− E ˜ ξ t [Q t (x, ˜ ξ t )] + sup x∈X t− 1 1 N t Nt X i=1 ˆ Q t,N [t+1:T] (x,ξ i t )− 1 N t Nt X i=1 Q t (x,ξ i t ) (2.8) from which sup x∈X t− 1 1 N t Nt X i=1 Q t (x,ξ i t )− E ˜ ξ t [Q t (x, ˜ ξ t )] =δ t ({ξ i t } Nt i=1 ) 26 and sup x∈X t− 1 1 N t Nt X i=1 ˆ Q t,N [t+1:T] (x,ξ i t )− 1 N t Nt X i=1 Q t (x,ξ i t ) ≤ 1 N t sup x∈X t− 1 Nt X i=1 ˆ Q t,N [t+1:T] (x,ξ i t )− Q t (x,ξ i t ) ≤ 1 N t Nt X i=1 sup x∈X t− 1 ˆ Q t,N [t+1:T] (x,ξ i t )− Q t (x,ξ i t ) (2.9) Let ˆ x i t ∈ argmin x∈Xt(x t− 1 ,ξ i t ) {c ⊤ t x+ 1 N t+1 P N t+1 i=1 ˆ Q t+1,N [t+2:T] (x,ξ i t+1 )}. By the definition of Q t in (2.3), we have Q t (x t− 1 ,ξ i t )≤ c ⊤ t ˆ x i t +E ˜ ξ t+1 [Q t+1 (ˆ x i t , ˜ ξ t+1 )] which implies that ˆ Q t,N [t+1:T] (x t− 1 ,ξ i t )− Q t (x t− 1 ,ξ i t ) ≥ c ⊤ t ˆ x i t + 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (ˆ x i t ,ξ j t+1 )− [c ⊤ t ˆ x i t +E ˜ ξ t+1 [Q t+1 (ˆ x i t , ˜ ξ t+1 )]] = 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (ˆ x i t ,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (ˆ x i t , ˜ ξ t+1 )] (2.10) Since ˆ x i t ∈X t , we have | 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (ˆ x i t ,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (ˆ x i t , ˜ ξ t+1 )]| ≤ sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] (2.11) It follows from (2.10) that 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (ˆ x i t ,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (ˆ x i t , ˜ ξ t+1 )] ≥− sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] 27 Similarly, let ¯x i t ∈ argmin xt∈Xt(x t− 1 ,ξ i t ) {c ⊤ t x t +E ˜ ξ t+1 [Q t+1 (x t , ˜ ξ t+1 )]}. By the definition of ˆ Q t,N [t+1:T] in (2.5), we have ˆ Q t,N [t+1:T] (x t− 1 ,ξ i t )≤ c ⊤ t ¯x i t + 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (¯x i t ,ξ j t+1 ) which implies that ˆ Q t,N [t+1:T] (x t− 1 ,ξ i t )− Q t (x t− 1 ,ξ i t ) ≤ c ⊤ t ¯x i t + 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (¯x i t ,ξ j t+1 )− [c ⊤ t ¯x i t +E ˜ ξ t+1 [Q t+1 (¯x i t , ˜ ξ t+1 )] ≤ 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (¯x i t ,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (¯x i t , ˜ ξ t+1 )] ≤ sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] (2.12) Thus, equations (2.10) and (2.12) imply that | ˆ Q t,N [t+1:T] (x t− 1 ,ξ i t )− Q t (x t− 1 ,ξ i t )| ≤ sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] (2.13) Hence, it follows from (2.9) that 1 N t Nt X i=1 sup x∈X t− 1 ˆ Q t,N [t+1:T] (x,ξ i t )− Q t (x,ξ i t ) ≤ 1 N t Nt X i=1 sup x∈X t− 1 sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] ≤ 1 N t Nt X i=1 sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] = sup y∈Xt 1 N t+1 N t+1 X j=1 ˆ Q t+1,N [t+2:T] (y,ξ j t+1 )− E ˜ ξ t+1 [Q t+1 (y, ˜ ξ t+1 )] (2.14) 28 UsingtherecursiverelationprovidedbyLemma2.4,wecanboundtheaccumulatedSAA estimation error starting from the first stage. Lemma 2.5. The following holds sup x 1 ∈X 1 1 N 2 N 2 X i=1 ˆ Q 2,N [3:T] (x 1 ,ξ i 2 )− E ˜ ξ 2 [Q 2 (x 1 , ˜ ξ 2 )] ≤ T X t=2 δ t ({ξ i t } Nt i=1 ). Proof. By Lemma 2.4, we have sup x 1 ∈X 1 1 N 2 N 2 X i=1 ˆ Q 2,N [3:T] (x 1 ,ξ i 2 )− E ˜ ξ 2 [Q 2 (x 1 , ˜ ξ 2 )] ≤ δ 2 ({ξ i 2 } N 2 i=1 )+ sup y∈X 2 1 N 3 N 3 X i=1 ˆ Q 3,N [4:T] (y,ξ i 3 )− E ˜ ξ 3 [Q 3 (y, ˜ ξ 3 )] ≤ δ 2 ({ξ i 2 } N 2 i=1 )+δ 3 ({ξ i 3 } N 3 i=1 )+ sup y∈X 3 1 N 4 N 4 X i=1 ˆ Q 4,N [5:T] (y,ξ i 4 )− E ˜ ξ 4 [Q 4 (y, ˜ ξ 4 )] ... ≤ T X t=2 δ t ({ξ i t } Nt i=1 ) For A,B⊆ℜ n , the distance between A and B is defined to be ∆( A,B)=sup a∈A inf b∈B ∥a− b∥ Denote θ ∗ =inf{c ⊤ 1 x+E ˜ ξ 2 [Q 2 (x, ˜ ξ 2 )]|x∈X 1 } ˆ θ ∗ N [2:T] =inf{c ⊤ 1 x+ 1 N 1 N 1 X i=1 ˆ Q 2,N [3:T] (x, ˜ ξ i 2 )|x∈X 1 } X ∗ ϵ ={x∈X 1 |c ⊤ 1 x+E ˜ ξ 2 [Q 2 (x, ˜ ξ 2 )]≤ θ ∗ +ϵ } 29 X ∗ ϵN [2:T] ={x∈X|c ⊤ 1 x+ 1 N 2 N 2 X i=1 ˆ Q 2,N [3:T] (x, ˜ ξ i 2 )≤ ˆ θ ∗ N [2:T] +ϵ } Proposition 2.6. Let D X 1 denote the diameter of X 1 . The following inequality holds. ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≤ 2D X 1 ϵ sup x 1 ∈X 1 1 N 2 N 2 X i=1 ˆ Q 2,N [3:T] (x 1 , ˜ ξ i 2 )− E ˜ ξ 2 [Q 2 (x 1 , ˜ ξ 2 )] Furthermore, by Lemma 2.5, we have ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≤ 2D X 1 ϵ T X t=2 δ t ({ ˜ ξ i t } Nt i=1 ). Proof. letζ N (x 1 )= 1 N 2 P N 2 i=1 ˆ Q 2,N [3:T] (x 1 , ˜ ξ i 2 )− E ˜ ξ 2 [Q 2 (x 1 , ˜ ξ 2 )]. Let∥ζ N ∥ C =sup{|ζ N (x 1 )||x 1 ∈ X 1 } Then one can get the first inequality by using Theorem 3.11 in [33]. AccordingtoShapiro[100],supposethatX andY aretworandomvariablesandϵ 1 +ϵ 2 = ϵ , then we have the following inequality. Pr{X +Y ≥ ϵ }≤ Pr{X ≥ ϵ 1 }+Pr{Y ≥ ϵ 2 } (2.15) With (2.15), we shall derive the sample complexity of SAA method for MSLP in the theorem below. Theorem 2.7. Suppose Assumptions (A1)-(A4) hold. For any α ∈ (0, 1 2 ), the following inequality holds. Pr ( T X t=2 N α t δ t ({ ˜ ξ i t } Nt i=1 )≥ (T − 1)s+ T X t=2 2r t N Qt (α ) ) ≤ T X t=2 2exp(− s 2 2M 2 t ) 30 Furthermore, let r = max 2≤ t≤ T {r t }, M = max 2≤ t≤ T {M t }, N =N 2 =...=N T and N(α )= max 2≤ t≤ T N Qt (α ), then Pr ( N α T X t=2 δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+2r(T − 1)N(α ) ) ≤ 2(T − 1)exp(− s 2 2M 2 ) Proof. By (2.15), we have Pr n T X t=2 N α t δ t ({ ˜ ξ i t } Nt i=1 )≥ (T − 1)s+ T X t=2 2r t N Qt (α ) o ≤ Pr n N α 2 δ 2 ({ ˜ ξ i 2 } N 2 i=1 )≥ s+2r 2 N Q2 (α ) o +Pr ( T X t=3 N α t δ t ({ ˜ ξ i t } Nt i=1 })≥ (T − 2)s+ T X t=3 2r t N Qt (α ) ) ... ≤ T X t=2 Pr n N α t δ t ({ ˜ ξ i t } Nt i=1 )≥ s+2r t N Qt (α ) o (2.16) By Proposition 2.3, we have Pr n N α t δ t ({ ˜ ξ i t } Nt i=1 )≥ s+2r t N Qt (α ) o ≤ 2exp(− s 2 2M 2 t ). It follows from (2.16) that T X t=2 Pr n N α t δ t ({ ˜ ξ i t } Nt i=1 )≥ s+2r t N Qt (α ) o ≤ T X t=2 2exp(− s 2 2M 2 t ). 31 Hence, this verifies the first inequality in Theorem 2.7. As for the second inequality, observe that Pr ( T X t=2 N α δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+ T X t=2 2r t N Qt (α ) ) ≤ T X t=2 2exp(− s 2 2M 2 t ) (2.17) and 2(T − 1)rN(α )= T X t=2 2rN(α )≥ T X t=2 2rN Qt (α )≥ T X t=2 2r t N Qt (α ). Hence, we have Pr n T X t=2 N α δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+2(T − 1)rN(α ) o ≤ Pr n T X t=2 N α δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+ T X t=2 2r t N Qt (α ) o (2.18) On the other hand, M t ≤ M implies that 1 2M 2 t ≥ 1 2M 2 which further implies that − 1 2M 2 t ≤ − 1 2M 2 . Hence, we have exp(− s 2 2M 2 t )≤ exp(− s 2 2M 2 ) (2.19) Thus, the combination of equations (2.17)-(2.19) implies that Pr ( N α T X t=2 δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+2r(T − 1)N(α ) ) ≤ 2(T − 1)exp(− s 2 2M 2 ). Corollary 2.8. Suppose that Assumptions (A1)-(A4) hold. Let let r =max 2≤ t≤ T {r t }, M = max 2≤ t≤ T {M t }, N =N 2 =...=N T andN(α )=max 2≤ t≤ T N Qt (α ), then the following holds. Pr ϵN α 2D X 1 ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≥ (T − 1)s+2r(T − 1)N(α ) ≤ 2(T − 1)exp(− s 2 2M 2 ) for any α ∈(0, 1 2 ). 32 Proof. By Proposition 2.6, we have ϵ 2D X 1 ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≤ T X t=2 δ t ({ ˜ ξ i t } N i=1 ). Multiply both sides of the inequality above by N α ′ ϵN α 2D X 1 ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≤ N α T X t=2 δ t ({ ˜ ξ i t } N i=1 ). which implies that Pr ( ϵN α 2D X 1 ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≥ (T − 1)s+2r(T − 1)N(α ) ) ≤ Pr ( N α T X t=2 δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+2r(T − 1)N(α ) ) (2.20) By Theorem 2.7, we have Pr ( N α T X t=2 δ t ({ ˜ ξ i t } N i=1 )≥ (T − 1)s+2r(T − 1)N(α ) ) ≤ 2(T − 1)exp(− s 2 2M 2 ) (2.21) Thus, the combination of (2.20) and (2.21) implies that Pr ϵN α 2D X 1 ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≥ (T − 1)s+2r(T − 1)N(α ) ≤ 2(T − 1)exp(− s 2 2M 2 ) Recall that N is the number of child nodes of any non-terminal node. As a result, in order to achieve the bound in Corollary 2.8, one needs to create a scenario tree which consists of 33 N T − 1 N− 1 =O(N T− 1 ) nodes. Suppose we want Pr n ∆( X ∗ ϵN [2:T] ,X ∗ ϵ )≥ ρ o ≤ β , where ρ,β > 0. Then 2(T − 1)exp(− s 2 2M 2 )≤ β, exp(− s 2 2M 2 )≤ β 2(T − 1) , − s 2 2M 2 ≤ log( β 2(T − 1) ) which implies s 2 ≥ 2M 2 log( 2(T− 1) β ). On the other hand, 2D X 1 [(T − 1)s+2r(T − 1)N(α )] ϵN α =ρ which implies that N α = 2D X 1 [(T − 1)s+2r(T − 1)N(α )] ϵρ and N =log( 2D X 1 [(T − 1)s+2r(T − 1)N(α )] ϵρ )/log(α ). Hence, to satisfy Pr(∆( X ∗ ϵN ,X ∗ ϵ )≥ ρ )≤ β , we need N ≥ log( 2D X 1 [(T − 1)2M 2 log( 2(T− 1) β )+2r(T − 1)N(α )] ϵρ )/log(α ). N ≥ [log(4D X 1 )+log(T − 1)+log(M 2 (log(2(T − 1))− log(β ))+2rN(α )) − log(ϵρ )]/log(α ) Let us investigate the impact of the complexity of decision space on the sample size. Note that N Qt (α ) := √ n t− 1 (L t D t + M t / p (1− 2α )e), where X t− 1 ⊆ ℜ n t− 1 and N(α ) = max 2≤ t≤ T {N Qt (α )}. Generallyspeaking,N willincreaseasn t increases(forallt=1,2,..,T− 1). N will increase as ϵ or ρ decreases. N will increase as the number of stages, T, increases. N will increase as the diameter of D X 1 increases. On the other hand, the required N is in the order ofO(log(T)+log(log(T))), if T dominates all the variables. As for the impact of the complexity of probability space on the required sample size, 34 note that r =max 2≤ t≤ T {r t }, where r t is the dimension of the t th -stage random variable (i.e. Ξ t ⊆ℜ rt ). Generally speaking, N will increase as r t (for all t=2,..,T) increases. 2.3 Decomposition-Based Methods for MSLP ItiswellknownthatthesamplecomplexityofMSLPwillincreaseexponentiallyasthenum- berofstagesincreases[100],whichturnsouttobeincreasinglyintractable. Onepopularway to mitigate the curse of dimensionality from MSLP is by using decomposition-based meth- ods. Decomposition-based methods can be traced back to Benders Decomposition [9] and L-Shaped method [108]. Birge [14] extends the L-Shaped method and proposes Nested De- compositionforStochasticProgrammingAlgorithm(NDSPA)tosolveMSLP.Gassmann[40] further develops an efficient implementation of the Nested Benders Decomposition method. A parallel version of the nested Benders Decomposition method is proposed by Dempster and Thompson [23]. Pereira and Pinto [86] propose Stochastic Dual Dynamic Programming (SDDP) to randomly traverse the scenario tree built upon the sampling-based problem to approximate the large-scale MSLP objective function from below and obtain the solution path by iteratively solving the approximated problem. Linowsky and Philpott [65] provide proof of asymptotic convergence of SDDP, and Philpott [88] further refines its theoretical framework. On the other hand, Higle and Sen [46] propose Stochastic Decomposition (SD) tosolvethetwo-stagestochasticlinearProgram. SenandZhou[99]extendSDtomultistage case and propose the Multistage Stochastic Decomposition algorithm (MSD) to solve MSLP with more general structure (e.g., stagewise dependence). Gangammanavar and Sen [39] redesign MSD and proposes Stochastic Dynamic Linear Programming to solve MSLP with stagewise independence in an efficient way. Although the algorithms mentioned above share some similarities (e.g., outer approx- imation, forward pass for computing candidate solution, and backward pass for updating function approximation), they have very different features. An algorithm for SP usually 35 consists of statistical estimation and problem approximation. In the SDDP and its variants (e.g., Regularized SDDP), statistical estimation of the problem (i.e., a scenario tree) is set up first. In the next step, it randomly traverses the sample paths of the fixed scenario tree to approximate the pre-built SP problem iteratively. Stochastic Decomposition (SD) anditsextension(e.g.,MultistageStochasticDecompositionandStochasticDynamicLinear Programming) keep refining statistical estimation by sampling scenarios from an oracle and updating function approximation for the latest statistical estimation while the algorithm progresses. In other words, the scenario tree in SD-type algorithms grows as the iteration increases. There are two ways that can distinguish stochastic decomposition-based methods. The first criterion is to check whether the algorithm uses true probability distribution when con- structing the lower bounding functions. In particular, SDDP and its variants use the true probability distribution to build the cuts of the approximated functions, while SD and its extension use the sample frequency to estimate the true probability distribution and then use the estimated probability distribution to construct the approximations. In other words, SDDP and its variants solve the problem where the probability distribution is known, while SD and its extensions are able to solve the problem with an unknown probability distri- bution. The second criterion is to check whether the underlying scenario tree for MSLP is fixed during the solving process. In detail, SDDP and its variants solve MSLP by randomly traversing each path of a fixed scenario tree. In the backward pass, a new cut in each stage is computed, and no old cuts need to be updated. SD and its extensions sample new data from an oracle and grow the scenario tree by adding the new sample. Then it traverses the path based on the new data in the grown scenario tree. As a result, all the old cuts need to be updated to adapt to the new scenario tree. The comparisons are summarized in Table 2.1. Motivated by the slow convergence rate of SDDP, Asamov and Powell [3] are the first to add a regularizing term to SDDP. Guigues et al. [43] further generalize the procedure from 36 Algorithm Probability Distribution Scenario Tree Update Old Cuts? SDDP and its variant Known Fixed No SD and its extension Unknown Growing Yes Table 2.1: Comparisons among Decomposition-Based Algorithms [3]. Although SD and its extension also have a regularizing term in the master problem, the purpose of adding the regularizing term and its hidden impact are different. The effect of regularizing term in the former is to stabilize the sequence of function approximations, while the latter uses the regularizing term to stabilize the sequence of incumbent solutions. Another key difference is the requirement on the coefficient before the regularizing term. Asamov and Powell [3] and Guigues et al. [43] require that the sequence of coefficients con- verge to 0, while Higle and Sen [47] and Gangammanavar and Sen [39] need the coefficient to be no less than 1. We summarize the mathematical foundations of Regularized SD and its extension as well as Regularized SDDP in Figures 2.2 and 2.3. Inspired by the control on the trajectory of the incumbent solution path in SD type al- gorithms, we propose a new type of decomposition-based method for MSLP with a known probability distribution. The forward pass solves the root-stage master problem with proxi- malterms, andtherestoftheforwardfollowsthesameschemaasSDDPhas. Thenecessary optimality condition of the root-stage master problem with a proximal term ensures the candidateroot-stagedecisionischoseninthereductionofthevaluefunctionapproximation. The minorants with Lagrange multipliers equal to 0 will be removed from the value function approximation. Similar to SDDP, the dual of the master problem with the current value function approximation is solved to obtain the cut coefficients for updating the function ap- proximation. At the end of each iteration, the incumbent selection is processed to determine whether to update the incumbent solution. The contributions of this paper are two folds: (1) This new type of algorithm can be regarded as an intermediate method between SDDP and SD. (2) We study the building blocks of various decomposition-based methods in one unified framework. The rest of the chapter is organized as follows. In Section 2.3.2, we formulate an MSLP 37 problem with state and control variables to generalize various versions of MSLP problems. TheclassicSDDPforsolvingthisclassofMSLPproblemisintroducedinSection2.3.3. Then we provide a literature review on various versions of Regularized SDDP in Section 2.3.4. In Section 2.3.6, we propose a new type of Regularized SDDP and show the asymptotic con- vergence of this algorithm. Finally, we transition from SD-SDDP and explain several unique features of SDLP. Figure 2.2: Building Blocks of the Regularized SD and its Extension 2.3.1 Notations Throughout the section, we let ⟨·,·⟩ denote the inner product of two column vectors (i.e., ⟨x,y⟩ = x ⊤ y, where x,y ∈ R n ) as well as the product of transposed matrix and a column vector (i.e.,⟨A,π ⟩=A ⊤ π , where A∈R m× n and π ∈R m ). 2.3.2 Problem Formulation Here,weformallydefinetheMSLPproblem. Let ˜ ξ [T] :Ω →Ξ 2 × Ξ 3 × ...× Ξ T beafullsample path (i.e., ˜ ξ [T] = (ξ 1 ,...,ξ T )) defined on the probability space (Ω ,Σ Ω ,P). Let Pr(·) denote 38 Figure 2.3: Building Blocks of the Regularized SDDP the probability of the event in the parentheses. For t = 0,1,..,T, let ξ t ∈ Ξ t ⊆ R rt be the exogenousdata(i.e.,arealizationof ˜ ξ t )whichcontainscomponentsof(a t ,A t ,B t )and(b t ,C t ). For t = 0,1,..,T, we let x t : Ω →R nx t denote the endogenous state variable of the system. We further let s t :=(x t ,ξ t ). Let u t :Ω →R nu t denote the control variable of the system and let U t (s t ) := {u t | D t u t ≤ b t − C t x t ,u t ≥ 0} denote its feasible region for a given s t . The state variable is updated as x t+1 =D t+1 (x t ,ξ t+1 ,u t )=(x t ,ξ t+1 ,u t )=a t+1 +A t+1 x t +B t+1 u t . LetF t be the information σ -algebra generated by (ξ 1 ,...,ξ t ) (the history of random variables up to stage t). We assume that the decision information structure in the MSLP problem is a filtration (i.e., For a collection of σ -algebras{F t } T t=1 , it satisfies F t ⊆F t+1 for t=1,..,T). For more information about the use of filtrations in MSLP, we refer the reader to [115] and [18]. We consider that the decision u = (u 1 ,...,u T ) made in the MSLP problem is nonanticipative. That is, u t is measurable with respect to the σ -algebraF t . We assume that s 0 is given and formulate the MSLP problem below. 39 min u 0 ∈U 0 ⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+E[⟨c 1 ,x 1 ⟩+ inf u 1 ∈U 1 (s 1 ) ⟨d 1 ,u 1 ⟩+E[⟨c 2 ,x 2 ⟩+ inf u 2 ∈U 2 (s 2 ) ⟨d 2 ,u 2 ⟩+E[... +E[⟨c T ,x T ⟩+ inf u T ∈U T (s T− 1 ) ⟨d T ,u T ⟩|F T− 1 ]...|F 2 ]|F 1 ]] (2.22) Note that when c t = 0, a t = 0, A t = 0 and B t is an identity matrix for all t = 0,1,2,...,T, then x t+1 =u t . Since s t =(x t ,ξ t ), now problem (2.22) is reduced to min u 0 ∈U 0 ⟨d 0 ,u 0 ⟩+E[ inf u 1 ∈U 1 (u 0 ,ξ 1 ) ⟨d 1 ,u 1 ⟩+E[ inf u 2 ∈U 2 (u 1 ,ξ 2 ) ⟨d 2 ,u 2 ⟩+E[... +E[ inf u T ∈U T (u T− 1 ,ξ T ) ⟨d T ,u T ⟩|F T− 1 ]...|F 2 ]|F 1 ]] (2.23) Therefore, problem (2.23) (see [100] for details) can be regarded as a special case of problem (2.22). Another variation of the problem which is considered by Asamov and Powell [3], is that c t = 0 and A t+1 = 0. Throughout this paper, we shall focus on studying algorithms for solving (2.22). The multistage program can also be stated in the dynamic programming formulation. min u 0 ⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+E ˜ ξ 1 [h 1 (s 1 )] s.t. D 0 u 0 ≤ b 0 − C 0 x 0 , u 0 ≥ 0 (2.24) For t=1,2,...,T, the value function in the t th stage is defined as follows. h t (s t )=⟨c t ,x t ⟩+min ut ⟨d t ,u t ⟩+E ˜ ξ t+1 [h t+1 (s t+1 )] s.t. D t u t ≤ b t (ξ t )− C t (ξ t )x t ,u t ≥ 0 (2.25) We set h T+1 ≡ 0. Throughout this chapter, we shall make the following assumption. (A1) Exogenous random variables are stagewise independent. (A2) The set Ξ t of random outcomes in each stage t=2,3,...,T is discrete and finite (A3) D t is fixed and has full row rank for all t=0,1,2,...,T. 40 Next we mathematically define the relative complete recourse assumption, which is inspired by Lan [61]. Let ξ t denote one realization of exogenous random variable in the t th stage. For ξ t ∈Ξ t , define X t (ξ t ):= {x 0 }, t=0 ∪ x t− 1 ∈ ¯ X t− 1 ∪ u t− 1 ∈Ut(x t− 1 , ˜ ξ t) {D t (x t− 1 ,ξ t ,u t− 1 )}, t≥ 1 ¯ X t := {x 0 }, t=0 ∪ ξ t∈Ξ t X t (ξ t ), t≥ 1 Let aff( X t (ξ t )) be the affine hull of X t (ξ t ). Let S t (ϵ ) = {(x ′ t ,ξ t ) | ξ t ∈ Ξ t ,x t ∈ X t ( ˜ ξ t ),y t ∈ aff( X t (ξ t )),∥y t ∥≤ ϵ,x ′ t =x t +y t } (A4) For t=1,...,T, ¯ X t is compact. (A5) Fort=1,...,T,thereexistsϵ t ≥ 0suchthath t (s t )<∞foralls t ∈S t (ϵ ). Furthermore, rint(U t (s t )̸=∅, for all s t ∈S t (ϵ t ). Assumption (A5) implies that the complete recourse assumption is satisfied at all non- root stages. 2.3.3 Stochastic Dual Dynamic Programming SDDPassumesthattheprobability distributionoftheexogenousrandomvariableis known, which is stated in Assumption (B1) below. (B1) For t=1,2,...,T, for any ξ t ∈Ξ t , Pr( ˜ ξ t =ξ t )>0 is known. There are several variations in the SDDP. For instance, DOASA-N ([88]) samples a finite set of N scenarios in advance and traverses each of the N scenarios in the forward pass. When N =Π T t=1 |Ξ t | (i.e., all the possible outcomes are used), they show that DOASA-N produces an optimal solution to the problem with probability one. In this section, we consider that 41 the algorithm samples one scenario per iteration. For simplicity, we let p(ξ t ) = Pr( ˜ ξ t = ξ t ). Based on Assumption (B1), MSLP problem in DP formulation is written as follows. min u 0 f 0 (s 0 ,u 0 ):=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ X ξ 1 ∈Ξ 1 p(ξ 1 )h 1 (s 1 ) s.t.D 0 u 0 ≤ b 0 − C 0 x 0 , u 0 ≥ 0 (2.26) The value function in the t th stage is defined below. h t (s t )=⟨c t ,x t ⟩+min ut ⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h t+1 (s t+1 ) s.t. D t u t ≤ b t (ξ t )− C t (ξ t )x t , u t ≥ 0 (2.27) Forward Pass Here, we provide a detailed description of the forward pass of SDDP at the k th (k≥ 2) iter- ation. In the beginning of the k th iteration, an exogenous sample path ξ k [T] = (ξ k 1 ,ξ k 2 ,...,ξ k T ) is generated based on the known distribution. In the root stage, the value function approx- imation updated at the (k− 1) th iteration is defined below. f k 0 (s 0 ,u 0 ):=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ X ξ 1 ∈Ξ 1 p(ξ 1 )h k− 1 1 (D 1 (x 0 ,ξ 1 ,u 0 ),ξ 1 ). (2.28) whereD 1 (x 0 ,ξ 1 ,u 0 )=a 1 (ξ 1 )+A 1 (ξ 1 )x 0 +B 1 (ξ 1 )u 0 and h k− 1 1 (D 1 (x 0 ,ξ 1 ,u 0 ),ξ 1 )= max 1≤ i≤ k− 1 {α i 1 (ξ 1 )+⟨β i 1 (ξ 1 ),D 1 (x 0 ,ξ 1 ,u 0 )⟩}. Here we state that for given x 0 and ξ 1 , h k− 1 1 (D 1 (x 0 ,ξ 1 ,u 0 ),ξ 1 ) is a polyhedral function of u 0 . To see why, by the definition of D 1 , we have α i 1 (ξ 1 )+⟨β i 1 (ξ 1 ),D 1 (x 0 ,ξ 1 ,u 0 )⟩=α i 1 (ξ 1 )+⟨β i 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 +B 1 (ξ 1 )u 0 ⟩ = ˇα i 1 (x 0 ,ξ 1 )+⟨ ˇ β i 1 (ξ 1 ),u 0 ⟩. (2.29) 42 where ˇα i 1 (x 0 ,ξ 1 ) = α i 1 (ξ 1 )+⟨β i 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 ⟩ and ˇ β i 1 (ξ 1 ) =⟨B 1 (ξ 1 ),β i 1 (ξ 1 )⟩. Based on (2.29), h k− 1 1 (D 1 (x 0 ,ξ 1 ,u 0 ),ξ 1 ) can be rewritten as h k− 1 1 (D 1 (x 0 ,ξ 1 ,u 0 ),ξ 1 )= ˇh k− 1 1 (u 0 ,x 0 ,ξ 1 )= max 1≤ i≤ k− 1 {ˇα i 1 (x 0 ,ξ 1 )+⟨ ˇ β i 1 (ξ 1 ),u 0 ⟩}. (2.30) Thus, equation (2.30) verifies the assertion. The u k 0 obtained by solving the root-stage problem is used to set x k 1 =D 1 (x 0 ,ξ k 1 ,u k 0 ) and s k 1 =(x k 1 ,ξ k 1 ). min u 0 f k 0 (s 0 ,u 0 ):=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ X ξ 1 ∈Ξ 1 p(ξ 1 ) ˇh k− 1 1 (u 0 ,x 0 ,ξ 1 ) s.t. D 0 u 0 ≤ b 0 − C 0 x 0 , u 0 ≥ 0. (2.31) Let Θ 1 = {θ ξ 1 } ξ 1 ∈Ξ 1 , then problem (2.31) can also be converted into the linear program below. min u 0 ,Θ 1 ⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ X ˜ ξ 1 ∈Ξ 1 p( ˜ ξ 1 )θ ˜ ξ 1 s.t. D 0 u 0 ≤ b 0 − C 0 x 0 , u 0 ≥ 0 θ ξ 1 ≥ ˇα i 1 (x 0 ,ξ 1 )+⟨ ˇ β i 1 (ξ 1 ),u 0 ⟩ ∀ ξ 1 ∈Ξ 1 , i=1,2,..,k− 1. (2.32) For t = 1,2,...,T − 1, given x k t calculated in the previous stage, we solve the following t th stage problem to get u k t and then get x k t+1 =D t+1 (x k t ,ξ k t+1 ,u k t ) and s k t+1 =(x k t+1 ,ξ t+1 ). min ut f k t (s k t ,u t ):=⟨c t ,x k t ⟩+⟨d t ,u t ⟩+ X ˜ ξ t+1 ∈Ξ t+1 p( ˜ ξ t+1 )h k− 1 t+1 (D t+1 (x k t ,ξ t+1 ,u t ),ξ t+1 ) s.t. D t u t ≤ b t − C t x k t , u t ≥ 0. (2.33) Let ˇα i t+1 (x t ,ξ t+1 )=α i t (ξ t+1 )+⟨β i t+1 (ξ t+1 ),a t+1 (ξ t+1 )+A t+1 (ξ t+1 )x t ⟩ and ˇ β i t+1 (ξ t+1 )=⟨B t+1 (ξ t+1 ),β i t+1 (ξ t+1 )⟩, then h k− 1 t+1 can be rewritten as h k− 1 t+1 (D t+1 (x k t ,ξ t+1 ,u t ),ξ t+1 )= ˇh k− 1 t+1 (u t ,x k t ,ξ t+1 )= ˇα i t+1 (x k t ,ξ t+1 )+⟨ ˇ β i t+1 (ξ t+1 ),u t ⟩. 43 Again, this implies that h k− 1 t+1 (D t+1 (x k t ,ξ t+1 ,u t ),ξ t+1 ) is a polyhedral function of u t for given x k t and ξ t+1 . Let Θ t+1 ={θ ξ t+1 } ξ t+1 ∈Ξ t+1 , problem (2.33) can be converted into the following LP. min ut,Θ t+1 ⟨c t ,x k t ⟩+⟨d t ,u t ⟩+ X ˜ ξ t+1 ∈Ξ t+1 p(ξ t+1 )θ ˜ ξ t+1 s.t. D t u t ≤ b t − C t x t ,u t ≥ 0 θ ξ t+1 ≥ ˇα i t+1 (x k t ,ξ t+1 )+⟨ ˇ β i t+1 (ξ t+1 ),u t ⟩, ∀ ξ t+1 ∈Ξ t+1 , i=1,2,..,k− 1. (2.34) In the terminal stage, given x k T , we solve the following problem to get u k T . min u T ⟨c T ,x k T ⟩+⟨d T ,u T ⟩ D T u T ≤ b T − C T x k T ,u T ≥ 0. Backward Pass In the terminal stage, for each ξ T ∈Ξ T , we set x T (ξ T )=a T (ξ T )+A T (ξ T )x k T− 1 +B t (ξ T )u k T− 1 and get the optimal solution of the T th stage dual problem below. π k T (ξ T )∈argmax⟨π T ,b T (ξ T )− C T (ξ T )x T (ξ T )⟩ s.t.⟨D T ,π T ⟩≤ d T , π T ≤ 0. (2.35) Note that (2.35) is the dual of the minimization part of the following problem. ˆ H k T (s T )=⟨c T ,x T ⟩+min⟨d T ,u T ⟩ s.t. A T u T ≤ b T (ξ T )− C T (ξ T )x T (ξ T ), u T ≥ 0. (2.36) Then we compute the cut coefficients as follows. α k T (ξ T )=⟨π k T (ξ T ),b T (ξ T )⟩, β k T (ξ T )=c T −⟨ C T (ξ T ),π k T (ξ T )⟩ (2.37) 44 The minorant is updated as h k T (s T (ξ T ))=max{h k− 1 T (s T (ξ T )),α k T (ξ T )+⟨β k T (ξ T ),x T ⟩}. We repeat the process of updating the minorant for each ξ T ∈ Ξ T . After that, the value function approximation is updated as f k T− 1 (s T− 1 ,u T− 1 ) = ⟨c T− 1 ,x T− 1 ⟩ +⟨d T− 1 ,u T− 1 ⟩ + P ξ T ∈Ξ T p(ξ T )h k T (s T (ξ T )). In the non-terminal stage (i.e., t = T − 1,T − 2,...,1), for each ξ t ∈ Ξ t , we set x t (ξ t ) = a t (ξ t )+A t (ξ t )x k t− 1 +B t (ξ t )u k t− 1 and we consider the following linear program ˆ H k t (x t (ξ t ),ξ t )=⟨c t ,x t (ξ t )⟩+min⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )θ ξ t+1 s.t. A t u t ≤ b t (ξ t )− C t (ξ t )x t (ξ t ), u t ≥ 0 θ ξ t+1 ≥ ˇα i t+1 (x t (ξ t ),ξ t+1 )+⟨ ˇ β i t+1 (ξ t+1 ),u t ⟩ ξ t+1 ∈Ξ t+1 ,i=1,2,...k (2.38) We aim to get the subgradient of ˆ H(·,ξ t ) at x t (ξ t ). The dual of minimization part of (2.38) is max⟨π t ,b t (ξ t )− C t (ξ t )x t (ξ t )⟩ − X ξ t+1 ∈Ξ t+1 p(ξ t+1 ) k X i=1 ρ i,ξ t+1 ˇα i t+1 (x t (ξ t ),ξ t+1 ) s.t.⟨A t ,π t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 ) k X i=1 ρ i,ξ t+1 ˇ β i t+1 (ξ t+1 )≤ d t 1+ k X i=1 ρ i,ξ t+1 =0,∀ ξ t+1 ∈Ξ t+1 π t ≤ 0, ρ i,ξ t+1 ≤ 0,∀ ξ t+1 ∈Ξ t+1 ,i=1,2,...,k. (2.39) Denote κ i t+1 (ξ t+1 )=α i t+1 (ξ t+1 )+⟨β i t+1 (ξ t+1 ),a t+1 (ξ t+1 )⟩ µ i t+1 (ξ t+1 )=⟨A t+1 (ξ t+1 ),β i t+1 (ξ t+1 )⟩. (2.40) 45 Then ˇα i t+1 (x t (ξ t ),ξ t+1 )=κ i t+1 (ξ t+1 )+⟨µ i t+1 (ξ t+1 ),x t (ξ t )⟩. We let π k t (ξ t ) and ρ k i,ξ t+1 (ξ t ) denote the optimal solution of (2.39) and compute the cut coefficients below. α k t (ξ t )=⟨π k t (ξ t ),b t ⟩− X ξ t+1 ∈Ξ t+1 p(ξ t+1 ) k X i=1 ρ i,ξ t+1 (ξ t )κ i t+1 (ξ t+1 ). (2.41) and β k t (ξ t )=c t −⟨ C t (ξ t ),π k t (ξ t )⟩− X ξ t+1 ∈Ξ t+1 p(ξ t+1 ) k X i=1 ρ i,ξ t+1 (ξ t )µ i t+1 (ξ t+1 ). (2.42) The minorant is updated as h k t (s t (ξ t ))=max{h k− 1 t (s t (ξ t )),α k t (ξ t )+⟨β k t (ξ t ),x t (ξ t )⟩}. We repeat the process of updating the minorant for each ξ t ∈ Ξ t . Then we update the value function approximation as f k t− 1 (s t− 1 ,u t− 1 )=⟨c t− 1 ,x t− 1 ⟩+⟨d t− 1 ,u t− 1 ⟩+ X ξ t∈Ξ t p(ξ t )h k t (s t (ξ t )). We summarize all the steps of SDDP in Algorithm 6. 2.3.4 Literature Review: Regularized SDDP Asamov and Powell [3] consider the problem when x t+1 =D t+1 (x t ,ξ t+1 ,u t ) = a t+1 +B t+1 u t and B t+1 is deterministic. They further introduce a new variable R u t =B t+1 u t . The forward pass in their Regularized SDDP is u k t ∈arg min ut∈Ut(s k t ) {f k− 1 t (s k t ,u t )+ ρ k 2 ⟨R u t − R u,k− 1 t ,Q t (R u t − R u,k− 1 t )⟩} whereQ t ispositivesemidefinte. Thepurposeoftheregularizedtermistoprotectthesearch point from jumping too away from the last search point so that the algorithm can stabilize 46 Algorithm 6 SDDP Set k =1, f 0 t =−∞ , t=0,...,T − 1, and x k 0 =x 0 , k≥ 0. Initialization for k = 1,2,... do Obtain (see (2.32) for its LP formulation) Forward Pass u k 0 ∈argmin{f k− 1 0 (s 0 ,u 0 )| D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0} (2.43) for t = 1,...,T-1 do Sample ξ k t from Ξ t based on the known distribution. Set x k t =a t (ξ k t )+A t (ξ k t )x k t− 1 +B t (ξ k t )u k t− 1 . Obtain u k t ∈argmin{f k− 1 t (s k t ,u t )| A t u t ≤ b t (ξ k t )− C t (ξ k t )x k t ,u t ≥ 0} (2.44) end for Generate ξ k T from Ξ T based on the known distribution. Set x k T =a T (ξ k T )+A T (ξ k T )x k T− 1 +B T (ξ k T )u k T− 1 . Obtain u k T ∈argmin{⟨c T ,x k T ⟩+⟨d T ,u T ⟩| A T u T ≤ b T (ξ k T )− C T (ξ k T )x k T ,u T ≥ 0}. for ξ T ∈Ξ T do Backward Pass Set x T (ξ T )=a T (ξ T )+A T (ξ T )x k T− 1 +B t (ξ T )u k T− 1 . Use (2.35) and (2.37) to get α k T (ξ T ) and β k T (ξ T ). Then update the minorant below. h k T (s T (ξ T ))=max{h k− 1 T (s T (ξ T )),α k T (ξ T )+⟨β k T (ξ T ),x T ⟩} end for Update the value function approximation as follows. f k T− 1 (s T− 1 ,u T− 1 )=⟨c T− 1 ,x T− 1 ⟩+⟨d T− 1 ,u T− 1 ⟩+ X ξ T ∈Ξ T p(ξ T )h k T (s T (ξ T )) for t=T − 1,T − 2,...,1 do for ξ t ∈Ξ t do Set x t (ξ t )=a t (ξ t )+A t (ξ t )x k t− 1 +B t (ξ t )u k t− 1 . Use (2.39) - (2.42) to get α k t (ξ t ) and β k t (ξ t ). Then update the minorant below. h k t (s t (ξ t ))=max{h k− 1 t (s t (ξ t )),α k t (ξ t )+⟨β k t (ξ t ),x t ⟩} end for Update value function approximation as follows. f k t− 1 (s t− 1 ,u t− 1 )=⟨c t− 1 ,x t− 1 ⟩+⟨d t− 1 ,u t− 1 ⟩+ X ξ t∈Ξ t p(ξ t )h k t (s t (ξ t )) end for end for 47 the function approximation. Guigues et al.[43] let x t+1 =D t+1 (x t ,ξ t+1 ,u t )=u t and hence the feasible region becomes U t (s t )={u t ∈U t | g t (x t ,u t ,ξ t )≤ 0, A t u t +B t (ξ t )x t =b t (ξ t )}, whereU t is a nonempty compact and convex set and each component of g t (·,·,ξ t ) is convex. Their forward pass is u k t ∈arg min ut∈Ut(st) {f k− 1 t (s k t ,u t )+ ρ k 2 ∥u t − u p,k t ∥ 2 } where u p,k t is any point inU t . With mild assumptions and lim k→∞ ρ k = 0, they claim that the convergence of the algorithm holds for any sequence{u p,k t } k≥ 1 . Similar to the proof of optimality of SDDP, the key step for proving the optimality of Regularized SDDP is to show that the algorithm finds necessary combination of pieces that cut the original value function after finite number of iterations. Both methods require that lim k→∞ ρ k =0 so that the effect of the regularized term diminish as the function approximation stabilizesandthustheoptimalityofthegeneratedsequenceintheforwardpassisguaranteed asymptotically. SD-type algorithms do one step further to show the directional derivative around any accumulation point is no larger than the directional derivative of the original function at the same accumulation point. Finally, the optimality of the SD algorithm is proved by showing the directional derivative of the true objective function at some accumulation point is non-negative. 48 2.3.5 SDDP with Proximal Mapping Compared to the SDDP in Algorithm 6, the only difference is that the root-stage problem in SDDP-Prox contains an extra proximal term, which is 1 2α k ∥u 0 − u k− 1 0 ∥ 2 . u k 0 =argmin{f k− 1 0 (s 0 ,u 0 )+ 1 2α k ∥u 0 − u k− 1 0 ∥ 2 | D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0}. Lemma 2.15 shows that h k t only has finitely many unique cuts. Hence, there exists a k ′ <∞ such that h k t (s t )=h k ′ t (s t ) on for all k≥ k ′ and t=1,2,3,...,T. Denote ¯h 1 =h k ′ 1 (2.45a) ¯ f 0 (s 0 ,u 0 )=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ X ξ 1 ∈Ξ 1 p(ξ 1 ) ¯h(D(x 0 ,ξ 1 ,u 0 ),ξ 1 ) (2.45b) and min u 0 ∈U 0 (s 0 ) ¯ f 0 (s 0 ,u 0 ) (2.46) It is worth noting ¯h 1 is polyhedral convex on U 0 (s 0 ). Since a finite sum of polyhedral function is still polyheral, ¯ f 0 (s 0 ,·) is polyhedral convex onU 0 (s 0 ). We shall use the following proposition from [94] to show that SDDP-Prox produces a solution to (2.46) in a finite number of iterations. As a matter of fact, for large enough k, updating u 0 in SDDP-Prox is reduced to a proximal point method with objective function, ˜ f 0 (s 0 ,u 0 ). 49 Algorithm 7 SDDP-Prox Set k = 1, u 0 0 ∈ U 0 (s 0 ), f 0 t = −∞ , t = 0,...,T − 1, and x k 0 = x 0 and α k > 0, for all k ≥ 0. Initialization for k = 1,2,... do Obtain (see (2.32) for its LP formulation) Forward Pass u k 0 ∈argmin{f k− 1 0 (s 0 ,u 0 )+ 1 2α k ∥u 0 − u k− 1 0 ∥ 2 | D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0} for t = 1,...,T-1 do Sample ξ k t from Ξ t based on the known distribution. Set x k t =a t (ξ k t )+A t (ξ k t )x k t− 1 +B t (ξ k t )u k t− 1 . Obtain u k t ∈argmin{f k− 1 t (s k t ,u t )| A t u t ≤ b t (ξ k t )− C t (ξ k t )x k t ,u t ≥ 0}. end for Generate ξ k T from Ξ T based on the known distribution. Set x k T =a T (ξ k T )+A T (ξ k T )x k T− 1 +B T (ξ k T )u k T− 1 . Obtain u k T ∈argmin{⟨c T ,x k T ⟩+⟨d T ,u T ⟩| A T u T ≤ b T (ξ k T )− C T (ξ k T )x k T ,u T ≥ 0}. for ξ T ∈Ξ T do Backward Pass Set x T (ξ T )=a T (ξ T )+A T (ξ T )x k T− 1 +B t (ξ T )u k T− 1 . Use (2.35) and (2.37) to get α k T (ξ T ) and β k T (ξ T ). Then update the minorant below. h k T (s T (ξ T ))=max{h k− 1 T (s T (ξ T )),α k T (ξ T )+⟨β k T (ξ T ),x T ⟩} end for Update the value function approximation as follows. f k T− 1 (s T− 1 ,u T− 1 )=⟨c T− 1 ,x T− 1 ⟩+⟨d T− 1 ,u T− 1 ⟩+ X ξ T ∈Ξ T p(ξ T )h k T (s T (ξ T )) for t=T − 1,T − 2,...,1 do for ξ t ∈Ξ t do Set x t (ξ t )=a t (ξ t )+A t (ξ t )x k t− 1 +B t (ξ t )u k t− 1 . Use (2.39) - (2.42) to get α k t (ξ t ) and β k t (ξ t ). Then update the minorant below. h k t (s t (ξ t ))=max{h k− 1 t (s t (ξ t )),α k t (ξ t )+⟨β k t (ξ t ),x t ⟩} end for Update value function approximation as follows. f k t− 1 (s t− 1 ,u t− 1 )=⟨c t− 1 ,x t− 1 ⟩+⟨d t− 1 ,u t− 1 ⟩+ X ξ t∈Ξ t p(ξ t )h k t (s t (ξ t )) end for end for 50 Proposition 2.9. Let f : H 7→ (−∞ ,+∞] be a polyhedral convex function and H is finite- dimensional. Suppose there exists z ∈ H such that f(z) < +∞. Let {α k } be bounded away from 0. Let z 0 ∈H and P k (z k )=argmin{f(z)+ 1 2α k ∥z− x k ∥}. If f attains its minimum at a unique point ¯z, then 0∈ int ∂f(¯z). Thus, the proximal point algorithm in its exact form (i.e., z k+1 =P k (z k )) gives convergence to ¯z in a finite number of iterations from any starting point z 0 . However, even if f does not attain its minimum at a unique point but merely is bounded below, the proximal point algorithm in its exact form will converge to some minimizer of f in a finite number of iterations. Proof. See Proposition 8 in [94]. Proposition 2.10. With probability one,{u k 0 } k∈N generated by SDDP-Prox converges to an optimal solution to (2.46) in a finite number of iterations. Proof. Accordingly, there exists k ′ such that f k 0 = ¯ f 0 for all k≥ k ′ . Then for all k≥ k ′ , u k+1 0 =arg min u 0 ∈U 0 (s 0 ) {f k 0 (s 0 ,u 0 )+ 1 2α k ∥u 0 − u k 0 ∥ 2 }=arg min u 0 ∈U 0 (s 0 ) { ¯ f 0 (s 0 ,u 0 )+ 1 2α k ∥u 0 − u k 0 ∥ 2 Since ¯ f 0 ispolyhedralconvex,wecanapplyProposition2.9toconcludethe{u k 0 } k∈N converges to an optimal solution of (2.46) in a finite number of iterations. To show the optimality of SDDP-Prox, we can mimic the way from Philpott and Guan [88] to prove by contradiction. Theorem 2.11. With probability one, {u k 0 } k∈N generated by SDDP-Prox converges to an optimal solution to the true problem in (2.26) in a finite number of iterations. Proof. By Proposition 2.10, with probability one, {u k 0 } k∈N converge to some ¯u 0 after finite numberofiterations. Hence,thereexists ¯k suchthatu k 0 = ¯u 0 andalsoh k t doesnotchangeaf- ter ¯k for t=1,2,...,T. Let (¯u 0 ,¯u 1 ,...,¯u T ) and corresponding (¯x 0 ,¯x 1 ,...,¯x T ) be generated by SDDP-Prox in iteration k and k > ¯k (i.e., (¯u 0 ,¯u 1 ,...,¯u T ) and (¯x 0 ,¯x 1 ,...,¯x T ) will be two 51 accumulation points since no function approximations will change). Similar to Lemma 3 in [88], we prove by contradiction and suppose that for some ξ T− 1 , we have h k− 1 T− 1 (¯x T− 1 ,ξ T− 1 )<h T− 1 (¯x T− 1 ,ξ T− 1 ). (2.47) h k− 1 T− 1 (¯x T− 1 ,ξ T− 1 )=⟨c T− 1 ,¯x T− 1 ⟩+ min u T− 1 ,Θ T ⟨d T− 1 ,u T− 1 ⟩+ X ξ T ∈Ξ T p(ξ T )θ ξ T s.t. A T− 1 u T− 1 ≤ b T− 1 (ξ T− 1 )− C T− 1 (ξ T− 1 )¯x T− 1 , u T− 1 ≥ 0 θ ξ T ≥ ˇα i T (¯x T− 1 ,ξ T )+⟨ ˇ β i T (ξ T ),u t ⟩ ξ T ∈Ξ T ,i=1,2,...k− 1. (2.48) Let u ∗ T− 1 ,Θ ∗ T denote the optimal solution of (2.48). Equation (2.47) implies that X ξ T ∈Ξ T p(ξ T )θ ∗ ξ T < X ξ T ∈Ξ T p(ξ T )h T (D T (¯x T− 1 ,ξ T ,u ∗ T− 1 ),ξ T ). This implies that there exists some ξ T ∈Ξ T such that h k− 1 T (D T (¯x T− 1 ,ξ T ,u ∗ T− 1 ),ξ T )<h T (D T (¯x T− 1 ,ξ T ,u ∗ T− 1 ),ξ T ). (2.49) Then there exists k ′′ > k ′ such that x k ′′ T− 1 = ¯x T− 1 and ξ k ′′ T− 1 = ξ T− 1 . Then u k ′′ T− 1 = u ∗ T− 1 and equation (2.49) implies that h k ′′ − 1 T (D T (x k ′′ T− 1 ,ξ T ,u k ′′ T− 1 ),ξ T )<h T (D T (x k ′′ T− 1 ,ξ T ,u k ′′ T− 1 ),ξ T ). BythebackwardpassofSDDP-Prox,thisimpliesthatanewcutwillbeaddedto h k ′′ T ,which violatesthetheassumptionthath k T doesnotchangeafteriteration ¯k. Similarly,wecanshow h k− 1 t (¯x t ,ξ t )=h(¯x t ,ξ t ) for all t=1,2,...,T − 2 by induction. 52 2.3.6 SD-SDDP: A Bridge between SD and SDDP In this section, we aim to connect Stochastic Decomposition (SD) to Stochastic Dual Dy- namic Programming (SDDP) by designing a new algorithm to solve MSLP problems with a known probability distribution. Inspired by Higle and Sen [47] and Gangammanavar and Sen [39], a proximal term is introduced to the root-stage master problem. The purpose of the proximal term has two folds: (1) It forces the algorithm to search for next candidate solution in the direction of improv- ing solution quality based on the current estimated directional derivative. (2) It helps the algorithm keep a few active minorants to the current candidate root-stage solution. Forward Pass The purpose of the optimization path is to find a solution path which is better than the one simulated in the prediction pass. The optimization pass in the root stage is written below. u k 0 ∈argmin{f k− 1 0 (s 0 ,u 0 )+ σ 2 ∥u 0 − ¯u k− 1 0 ∥ 2 | D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0} Note that σ is the coefficient before the regularizing term. We require that σ is no less than 1. In general, the forward pass in the t th stage is u k t ∈argmin{f k− 1 t (s k t ,u t )| D t u t ≤ b t (ξ k t )− C t (ξ k t )x k t ,u t ≥ 0} Backward Pass For t = T,T − 1,...,2, the minorant updates and construction are same as the one in Section 2.3. For t=1, we aim to further compute the minorants of u k and ¯u k because x 0 is assumed to be fixed. We follow the process from (2.39) - (2.42) to get {(α k 1 (ξ 1 ),β k 1 (ξ 1 ))} ξ 1 ∈Ξ 1 53 for fixing u 1 = u k 1 and {(α − k 1 (ξ 1 ),β − k 1 (ξ 1 ))} ξ 1 ∈Ξ 1 for fixing u 1 = ¯u k 1 . By definition, x 1 (ξ 1 ) = a 1 (ξ 1 )+A 1 (ξ 1 )x 0 +B 1 (ξ 1 )u 0 . Compute ¯α k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 ) α k 1 (ξ 1 )+⟨β k 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 ⟩ , ¯ β k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 )⟨B 1 (ξ 1 ),β k 1 (ξ 1 )⟩, ¯α − k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 ) α − k 1 (ξ 1 )+⟨β − k 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 ⟩ , ¯ β − k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 )⟨B 1 (ξ 1 ),β − k 1 (ξ 1 )⟩. Let C k (u 0 ) = ¯α k 1 +⟨ ¯ β k 1 ,u 0 ⟩ and C − k (u 0 ) = ¯α − k 1 +⟨ ¯ β − k 1 ,u 0 ⟩. The value function in the first stage is updated as ¯ H k 1 (u 0 ) = max{C i (u 0 ) : i ∈ J k }, where J k is set the active minorants plus the newly generated two minorants. Then the root-stage objective function is updated as f k 0 (s 0 ,u 0 )=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ ¯ H k 1 (u 0 ). Lemma 2.12. Suppose that σ ≥ 1. In the root stage, the following holds: f k 0 (s 0 ,u k+1 0 )− f k 0 (s 0 ,¯u k 0 )≤− σ ∥u k+1 0 − ¯u k 0 ∥ 2 ≤−∥ u k+1 0 − ¯u k 0 ∥ 2 ≤ 0. Proof. The forward pass in the root stage is equivalent to the following linear programming problem. min u 0 ,θ 1 σ 2 ∥u 0 − ¯u k 0 ∥ 2 +⟨d 0 ,u 0 ⟩+⟨c 0 ,x 0 ⟩+θ 1 s.t. θ 1 ≥ ¯α i 1 +⟨ ¯ β i 1 ,u 0 ⟩ ∀ i∈J k D 0 u 0 ≤ b 0 − C 0 x 0 , u 0 ≥ 0 (2.50) 54 The necessary condition of (2.50) is D 0 u 0 − b 0 +C 0 x 0 ≤ 0, u 0 ≥ 0 (2.51a) − θ 1 + ¯α i 1 +⟨ ¯ β i 1 ,u 0 ⟩≤ 0, ∀ i∈J k (2.51b) ⟨λ 0 ,D 0 u 0 − b 0 +C 0 x 0 ⟩=0 (2.51c) η i 1 (− θ 1 + ¯α i 1 +⟨ ¯ β i 1 ,u 0 ⟩)=0, ∀ i∈J k (2.51d) ⟨D 0 ,λ ⟩+σ (u 0 − ¯u k 0 )+d 0 + X i∈J k η i 1 ¯ β i 1 =0 (2.51e) X i∈J k η i 1 =1, η i 1 ≥ 0, λ ≥ 0. (2.51f) Let (u k+1 0 ,θ k+1 1 ) and λ 0,k ,{η i 1,k } i∈J k denote the solutions in (2.51). Equations (2.51c) and (2.51e) imply that ⟨λ 0,k ,D 0 ¯u k 0 − b 0 +C 0 x 0 ⟩− 0=⟨λ 0,k ,D 0 ¯u k 0 − b 0 +C 0 x 0 ⟩−⟨ λ 0,k ,D 0 u k+1 0 − b 0 +C 0 x 0 ⟩ =−⟨⟨ D 0 ,λ 0,k ⟩,u k+1 0 − ¯u k 0 ⟩ =σ ∥u k+1 0 − ¯u k 0 ∥ 2 +⟨d 0 ,u k+1 0 − ¯u k 0 ⟩+ X i∈J k η i 1,k ⟨ ¯ β i 1 ,u k+1 0 − ¯u k 0 ⟩ (2.52) Equations (2.51d) and (2.51f) imply that ¯ H k 1 (u k+1 0 )=θ k+1 1 = X i∈J k η i 1,k = X i∈J k η i 1,k (¯α i 1 +⟨ ¯ β i 1 ,u k+1 0 ⟩). (2.53) On the other hand, by the definition of ¯ H k 1 , we have X i∈J k η i 1,k (¯α i 1 +⟨ ¯ β i 1 ,¯u k 0 ⟩)≤ ¯ H k 1 (¯u k 0 ). (2.54) 55 Equations (2.53) and (2.54) imply that θ k+1 1 ≤ P i∈J k η i 1,k ⟨ ¯ β i 1 ,u k+1 0 − ¯u k 0 ⟩+ ¯ H k 1 (¯u k 0 ). Equation (2.52) further implies that X i∈J k η i 1,k ⟨ ¯ β i 1 ,u k+1 0 − ¯u k 0 ⟩+ ¯ H k 1 (¯u k 0 ) = ¯ H k 1 (¯u k 0 )− σ ∥u k+1 0 − ¯u k 0 ∥ 2 −⟨ d 0 ,u k+1 0 − ¯u k 0 ⟩−⟨ λ 0,k ,D 0 ¯u k 0 − b 0 +C 0 x 0 ⟩ ≤ ¯ H k 1 (¯u k 0 )− σ ∥u k+1 0 − ¯u k 0 ∥ 2 −⟨ d 0 ,u k+1 0 − ¯u k 0 ⟩. (2.55) The last inequality of the equation above follows because λ 0,k ≥ 0 and D 0 ¯u k 0 − b 0 +C 0 x 0 ⟩≤ 0. By the definition of f k 0 , we have f k 0 (s 0 ,u k+1 0 ) = ⟨d 0 ,u k+1 0 ⟩ +⟨c 0 ,x 0 ⟩ + ¯ H k 1 (u k+1 0 ), and f k 0 (s 0 ,¯u k 0 )=⟨d 0 ,¯u k 0 ⟩+⟨c 0 ,x 0 ⟩+ ¯ H k 1 (¯u k 0 ). Hence, it follows from (2.55) that f k 0 (s 0 ,u k+1 0 )− f k 0 (s 0 ,¯u k 0 )≤− σ ∥u k+1 0 − ¯u k 0 ∥ 2 ≤−∥ u k+1 0 − ¯u k 0 ∥ 2 ≤ 0. Now we formally state the SD-SDDP algorithm in Algorithm 8. 2.3.7 Convergence Analysis Thelogicoftheproofisasfollows. Atfirst,weshowtheuniformconvergenceoftheapproxi- mation function up to the first stage. The uniform convergence result cannot guarantee that the estimated function value converges to the original function value. It is also worth noting that the root-stage objective function approximation will not converge uniformly since only a set of active minorants is used. Based on uniform convergence results and the existence of accumulation points, we show the existence of accumulation points of the incumbents. Then we show that the estimated function value converges to the original function value at any accumulation point. Finally, we prove that SD-SDDP can produce an optimal solution to (2.26) by showing that the directional derivative of the limiting point of the root stage 56 Algorithm 8 SD-SDDP Set σ ≥ 1, q ∈ (0,1), k = 1, f 0 t = −∞ , t = 0,...,T − 1, J 0 = ∅, x k 0 = x 0 , for all k ∈ N. Initialization Let ¯u 0 0 ∈{u 0 | D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0}. for k = 1,2,... do Generate ξ k [T] based on the known distribution. Obtain Forward Pass u k 0 =argmin{f k− 1 0 (s 0 ,u 0 )+ σ 2 ∥u 0 − ¯u k− 1 0 ∥ 2 | D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0}. (2.56) Update the index setsJ k =J k− 1 ∪{− k,k}\{i∈J k− 1 | η i 1,k =0}. (η i 1,k is in (2.51)) for t = 1,...,T-1 do Set x k t =a t (ξ k t )+A t (ξ k t )x k t− 1 +B t (ξ k t )u k t− 1 and obtain u k t ∈argmin{f k− 1 t (s k t ,u t )| D t u t ≤ b t (ξ k t )− C t (ξ k t )x k t ,u t ≥ 0} (2.57) end for Set x k T =a T (ξ k T )+A T (ξ k T )x k T− 1 +B T (ξ k T )u k T− 1 and obtain u k T ∈argmin{⟨c T ,x k T ⟩+⟨d T ,u T ⟩| D T u T ≤ b T (ξ k T )− C T (ξ k T )x k T ,u T ≥ 0} for ξ T ∈Ξ T do Backward Pass Set x T (ξ T )=a T (ξ T )+A T (ξ T )x k T− 1 +B t (ξ T )u k T− 1 . Use (2.35) and (2.37) to get α k T (ξ T ) and β k T (ξ T ). Then update the minorant below. h k T (s T (ξ T ))=max{h k− 1 T (s T (ξ T )),α k T (ξ T )+⟨β k T (ξ T ),x T ⟩} end for Update the objective function approximation as follows. f k T− 1 (s T− 1 ,u T− 1 )=⟨c T− 1 ,x T− 1 ⟩+⟨d T− 1 ,u T− 1 ⟩+ X ξ T ∈Ξ T p(ξ T )h k T (s T (ξ T )) for t=T − 1,T − 2,...,2 do Run CC2-SD-SDDP in Algorithm 10 to get f k t− 1 . end for Run CC1-SD-SDDP in Algorithm 9 to get C k ,C − k and update the first-stage value function as ¯ H k 1 (u 0 )=max{C i (u 0 ):i∈J k }. Set f k 0 (s 0 ,u 0 )=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ ¯ H k 1 (u 0 ). if f k 0 (s 0 ,u k 0 )− f k 0 (s 0 ,¯u k− 1 0 )≤ q h f k− 1 0 (s 0 ,u k 0 )− f k− 1 0 (s 0 ,¯x k− 1 0 ) i thenIncumbent Selection Set ¯u k 0 ← u k 0 . else Set ¯u k 0 ← ¯u k− 1 0 . end if end for 57 Algorithm 9 CC1-SD-SDDP Input: u k 0 ,¯u k− 1 0 ,Ξ 1 ,f k 1 . Output: C k ,C − k . for ξ 1 ∈Ξ 1 do Set x 1 (ξ 1 )=a 1 (ξ 1 )+A 1 (ξ 1 )x 0 +B t (ξ 1 )u k 0 and use (2.39) - (2.42) to get α k t (ξ t ) and β k t (ξ t ). Setx ′ 1 (ξ 1 )=a 1 (ξ 1 )+A 1 (ξ 1 )x 0 +B t (ξ 1 )¯u k− 1 0 anduse(2.39)-(2.42)togetα − k t (ξ t )andβ − k t (ξ t ). end for Compute ¯α k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 ) α k 1 (ξ 1 )+⟨β k 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 ⟩ , ¯ β k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 )⟨B 1 (ξ 1 ),β k 1 (ξ 1 )⟩, ¯α − k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 ) α − k 1 (ξ 1 )+⟨β − k 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 ⟩ , ¯ β − k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 )⟨B 1 (ξ 1 ),β − k 1 (ξ 1 )⟩. SetC k (u 0 )= ¯α k 1 +⟨ ¯ β k 1 ,u 0 ⟩ andC − k (u 0 )= ¯α − k 1 +⟨ ¯ β − k 1 ,u 0 ⟩. Algorithm 10 CC2-SD-SDDP Input: u k t− 1 ,x k t− 1 ,Ξ t ,f k t . Output: f k t− 1 . for ξ t ∈Ξ t do Set x t (ξ t )=a t (ξ t )+A t (ξ t )x k t− 1 +B t (ξ t )u k t− 1 and use (2.39) - (2.42) to get α k t (ξ t ) and β k t (ξ t ). Then update the minorant below. h k t (s t (ξ t ))=max{h k− 1 t (s t (ξ t )),α k t (ξ t )+⟨β k t (ξ t ),x t ⟩} end for Update objective function approximation as follows. f k t− 1 (s t− 1 ,u t− 1 )=⟨c t− 1 ,x t− 1 ⟩+⟨d t− 1 ,u t− 1 ⟩+ X ξ t∈Ξ t p(ξ t )h k t (s t (ξ t )) 58 control variable is non-negative in any feasible direction. We restate several important definitions below. h t (s t )=⟨c t ,x t ⟩+min ut ⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h t+1 (s t+1 ) s.t. D t u t ≤ b t (ξ t )− C t (ξ t )x t , u t ≥ 0 (2.58) h k t (s t )=h k t (D t (x t− 1 ,ξ t ,u t− 1 ),ξ t )= max 1≤ i≤ k {α i t (ξ 1 )+⟨β i t (ξ t ),D t (x t− 1 ,ξ t ,u t )⟩}. Value function approximation: f k t (s t ,u t )=⟨c t ,x t ⟩+⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h k t+1 (s t+1 ) ˆ H k T (s T )=⟨c T ,x T ⟩+min⟨d T ,u T ⟩ s.t. A T u T ≤ b T (ξ T )− C T (ξ T )x T (ξ T ), u T ≥ 0. ˆ H k t (s t )=⟨c t ,x t ⟩+min⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )θ ξ t+1 s.t. A t u t ≤ b t (ξ t )− C t (ξ t )x t (ξ t ), u t ≥ 0 θ ξ t+1 ≥ ˇα i t+1 (x t (ξ t ),ξ t+1 )+⟨ ˇ β i t+1 (ξ t+1 ),u t ⟩ ξ t+1 ∈Ξ t+1 ,i=1,2,...k. (2.59) We define that ξ [t] is an exogenous sample path up to stage t. We let K(ξ [t] ) denote the set of iteration indices where the algorithm generates the sample path ξ [t] and uses it in the forward. Definition 2.13. For any possible sample path up to stage t, which is denoted by ξ [t] ,K(ξ [t] ) is the set of all iteration indices where the algorithm traverse ξ [t] . Since each sample path is traversed infinitely many times, K( ˜ ξ [t] ) is countably infinite. Thelemmabelowshowsthemonotonicityoftheminorantsand ˆ H k t , whichisthestarting point to show their convergence. Lemma 2.14. Suppose assumptions (A1) - (A4) and (B1) are satisfied. For t = 1,...,T, h k t ≤ ˆ H k t . For t = 1,...,T, {h k t } k∈N is monotonically increasing on S t (ϵ t ) 59 and bounded above by h t . For t=1,...,T,{ ˆ H k t } k∈N is monotonically increasing onS t (ϵ t ) and bounded above by h t . Proof. When t = T, ˆ H k T (s T ) = h T (s T ) for all k ∈ N. On the other hand, h k T (s T ) = max 1≤ i≤ k {α i T (ξ 1 )+⟨β i T (ξ T ),D t (x T− 1 ,ξ T ,u T− 1 )⟩}.Byconstruction,itisobviousthath k T (s T )≤ h k+1 T (s T ). By the construction of (α i T (ξ T ),β i T (ξ T )) and strong duality, we have α i T (ξ T )+⟨β i T (ξ T )),D t (x i T− 1 ,ξ T ,u i T− 1 )⟩= ˆ H k T (D t (x i T− 1 ,ξ T ,u i T− 1 ),ξ T ) =h T (D t (x i T− 1 ,ξ T ,u i T− 1 ),ξ T ) By weak duality and the complete recourse assumption in assumption (A4), we have α i T (ξ T )+⟨β i T (ξ T )),D t (x T− 1 ,ξ T ,u T− 1 )⟩≤ ˆ H k T (D t (x T− 1 ,ξ T ,u T− 1 ),ξ T ) ≤ h T (D t (x T− 1 ,ξ T ,u T− 1 ),ξ T )=h T (s T ). By the construction of (ˆ α i T (ξ T ), ˆ β i T (ξ T )) and strong duality, we have ˆ α i T (ξ T )+⟨ ˆ β i T (ξ T )),D t (ˆ x i− 1 T− 1 ,ξ T ,ˆ u i− 1 T− 1 )⟩=h T (D t (ˆ x i− 1 T− 1 ,ξ T ,ˆ u i− 1 T− 1 ),ξ T ) By weak duality and the complete recourse assumption in assumption (A4), we have ˆ α i T (ξ T )+⟨ ˆ β i T (ξ T )),D t (x T− 1 ,ξ T ,u T− 1 )⟩≤ h T (D t (x T− 1 ,ξ T ,u T− 1 ),ξ T )=h T (s T ). Thisshowsthatanypieceof h k T isboundedabovebyh T . Thus, wehaveh k T (s T )≤ ˆ H k T (s T )= h T (s T ). We shall prove the case in the t th stage by induction. Suppose that h k t+1 (s t+1 ) ≤ h t+1 (s t+1 ) and h k t+1 (s t+1 )≤ h k+1 t+1 (s t+1 ) for any k∈N. This implies that X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h k t+1 (D t+1 (x t ,ξ t+1 ,u t ),ξ t+1 )≤ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h t+1 (D t+1 (x t ,ξ t+1 ,u t ),ξ t+1 ) 60 By (2.59) and (2.58), we have ˆ H k t (s t )≤ h t (s t ). On the other hand, X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h k t+1 (D t+1 (x t ,ξ t+1 ,u t ),ξ t+1 )≤ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h k+1 t+1 (D t+1 (x t ,ξ t+1 ,u t ),ξ t+1 ) Hence, ˆ H k t (s t )≤ H k+1 t (s t ). By the definition of h k t , it is obvious that h k t (s t )≤ h k+1 t (s t ). On the other hand, by the construction of (α i t (ξ t ),β i t (ξ t )) and strong duality, we have α i t (ξ t )+⟨β i t (ξ t )),D t (x i t− 1 ,ξ t ,u i t− 1 )⟩=H i t (D t (x i t− 1 ,ξ t ,u i t− 1 ),ξ t )≤ h t (D t (x i t− 1 ,ξ t ,u i t− 1 ),ξ t ) By the weak duality, we have α i t (ξ t )+⟨β i t (ξ t )),D t (x t− 1 ,ξ t ,u t− 1 )⟩≤ H i t (D t (x t− 1 ,ξ t ,u t− 1 ),ξ t )≤ h t (D t (x t− 1 ,ξ t ,u t− 1 ),ξ t ) This shows that every piece of h k t is bounded above by h t . Thus, h k t (s t )≤ ˆ H k t (s t )≤ h t (s t ). [88] shows that there are only finitely many unique cuts. We state their results below. Lemma 2.15. Suppose assumptions (A1) - (A4) and (B1) are satisfied. LetG k t (ξ t ):={(α k t (ξ t ),β k t (ξ t ))} k∈N denote the sequence of cut coefficients of h k t (x t , ˜ ξ t ). Then for t=1,2,...,T, there exists m t such that|G k t (ξ t )|≤ m t for any k∈N and ξ t ∈Ξ t . Proof. See Lemma 1 in [88]. With the results of finiteness of the unique cut coefficients in Lemma 2.15, we shall proceed by deriving the uniform Lipschitz continuity of minorants,{h k t (·,ξ t )} k∈N , as well as { ˆ H k t (·,ξ t )} k∈N . Lemma 2.16. Suppose assumptions (A1) - (A4) and (B1) are satisfied. For t = 1,...,T and ξ t ∈ Ξ t , {h k t (·,ξ t )} k∈N is Lipschitz continuous on X t (ξ t ) with a common constant. 61 For t=1,...,T and ξ t ∈Ξ t ,{ ˆ H k t (·,ξ t )} k∈N is Lipschitz continuous onX t (ξ t ) with a common constant. Proof. ByLemma2.15,ineachstage,therearefinitelymanyuniquecutcoefficientsintheset of coefficients of the approximated function, which further implies the Lipschitz continuity of{h k t (·,ξ t )} k∈N with a common constant. As for ˆ H k T , ˆ H k T (s T )=h T (s T )=⟨c T ,x T ⟩+max⟨π T ,b T (ξ T )− C T (ξ T )x T (ξ T )⟩ s.t.⟨D T ,π T ⟩≤ d T , π T ≤ 0. (2.60) Suppose there are N T dual extreme points in{π T |⟨D T ,π T ⟩≤ d T ,π T ≤ 0}. Then it follows from (2.60) that ˆ H k T (s T )=h T (s T )=⟨c T ,x T ⟩+ max 1≤ i≤ N T ⟨π i T ,b T (ξ T )− C T (ξ T )x T (ξ T )⟩ (2.61) This implies that ˆ H k T (·,ξ T ) is a piecewise linear function on X T (ξ T ). Furthermore, since ˆ H k T = h T for all k ∈ N, it is obvious that { ˆ H k T (·,ξ T )} k∈N is Lipschitz continuous on X t (ξ t ) with a common constant. In general, by Lemma 2.15, in the (t+1) th stage, there exists k(ξ t+1 ) such that |G k t+1 (ξ t+1 )|=m(ξ t+1 ), ∀ k≥ k(ξ t ). Let k t+1 =max ξ t+1 ∈Ξ t+1 {k(ξ t+1 )}. Then after k t+1 , there will not be newly unique cut added to h k t+1 . We define the dual solution in the t th stage and at k th iteration as follows V k t (ξ t )={(π t ,ρ ξ t+1 )|⟨D t ,π t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 ) k X i=1 ρ i,ξ t+1 ⟨β i t+1 (ξ t+1 ),B t+1 (ξ t+1 )⟩≤ d t , 1+ k X i=1 ρ i,ξ t+1 =0,∀ ξ t+1 ∈Ξ t+1 , π t ≤ 0, ρ i,ξ t+1 ,∀ ξ t+1 ∈Ξ t+1 , i=1,2,...,k} 62 Consider k ′ > k t+1 , for any (π ′ t ,ρ ′ ξ t+1 )∈V k ′ t (ξ t ), we can find ( π t ,ρ ξ t+1 )∈V k t+1 t (ξ t ) such that π ′ t = π t and the objective value in (2.33) are equal. Since V k t+1 t (ξ t ) only have finitely many extreme points, we assume thatthere are N t extreme pointsand let (π j t ,ρ j ξ t+1 ) denotethe j th extremepointofV k t+1 t (ξ t ). Therefore,foranyk≥ k t+1 , ˆ H k t canberepresentedasapiecewise linear function whose pieces are constructed by the dual extreme points inV k t+1 t (ξ t ). ˆ H k t (s t )=⟨c t ,x t (ξ t )⟩+ max 1≤ j≤ Nt {⟨π j t ,b t (ξ t )− C t (ξ t )x t (ξ t )⟩ − X ξ t+1 ∈Ξ t+1 p(ξ t+1 ) k X i=1 ρ i,ξ t+1 (κ i t+1 (ξ t+1 )+⟨µ i t+1 (ξ t+1 ),x t (ξ t )⟩)} where κ i t+1 (ξ t+1 )=α i t+1 (ξ t+1 )+⟨β i t+1 (ξ t+1 ),a t+1 (ξ t+1 )⟩, µ i t+1 (ξ t+1 )=⟨A t+1 (ξ t+1 ),β i t+1 (ξ t+1 )⟩. This implies that ˆ H k t will not change when k≥ k t+1 . Therefore, there are only finitely many unique piecewise linear functions in{ ˆ H k t (·,ξ t )} k∈N , which further implies that{ ˆ H k t (·,ξ t )} k∈N is Lipschitz continuous onX t (ξ t ) with a common constant. Lemma 2.17. For t=1,...,T and ξ t ∈Ξ t ,{h k t (·,ξ t )} k∈N is uniformly convergent onX t (ξ t ). For t=1,...,T and ξ t ∈Ξ t ,{ ˆ H k t (·,ξ t )} k∈N is uniformly convergent onX t (ξ t ). Proof. By Lemma 2.14, {h k t (·, ˜ ξ t )} k∈N is monotonically increasing on X t ( ˜ ξ t ) and bounded abovebyh t . Hence,{h k t (·, ˜ ξ t )} k∈N convergespointwisetosomefunction. Furthermore,Based on Lemma 2.16, {h k t (·,ξ t )} k∈N is Lipschitz continuous on X t (ξ t ) with a common constant. This implies that {h k t (·,ξ t )} k∈N is equicontinuous. This further implies that {h k t (·,ξ t )} k∈N converges uniformly to some continuous function. Similarly, based on Lemma 2.14, { ˆ H k t (·,ξ t )} is monotonically increasing on X t (ξ t ) and bounded above by h t . Furthermore, Lemma 2.16 implies that { ˆ H k t (·,ξ t )} is Lipschitz con- tinuous on X t (ξ t ) with a common constant. Therefore, { ˆ H k t (·,ξ t )} is uniformly convergent onX t (ξ t ). 63 Next, weshowtheminorantconvergestotheoriginalvaluefunctionatanyaccumulation point. Proposition 2.18. Suppose assumptions (A1) - (A4) and (B1) are satisfied. For t = 1,..,T − 1. Suppose that for any ξ [t+1] , there existsS ′ (ξ [t+1] ) such that lim k∈S ′ (ξ [t+1] ) s k t+1 = ¯s ξ t+1 and lim k∈S ′ (ξ [t+1] ) h k t+1 (s k t+1 )=h t+1 (¯s ξ t+1 ) , then for any ξ [t] and lim k∈S ′ (ξ [t] ) s k t = ¯s ξ t , it holds that lim k∈S ′ (ξ [t] ) h k t (¯s k t )=h t (¯s ξ t ) Proof. According to the forward pass, we have ˆ H k− 1 t (s k t )=f k− 1 t (s k t ,u k t )=⟨c t ,x k t ⟩+⟨d t ,u k t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h k− 1 t+1 (D t+1 (x k t ,ξ t+1 ,u k t ),ξ t+1 ) By assumptions and the uniform convergence of h k t+1 , lim k∈S ′ (ξ [t] ) h k− 1 t+1 (D t+1 (x k t ,ξ t+1 ,u k t ), ˜ ξ t+1 )=h t+1 (D t+1 (¯x ξ t ,ξ t+1 ,¯u xit ),ξ t+1 ). This implies that lim k∈S ′ (ξ [t] ) f k− 1 t (s k t ,u k t )=⟨c t ,¯x ξ t ⟩+⟨d t ,¯u ξ t ⟩+ X ξ t+1 ∈Ξ t+1 p(ξ t+1 )h t+1 (D t+1 (¯x ξ t ,ξ t+1 ,¯u ξ t ),ξ t+1 ) ≥ h t (¯s ξ t ) The last inequality holds because ¯u ξ t ∈X t (¯s ξ t ). On the other hand, it holds that ˆ H k− 1 t (s k t )=f k− 1 t (s k t ,u k t ). 64 Therefore, we have lim k∈S ′ (ξ [t] ) ˆ H k− 1 t (s k t )= lim k∈S ′ (ξ [t] ) f k− 1 t (s k t ,u k t )≥ h t (¯s ξ t ) (2.62) Based on the Lemma 2.14,{H k t } k∈N is bounded above by h t . It follows from (2.62) that lim k∈S ′ ( ˜ ξ [t] ) ˆ H k− 1 t (¯s k t )=h t (¯s ˜ ξ t ) (2.63) Based on the cut calculation in the backward pass, we have h k t (s k t ) = ˆ H k− 1 t (s k t ). Hence, (2.63) implies that lim k∈S ′ ( ˜ ξ [t] ) h k t (s k t )=h t (¯s ξ t ). Theorem 2.19. Suppose Assumptions (A1) - (A4) and (B1) are satisfied. With probability one, for any accumulation point, ¯u, of{u k 0 } k∈N , it holds that lim n→∞ (¯α kn 1 +⟨ ¯ β kn 1 ,u kn 0 ⟩)= X ξ 1 ∈Ξ 1 p(ξ 1 )h 1 (D 1 (x 0 ,ξ 1 ,¯u 0 ),ξ 1 ), where{u k 0 } ∞ n=1 is an infinite subsequence of {u k } k∈N and lim n→∞ u kn 0 = ¯u 0 . Proof. By the definition ¯ α k 1 and ¯ β k 1 , we have ¯α k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 ) α k 1 (ξ 1 )+⟨β k 1 (ξ 1 ),a 1 (ξ 1 )+A 1 (ξ 1 )x 0 ⟩ , ¯ β k 1 = X ξ 1 ∈Ξ 1 p(ξ 1 )⟨B 1 (ξ 1 ),β k 1 (ξ 1 )⟩. According to the definition of h k 1 , we have ¯α k 1 +⟨ ¯ β k 1 ,u k 0 ⟩= X ξ 1 ∈Ξ 1 p(ξ 1 )h k 1 (D 1 (x 0 ,ξ 1 ,u k 0 ),ξ 1 ). (2.64) 65 For a ξ 1 ∈ Ξ 1 , let {k ′ n } n∈N denote a further subsequence of {k n } n∈N such that ξ k ′ n 1 = ξ 1 . Based on Proposition 2.18 and lim n→∞ u kn 0 = ¯u 0 , it holds that with probability one, lim n→∞ h k ′ n 1 (D 1 (x 0 ,ξ k ′ n 1 ,u k ′ n 0 ),ξ k ′ n 1 )=h 1 (D 1 (x 0 ,ξ 1 ,¯u 0 ),ξ 1 ). Hence, by the uniqueness of uniform convergence of h k 1 , we have lim n→∞ h kn 1 (D 1 (x 0 ,ξ 1 ,u kn 0 ),ξ 1 )=h 1 (D 1 (x 0 ,ξ 1 ,¯u 0 ),ξ 1 ). (2.65) Since argument in (2.65) can apply to any ξ 1 ∈Ξ 1 , we have lim n→∞ X ξ 1 ∈Ξ 1 p(ξ 1 )h kn 1 (D 1 (x 0 ,ξ 1 ,u kn 0 ),ξ 1 )= X ξ 1 ∈Ξ 1 p(ξ 1 )h 1 (D 1 (x 0 ,ξ 1 ,¯u 0 ),ξ 1 ),w.p.1. OncewehaveTheorem2.19, wecanapplytheprooftechniquefromHigleandSen[47]to show that SD-SDDP produce a subsequence that converges to an optimal solution to (2.26) with probability one. Here, we give a sketch of the proof and reader can find more details in [47]. Corollary 2.20. With probability one, for any accumulation point, ¯u, of {¯u k 0 } k∈N , it holds that lim n→∞ f kn 0 (s 0 ,¯u kn 0 )= lim n→∞ f kn+1 0 (s 0 ,¯u kn 0 )=f(s 0 ,¯u 0 ), where{¯u kn 0 } n∈N is the subsequence of{¯x k } k∈N such that lim n→∞ ˆ x kn 0 = ˆ x. Proof. Accordingly, we have ⟨c 0 ,x 0 ⟩+lim n→∞ ⟨d 0 ,u kn 0 ⟩+max{¯α kn 1 +⟨ ¯ β kn 0 ,¯u kn 0 ⟩,¯α − kn 1 +⟨ ¯ β − kn 0 ,¯u kn 0 ⟩}≤ f kn 0 (s 0 ,u kn 0 )≤ f 0 (s 0 ,u kn 0 ). 66 which implies that liminf n→∞ ⟨c 0 ,x 0 ⟩+ lim n→∞ ⟨d 0 ,u kn 0 ⟩+max{¯α kn 1 +⟨ ¯ β kn 0 ,¯u kn 0 ⟩,¯α − kn 1 +⟨ ¯ β − kn 0 ,¯u kn 0 ⟩} ≤ liminf n→∞ f kn 0 (s 0 ,¯u kn 0 )≤ limsup n→∞ f kn 0 (s 0 ,¯u kn 0 )≤ limsup n→∞ f 0 (s 0 ,¯u kn 0 ) (2.66) By the continuity of f 0 (x 0 ,·), we have lim n→∞ f 0 (s 0 ,ˆ u kn 0 )=f 0 (s 0 ,¯u 0 ), w.p.1. (2.67) Hence, equations (2.66) and (2.67) imply that lim n→∞ f kn 0 (s 0 ,¯u kn 0 ) = f 0 (s 0 ,¯u 0 ), w.p.1. Since f kn+1 0 contains a minorant at ¯u kn 0 that is generated in iteration k n +1, we can mimic the analysis above and get lim n→∞ f kn+1 0 (s 0 ,¯u kn 0 )=f 0 (s 0 ,¯u 0 ), w.p.1. Lemma 2.21. Suppose σ ≥ 1. Let {u k 0 } k∈N and {¯u l } u∈N be the sequence of master problem and incumbent solutions identified by the SD-SDDP. With probability one, limsup k→∞ f k 0 (s 0 ,u k+1 0 )− f k 0 (s 0 ,¯u k )=0. Proof. Let θ l = ˇ f k 0 (s 0 ,u k+1 )− f k (s 0 ,¯u k )). Let{k n } n∈N be a sequence of iterations that the incumbent changes. Then the proof is the same as the one in Theorem 3 of [47]. Lemma 2.22. Suppose assumptions (A1) - (A4) and (B1) are satisfied and σ ≥ 1. Let u 0 ∈ U 0 (s 0 ). Define δ k (u 0 ) = u 0 − u k 0 ∥u 0 − ¯u k 0 ∥ and ¯δ (u 0 ) = u 0 − ¯u 0 ∥u 0 − ¯u 0 ∥ . Then with probability one, the following holds: (a) lim n→∞ u kn+1 0 = lim n→∞ ¯u kn 0 = ¯u 0 , for some ¯u 0 ∈ U 0 (s 0 ) and {k n } n∈N is a infinite subsequence of k k∈N . (b) limsup n→∞ (f kn 0 )(s 0 ,u kn+1 0 ;δ kn+1 (u))≤ f ′ 0 (s 0 ,¯u 0 ; ¯δ (u 0 )). Proof. The proof is similar to Lemma 4 in [47]. 67 Theorem2.23. Suppose σ ≥ 1. Let{u k 0 } k∈N and{¯u k 0 } k∈N be the sequence of master problem and incumbent roo-stage solutions identified by the SD-SDDP. With probability one, there exist{u kn+1 0 } n∈N and{¯u kn 0 } n∈N suchthatbothsequencesconvergetothesameoptimalsolution to (2.26). Proof. The proof is similar to the proof of Theorem 4.18 in chapter 4 and Theorem 5 in [47]. According to Lemma 2.22, we know that there exists{k n } n∈N ⊆{ k} k∈N such that lim n→∞ f kn 0 (s 0 ,u kn+1 0 )− f kn− 1 0 (s 0 ,¯u kn )+∥u kn+1 0 − ¯u kn 0 ∥ 2 =0, and lim n→∞ u kn+1 0 = lim n→∞ ¯u kn 0 = ¯u 0 . (2.68) Let u 0 ̸= ¯u 0 and denote δ k 0 (u 0 )= u 0 − u k 0 ∥u 0 − u k 0 ∥ , ¯δ 0 (u 0 )= u 0 − ¯u 0 ∥u 0 − ¯u 0 ∥ . Since u 0 ∈U 0 , it holds that u kn+1 0 +∥u 0 − u kn+1 0 ∥δ kn+1 0 (u 0 )=u 0 ∈U 0 (s 0 ). (2.69) lim n→∞ u kn+1 0 implies that there exists N such that∥¯u 0 − u kn+1 0 ∥≤ 1 2 ∥¯u 0 − u 0 ∥ for all n≥ N. This further implies that ∥u kn+1 0 − u 0 ∥≥∥ ¯u 0 − u 0 ∥−∥ ¯u 0 − u kn+1 0 ∥≥ 1 2 ∥¯u 0 − u 0 ∥>0, ∀n≥ N. (2.70) Hence, equations (2.69) and (2.70) imply that one can pick 0≤ α ≤ 1 2 ∥¯u 0 − u 0 ∥ so that u kn+1 0 +αδ kn+1 0 (u 0 )∈U 0 , ∀n≥ N. 68 By the optimality regarding the root-stage proximal mapping problem, we can get f kn 0 (s 0 ,u kn+1 0 )+ σ 2 ∥u kn+1 0 − ¯u kn 0 ∥ 2 ≤ f kn 0 (s 0 ,u kn+1 0 +αδ kn+1 0 (u 0 ))+ σ 2 ∥(u kn+1 0 +αδ kn+1 0 (u 0 ))− ¯u kn 0 ∥ 2 , which implies that 0≤ f kn 0 (s 0 ,u kn+1 0 )+αδ kn+1 0 (u 0 ))− f kn 0 (s 0 ,u kn+1 0 ) + σ 2 ∥(u k 0 +αδ kn+1 0 (u 0 ))− ¯u kn 0 ∥ 2 −∥ u kn+1 0 − ¯u kn 0 ∥ 2 . (2.71) Divide both side of (2.71) by α , we have 0≤ [f kn 0 (s 0 ,u kn+1 0 +αδ kn+1 0 (u 0 ))− f kn 0 (s 0 ,u kn+1 0 )] α + σ 2 [∥(u kn+1 0 +αδ kn+1 0 (u 0 ))− ¯u kn 0 ∥ 2 −∥ u kn+1 0 − ¯u kn 0 ∥ 2 ] α . Let α ↓0 and by the definition of directional derivative, we have 0≤ (f kn 0 ) ′ (s 0 ,u kn+1 0 ;δ kn+1 0 (u 0 ))+ σ 2 δ kn+1 (u 0 ) ⊤ (u kn+1 0 − ¯u kn 0 ). (2.72) whereδ kn+1 (u 0 ) ⊤ (u kn+1 0 − ¯u kn 0 )isthedirectionalderivativeofthequadraticfunction∥u− ¯u kn 0 ∥ 2 at the point u kn+1 0 with direction δ kn+1 (u 0 ). It follows from (2.72) that 0≤ limsup n→∞ (f kn 0 ) ′ (s 0 ,u kn+1 0 ;δ kn+1 0 (u 0 )) Based on (2.68) and Lemma 2.22, we have 0≤ limsup n→∞ (f kn 0 ) ′ (s 0 ,u kn+1 0 ;δ kn+1 0 (u 0 ))≤ f ′ 0 (s 0 ,¯u 0 ; ¯δ 0 (u 0 )). (2.73) Since f 0 is convex on U(s 0 ) and ¯δ 0 (u 0 ) is arbitrary, equation (2.73) implies that ¯u 0 is the optimal solution of the original problem in (2.26). 69 2.3.8 Stochastic Dynamic Linear Programming GangammanavarandSen[39]proposeawayasStochasticDynamicLinearProgrammingto solve the MSLP problem in (2.22) with unknown probability distribution while assuming all the cost-to-go functions are bounded above by zero. We summarize the critical components of the SDLP algorithm in the following algorithm. (B2) Zero provides a lower bound of all cost-to-go functions. ThereareseveraluniquefeaturesinthealgorithmicsettingofSDLP.SDLPusesfeasiblebases to build a Basic Feasible Policy (BFP) to generate a candidate solution in the prediction pass. In the backward recursion, it constructs Stage-wise Dual Approximation (SDA) by only using the cuts at the current solution point. This allows the dimension of the feasible region in SDA to be fixed. Also, the backward recursion only generates new cuts on the latest generated sample. It is worth noting that such a strategy will save computational power to obtain the optimal solution while sacrificing the uniform convergence of value function approximation. That is, the value function will no longer converge to the true value function on the entire effective feasible region. Instead, the value function converges to the true value function at any limiting points. NowweprovideseveraldeeperexplanationofSDLPintermsofitsmathematicalbuilding blocks. One key difference between SDLP and the Regularized SDDP is that SDLP uses sample frequency to estimate the probability distribution and updates the estimate on the flywhileSDDPorRegularizedSDDPcanonlysolvethecasewhenthedistributionisknown. The value function approximation in SDLP is ˆ f k t (s k t ,u t ):=⟨c t ,x k t ⟩+⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ k t+1 ˆ p k (ξ t+1 )h k− 1 t+1 (D t+1 (x k t ,ξ t+1 ,u t ),ξ t+1 ) where Ξ k t+1 is the set of outcomes that appear in the first k iterations and ˆ p k (ξ t+1 ) is the frequency of the outcome ξ t+1 . Since SDLP uses ˆ p k (ξ t+1 ) instead of the true probability, 70 p(ξ t+1 ), the sequence of corresponding value function approximation will no longer have monotonicity. In the backward pass, SDLP defines a sample average value function by replacing the ˆ p k (ξ t+1 ) by ˆ p k (ξ t+1 ) in ˆ H k t . ˜ H k t (s t )=⟨c t ,x t ⟩+ min ut∈Ut(st) {⟨d t ,u t ⟩+ X ξ t+1 ∈Ξ k t+1 ˆ p k t h k t+1 (D t+1 (x t ,ξ t+1 ,u t ),ξ t+1 )} It then uses Subgradient Selection to get the subgradient of the minorant as follows. β k t+1 (ξ t+1 )∈∂h k t+1 (D t+1 (x k t ,ξ t+1 ,u k t ),ξ t+1 ) (2.74) Next, it uses those subgradeints to construct a lower bound of the sample average value function, which is called by Stagewise-Dual Approximation (SDA). ˜ H k t (s t )≥⟨ c t ,x t ⟩+ ¯α k t+1 +⟨ ¯ β k t+1 ,x t ⟩+max{⟨π t ,(b t − C t x t )⟩|⟨D t ,π t ⟩≤ ¯ρ k t+1 ,π t ≤ 0} where ¯ β k t+1 = X ξ t+1 ∈Ξ k t+1 ˆ p k (ξ t+1 )⟨β k t+1 (ξ t+1 ),A t+1 (ξ t+1 )⟩, (2.75) ¯ρ k t+1 =d t + X ξ t+1 ∈Ξ k t+1 ˆ p k (ξ t+1 )⟨β k t+1 (ξ t+1 ),B t+1 (ξ t+1 )⟩, (2.76) and ¯α k t+1 = X ξ t+1 ∈Ξ k t+1 ˆ p k (ξ t+1 ) α k t+1 (ξ t+1 )+⟨β k t+1 (ξ t+1 ),a t+1 (ξ t+1 )⟩ . (2.77) One benefit of using SDA is that the dimension of the dual space remains the same so that one can store all the historic dual bases for the incumbent selection in the prediction pass, 71 which is shown below. M k t (s t )=argmin{f k− 1 t (s t ,ˆ u j t )| ˆ u j t ∈ ˆ U k t (s t )} (2.78) where ˆ U k t (s t ) is the set of basic feasible solutions calculated by using the set of dual bases obtained in SDA. For more details how to maintain ˆ U k t (s t ), we refer the reader to the in- cumbent selection of [39]. Since SDLP uses Ξ k t instead of Ξ t , the scenario tree is growing. Therefore, the cut coeffi- cients computed in the previous iteration needs to be multiplied by a factor so that they can be used as the lower bound of the current value function approximation in the new scenario tree. The corresponding mathematical formulation is shown in (2.79). 2.4 Remarks Therandomnessofdecomposition-basedmethodsforMSLPusuallyhastwodifferentsources. The first source is from the random traversal of the scenario tree, while the second source is from the sampling. In particular, the SDDP-type algorithm only has the first source of randomness since the probability distribution is known beforehand and the scenario tree is fixed. SD type algorithm has both sources because it grows scenario trees based on the sampling and its traversal is still random. The use of the regularizing term in SDLP and our proposed algorithm (SD-SDDP) can alsobeinterpretedasaproximalmappingthatenforcescontraction,asshownin[69]. Define T X αg (x)=argmin z∈X {αg (z)+ 1 2 ∥z− x∥ 2 }, Π X (x)=argmin z∈X { 1 2 ∥z− x∥ 2 } 72 Algorithm 11 SDLP Choose a proximal parameter σ ≥ 1. For t = 1,2,...,T, set Ξ 0 t =∅, J 0 t =∅, l 0 t = 0. Set Π 0 T =∅. Initialization Let ˆ u 0 0 ∈argmin{ ˆ f 0 0 (s 0 ,u 0 )| D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0}. Let ˆ x k 0 =x k 0 =x 0 for all k≥ 0. for k =1,2,... do Sample ξ k [T] from an oracle and start with ˆ u k− 1 0 . for t=1,2,...,T do Prediction Pass set s k t =a t (ξ t )+A t (ξ k t )ˆ x k t− 1 +B t (ξ k t )ˆ u k t− 1 If k >1 then use (2.78) to get ˆ u k t ; else, get ˆ u 0 t ∈argmin ut∈U 0 t (s 0 t ) { ˆ f 0 0 (s 0 t ,u 0 t )}. end for Obtain u k 0 =argmin{ ˆ f k− 1 0 (s 0 ,u 0 )+ σ 2 ∥u 0 − ˆ u k− 1 0 ∥ 2 | D 0 u 0 ≤ b 0 − C 0 x 0 ,u 0 ≥ 0} Opti- mization Pass for t=1,...,T − 1 do Set x k t =a t (ξ t )+A t (ξ k t )x k t− 1 +B t (ξ k t )u k t− 1 Obtain u k t =argmin{ ˆ f k− 1 t (s k t ,u t )+ σ 2 ∥u t − ˆ u k− 1 t ∥ 2 | D t u t ≤ b t (ξ k t )− C t (ξ k t ),u t ≥ 0} end for Set x k T =a T (ξ k T )+A T (ξ k T )x k T− 1 +B T (ξ T )u k T− 1 , and obtain u k T ∈argmin{⟨c T ,x k T ⟩+⟨d T ,u T ⟩| D T u T ≤ b T (ξ k T )− C T (ξ k T )x k T ,u T ≥ 0} Use Terminal Stage Backward Pass in Algorithm 12 to obtain h k T . Backward Pass Update the approximated objective function below. ˆ f k T− 1 (s T− 1 ,u T− 1 )=⟨c T− 1 ,x T− 1 ⟩+⟨d T− 1 ,u T− 1 ⟩+ X ξ T ∈Ξ k T ˆ p k (ξ T )h k T (s T (ξ T )) for t=T − 1,...,1 do Use Non-terminal Stage Backward Pass in Algorithm 13 to get the updated mino- rants, h k t . The approximated objective function is updated as follows. ˆ f k t− 1 (s t− 1 ,u t− 1 )=⟨c t− 1 ,x t− 1 ⟩+⟨d t− 1 ,u t− 1 ⟩+ X ξ t∈Ξ k t ˆ p k (ξ t )h k t (s t (ξ t )) end for Set ˆ u k 0 =u k 0 . end for 73 Algorithm 12 Terminal Stage Backward Pass in SDLP Input: x k T , ˆ x k T , ξ k T , Ξ k− 1 T , h k− 1 T . Output: Ξ k T , h k− 1 T . Update Ξ k T by adding ξ k T to Ξ k− 1 T (i.e., Ξ k T =Ξ k− 1 T ∪{ξ k T } ). For ξ T =ξ k T , solve π k T (ξ T )∈argmax{⟨π T ,b T (ξ T )− C T (ξ T )x k T |⟨D T ,π T ⟩≤ d T ,π T ≤ 0} and ˆ π k T (ξ T )∈argmax{⟨π T ,b T (ξ T )− C T (ξ T )ˆ x k− 1 T |⟨D T ,π T ⟩≤ d T ,π T ≤ 0}. Compute the cut coefficients for l k T (s T ) = α k T (ξ T )+⟨β k T (ξ T ),x t ⟩ and ˆ l k T (s T ) = ˆ α k T (ξ t )+ ⟨ ˆ β k T (ξ T ),x t ⟩ below. α k T (ξ T )=⟨b T (ξ T ),π k T (ξ T )⟩, β k T (ξ T )=c T −⟨ C T (ξ T ),π k T (ξ T )⟩ ˆ α k T (ξ T )=⟨b T (ξ T ),ˆ π k T (ξ T )⟩, ˆ β k T (ξ T )=c T −⟨ C T (ξ T ),ˆ π k T (ξ T )⟩ Then add π k T (ξ T ) and π k T (ξ T ) to the set of collection of previously discovered dual extreme points (i.e., Π k T =Π k T ∪{π k T (ξ T ),ˆ π k T (ξ T )}). For ξ T ∈Ξ k \{ξ k T }, let π k T (ξ T )∈argmax{⟨π T ,(b T (ξ T )− C T (ξ T )x k T )| π T ∈Π k T } and ˆ π k T (ξ T )∈argmax{⟨π T ,(b T (ξ T )− C T (ξ T )ˆ x k− 1 T )| π T ∈Π k T } Compute the cut coefficients for l k T (s T ) = α k T (ξ T )+⟨β k T (ξ T ),x T ⟩ and ˆ l k T (s T ) = ˆ α k T (ξ T )+ ⟨ ˆ β k T (ξ T ),x t ⟩. α k T (ξ T )=⟨b T (ξ T ),π k T (ξ T )⟩, β k T (ξ T )=c T −⟨ C T (ξ T ),π k T (ξ T )⟩ ˆ α k T (ξ T )=⟨b T (ξ T ),ˆ π k T (ξ T )⟩, ˆ β k T (ξ T )=c T −⟨ C T (ξ T ),ˆ π k T (ξ T )⟩ for ξ T ∈Ξ k T do Update the minorant as h k T (s T (ξ T ))=max 0≤ j≤ k n l j T (s T (ξ T )), ˆ l j T (s T (ξ T )) o end for 74 Algorithm 13 Non-terminal Stage Backward Pass in SDLP Input: h k t+1 , u k t , x k t , ˆ u k− 1 t , ˆ x k− 1 t , ξ k t , Ξ k− 1 t , Ξ k t+1 ,J k− 1 t Output: h k t , Ξ k t ,J k t Update Ξ k t by adding ξ k t to Ξ k t . Update the index set,J k t (ξ k t ), where the realizations that equal to ξ k t occur. For ξ t ∈Ξ k t \{ξ k t }, letJ k t (ξ t )=J k− 1 t (ξ t ). Calculate ¯ β k t+1 , ¯ρ k t+1 and ¯α k t+1 based on (2.75) - (2.77) for u k t . SDA k t Get the optimal solution and its corresponding basis of the following problem. π k t (ξ k t )∈argmax{⟨π t ,(b t (ξ k t )− C t (ξ k t )x k t ⟩|⟨D t ,π t ⟩≤ ¯ρ k t+1 ,π t ≤ 0}. Construct new cut, l k t (s t )=α k t (ξ k t )+⟨β k t (ξ k t ),x t ⟩, with the following coefficients. α k t (ξ k t )=⟨π k t (ξ k t ),b t (ξ k t )⟩, β k t (ξ k t )=c t −⟨ C t (ξ k t ),π k t (ξ k t )⟩+ ¯α k t+1 . Calculate ˆ ¯ β k t+1 , ˆ ¯ρ k t+1 and ˆ ¯α k t+1 based on (2.75) - (2.77) for ˆ u k− 1 t . Similartothecalculationabove, wecanconstructanothercut, ˆ l k t (s t )byusing ˆ x k t , ˜ ξ k t , ˆ ¯ β k t+1 , ˆ ¯ρ k t+1 and ˆ ¯α k t+1 . For ξ t =ξ k t , update the minorant as follows. h k t (s t )=max ( ( k− 1 k ) T− t l j t (s t ) j∈J k− 1 t (ξ t) ,l k t (s t ), ˆ l k t (s t ) ) . (2.79) For ξ t ∈Ξ k t \{ξ k t }, update the minorant as follows. h k t (s t )=max ( ( k− 1 k ) T− t l j t (s t ) j∈J k− 1 t (ξ t) ) . 75 Let u ∗ 0 ∈ argmin u 0 ∈U 0 (s 0 ) {f 0 (s 0 ,u 0 )} The candidate solution and the previous incumbent solution from our Regularized SDDP has the following relationship. ∥u k 0 − u ∗ 0 ∥≤∥ T U 0 αf k− 1 0 (¯u k− 1 0 )− T U 0 αf 0 (¯u k− 1 0 )∥+∥T U 0 αf 0 (¯u k− 1 0 )− u ∗ 0 ∥ Furthermore, proximal mapping can be represented as an implicit projection below. T U 0 αf k− 1 0 (¯u k− 1 0 )=Π U 0 (¯u k− 1 0 − αg k (T U 0 αf k− 1 0 (¯u k− 1 0 ))) where g k (T U 0 αf k− 1 0 (¯u k− 1 0 ))) is the subgradient of f k− 1 0 under optimality condition at the point ¯u k− 1 0 . In SDLP, since the true probability is unknown, we further defines the sample average of the value function as follows. H k 1 (s 1 )=⟨c 1 ,x 1 ⟩+min⟨d 1 ,u 1 ⟩+ X ξ [2:T] ∈P k [2:T] p k (ξ [2:T] ) T X t=2 [⟨c t ,x t ( ˜ ξ t )⟩+⟨d t ,u t (ξ t )⟩] s.t. u 1 ∈U 1 (s 1 ),{u j t (ξ t )∈U t (s t (ξ t ))} t=2,...,T and non-anticipative, {x t+1 =D t+1 (x t (ξ t ),ξ t ,u t (ξ t ))} t=1,...,T− 1 F k 0 (s 0 ,u 0 )=⟨c 0 ,x 0 ⟩+⟨d 0 ,u 0 ⟩+ X ξ 1 ∈Ξ k 1 p k (ξ 1 )H k 1 (s 1 (ξ 1 )) Then candidate solution, u k 0 , and the previous incumbent solution, ¯u k− 1 0 , from SDLP has the following relation. ∥u k 0 − u ∗ 0 ∥≤∥ T U 0 αf k− 1 0 (¯u k− 1 0 )− T U 0 αF k− 1 0 (¯u k− 1 0 )∥+∥T U 0 αF k− 1 0 (¯u k− 1 0 )− T U 0 αF k− 1 0 (u ∗ 0 )∥+∥T U 0 αF k− 1 0 (u ∗ 0 )− u ∗ 0 ∥. For further discussion on the use of regularizing term and the convergence rate of SD, we recommend readers to [69]. In conclusion, this chapter proposes a new type of MLSP algorithm to build a bridge between SD and SDDP. In detail, it utilizes the proximal mapping, minorant selection, and 76 incumbent selection like SD does to reduce the workload per iteration. Since the value func- tions other than the ones in the first stage are approximated based on the same schema of the SDDP methodology, the monotonicity and finite convergence of the value functions are ensured. Currently, to the best of our knowledge, one limit on all the decomposition-based meth- ods for MSLP is that it requires the assumption of the finite probability support. Such assumption ensures that the scenario tree only contains finitely many paths so that the al- gorithm can traverse each path an infinite number of times. The traversal of each path an infinitenumberoftimesmakessurethesequenceofapproximatedfunctionsconvergestothe original function at least at the accumulation points. To relax the finite probability support assumption, one way is to design an algorithm that has the features of decomposition-based methods as well as the features in the methods for refining the filtration (e.g., [18, 37]). 77 Chapter 3 Distribution-free Algorithms for Learning Enabled Optimization with Non-parametric Approximation 3.1 Overview Non-parametric statistics [111] has broad applications in density estimation [52], regression [44]andmanyengineeringapplicationssuchasimageclassification[71]andnaturallanguage processing [106]. Due to its distribution-free and predictive characteristics, non-parametric estimators are imbued with predictive power as well as mild requirements on oracles for probability distribution. Recently, the study of Stochastic Programming (SP) models with predictor/response data has begun to attract attention due to modern applications with evolving/streaming data [12, 24, 30, 55]. Following [24], we refer to such SP models as Predictive SP (PSP) which hold the promise of providing greater agility and adaptivity to SP. While the predictive feature adds agility to decision-making, there is one particularly insidiousaspectofsuchmodels: simplyaveragingfinitely-manyfunctionswithequalweights (asinSampleAverageApproximation(SAA))will,ingeneral,produceaconstantbiaswhich may not reduce even when the sample size increases indefinitely ([12]). In this paper, we introduce a variety of non-parametric estimation techniques for data-driven decision making andproposeanintegratedalgorithmicapproachinwhichthedensityestimationanddecision optimization are undertaken simultaneously. 78 Tosetthestageforourdevelopment,weconsiderthePSPproblemwhereavectorofran- dom model parameters, ˜ ξ , depends on a predictor (random variable), ˜ ω, and such predictors can be observed before the realization of ˜ ξ . While the underlying distribution of predictor- response pairs (˜ ω, ˜ ξ ) will not be assumed to be known, we will assume that a dataset of outcomes S N ≜{(ω 1 ,ξ 1 ),(ω 2 ,ξ 2 ),...,(ω N ,ξ N )} is available, and more observations become available as N →∞. Moreover, as N →∞ the regular conditional distribution of ˜ ξ given ˜ ω =ω, denoted µ ˜ ξ |˜ ω=ω , exists 1 and E[F(x, ˜ ξ )|˜ ω =ω]≜ Z S ξ F(x,ξ )µ ˜ ξ |˜ ω=ω (dξ ) (3.1) is well defined and finite valued for any x∈X. With the above setup, we focus on solving the following PSP model. min x∈X f(x,ω)≜E[F(x, ˜ ξ )|˜ ω =ω]. (3.2) A word of caution is in order at the outset: it is important to distinguish between the above conditional expectation, and its variant E[F(x, ˜ ξ |˜ ω)] which calls for a different estimation process. In (3.2) the objective function F (which is scalar valued) is the object to be estimated (as a function of x), whereas, the variant mentioned in the previous sentence refers to a process which first builds a statistical model (i.e., a regression of ˜ ξ |˜ ω), which may then be incorporated within optimization. The works of [24] and [55] are devoted to such a two-step process. In contrast, (3.2) works on estimation and optimization simultaneously. In addition to the global regression, a local regression for the decision-dependent case (i.e., ˜ ξ |x) can be found in [68]. The challenges of solving (3.2) arise from: (a) How should one estimate the conditional distribution of ˜ ξ given ˜ ω = ω without complete knowledge of the “ground truth”? (b) Given a dataset S N suppose that the joint estimation-optimization step produces a decision 1 seesection5.1.3of[29]foradiscussionofregularconditionaldistributionandtheconditionalexpectation 79 x(S N ). Whatpropertiesshouldthejointestimation-optimizationstepimparttothesequence {x(S N )} so that we obtain an optimal solution to (3.2) asN →∞? Inresponsetotheabovechallenges, BertsimasandKallus[12]proposeanapproximation method which utilizes non-parametric estimation to approximate E[F(x, ˜ ξ )|˜ ω = ω] directly via a dataset S N . A non-parametric estimation of (3.2) is formulated as: min x∈X ˆ f N (x,ω;S N )≜ N X i=1 v N,i (ω)F(x,ξ i ). (3.3) Here, ˆ f N (x,ω;S N ) denotes an estimated function which is parametrized by x and ω and depends on the dataset S N . The weight function (i.e., v N,i (ω)) suggested by [12] is calcu- lated by some non-parametric method, such as k nearest neighbors estimation, or kernel estimation, which only depend on ω and S N . In other words, the dataset{(ω i ,F(x,ξ i ))} N i=1 is used to construct a point estimate of f(x,ω). Equation (3.3) should be considered as a non-parametric analog of Sample Average Approximation (SAA). Just as SAA provides a statisticalapproximationwhichneedstobesolvedusinganoptimizationalgorithm,themin- imization in (3.3) also requires an optimization algorithm for its solution. For this reason, we refer to (3.3) as a statistical approach. However unlike SAA, the setup of (3.3) requires supervised machine learning methods, such as k nearest neighbors method (kNN), kernel estimation, random forest, or others to calculate ω-dependent weights v N,i (ω). Bertsimas and Kallus [12] show that under certain assumptions (e.g., equi-continuity of the random objective function), solving (3.3) provides a statistically consistent estimate of an optimal solutionto(3.2). Theconsistencyofsuchnon-parametricestimationdoesnotrelyonspecific assumptions regarding any particular distribution. In this chapter, we develop a first-order method which uses the oracle (A1) (see below), the datasets S N , and a combined non-parametric estimation and stochastic quasigradient optimization algorithm, for which we show asymptotic convergence to an optimal solution of (3.2), asN goes to infinity. 80 (A1) There exists an oracle which returns a subgradient, G(x,ξ ), of F(·,ξ ) at a given point x (i.e., G(x,ξ ) ∈ ∂ x F(x,ξ )) so that E ˜ ξ [G(x, ˜ ξ )] = g(x), where g(x) ∈ ∂f(x) and f(x)≜E ˜ ξ [F(x, ˜ ξ )] 2 Beforeproceedingtothemainpartofthechapter, wemotivateourworkbythefollowing counterexample. In general, a first-order method (e.g., Stochastic Approximation (SA)) adopts iterations (indexed by l) as follows. x l+1 =Π X (x l − γ l G(x l ,ξ l )), where Π X denotes a projection operator. Robust SA with constant stepsize (see [79]) is equivalenttorunningtheSAalgorithmwithaconstantstepsizetogenerate{x l } N l=1 andout- put 1 N P N l=1 x l as the estimated solution. As illustrated by the example below, neither SA norRobustSAproduceasymptoticallyoptimalsolutionto(3.2)duetoanonzeroestimation bias in the limit. A Counterexample using SA for a Predictive Newsvendor Problem The newsvendor problem is a classical SP problem [48]. In this example, we consider an extension of the classical problem by introducing covariates. Let p denote the selling price, and let c denote the purchase price satisfying p > c > 0. Let x and ˜ ξ denote the units of product purchased and demand, respectively. We maximize the total profit given the observed predictor as follows. max x≥ 0 E[pmin(x, ˜ ξ )|˜ ω =ω]− cx. (3.4) 2 The expectations here are taken with respect to ˜ ξ. Namely, E ˜ ξ [G(x, ˜ ξ )] = R ξ ∈S ξ G(x,ξ )µ ˜ ξ (dξ ) and E ˜ ξ [F(x, ˜ ξ )]= R ξ ∈S ξ F(x,ξ )µ ˜ ξ (dξ ), where µ ˜ ξ is the marginal distribution of ˜ ξ . 81 In recent years, the newsvendor problem with covariates has attracted some attention (e.g., [70] and [4]). Assumingthattheconditionaldensityfunctionexists,theconditionalexpectationin(3.4) is defined as E[pmin(x, ˜ ξ )|˜ ω = ω] = p R min(x,ξ )k(ξ |ω)dξ , where k(ξ |ω) is the conditional density function of ˜ ξ given ˜ ω = ω. Furthermore, let Φ ˜ ξ |˜ ω=ω (x) = R x −∞ k(ξ |ω)dξ . It is easily shown that the optimal solution to (3.4) is x ∗ = Φ − 1 ˜ ξ |˜ ω=ω ( p− c p ) 3 , which is an extension of the standard newsvendor policy [70]. Given this analytical solution, we can show that neither SA nor Robust SA produces the optimal solution in the limit. We let g(ω,ξ ) denote the joint density function of (˜ ω, ˜ ξ ) and let g ˜ ξ (ξ )≜ R ∞ −∞ g(ω,ξ )dω denote the marginal density function of ˜ ξ . Equation (3.4) implies that G(x,ξ )= p− c if ξ >x − c otherwise (3.5) which further implies that R ∞ −∞ R ∞ −∞ G(x,ξ )g(ω,ξ )dωdξ = R ∞ −∞ G(x,ξ )g ˜ ξ (ξ )dξ = (p− c) + pΦ ˜ ξ (x), where Φ ˜ ξ (x)≜ R x −∞ g ˜ ξ (ξ )dξ . Let ˆ x = Φ − 1 ˜ ξ ( p− c p ). Let F l denote the sigma algebra generatedbythehistory{x 1 ,x 2 ,...,x l }, thentheiteratesfromSAforsolvingthepredictive newsvendor problem have the following recursive relation: E[∥x l+1 − ˆ x∥ 2 |F l ]≤∥ x l − ˆ x∥ 2 − 2γ l ((E[pmin(ˆ x, ˜ ξ )]− cˆ x)− (E[pmin(x l , ˜ ξ )]− cx l ))+γ 2 l (p+c) 2 Notethat ˆ x∈argmax x≥ 0 {E[pmin(x, ˜ ξ )]− cx}, then(E[pmin(ˆ x, ˜ ξ )]− cˆ x)− (E[pmin(x l , ˜ ξ )]− cx l )≥ 0. By using the diminishing stepsize rule (i.e., γ l > 0, P ∞ l=1 γ l =∞, and P ∞ i=1 γ 2 l < ∞), we have P ∞ l=1 γ 2 l (p+c) 2 <∞. By applying the Supermartingale Convergence Theorem (see Proposition 8.2.10 in [11]), we can show that {x l } l converges to ˆ x almost surely. How- ever, ˆ x is not the optimal solution of max x≥ 0 E[pmin(x, ˜ ξ )|˜ ω = ω]− cx when Φ − 1 ˜ ξ ( p− c p ) ̸= 3 For the extreme case when F − 1 ˜ ξ |˜ ω=ω ( p− c p ) < 0, x ∗ = 0 which implies that there is no need to purchase products beforehand. 82 Φ − 1 ˜ ξ |˜ ω=ω ( p− c p ). In particular, if we let (˜ ω, ˜ ξ ) follow a bivariate normal distribution: g(ω,ξ )= 1 2πσ ω σ ξ p 1− ρ 2 e {− 1 2(1− ρ 2 ) [( ω− µ ω σ ω ) 2 +( ξ − µ ξ σ ξ ) 2 − 2ρ (ω− µ ω)(ξ − µ ξ ) σ ωσ ξ ]} . (3.6) Then the conditional distirbution of ˜ ξ given ˜ ω = ω is normal distribution with mean µ ξ + ρ σ ξ σ ω (ω− µ ω ) and variance σ 2 ξ (1− ρ 2 ). The associated conditional density function of ˜ ξ given ˜ ω =ω is written below. k(ξ |ω)= 1 q 2πσ 2 ξ (1− ρ 2 ) exp ( − 1 2 ξ − (µ ξ +ρ σ ξ σ ω (ω− µ ω )) σ 2 ξ (1− ρ 2 ) !) . (3.7) Let µ ξ = 50, σ ξ = 20, µ ω = 30, σ ω = 15 and ρ = 0.5. The marginal distribution of ˜ ξ is normal distribution with mean 50 and standard deviation 20. By “plugging in” these parameters, ˜ ξ given ˜ ω = 24 follows a normal distribution with mean 46 and variance 300. For the case when p = 7 and c = 5, the optimal solution of (3.4) is x ∗ = 36.1975 based on x ∗ = Φ − 1 ˜ ξ |˜ ω=ω ( p− c p ). Furthermore, since k(ξ |ω) > 0 for all ξ ∈ (−∞ ,∞), the optimal solution is unique. Next,supposeweapplybothSAandRobustSAtosolve(3.4)withthemodelparameters given above. We set the batch size equal to 5 for varying sample sizes. We replicate each instance 20 times and record the distances to the optimal solution point (i.e., Avg(|d|)). For each algorithm, we plot the average distance to the optimal solution versus sample size in theFigure3.1. AccordingtoFigure3.1, theAvg(|d|)ofbothSAandRobustSAconvergeto apointaround2.5asthesamplesizeincreases. Asamatteroffact,|Φ − 1 ˜ ξ |˜ ω=ω ( p− c p )− Φ − 1 ˜ ξ ( p− c p )| = 2.4836. Therefore, this simple example demonstrates the need for new SP algorithms which can combine joint estimation and optimization (“estimization”?). Towardstheendofthispaper, 83 Figure 3.1: Predictive Newsvendor Problem: SA and Robust SA wewilldemonstratethatthenewclassofalgorithmswillproduceasymptoticallyoptimalso- lutionsusingajoint-estimation-optimizationprocedurewhichcombinesafirst-ordermethod, together with a variety of non-parametric estimation methods. ■ The contributions of this chapter are: a) An SP algorithm which goes “beyond using sample frequency” to solve SP problems with conditional expectation given ˜ ω =ω. The SP algorithm that we study, namely LEON, is the first to allow non-parametric estimation as anintegralpartofasolutionalgorithmforSP.TheLEONalgorithmprovidesthepossibility of rapid response to streaming data, and thus expanding the horizon of SP algorithms to new applications. b) We provide a generalized framework to study and control the errors from the joint-estimation-optimization process. c) We demonstrate the realization of the LEON algorithm by studying a variety of non-parametric methods (e.g., kNN and a variety of kernels). Such subgradient estimation is the key component for asymptotic convergence provided by the new algorithm. d) We provide preliminary computational evidence for the potential of this new class of optimization algorithms. Note that both LEON algorithm as well as the validation approach are distribution-free. This implies that the knowledge of true conditional distribution is not required by proposed methodologies. The numerical experiments in Section 3.5 illustrate that the solution quality obtained by most versions of LEON algorithm are reasonably close to optimality. 84 The rest of the chapter is organized as follows. In Section 3.2.1, we provide some back- ground on non-parametric estimation methods. In Section 3.3.1, we present the LEON algorithm and its convergence results. With the help of the generalized structure of LEON, we also investigate LEON-kNN as well as LEON-kernel in Sections 3.3.3 and 3.3.4, respec- tively. In Section 3.5.1, LEON is shown to clearly outperform SA when solving predictive newsvendor problem. In Section 3.5.2, LEON is shown to numerically solve a class of two- stage predictive stochastic linear programming problems. These computations justify the labels of LEON applied to methodologies presented in this paper. Overall this chapter over- comes an important criticism that input distributions are necessary to even get started with SP models. 3.1.1 Notations Since we are introducing notations in optimization and non-parametric estimation, we start the process of merging concepts via the notations introduced below. To be consistent with the optimization literature, we use x for the decision variable and X for the set to which x belongs 4 . Let X be a Borel subset of R m 1 and let Y be a Borel subset of R m 2 . Let (Ω ,Σ Ω ,P) be the probability space of the correlated continuous random variables ˜ ω and ˜ ξ taking the values in the measurable spaces, (X,Σ X ) and (Y,Σ Y ), respectively. In particular, the tuple (˜ ω, ˜ ξ ) : Ω 7→X ×Y is a random variable taking values in the product space (X ×Y ,Σ X ⊗ Σ Y ). Let the joint distribution of (˜ ω, ˜ ξ ) be µ ˜ ω, ˜ ξ . Let µ ˜ ω and µ ˜ ξ denote the marginal distribution functions of ˜ ω and ˜ ξ , respectively. We let ξ and ω denote the realizations of ˜ ξ and ˜ ω, respectively. Let g(ω,ξ ) be the joint density function at (˜ ω, ˜ ξ ) = (ω,ξ ), and µ ˜ ω, ˜ ξ has the form, µ ˜ ω, ˜ ξ (dω,dξ ) = g(ω,ξ )dωdξ, (ω,ξ )∈X ×Y . Let 4 We use Greek letters with˜on top to denote the random variables and Greek letters for corresponding outcomes. 85 m(ω) = R Y g(ω,ξ )dξ and suppose that the conditional distribution given that ˜ ω = ω is as follows (see [20] for details). k(ξ |ω)= g(ω,ξ ) m(ω) if m(ω)>0, R X g(ω ′ ,ξ )dω ′ if m(ω)=0. (3.8) WeusePandEtodenotetheprobabilityandexpectationoperators,respectively. Specif- ically, theexpectationE[·]takenin thealgorithmiswith respecttothe Cartesian product of infinitely many probability spaces of the i.i.d samples generated by the sampling procedure. The expectation of the bias is defined with respect to the Cartesian product of finitely many probability spaces which correspond to the dataset ( ˜ S N ) used. We let N << N and let ˜ S N ≜ {(˜ ω i , ˜ ξ i )} N i=1 denote a set of N i.i.d. copies of (˜ ω, ˜ ξ ). By the convention followed in this paper, S N denotes a realization of ˜ S N . We use the Euclidean distance to measure the distance between the two predictors. We let ∥·∥ denote the Euclidean norm of the vector and spectral norm of the matrix. Let ˜ ω N [j] (ω) denote the j th closest neighbor of ω from a collection of random variables{˜ ω i } N i=1 . 3.2 Background 3.2.1 Non-parametric Statistical Estimation Statisticalregressionisusedtomodeltherelationshipbetweenpredictors(sometimesreferred to as independent variables) and responses (sometimes referred to as dependent variables). Without assuming any specific pre-determined structure of predictors and responses, non- parametric estimation is one branch of statistical regression which sets up models purely based on the provided sampled data. One of the main advantages of non-parametric esti- mation is that model-fitting is not required prior to optimization process (see [38] for more details). 86 One of the non-parametric estimation methods is the k nearest neighbors method. As the name suggests, the mean of the k nearest neighbors (i.e., here the word “nearest” is sensitive to the metric used to measure distance) of the predictor is used to as the estimate of the corresponding conditional expectation. After Fix and Hodges Jr [36] proposed the nearest-neighbor method in 1951, several articles [25, 26, 44, 109, 110] have contributed to the consistency of kNN regression, among which Walk [109] shows that the kNN esti- mate converges to its true conditional expectation almost surely under certain regularity conditions. We give the definition of kNN estimate of f(x,ω) below. Definition 3.1. LetS(k N ,ω,{ω j } N j=1 )) be the k N nearest neighbors of ω from the sample set {ω j } N j=1 . The kNN estimate of f(x,ω) is defined as follows. ˆ f k N ,N (x,ω;S N )≜ 1 k N N X i=1 I ω i ∈S(k N ,ω;{ω j } N j=1 ) F(x,ξ i ). Notethat ˆ f k N ,N (x,ω;S N )isameasurablefunctionofx,ω andthedataS N (see[44]formore details). Weprovidetheformalstatementofasymptoticconvergenceof kNNestimateofthe objective function in (3.2) below. Theorem 3.2. (Theorem 1 in [109]) Let {(˜ ω i , ˜ ξ i )} N i=1 be identically and independently dis- tributed (i.i.d.) random variables. Let x ∈ X and suppose that E[|F(x, ˜ ξ )|] < ∞. Let k N be monotone increasing with N, lim N→∞ k N =∞, lim N→∞ k N N =0 and (N) varies regularly with exponent β ∈(0,1] (e.g., k N =⌊N β ⌋, β ∈(0,1)). Then the following holds: lim N→∞ ˆ f k N ,N (x,ω; ˜ S N )=f(x,ω), almost surely. Figure 3.2 shows that the distribution of the k nearest neighbors become more and more like the true conditional distribution as the sample size increases. Another popular non-parametric regression method falls under the category of kernel methods. Specifically, some predefined kernel function is used to estimate the marginal 87 Figure3.2: IllustrationofthekNNmethodwhere(ω,ξ )followsbivariatenormaldistribution (sample sizes = 100, 1000, 10000). density function. Since Nadaraya [78] and Watson [112] proposed the famous Nadaraya- Watson kernel estimate in 1964, it has been used in biostatistics [19], economics [16], image processing [105] etc. Several papers [25, 41, 42, 44, 57, 110] further refined the theory of kernel estimation. LetK denote a set of following kernel functions: 1. Na¨ ıve kernel : K N (z)≜I {∥z∥≤ 1} . 2. Epanechnikov kernel: K E (z)≜(1−∥ z∥ 2 )I {∥z∥≤ 1} . 3. Quartic kernel: K Q (z)≜(1−∥ z∥ 2 ) 2 I {∥z∥≤ 1} . 4. Gaussian kernel: K G (z)≜e − ∥z∥ 2 2 . Definition 3.3. Let K∈K. The kernel estimation of f(x,ω) is defined as follows. ˆ f kernel N (x,ω;S N )≜ P N i=1 F(x,ξ i )K( ω− ω i h N ) P N i=1 K( ω− ω i h N ) , (3.9) Also note that ˆ f kernel N (x,ω;S N ) is a measurable function of x, ω, and data S N . The strong pointwiseconsistencypropertyofNadaraya-Watsonkernelregressionestimateissummarized below. 88 Theorem 3.4. (Theorem 3 in [110]) Let {(˜ ω i , ˜ ξ i )} N i=1 be identically and independently dis- tributed (i.i.d.) random variables. Let the bandwidth be denoted h N >0 and h N =CN − β , β ∈(0, 1 n ω ), ∀ N ∈N. Let K∈K kernel . Let x∈X and suppose that E h |F(x, ˜ ξ )|max{log|F(x, ˜ ξ )|,0} i <∞. If the kernel is used to calculate v N,i (ω), then the following holds: lim N→∞ ˆ f kernel N (x,ω; ˜ S N )=f(x,ω), almost surely. 3.3 LEON 3.3.1 Mathematical Setting In this section, we formally set up the mathematical foundations of LEON. Given ω as an observation of ˜ ω, we introduce the following assumptions for ensuring asymptotic properties of the LEON algorithm. (B0) (Non-negativity of the weight function) Let v N,i : R dz 7→ R. The weight function is non-negative for all i (i.e., v N,i (ω)≥ 0) and P N i=1 v N,i (ω)≤ 1. (B1) (Probabilisticexistenceof ˜ ω =ω)Althoughwedonotrequiretheknowledgeofthedis- tributionunderlyingtheuncertainty,wedoneedsomeassumptionsaboutthestructure of the distribution. We assume that the joint density function p(ω,ξ ) exists and the conditional distribution is defined by (3.8), which obeys P(˜ ω ∈ B ϵ (ω)) > 0, ∀ ϵ > 0, where B ϵ (ω) is a closed ball with radius ϵ centered around the point ω. Note that p(ω,ξ )doesnotdependonthedecisionxandhencefortheclassofproblemswestudy, the distribution underlying the uncertainty is not decision-dependent. 89 (B2) (i.i.d assumption on data pairs) (˜ ω 1 , ˜ ξ 1 ),(˜ ω 2 , ˜ ξ 2 ),... are independent and identically distributed random variables. (B3) (Compact and convex feasible region) X ⊂ R n is a deterministic compact convex set. (B4) (Convex objective function and related boundedness properties) Let F : X×Y 7→ R. For every ξ ∈Y ⊂ R m 2 , F(·,ξ ) is convex on X, and its subdifferential is bounded on X (i.e. there exits M G ∈(0,∞) such that∥G(x,ξ )∥≤ M G ). Additionally, there exists M F ∈(0,∞) such that|F(x,ξ )|≤ M F for all x∈X and ξ ∈Y. We begin by stating a result from Durrett [29], which defines the comparison between the values of two conditional expectations. Lemma 3.5. Let (Ω ,Σ Ω ,P) be a probability space. Let Σ 0 ⊂ Σ Ω and Y 1 ,Y 2 ∈ Σ Ω . Suppose that E[|Y 1 |]<∞ and E[|Y 2 |]<∞. Then Y 1 ≤ Y 2 implies E[Y 1 |Σ 0 ]≤ E[Y 2 |Σ 0 ] a.s.. Next, Lemma 3.6 summarizes that the true objective function and its non-parametric estimation, which are provided in (3.2) and (3.3), are convex under the assumptions given above. Lemma 3.6 (Convexity of objective functions). Suppose assumptionsA1 andB0 -B4 are satisfied. Then the following hold: (1) f(·,ω) is convex on X, for µ ˜ ω almost every ω∈X. (2) ˆ f N (·,ω;S N ) is convex on X. Lemmas 3.5 and 3.6 establish the structure of the model (convexity) and at a conceptual level, it is amenable to large scale optimization. One consequence of convexity is that for points where the value is finite, the standard subdifferential of convex analysis exists. Ermoliev and Gaivoronski [32, 35] define a class of stochastic subgradients with a particular property of bias in classical SP without covariates of predictors and responses. 90 Definition 3.7. [32] Let ˆ G N (x;{ξ i } N i=1 ) be a measurable function of x and{ξ i } N i=1 . Further- more, let x ∗ ∈ argmin x∈X f(x) ≜ E ˜ ξ [F(x, ˜ ξ )]. Then ˆ G N (x;{ ˜ ξ i } N i=1 ) is a stochastic quasi- gradient of f(x) if it satisfies the following condition: f(x ∗ )− f(x)≥ (x ∗ − x) ⊤ E[ ˆ G N (x;{ ˜ ξ i } N i=1 )]+τ N , where lim N→∞ τ N =0. (3.10) We extend the notion of stochastic quasi-gradient to the case of SP with covariates. As a result,ourapproachopensawholeareaofapplyingnon-parametricestimationindata-driven SP. Definition 3.8. Let ˆ G N (x,ω;S N ) be a measurable function of x, ω, and S N . Furthermore, let x ∗ be an optimal solution of (3.2). Then ˆ G N (x,ω; ˜ S N ) is a stochastic quasi-gradient of f(x,ω) if it satisfies the following condition: f(x ∗ ,ω)− f(x,ω)≥ (x ∗ − x) ⊤ E[ ˆ G N (x,ω; ˜ S N ]+τ N , with lim N→∞ τ N =0. (3.11) UnlikeMonteCarloestimatesofsubgradients, estimationusingstochasticquasi-gradient isbiased,andmoreover,theseestimatesarenotϵ -subgradientsinthestandardsensebecause the errors are not required to be positive. However, ˆ G N (x,ω;{(˜ ω i , ˜ ξ i )} N i=1 ) is the stochastic subgradient of f(x,ω) when τ N ≡ 0. In assumption A1, recall that we let G(x,ξ ) denote a subgradient of F(x,ξ ). By the scaling and addition properties of subdifferentials, we define Non-parametric Stochastic Quasi-gradient (NSQG) below. Definition 3.9. Suppose assumption A1 is satisfied. Let ˆ G N (x,ω;S N ) be a measurable function of x, ω, and data S N . ˆ G N (x,ω;S N ) is a Non-parametric Stochastic Quasi-gradient (NSQG) of f(x,ω) in (3.2), if it can be written as the weighted sum below, ˆ G N (x,ω;S N )≜ N X i=1 v N,i (ω)G(x,ξ i ), (3.12) 91 where G(x,ξ i ) is the subgradient of F(x,ξ i ) and the weight v N,i (ω) is calculated via non- parametric estimation methods. This notion of NSQG in Definition 3.9 is the key component of LEON algorithm and dis- tinguishesitfromthepreviousmethods. Accordingto[1], anon-parametricestimator(kNN estimator and kernel estimators) is often biased. But under mild assumptions, Lemmas 3.15 and 3.20 below show that the bias of both kNN and kernel estimators of true subgradients decrease to zero as the sample size increases to infinity. For now, we focus on a general case in which the estimation bias converges to 0 uniformly on X and show that such a condition will guarantee that ˆ G N (x,ω; ˜ S N ) is a stochastic quasi-gradient of f(x,ω). Lemma 3.10. Let ˆ f N (x,ω; ˜ S N ) be defined in (3.3). Suppose assumptions A1 and B0 - B4 are satisfied. Given ˜ ω =ω, we further suppose that E[ ˆ f N (x,ω; ˜ S N )]− f(x,ω) ≤ δ N (x,ω), ∀x∈X and δ N (·,ω) converges uniformly to 0 on X, as N →∞. Then ˆ G N (x,ω; ˜ S N ) is a stochastic quasi-gradient of f(x,ω) by definition 3.8. Proof. Letx,x ∗ ∈X,wherex ∗ denotesanoptimalsolution. Bytheconvexityof ˆ f N (·,ω;S N )) with respect to x∈X for a given ω in Lemma 3.6, we have ˆ f N (x ∗ ,ω; ˜ S N )− ˆ f N (x,ω; ˜ S N )≥ (x ∗ − x) ⊤ ˆ G N (x,ω; ˜ S N ) a.s. By taking the expectation of the both sides of the equation above, we get E[ ˆ f N (x ∗ ,ω; ˜ S N )]− E[ ˆ f N (x,ω; ˜ S N )]≥ (x ∗ − x) ⊤ E[ ˆ G N (x,ω; ˜ S N )] (3.13) 92 Combining the assumption that |E[ ˆ f N (x,ω; ˜ S N )]− f(x,ω)|≤ δ N (x,ω) and equation (3.13) , we obtain f(x ∗ ,ω)− f(x,ω)≥ (x ∗ − x) ⊤ E[ ˆ G N (x,ω; ˜ S N )]+(− δ N (x ∗ ,ω)− δ N (x,ω)) Let τ N =− δ N (x ∗ ,ω)− δ N (x,ω). Since δ N (x,ω) converges uniformly to 0 on X as N →∞, we conclude that τ N → 0 as N →∞. By Definition 3.8, it shows that ˆ G N (x,ω; ˜ S N ) is the stochastic quasi-gradient of f(x,ω). It is worth noting that if the assumption of uniform convergence of δ N (x,ω) is replaced by point-wise convergence, then ˆ G N (x,ω; ˜ S N ) is still a stochastic quasi-gradient of f(x,ω). However, uniform convergence of the non-parametric estimation [44] is essential in proving asymptotic convergence of the algorithm proposed in the next section. 3.3.2 Algorithm Design and Convergence Analysis In the proposed algorithm, we shall also assume assumption A1 holds. With the oracle provided in assumptionA1, we focus on developing a first-order method which incorporates non-parametric estimation of g(x,ω) to solve problem (3.2). In formulating the combined estimation-optimization process, we adopt a k nearest-neighbor (kNN)/ kernel approach for subgradient estimation and a stochastic quasi-gradient method for optimization ([35]). Such aprocesscanbeunderstoodasasimultaneouslearningapproachwhichusesstatisticallearn- ing to improve the decision making in the first-order algorithm. We refer to the update rule in our method as Non-parametric Stochastic Quasi-gradient (NSQG) update and present this approach in this section. Inspired by Polyak’s averaging approach in stochastic approximation [79, 89], we utilize the proposed NSQG Update (in Algorithm 14) to develop a first-order method which guar- antees asymptotic convergence. We refer to this method as a LEON algorithm. It is worth 93 noting that since the response is assumed to be parametrized by the predictor, the conver- gence of the proposed algorithm is consistent for all possible realizations of the predictor. WefirstshowthecontractionpropertyoftheNSQGUpdateandthenprovetheconvergence of the LEON algorithm by showing that the expected optimality gap converges to zero. The details of the LEON algorithm are introduced below. The systematic error of esti- mated solutions from LEON algorithm consists of error resulting from the step size as well as the bias from the non-parametric estimation method. The LEON algorithm contains a moving averaging window to compute the average of estimated solutions. It is shown in Theorem3.12thatwecanfindagoodconstantstepsizetospeeduptheconvergenceoferror resulting from the step size in each moving window. Although the step size has no effect on the bias of the non-parametric method, the following sections will further show that, with certainrequirementonthebatchsize,LEONwithkNNmethodaswellasthekernelmethod produces a sequence of averaged estimated solutions that converge to the optimal solution of (3.2). Algorithm 14 NSQG Update 1: Inputs: l, x l , γ l >0, ∆ ≥ 1, N l 2: (Random Re-sampling) Generate a dataset, S N l l ≜{(ω l i ,ξ l i )} N l i=1 (i.e., assume that an oracle generates a dataset, S N l l , with N l pairs of predictors and responses) and obtain {G(x l ,ξ l i )} N l i=1 from the oracle in A1. 3: (NSQG calculation) Calculate ˆ G N l (x l ,ω;S N l l ) by using (3.12). 4: (Estimate solution update) Compute the new estimated solution by using the fol- lowing formula: x l+1 = Π X (x l − γ l ˆ G N l (x l ,ω;S N l l )), where Π X (·) denotes the projection operator. 5: (Increment batch size) N l+1 ← N l +∆. Based on data generating process in the NSQG update, the dataset ˜ S N l l is independent of ˜ S N j j for any j̸=l. When M G and D X are unknown, one chooses a step size γ q = C √ mq , where C is a finite positive constant (i.e., C ∈ (0,∞)). In other words, as long as we choose the constant step size (in each moving window) which is of the order of the inverse of the square root of the moving window size (i.e.,O( 1 √ mq )), the asymptotic convergence of LEON algorithm will be 94 Algorithm 15 LEON 1: (Initialization) Set the maximum of the norm of the subgradient, M G . Start with x 0 ∈ X and γ 0 > 0 and set D X = max x∈X ∥x− x 0 ∥. Let n and m be some positive integer constants (e.g. n = 1, m = 2). q ← 0. If n > 1, use NSQG Update in Algorithm 14 to generate x 1 , x 2 ,...,x n− 1 . Set i q ← n and j q ← n− 1 (the updates in step 2 will make sure that i q <j q ). Choose the maximum number of outer iterations, q max . 2: (Constant stepsize setup) Set q← q+1 and γ q ← 2D X M G √ mq . 3: (Averaging window for inner loops) Set i q ← i q− 1 +m(q− 1) and j q ← j q− 1 +mq. Start with x iq− 1 , call NSQG Update (Algorithm 14) with fixed step size γ q for mq times to generate x iq , x iq+1 ,..., x jq . 4: (Averaging) Set ˜ x q ← P jq l=iq x l mq . 5: (Stopping rule) If q ≥ q max , stop and output the estimated solution, ˜ x q . Otherwise, repeat from Step 2. assured. This circumstance will be discussed at the end of the subsection. The motivation for using the moving averaging window is as follows. By increasing the size of the averaging window (i.e. increase m in mq), we generate a sequence of averaged estimated solutions ˜ x q , which is shown to produce an optimality gap which converges to 0. Before we prove the asymptotic convergence of LEON, we shall show the contraction property of the NSQG Update. Lemma 3.11 (Contraction in NSQG Update). Let x l and x l+1 denote the l th and (l + 1) th iterates, respectively, generated by NSQG Update. Suppose that A1 and B0 - B4 are satisfied. Suppose that there exists δ N (·,·) ≥ 0 such that |E[ ˆ f N (x,ω; ˜ S N )]− f(x,ω)| ≤ δ N (x,ω) for all x∈X. Moreover, for any x ′ ∈X the following inequality holds: E[∥x l+1 − x ′ ∥ 2 ]≤ E[∥x l − x ′ ∥ 2 ]− 2γ l E[f(x l ,ω)− f(x ′ ,ω)]+γ 2 l M 2 G +2γ l E[δ N l (x l ,ω)+δ N l (x ′ ,ω)] . Proof. ByLemma3.6, ˆ f N (·,ω;S N )isconvexonX, whichimpliesthatthefollowinginequal- ity holds, ˆ f N l (x ′ ,ω; ˜ S N l l ) ≥ ˆ f N l (x l ,ω; ˜ S N l l )+(x ′ − x l ) ⊤ ˆ G N l (x l ,ω; ˜ S N l l ) a.s. (3.14) 95 By the non-expansiveness property of projection operator, we have ∥Π X (x ′ )− Π X (x)∥ ≤ ∥x ′ − x∥. On the other hand, if x ′ ∈X, then Π X (x ′ )=x ′ . Hence, we have ∥x l+1 − x ′ ∥ 2 =∥Π X (x l − γ l ˆ G N l (x l ,ω; ˜ S N l l ))− Π X (x ′ )∥ 2 ≤∥ x l − x ′ ∥ 2 +γ 2 l ∥ ˆ G N l (x l ,ω; ˜ S N l l )∥ 2 − 2γ l (x l − x ′ ) ⊤ ˆ G N l (x l ,ω; ˜ S N l l ) a.s. Note that there exits M G > 0 such that ∥G(x, ˜ ξ )∥ ≤ M G < ∞ a.s.. By assumption (B0), P N l i=1 v N l ,i (ω)≤ 1 and v N l ,i (ω)≥ 0, which implies that∥ ˆ G N l (x l ,ω; ˜ S N l l )∥≤ M G a.s.. There- fore, by taking the expectation of the both sides of inequality above, we have E[∥x l+1 − x ′ ∥ 2 ]≤ E[∥x l − x ′ ∥ 2 ]+γ 2 l M 2 G − 2γ l E[(x l − x ′ ) ⊤ ˆ G N l (x l ,ω; ˜ S N l l )] (3.15) Note ˜ S N j j = {(˜ ω j i , ˜ ξ j i )} N j i=1 is the dataset generated at the j th iteration by using the NSQG Update. Note that x l is a function of the history ˜ S [l− 1] :=( ˜ S N 0 0 ,..., ˜ S N l− 1 l− 1 ). In other words, x l ismeasurablewithrespecttothe σ -algebrageneratedby ˜ S [l− 1] . Bytakingtheconditional expectation of (3.14) on the history ˜ S [l− 1] and moving ˆ f N (x l ,ω; ˜ S N l ) to the left hand side of (3.14), we get E[ ˆ f N l (x ′ ,ω; ˜ S N l l )− ˆ f N l (x l ,ω; ˜ S N l l )| ˜ S [l− 1] ] ≥ E[(x ′ − x l ) ⊤ ˆ G N l (x l ,ω; ˜ S N l l )| ˜ S [l− 1] ] a.s. (3.16) Based on the assumption that|E[ ˆ f N (x,ω; ˜ S N )]− f(x,ω)|≤ δ N (x,ω), we have E[f(x ′ ,ω)− f(x l ,ω)+δ N l (x ′ ,ω)+δ N l (x l ,ω)|S [l− 1] ] ≥ E[ ˆ f N l (x ′ ,ω; ˜ S N l l )− ˆ f N l (x l ,ω; ˜ S N l l )| ˜ S [l− 1] ] a.s. (3.17) 96 By combining (3.16) and (3.17), we obtain E[f(x ′ ,ω)− f(x l ,ω)+δ N l (x ′ ,ω)+δ N l (x l ,ω)|S [l− 1] ] ≥ E[(x ′ − x l ) ⊤ ˆ G N l (x l ,ω; ˜ S N l )| ˜ S [l− 1] ] a.s. We take the expectation of both sides of the inequality above and obtain E[f(x ′ ,ω)− f(x l ,ω)+δ N l (x ′ ,ω)+δ N l (x l ,ω)] ≥ E[(x ′ − x l ) ⊤ ˆ G N l (x l ,ω; ˜ S N l l )]. (3.18) By using (3.18) to replace− E[(x l − x ′ ) ⊤ G N (x l ,ω; ˜ S N l l )] in (3.15) withE[f(x ′ ,ω)− f(x l ,ω)+ δ N l (x ′ ,ω)+δ N l (x l ,ω)], we have E[∥x l+1 − x ′ ∥ 2 ]≤ E[∥x l − x ′ ∥ 2 ]− 2γ l E[f(x l ,ω)− f(x ′ ,ω)] +γ 2 l M 2 G +2γ l E[δ N l (x l ,ω)+δ N l (x ′ ,ω)]. This completes the proof. ItisworthnotingthattherearetwotypesofestimatedsolutionupdatesintheLEONal- gorithm. Werefertotheprocessofusingnon-parametericestimateofsubgradientoff(x l ,ω) and x l to produce x l+1 as a new point of the NSQG update, while we refer to the process of calculating ˆ x q (i.e. running steps 2-5 in LEON) as an iteration of the LEON algorithm. Notwithstanding the bias associated with stochastic quasi-gradients, LEON assures asymp- totic convergence by extending the averaging process in which we use an ever-increasing size of the window. In the following, we use X ∗ to denote the optimal solution set of problem specified in (3.2) and c ∗ (ω)=min x∈X f(x,ω). Theorem 3.12 (Asymptotic convergence of LEON). Let ω be fixed. Let ˜ x q be gener- ated by LEON, where q denotes the iteration number. Suppose assumptions A1 and B0 - B4 are satisfied. Suppose that there exists a non-negative function δ N (·,·) such that 97 |E[ ˆ f N (x,ω; ˜ S N )]− f(x,ω)| ≤ δ N (x,ω) for all x ∈ X. Further assume that δ N (·,ω) con- verges to 0 uniformly on X as N goes to infinity. Let l denote the l th iteration of NSQG Update. In each iteration of NSQG update (process of generating x l ), the size of the datasets {N l } l increases monotonically. Then the following holds: lim q→∞ E[f(˜ x q ,ω)− c ∗ (ω)]=0 Proof. TheproofisinspiredbythatofconvergenceofRobustSAin[79]althoughthedetails are different due to the bias. We continue using the notation in the proof of Lemma 3.11. Furthermore, denote x ∗ l =Π X ∗ (x l ), c ∗ (ω)=f(x ∗ ,ω) for x ∗ ∈X ∗ . (3.19) By the Projection Theorem, we have∥x l+1 − x ∗ l+1 ∥≤∥ x l+1 − x ∗ l ∥. According to Lemma 3.11 and let a l =E[∥x l − x ∗ l ∥ 2 ], we have a l+1 ≤ a l − 2γ l E[f(x l ,ω)− c ∗ (ω)]+γ 2 l M 2 G +2γ l E[δ N l (x l ,ω)+δ N l (x ∗ l ,ω)]. (3.20) Let A l = sup{δ N l (x,ω) : x ∈ X}. Since δ N (·,ω) → 0 uniformly on X as N → ∞, we conclude that A l →0 as l→∞. It follows from (3.20) that γ l [E[f(x l ,ω)− c ∗ (ω)]− 2A l ]≤ 1 2 a l − 1 2 a l+1 + 1 2 γ 2 l M 2 G . (3.21) We let 1≤ i≤ j and get j X l=i γ l [E[f(x l ,ω)− c ∗ (ω)]− 2A l ]≤ j X l=i ( 1 2 a l − 1 2 a l+1 )+ j X l=i 1 2 γ 2 l M 2 G (3.22a) ≤ 1 2 a i + 1 2 M 2 G j X l=i γ 2 l . (3.22b) 98 Dividing both sides of the inequality above by P j l=i γ l we obtain P j l=i γ l E[f(x l ,ω)− c ∗ (ω)] P j l=i γ l − 2 P j l=i γ l A l P j l=i γ l ≤ 1 2 a i + 1 2 M 2 G P j l=i γ 2 l P j l=i γ l . Let v l = γ l P j l=i γ l . Since P j l=i v l = 1 and v l > 0, by the convexity of f(·,ω) for a given ω, we have f( P j l=i v l x l ,ω) ≤ P j l=i v l f(x l ,ω). Now we let D X = max x∈X ∥x− x 0 ∥ and thus a i =E[∥x i − x ∗ i ∥ 2 ]≤ E[(∥x i − x 0 ∥+∥x 0 − x ∗ i ∥) 2 ]≤ 4D 2 X . We further let x j i = P j l=i v l x l and get E[f(x j i ,ω)− c ∗ (ω))]≤ 2 j X l=i v l A l + 2D 2 X + 1 2 M 2 G P j l=i γ 2 l P j l=i γ l (3.23) Comparedto[79],wehaveanextraterm,2 P j l=i v l A l ,whichrepresentstheaccumulatednon- parametricestimationbias. Nowchoosing n>0andm>0, weleti 1 =n,j 1 =m+n− 1,..., i q = n+m P q− 1 s=1 s and j q = n+m P q s=1 s− 1. Note that j q − i q = mq− 1. This step is critical in creating a moving window with increasing window size to control the accumulated non-parametric estimation bias. For i q ≤ l≤ j q (i.e., q th moving window), we use a constant stepsize, which is ˜ γ q . It follows from (3.23) that E[f(x jq iq ,ω)− c ∗ (ω))]≤ 2 jq X l=iq v l A l + 2D 2 X + 1 2 M 2 G mq˜ γ 2 q mq˜ γ q . (3.24) We minimize the right hand side of (3.24) and get the optimal stepsize, ˜ γ q = 2D X M G √ mq . By substituting the optimal constant stepsize into (3.24), we have E[f(x jq iq ,ω)− c ∗ (ω))]≤ 2 jq X l=iq v l A l + 2D X M G √ mq . (3.25) 99 Let ˜ x q =x jq iq andconstructasequence{˜ x q }withrespecttoq (i.e., ˜ x q istheaverageestimated solution in the q th moving window). Since v l ≥ 0, P jq l=iq v l =1 and lim l→∞ A l =0, we have lim q→∞ 2 jq X l=iq v l A l =0 and lim q→∞ 2D X M G √ mq =0. (3.26) Equation (3.26) implies that lim q→∞ E[f(˜ x q ,ω)− c ∗ (ω))]=0. To see why lim q→∞ 2 P jq l=iq v l A l = 0, we let A lq = max iq≤ l≤ jq {A l }. Then v l ≥ 0 and P jq l=iq v l = 1 implies 0 ≤ 2 P jq l=iq v l A l ≤ 2A lq . Since {A lq } is a subsequence of {A l } and lim l→∞ A l =0, then lim q→∞ A lq =0. Remark 1: We do not need monotonicity of the bias (i.e., δ N (x,ω)≥ δ N+1 (x,ω)); it is suffi- cient that the uniform convergence of the bias ensures the asymptotic convergence of LEON algorithm. Remark 2: In the case when M G and D X are unknown, we can pick a constant stepsize C √ mq , where C is a positive constant (i.e., C ∈ (0,∞)). In this case, it follows from (3.24) that E[f(x jq iq ,ω)− c ∗ (ω))]≤ 2 P jq l=iq v l A l + 2D 2 X + 1 2 C 2 M 2 G C √ mq . Again, since D X , M G , m and C are finite constants, we have lim q→∞ 2D 2 X + 1 2 C 2 M 2 G C √ mq = lim q→∞ q − 1 2 2D 2 X + 1 2 C 2 M 2 G C √ m =0. This implies that lim q→∞ E[f(˜ x q ,ω)− c ∗ (ω))]=0. It is worth noting that although this step size rule is not optimal, the asymptotic convergence of LEON is still valid. Remark 3: Since the optimal stepsize given by Theorem 3.12 is based on minimizing the optimization errors, it only differs from the stepsize rule in Robust SA by a constant. In the following, we derive the convergence of NSQG if the limit of the summation of the product of stepsize and bias is finite. It is worth noting that such requirement on the 100 stepsize is hard to verify, which makes the solution update purely based on NSQG update unfavorable. The proof of the convergence of NSQG is based on Theorem 1.3 in Chapter 1.2.1. Proposition 3.13 (Convergence of NSQG update). Suppose assumptions A1 and B0-B4 hold. Suppose that there exists a non-negative function δ N (·,·) such that |E[ ˆ f N (x,ω; ˜ S N )]− f(x,ω)| ≤ δ N (x,ω) for all x ∈ X. Further assume that δ N (·,ω) converges to 0 uniformly on X as N goes to infinity. Let A N = sup x∈X δ N (x,ω). Suppose the stepsize {γ l } and {N l } satisfies following conditions: γ l ≥ 0, ∞ X l=0 γ l =∞, ∞ X l=0 A N l γ l <∞, ∞ X l=0 γ 2 l <∞. Then the sequence{x l } generated by NSQG update converges to some optimal solution with probability 1. Proof. Accordingly, let x ∗ ∈X ∗ (where X ∗ =arg min x∈X {f(x,ω)}) and we have ∥x l+1 − x ∗ ∥ 2 ≤∥ x l − x ∗ ∥ 2 − 2γ l ( ˆ f N l (x ∗ ,ω; ˜ S N l l )− ˆ f N l (x l ,ω; ˜ S N l l )) +γ 2 l M 2 G (3.27) By taking the conditional expectation of (3.27) with respect toF l ={x 0 ,...,x l }, we have E[∥x l+1 − x ∗ ∥ 2 |F l ]≤∥ x l − x ′ ∥ 2 − 2γ l (f(x,ω)− f(x ∗ ,ω)− 2A l )+γ 2 l M 2 G (3.28) Note that E[ ˆ f N l (x l ,ω; ˜ S N l l )|F l ]≥ f(x l ,ω)− A l and E[ ˆ f N l (x ∗ ,ω; ˜ S N l l )|F l ]≤ f(x ∗ ,ω)+A l 101 which implies that E[ ˆ f N l (x l ,ω; ˜ S N l l )− ˆ f N l (x ∗ ,ω; ˜ S N l l )|F l ]≥ f(x,ω)− f(x ∗ ,ω)− 2A l To proceed, it follows from (3.28) that E[∥x l+1 − x ∗ ∥ 2 |F l ]≤∥ x l − x ∗ ∥ 2 − 2γ l (f(x,ω)− f(x ∗ ,ω))+4γ l A l +2γ 2 l M 2 G By assumptions, we have P ∞ l=0 4γ l A l +2γ 2 l M 2 G <∞ and 2γ l (f(x,ω)− f(x ∗ ,ω))≥ 0 for all l. Hence, by Theorem 1.3, we have for all sample paths in a set Ω x ∗ of probability 1 ∞ X l=0 2γ l (f(x,ω)− f(x ∗ ,ω))<∞, and the sequence {∥x l − x ∗ ∥ 2 } converges. The rest of the proof follows from the proof in Prop. 8.2.13 of [11]. 3.3.3 LEON with kNN Estimation WebeginbyprovidingthekNNrealizationofNSQGandthenanalyzetheimpactofthebias on the kNN estimate. We will first study LEON- kNN with constant k and then investigate the case for which k N is a function of the sample size N. We will show that LEON with both versions of kNN estimation satisfies the conditions in Theorem 3.12. Given a positive integer k (or k N ), a dataset S N and an observation of the predictor ω, we aim to find the k (or k N ) covariates from this dataset closest (in distance) to ω. We require that ties are broken randomly and k (or k N ) must be smaller than N. Here, we provide the details of how to calculate v N,i (ω), which corresponds to data point ω i , via kNN estimation. If ω i belongs to the set of k (or k N ) nearest neighbors of ω, v N,i (ω) is set to be 1; otherwise, v N,i (ω) is set to be 0. The convergence of LEON-kNN with constant k requires an additional assumption which is provided below. 102 (B5) (H¨oldercontinuityofobjectivefunction)Letf :X×X 7→ R. Thereexistsacontinuous function, C(x) <∞, and p > 0 such that|f(x,ω 1 )− f(x,ω 2 )|≤ C(x)∥ω 1 − ω 2 ∥ p , for every x∈X and µ ω almost all ω 1 , ω 2 ∈X. In definition 3.1, we let S(k,ω;{ω j } N j=1 ) denote the set of k nearest neighbors of ω from a samplesetwithsizeN andformulatethenon-parametricstochasticquasi-gradientoff(x,ω) based on kNN estimation, which we refer to as kNN estimate of NSQG. ˆ G k,N (x,ω;S N )≜ 1 k N X i=1 I ω i ∈S(k,ω;{ω j } N j=1 ) G(x,ξ i ). (3.29a) ˆ f k,N (x,ω;S N )≜ 1 k N X i=1 I ω i ∈S(k,ω;{ω j } N j=1 ) F(x,ξ i ). (3.29b) Such kNN estimate of NSQG in (3.29a) has a corresponding benchmark function in (3.29b) based on the approximation method proposed by [12]. Although such a benchmark function will not be used in the LEON-kNN algorithm, it is useful in quantifying the estimation bias and thus proving the convergence of the algorithm. Also note that 1 k I ω i ∈S(k,ω;{ω j } N j=1 ) ≥ 0, N X i=1 1 k I ω i ∈S(k,ω;{ω j } l j=1 ) =1. (3.30) Thus,equation(3.30)impliesthattheweightfunctioncalculatedbykNNestimationmethod satisfies assumption B0. It has been discussed that the optimal solution of approximation problem (3.29b) can be used as an estimate of the true optimal solution (see [12] for more details). However,sincetheruntimeofna¨ ıve kNNalgorithmisproportionaltothedimension of the vector ω and the sample size 5 , finding such a set of k nearest neighbors of a given point can be computationally expensive for large datasets. On the other hand, LEON-kNN uses far fewer data points in each iteration and also keeps improving estimated solutions. In general, LEON-kNN requires a small dataset to update the estimate when it is far from 5 Calculating distances of N data points from a dataset consumes O(N) time and sorting all the distances by using quick sort takes O(NlogN) time 103 the optimal solution set and a large dataset to further improve the estimate when it is very close to the optimal solution set. The bound provided next will be useful for the analysis of specific non-parametric meth- ods such as kNN and kernel estimation. Lemma 3.14. Suppose that A1 and B0 - B5 are satisfied. Then the following holds: E[|f(x,˜ ω 1 )− f(x,ω)|v N,1 (ω)]≤ E[C(x)∥˜ ω 1 − ω∥ p v N,1 (ω)]. Proof. Let H ={z ∈ Ω : |f(x,˜ ω 1 (z))− f(x,ω)|v N,1 (ω)≤ C(x)∥˜ ω 1 (z)− ω∥ p }. Then based on assumptionB5, we haveE[I(˜ ω 1 ∈H)]=P(H)=1. Additionally, assumptionB4 implies thatf(x,˜ ω)=E[F(x, ˜ ξ )|˜ ω]≤ E[|F(x, ˜ ξ )||˜ ω]≤ M F a.s.Similarly,wehave− f(x,˜ ω)≥− M F . This implies that|f(x,˜ ω)| is bounded above by M F almost surely. Consequently, E[|f(x,˜ ω 1 )− f(x,ω)|v N,1 (ω)]≤ E[C(x)∥˜ ω 1 − ω∥ p v N,1 (ω)I(˜ ω 1 ∈H)] +2M F E[I(˜ ω 1 / ∈H)] Since E[I(˜ ω 1 / ∈ H)] = P(I(˜ ω 1 / ∈ H)) = 0 and E[C(x)∥˜ ω 1 − ω∥ p v N,1 (ω)I(˜ ω 1 ∈ H)] ≤ E[C(x)∥˜ ω 1 − ω∥ p v N,1 (ω)], it follows from (3.31) that E[|f(x,˜ ω 1 )− f(x,ω)|v N,1 (ω)]≤ E[C(x)∥˜ ω 1 − ω∥ p v N,1 (ω)]. Lemma 3.15 shows that the bias of the kNN estimate satisfies conditions of Theorem 3.12. Lemma3.15. Letω befixed. Let v N,i (ω)= 1 k I ω i ∈S(k,ω;{ω j } N j=1 ) . Supposeassumptions A1 andB1 -B5 are satisfied. Further suppose that ∥˜ ω∥ is bounded almost surely (i.e. there exits T ∈(0,∞) such that∥˜ ω∥≤ T a.s.). Then the following hold: (1) There exists δ N (x,ω) such that|E[ ˆ f k,N (x,ω; ˜ S N )]− f(x,ω)|≤ δ N (x,ω). 104 (2) δ N (x ′ ,ω)→0, as N →∞, for any x ′ ∈X. (3) δ N (·,ω) decreases monotonically with N. (4) As N ↑∞, δ N (·,ω)→0 uniformly on X. Proof of (1). LetS =σ {˜ ω i :i∈1,2,...,N}denotetheσ -algebrageneratedby{˜ ω i } N i=1 . Since {(˜ ω i , ˜ ξ i )} N i=1 are independent and identically distributed, the expectation of k-NN estimate in (3.29b) can be decomposed into the following form [109]: E h ˆ f k,N (x,ω; ˜ S N ) i = 1 k N X i=1 E h E h F(x, ˜ ξ i )I ˜ ω i ∈S(k,ω;{˜ ω j } N j=1 ) |S ii = 1 k N X i=1 E h E[F(x, ˜ ξ i )|˜ ω i ]I ˜ ω i ∈S(k,ω;{˜ ω j } N j=1 ) i = N k E f(x,˜ ω 1 )I ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) . (3.31) On the other hand, we have E I ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) =P I ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) =1− N− 1 k N k = k N , because ˜ ω 1 ,...,˜ ω N are equally likely to be in the set of k nearest neighbors of ω. Hence, it implies that for any constant a∈R, we have a= N k E aI ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) . By using equations (3.31) and Lemma 3.14, we attain an upper bound of|E[ ˆ f k,N (x,w; ˜ S N )]− f(x,w)| below, E[ ˆ f k,N (x,ω; ˜ S N )]− f(x,ω) ≤ N k C(x)E[∥˜ ω 1 − ω∥ p I ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) ]. (3.32) Let δ N (x,ω)= N k C(x)E ∥˜ ω 1 − ω∥ p I ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) , equation (3.32) can be simpli- fied to E[ ˆ f k,N (x,ω; ˜ S N )]− f(x,ω) ≤ δ N (x,ω). (3.33) 105 Proof of (2). Let ˜ ω N [k] (ω) be the k th nearest neighbor of ω in the dataset {(˜ ω i , ˜ ξ i )} N i=1 . By the SLLN of kNN (Lemma 6.1 in [44]), we have∥˜ ω N [k] (ω)− ω∥→ 0 a.s., as N →∞, which implies that∥˜ ω N [k] (ω)− ω∥ p →0 a.s. as N →∞. This can be generalized to ∥˜ ω N [i] (ω)− ω∥ p →0 a.s. as N →∞ for i≤ k. (3.34) Since k is finite, we have 1 k P k i=1 ∥˜ ω N [i] (ω)− ω∥ p →0 a.s. as N →∞. Since P k i=1 ∥˜ ω N [i] (ω)− ω∥ p = 1 k P N i=1 ∥˜ ω i − ω∥ p I(˜ ω i ∈S(k,ω;{˜ ω j })) a.s., this implies that 1 k N X i=1 ∥˜ ω i − ω∥ p I(˜ ω i ∈S(k,ω;{˜ ω j }))→0 a.s. as N →∞. (3.35) Since ∥˜ ω∥ ≤ T < ∞ a.s., ∥˜ ω− ω∥ p ≤ (2T) p < ∞ a.s.. By the Dominated Convergence Theorem (or Bounded Convergence Theorem), it follows from (3.34) that E ∥˜ ω N [i] (ω)− ω∥ p →0 as N →∞ for i≤ k. (3.36) Equation (3.36) implies that E " 1 k k X i=1 ∥˜ ω N [i] (ω)− ω∥ p # →0 as N →∞. (3.37) Since C(x ′ ) is finite for x ′ ∈X and 1 k k X i=1 ∥˜ ω N [i] (ω)− ω∥ p = 1 k N X i=1 ∥˜ ω i − ω∥ p I ˜ ω i ∈S(k,ω;{˜ ω j } N j=1 ) a.s., equation (3.37) implies that E " 1 k C(x ′ ) N X i=1 ∥˜ ω i − ω∥ p I ˜ ω i ∈S(k,ω;{˜ ω j } N j=1 ) # →0 as N →∞. (3.38) 106 Since ˜ ω 1 ,˜ ω 2 ,...,˜ ω N are independent and identically distributed, we have E " 1 k C(x ′ ) N X i=1 ∥˜ ω i − ω∥ p I ˜ ω i ∈S(k,ω;{˜ ω j } N j=1 ) # = N k C(x ′ )E ∥˜ ω 1 − ω∥ p I ˜ ω 1 ∈S(k,ω;{˜ ω j } N j=1 ) =δ N (x ′ ,ω). (3.39) The combination of (3.38) and (3.39) implies that for any x ′ ∈ X, δ N (x ′ ,ω) → 0 as N → ∞. Proof of (3). We shall show that δ N ≥ δ N+1 pointwise for N >k. Let{˜ ω 1 ,˜ ω 2 ,..., ˜ ω N ,˜ ω N+1 } be N + 1 independent copies of ˜ ω. Let ˜ ω N+1 [i] (ω) be the i th nearest neighbor of ω from {˜ ω 1 ,˜ ω 2 ,...,˜ ω N ,˜ ω N+1 }. Let ˜ ω N [i] (ω) be the i th nearest neighbor of ω from {˜ ω 1 ,˜ ω 2 ,...,˜ ω N }. It is obvious that∥˜ ω N [i] (ω)− ω∥ p ≥∥ ˜ ω N+1 [i] (ω)− ω∥ p a.s., which implies E ∥˜ ω N [i] (ω)− ω∥ p ≥ E h ∥˜ ω N+1 [i] (ω)− ω∥ p i . (3.40) Since (3.40) holds for i∈{1,2,3,...,k}, it implies that k X i=1 E ∥˜ ω N [i] (ω)− ω∥ p ≥ k X i=1 E h ∥˜ ω N+1 [i] (ω)− ω∥ p i . (3.41) We also have δ N (x,ω)=C(x) 1 k P k i=1 E h ∥˜ ω N [i] (ω)− ω∥ p i . Similarly, δ N+1 (x,ω)=C(x) 1 k P k i=1 E[∥˜ ω N+1 [i] (ω)− ω∥ p ]. So (3.41) implies δ N (x,ω)≥ δ N+1 (x,ω). Proof of (4). According to (2) and (3) from Lemma 3.15, we know that δ N (·,ω)→0 point- wise onX andδ N (x,ω) ismonotonically decreasingwith respectto N. SinceC(x) iscontin- uous on X, it is obvious that δ N (·,ω) is continuous on X. By Dini’s Theorem [53], δ N (·,ω) converges uniformly to 0 on X, as N →∞. Note that the assumption B5 is practically hard to verify. Alternatively, we can make the following assumption and design the second version of LEON-kNN. 107 (C1) For any ξ ∈Y, F(·,ξ ) is Lipschitz continuous with a common constant L F on X (i.e., |F(x,ξ )− F(x ′ ,ξ )|≤ L F ∥x− x ′ ∥ for all x,x ′ ∈X). The following lemma shows that the kNN estimate (with k N = ⌊N β ⌋) also satisfies the conditions of Theorem 3.12. Lemma 3.16. Let ω be fixed. Let v N ,i(ω) = 1 k N I(ω i ∈S(k N ,ω;{ω} N j=1 )) and k N =⌊N β ⌋, β ∈(0,1). Supposethat assumptionsA1,B1 -B4 ,andC1 are satisfied. Then thefollowing hold: (1) {E[ ˆ f k N ,N (·,ω; ˜ S N )]} N is equicontinuous on X. (2) lim N→∞ |E[ ˆ f k N ,N (·,ω; ˜ S N ]− f(x,ω)|=0 uniformly on X. Proof of (1). For any x,x ′ ∈X, by the Jensen’s inequality (see Theorem 1.6.2 in [29]), |E[ ˆ f k N ,N (x,ω; ˜ S N )]− E[ ˆ f k N ,N (x ′ ,ω; ˜ S N )]| ≤ E h | ˆ f k N ,N (x,ω; ˜ S N )− ˆ f k N ,N (x ′ ,ω; ˜ S N )| i . (3.42) On the other hand, since v N ,i(ω)≥ 0 and P N i=1 v N ,i(ω)=1, | ˆ f k N ,N (x,ω;S N )− ˆ f k N ,N (x ′ ,ω;S N )|≤ N X i=1 v N,i (ω)|F(x,ξ i )− F(x ′ ,ξ i )| ≤ L F ∥x− x ′ ∥. (3.43) Hence, the combination of (3.42) and (3.43) implies that |E[ ˆ f k N ,N (x,ω; ˜ S N )]− E[ ˆ f k N ,N (x ′ ,ω; ˜ S N )]|≤ L F ∥x− x ′ ∥. (3.44) Equation (3.44) implies that {E[ ˆ f k N ,N (·,ω; ˜ S N )]} N is Lipschitz continuous with a common Lipschitz constant, L F , which further implies the equicontinuity. 108 Proof of (2). For any x∈X, by theorem 3.2, we have lim N→∞ ˆ f k N ,N (x,ω; ˜ S N )=f(x,ω) a.s. By assumption B4,|F(x,ξ )|≤ M F for all x∈X and ξ ∈S ξ . This implies that | ˆ f k N ,N (x,ω; ˜ S N )|≤ M F a.s. Hence, by the Bounded Convergence Theorem, we have lim N→∞ E[ ˆ f k N ,N (x,ω; ˜ S N )]=f(x,ω). (3.45) Combining (3.45) and equicontinuity of{E[ ˆ f k N ,N (·,ω; ˜ S N )]} N in (1), we conclude that lim N→∞ |E[ ˆ f k N ,N (x,ω; ˜ S N )]− f(x,ω)|=0 uniformly on X. UsingTheorem3.12,Lemma3.15andLemma3.16,weareabletoshowthattheexpected optimality gap from the sequence of estimated solutions generated by LEON-kNN converges to 0, which is summarized in the following corollary. Corollary 3.17 (Asymptotic convergence of LEON-kNN). Let ω be fixed. Let ˜ x q be gen- erated by LEON-kNN, where q stands for the iteration number. Suppose assumptions A1 and B1 - B4 are satisfied. Let l denote the l th iteration of LEON Update. In each iteration of LEON update (process of generating x l ), a new dataset with larger size is generated (i.e. N l+1 ≥ N l +1 for l∈N). If either one of the following condition holds: (1) Let v N ,i(ω) = 1 k I(ω i ∈S(k,ω;{ω} N j=1 )) for a fixed positive integer k. Further suppose that assumption B5 also holds and∥˜ ω∥ is bounded almost surely. 109 (2) Let v N ,i(ω)= 1 k N I(ω i ∈S(k N ,ω;{ω} N j=1 )) and k N =⌊N β ⌋, β ∈(0,1). Further suppose that assumption C1 holds. Then lim q→∞ E[f(˜ x q ,ω)− c ∗ (ω)]=0. Proof. If condition (1) holds, Lemma 3.15 implies that there exists δ N l (x,ω) such that E[ ˆ f k,N l (x,ω; ˜ S N l l )]− f(x,ω) ≤ δ N l (x,ω). Since {N l } l is a monotonically increasing sequence and N l → ∞ as l → ∞, Lemma 3.15 also implies that δ N l (·,ω) → 0 uniformly on X, as l → ∞ and moreover, it also implies that δ N l (·,ω) is also monotonically decreasing with respect to N l . Thus, all the conditions in Theorem 3.12 are satisfied, which finishes the proof in condition (1). If condition (2) holds, Lemma 3.16 implies that lim l→∞ E[ ˆ f k N l ,N l (x,ω; ˜ S N l l )]− f(x,ω) =0 uniformly on X. By Theorem 3.12, we can get the desired results. 3.3.4 LEON with Kernel Estimation In this section, we introduce another variation of LEON algorithm, which utilizes Nadaraya- Watson kernel estimator to estimate the subgradient of f(x,ω). Recall that the kernel function is denoted as K. In the literature on kernel methods, h N is known as bandwidth. To ensure the pointwise convergence of kernel estimate, Theorem 3.4 suggests that h N needs to satisfy h N = CN − β , β ∈ (0, 1 nω ). We also note that h N dereases as N increases. The stochastic quasi-gradient of f(x,ω) based on the Nadaraya-Watson kernel estimator is writ- ten as follows, which we refer to as kernel estimate of SQG. ˆ G kernel N (x,ω;S N )≜ P N i=1 G(x,ξ i )K( ω− ω i h N ) P N i=1 K( w− w i h N ) . (3.46) 110 The benchmark function of kernel estimate of SQG is provided below. ˆ f kernel N (x,ω;S N ))= P N i=1 F(x,ξ i )K( ω− ω i h N ) P N i=1 K( ω− ω i h N ) . (3.47) It is worth noting that assumption B5 is not needed in the asymptotic convergence of LEON-kernel. In other words, LEON-kernel should be regarded as a more flexible approach which can solve a relatively broad class of problems. In the analysis of kernel estimates, it is necessary to define 0 0 = 0 (see [110]), because it is possible that P N i=1 K( ω− ω i h N ) = 0 when a kernel with compact support (e.g., Na¨ ıve kernel) is used. With this definition, for any K∈K, we have K( w− w i h N ) P N i=1 K( ω− ω i h N ) ≥ 0∀ i, and X K( w− w i h N ) P N i=1 K( ω− ω i h N ) ≤ 1. (3.48) Therefore, equation (3.48) implies that the requirement in assumption B0 is satisfied. Lemma 3.18. Let K ∈ K. If assumption C1 is satisfied, then for any x,x ′ ∈ X, the following holds: | ˆ f kernel N (x,ω;S N )− ˆ f kernel N (x ′ ,ω;S N )|≤ L F ∥x− x ′ ∥. In other words, ˆ f kernel N (·,ω;S N ) is Lipschitz continuous on X with Lipschitz constant L F . Proof. Note that P N i=1 K( ω− ω i h N )≥ 0 and K( ω− ω i h N )≥ 0. There are two cases that we should consider (i.e., P N i=1 K( ω− ω i h N )>0 and P N i=1 K( ω− ω i h N )=0). Case 1: P N i=1 K( ω− ω i h N )>0. By definition, | ˆ f kernel N (x,ω;S N )− ˆ f kernel N (x ′ ,ω;S N )| = P N i=1 F(x,ξ i )K( ω− ω i h N ) P N i=1 K( ω− ω i h N ) − P N i=1 F(x ′ ,ξ i )K( ω− ω i h N ) P N i=1 K( ω− ω i h N ) ≤ P N i=1 K( ω− ω i h N )|F(x,ξ i )− F(x ′ ,ξ i )| P N i=1 K( ω− ω i h N ) (3.49) 111 The last relation holds because of the triangleinequality. By assumptionC1, it follows from (3.49) that P N i=1 K( ω− ω i h N )|F(x,ξ i )− F(x ′ ,ξ i )| P N i=1 K( ω− ω i h N ) ≤ L F ∥x− x ′ ∥. Case 2: P N i=1 K( ω− ω i h N )=0. Then| ˆ f kernel N (x,ω;S N )− ˆ f kernel N (x ′ ,ω;S N )|=0. Lemma 3.19. Let K∈K. If assumption C1 is satisfied, then the following holds: |E[ ˆ f kernel N (x,ω; ˜ S N )]− E[ ˆ f kernel N (x ′ ,ω; ˜ S N )]|≤ L F ∥x− x ′ ∥. Proof. By the Jensen’s inequality (see Theorem 1.6.2 in [29]) , E[ ˆ f kernel N (x,ω; ˜ S N )− ˆ f kernel N (x ′ ,ω; ˜ S N )] ≤ E h | ˆ f kernel N (x,ω; ˜ S N )− ˆ f kernel N (x ′ ,ω; ˜ S N )| i . (3.50) Lemma 3.18 implies that E[| ˆ f kernel N (x,ω; ˜ S N )− ˆ f kernel N (x ′ ,ω; ˜ S N )|]≤ L F ∥x− x ′ ∥. (3.51) By combining (3.50) and (3.51), we have |E[ ˆ f kernel N (x,ω; ˜ S N )− ˆ f kernel N (x ′ ,ω; ˜ S N )]|≤ L F ∥x− x ′ ∥. (3.52) RecallthatTheorem3.4showspointwiseconvergenceofkernelestimate. Withthehelpof Bounded Convergence Theorem (see Theorem 1.6.7 in [29]), pointwise convergence of kernel estimate, as illustrated in Theorem 3.4, and the boundedness of kernel estimate implies the pointwise convergence of expected kernel estimate, which is summarized in the following Lemma. 112 Lemma 3.20. Let K ∈ K. Suppose that E h |F(x, ˜ ξ )|max{log|F(x, ˜ ξ )|,0} i < ∞ for all x∈ X and the bandwidth h N satisfies that h N = CN − β , β ∈ (0, 1 nω ). If assumptions B1 - B4 and C1 are satisfied, then E[ ˆ f kernel N (x,ω; ˜ S N )] converges to f(x,ω) uniformly on X. Proof. By assumption B4, we know that|F(x,ξ )|≤ M F <∞ for all x∈ X and ξ ∈Y. So it holds that ˆ f kernel N (x,ω; ˜ S N ) ≤ X i=1 K( ω− ˜ ω i h N ) P N j=1 K( ω− ˜ ω i h N ) F(x, ˜ ξ i ) ≤ M F a.s.. By the pointwise convergence of kernel estimate in Theorem 3.4 and the Bounded Conver- gence Theorem, we have lim N→∞ E[ ˆ f kernel N (x,ω; ˜ S N )]=f(x,ω). Lemma 3.19 shows that E[ ˆ f kernel N (·,ω; ˜ S N )] is Lipschitz continuous with a constant L F on X. Since L F does not depend on N, this further implies that {E[ ˆ f kernel N (·,ω; ˜ S N )]} N is Lipschitz continuous on X with a common constant L F and thus {E[ ˆ f kernel N (·,ω; ˜ S N )]} N is equicontinuous on X. Thus, equicontinuity and pointwise convergence of{E[ ˆ f kernel N (·,ω; ˜ S N )]} N implies that lim N→∞ E[ ˆ f kernel N (x,ω; ˜ S N )]=f(x,ω) uniformly on X. Lemma 3.20 also implies that lim N→∞ E[ ˆ f kernel N (x,ω; ˜ S N )]− f(x,ω) =0 uniformly on X. WeendthissectionwiththefollowingresultwhichusesbothLemma3.20andTheorem3.12 toshowthattheexpectedoptimalitygapfromthesequenceofestimatedsolutionsgenerated by LEON-kernel converges to 0. 113 Corollary 3.21 (Asymptotic convergence of LEON-kernel). Let K ∈K and let ω be fixed. Let ˜ x q be generated by LEON-kernel, where q denotes the iteration number. Suppose that E h |F(x, ˜ ξ )|max{log|F(x, ˜ ξ )|,0} i <∞ for all x∈X. Suppose assumptions A1, B1 - B4 and C1 are satisfied. The bandwidth h N satisfies that h N = CN − β , β ∈ (0, 1 m 1 ), where m 1 is the dimension of the random vector ˜ ω. Let l denote the l th iteration of NSQG Update. In each iteration of NSQG update (process of generating x l ), the size of the datasets{N l } l increases monotonically. Then the following holds: lim q→∞ E[f(˜ x q ,ω)− c ∗ (ω)]=0. Proof. Since {N l } l is an increasing sequence and N l → ∞ as l → ∞, Lemma 3.20 implies that lim l→∞ E[ ˜ f kernel N l (x,ω; ˜ S N l l )]− f(x,ω) =0 uniformly on X. Thus, all the conditions in Theorem 3.12 are satisfied, which completes the proof. 3.4 Convergence Rate of LEON Inspired by [44], we shall analyze the convergence rate of the quantity below. Z E[f(˜ x q (ω),ω)− c ∗ (ω)]µ ω (dω). (3.53) The quantity in (3.53) measures the average performance of LEON algorithm with respect to the distribution of the predictor, ˜ ω. We shall make the following assumptions before we proceed. (D1) Var(F(x, ˜ ξ )|˜ ω =ω)≤ σ 2 for all x∈X and ω∈X. (D2) |f(x,ω 1 )− f(x,ω 2 )|≤ C∥ω 1 − ω 2 ∥ for all ω 1 ,ω 2 ∈X. 114 Recall that m 1 is the dimension of the random vector ˜ ω. Lemma 3.22. Let k N = N 2 m 1 +2 . Suppose assumptions D1 and D2 hold. Further suppose that m 1 ≥ 3 and ˜ ω is bounded. Then R E[ R ( ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)) 2 µ ω (dω)]≤ (σ 2 +c 1 C 2 )N − 2 m 1 +2 for all x∈ X, where c 1 is a constant only depending on ˜ ω. Furthermore, R E[| ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)|]µ ω (dω)≤ (σ 2 +c 1 C 2 ) 1 2 N − 1 m 1 +2 for all x∈X. Proof. By Theorem 6.2 in [44], we have E[ Z ( ˆ f k N ,N (x,ω;{(˜ ω i , ˜ ξ i )} N i=1 )− f(x,ω)) 2 µ ω (dω)]≤ σ 2 k N +c 1 C 2 ( k N N ) 2/m 1 for all x∈X. (3.54) Plugging k N =N 2 m 1 +2 into (3.54), we get E[ Z ( ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)) 2 µ ω (dω)]≤ (σ 2 +c 1 C 2 )N − 2 m 1 +2 for all x∈X. (3.55) Note that ( Z E[| ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)|]µ ω (dω)) 2 ≤ Z E[( ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)) 2 ]µ ω (dω) =E[ Z ( ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)) 2 µ ω (dω)], (3.56) wherethefirstinequalityof(3.56)holdsbasedonJensen’sinequalityandthesecondequality of (3.56) is based on Fubini’s theorem. Hence, the combination of (3.55) and (3.56) implies that R E[| ˆ f k N ,N (x,ω; ˜ S N )− f(x,ω)|]µ ω (dω)≤ (σ 2 +c 1 C 2 ) 1 2 N − 1 m 1 +2 for all x∈X. Lemma 3.23. Let h N = N − 1 m 1 +2 and let na¨ ıve kernel be used. Suppose assumptions D1 and D2 hold. Further assume that ˜ ω has a compact supportX. Then for all x∈X, we have Z E[( ˆ f kernel N (x,ω; ˜ S N )− f(x,ω)) 2 µ ω (dω)]≤ c 2 N − 2 m 1 +2 , 115 where c 2 = ˆ c(σ 2 +sup x∈X,ω ′ ∈X |f(x,ω ′ )| 2 )+C 2 and ˆ c only depends on the diameter of X and m 1 . Furthermore, for all x∈X, we have Z E[| ˆ f kernel N (x,ω; ˜ S N )− f(x,ω)|]µ ω (dω)≤ c 1 2 2 N − 1 m 1 +2 . Proof. By Theorem 5.2 in [44], we have E[ Z ( ˆ f kernel N (x,ω; ˜ S N )− f(x,ω)) 2 µ ω (dω)]≤ ˆ c σ 2 +sup ω ′ ∈Sω |f(x,ω ′ )| 2 Nh m 1 N +C 2 h 2 N , (3.57) Plugging h N =N − 1 m 1 +2 into (3.57), we have E[ Z ( ˆ f kernel N (x,ω; ˜ S N )− f(x,ω)) 2 µ ω (dω)] ≤ ˆ c σ 2 +sup x∈X,ω ′ ∈Sω |f(x,ω ′ )| 2 Nh m 1 N +C 2 h 2 N = ˆ c(σ 2 + sup x∈X,ω ′ ∈Sω |f(x,ω ′ )| 2 )+C 2 N − 2 m 1 +2 (3.58) Then one can mimic the proof of Lemma 3.22 to finish the rest of the proof. By the Jensen’s inequality, we have |E[ ˆ f kernel N (x,ω; ˜ S N )]− f(x,ω)|≤ E[| ˆ f kernel N (x,ω; ˜ S N )− f(x,ω)|]. For the following analysis, we let δ N (x,ω)=E[| ˆ f kernel N (x,ω; ˜ S N )− f(x,ω)|]. Theorem 3.24. Let k N = N 2 m 1 +2 . Let ˜ x q (ω) be generated by LEON-kNN conditioned on the event that ˜ ω = ω. Suppose that assumptions A1, B1 - B4, C1, and D1 - D2 hold. Further suppose that the bias is homogeneous over X (i.e., δ N (x 1 ,ω) = δ N (x 2 ,ω) for any x 1 ,x 2 ∈X). Then the following holds: Z E[f(˜ x q (ω),ω)− c ∗ (ω)]µ ω (dω)≤ (σ 2 +c 1 C 2 ) 1 2 ( (2n+(q− 1)qm) 2 ∆) − 1 m 1 +2 + 2D X M G √ mq . 116 Proof. By the assumption that the bias is homogeneous over X, we have A l = sup x∈X {δ N l (x,ω)}=δ N l (x 0 ,ω), for some x 0 ∈X. By the proof of Theorem 3.12, we have E[f(˜ x q (ω),ω)− c ∗ (ω))]≤ 2 jq X l=iq v l A l + 2D X M G √ mq =2 jq X l=iq v l δ N l (x 0 ,ω)+ 2D X M G √ mq (3.59) By Lemma 3.22, we have Z δ N l (x 0 ,ω)µ ω (dω)≤ (σ 2 +c 1 C 2 ) 1 2 N l − 1 m 1 +2 (3.60) Note that right-hand side of (3.60) is a decreasing function of N l . since P jq l=iq v l = 1, we have Z jq X l=iq v l δ N l (x 0 ,ω)µ ω (dω)≤ (σ 2 +c 1 C 2 ) 1 2 N iq − 1 m 1 +2 . (3.61) By the construction of N iq in algorithm 15, N iq = 2n+(q− 1)qm 2 ∆, where n, m, and ∆ are positive integers. Hence, the combination of 3.59 and 3.61 implies that Z E[f(˜ x q (ω),ω)− c ∗ (ω))]µ ω (dω)≤ (σ 2 +c 1 C 2 ) 1 2 ( (2n+(q− 1)qm) 2 ∆) − 1 m 1 +2 + 2D X M G √ mq . (3.62) Equation (3.62) implies that R E[f(˜ x q (ω),ω)− c ∗ (ω))]µ ω (dω) is in the order of O((2n+(q− 1)qm) − 1 m 1 +2 +q − 1 2 ). Theorem 3.25. Let h N = N − 1 m 1 +2 and let na¨ ıve kernel be used. Let ˜ x q (ω) be generated by LEON-kernel conditioned on the event that ˜ ω =ω. Suppose that assumptionsA1,B1 -B4, C1, and D1 - D2. Assume that ˜ ω has a compact support S ω . Further suppose that the bias 117 is homogeneous over X (i.e., δ N (x 1 ,ω)=δ N (x 2 ,ω) for any x 1 ,x 2 ∈X). Then the following holds: Z E[f(˜ x q (ω),ω)− c ∗ (ω)]µ ω (dω)≤ c 1 2 2 ( 2n+(q− 1)qm 2 ∆) − 1 m 1 +2 + 2D X M G √ mq , where c 2 is defined in Lemma 3.23. Proof. The proof is similar to the proof of Theorem 3.24. By the proof of Thoerem 3.12 and Lemma 3.23, we have Z E[f(˜ x q (ω),ω)− c ∗ (ω)]µ ω (dω)≤ Z jq X l=iq v l δ N l (x 0 ,ω)µ ω (dω)+ 2D X M G √ mq ≤ c 1 2 2 ( 2n+(q− 1)qm 2 ∆) − 1 nω+2 + 2D X M G √ mq . (3.63) 3.5 Computational Experiments Inthissection,wereportcomputationalresultsutilizingtheLEONalgorithmtosolvevarious PSPs, namely, a predictive newsvendor problem, a two-stage predictive shipment planning problemfrom[12], andatwo-stagepredictivemulti-productinventoryplanningproblem(an extension of the problem in [6] by introducing covariates). The last two instances will be presented together under the banner of two-stage predictive SLP. The sample sizes required for reasonable accuracy of non-parametric estimation are known to be notoriously large. It should come as no surprise that number of samples used in the computational experiments are relatively large compared to ordinary SP where the distributions are assumed to be known. 118 3.5.1 Predictive Newsvendor Problem The mathematical formulation of the predictive newsvendor problem can be seen in (3.4). We use the same model parameters that are used to test SA/Robust SA in the courterex- ample. Since the feasible region is required to be bounded, we introduce an additional upper-boundingconstraint x≤ 100. Thisensuresthat D X =max x∈X ∥x− x 0 ∥ismeaningful while the optimal solution remains in the interior of X. Because the analytical optimal solution is available, we let Avg(d) denote the average distance to the optimal solution point and let Dev(d) denote the standard deviation of the distances to the optimal solution point. WepresenttwoversionsoftheLEONalgorithmtoillustratetheimpactofnon-parametric estimation. In LEON-pure-average, the estimated subgradient for the l th iteration of NSQG update is ˆ G N (x l ,ω;S N l l ) = 1 N l P N l i=1 G(x l ,ξ l i ), whereas LEON-kNN (k = ⌊N β ⌋) adopts the NSQG calculation recommended by Corollary 3.17(2). To compare the sample efficiency of LEON-kNN(k =⌊N β ⌋)withSA-BatchandRobustSA-Batch, weplotthecurveofAvg(|d|) versus sample size for each algorithm in Figure 3.3, which includes the plots from Figure 3.1 for the purpose of illustration. Each algorithm is replicated 20 times. The curve of LEON-kNN(k =⌊N β ⌋)showsthatitstartsoutperformingtheSA-Batch, RobustSA-Batch and LEON-pure-average after 2,000 samples and its solution quality keeps improving as the sample size increases. On other hand, Figure 3.3 suggests that SA-Batch, Robust SA-Batch and LEON-pure-average converge to a point around 2.5. We find that the sample efficiency of LEON with other variations is similar to LEON- kNN (k =⌊N β ⌋), so we choose not to plot the curve of each variation of LEON algorithm in the following figure. Instead, we summarize the results of each algorithm in Table 3.1. Forcomparison, wealsopresentthenumericsofSA-BatchandRobustSA-Batchwhichwere usedtocreateFigure3.1. Althoughthebiasofnon-parametricestimationstillexistsinfinite number of samples, Table 3.1 clearly suggests that LEON algorithm produces solutions that are within one percent of optimality. 119 Algorithm N Avg(|d|) Avg(|d|) |x ∗ | Dev(|d|) Dev(|d|) |x ∗ | SA-Batch 209,700 2.487 6.87% 0.076 0.21% Robust SA-Batch 209,700 2.548 7.04% 0.060 0.17% kNN 209,700 0.576 1.59% 0.480 1.33% kNN (k N =⌊N β ⌋) 209,700 0.331 0.91% 0.263 0.73% Na¨ ıve 209,700 0.550 1.52% 0.435 1.20% Epanechnikov 209,700 0.549 1.52% 0.442 1.22% Quartic 209,700 0.704 1.94% 0.476 1.32% Gaussian 209,700 0.261 0.72% 0.285 0.79% Table3.1: ComputationalResultsoftheLEONAlgorithmand(Robust)SAinthePredictive Newsvendor Problem. (x ∗ =36.1975) Figure 3.3: Predictive Newsvendor Problem: LEON-kNN (k =⌊N β ⌋) 120 3.5.2 Computational Results with Two-Stage Predictive SLP A two-stage predictive stochastic linear programming can be formulated as min x n c ⊤ 1 x+E[Q(x, ˜ ξ )|˜ ω =ω]:A 1 x=b 1 , x≥ 0 o , where Q(x,ξ ) depends on the realization of ˜ ω, and it is the optimal cost of the second-stage problem: Q(x,ξ ) = min y c ⊤ 2 y :A 2 y+B 2 x=b 2 (ξ ), y≥ 0 . Here, c 1 , c 2 and b 1 are deter- ministiccolumnvectors. A 1 ,A 2 andB 2 aredeterministicmatrices. b 2 ( ˜ ξ )isstochasticcolumn vectorwhichdependson ˜ ξ . Todrawaconnectionwiththestandardproblem(3.2),notethat F(x,ξ )=c ⊤ 1 x+Q(x,ξ ), f(x,ω)=c ⊤ 1 x+E[Q(x, ˜ ξ )|˜ ω =ω], and X ={x:A 1 x=b 1 ,x≥ 0}. 3.5.2.1 Predictive Multi-product Inventory Planning Multi-product inventory modeling [6] can be formulated as a two-stage SLP [98]. With a slight modification of the random process to include covariates, we convert the ordinary BAA99 SLP to BAA99-(25,50) which includes 25 predictors and 50 responses. The valida- tion process that we adopt is based on the covariates, which we refer to as distribution-free validation approach. In other words, both the LEON algorithm and the distribution-free validation approach use no knowledge of the distribution. In BAA99-(25,50), the predictor and demand (i.e., response) altogether follows a mul- tivariate normal distribution with mean vector, µ , and covariance matrix, Σ (i.e., ( ω,ξ )∼ N(µ, Σ)). Moreover, the scale of this experiment is much larger than the instance in [98]. In the distribution-free validation approach, we utilize kNN to estimate the true cost of a given solution. Although such an approach will inevitably incur some bias in measuring solution quality,thisapproachisavailableinpracticeanditsperformanceisensuredbythepointwise convergence of kNN estimator [109]. To initiate the distribution-free validation approach for 121 BAA99-(25,50), we generate 10 6 predictor-response data pairs from unconditional distribu- tion and find 10 3 nearest neighbors of the observed predictor. As a result, the associated estimated optimal cost (i.e., θ ∗ c ) is -14,256.5. In BAA99-(25,50), we fix the total number of samples to 134 ,750 for each variation of LEON algorithm. The observed predictor is ω i =108− i 3 , i=1,2,...,25. Again, we test on each non-parametric estimation variation of LEON algorithm and replicate each instance 10 times. The results are summarized in Table 3.2 which validates the computational perfor- mance of the LEON algorithm. We note that LEON with Gaussian Kernel does not provide a solution in 30 minutes for BAA99-(25,50). Perhaps, this is because that it needs to solve every sampled problem regardless of its weight. On the other hand, other LEON variations only solve the sampled problems whose distance to the observed predictor is smaller than a certain bandwidth. Algorithm N Avg(Obj) |Avg(Obj)− θ ∗ c | |θ ∗ c | Dev(Obj) | Dev(Obj) θ ∗ c | kNN 134,750 -14250.15 0.04% 2.959 0.02% kNN (k N =⌊N β ⌋) 134,750 -14251.25 0.04% 2.972 0.02% Na¨ ıve 134,750 -14249.69 0.05% 4.167 0.03% Quartic 134,750 -14246.46 0.07% 7.039 0.05% Epanechnikov 134,750 -14248.29 0.06% 6.011 0.04% Gaussian —Failed to converge in 30 minutes— Table 3.2: Computational Results of LEON Algorithm in BAA99-(25,50) (˜ γ q = ˇ C √ mq ) 3.5.2.2 Two-Stage Predictive Shipment Planning Here,wedirectlyutilizeatwo-stagepredictiveshipmentplanningproblemfrom[12]tostudy the computational performance of the LEON algorithm. The associated model parameters and data generation process can be found in [12]. We follow the overall design of their experiments. However, the realization of the data presented to the LEON algorithm is not necessarily identical tothe experimentof [12]. Bertsimasand Kallus in[12] presentan ideal- izedinstanceinwhichthetrueconditionaldistributionisavailablealthoughitisnotusedfor optimization purposes. Nevertheless, they do use true conditional distribution to estimate 122 the quality of the decision in an idealized setting. We shall use their validation approach to measure the quality of the solutions obtained by the LEON algorithm and we refer to this process as an idealized validation approach. In the experiment, the observed predictor is set to ω = (− 0.3626,0.5871,− 0.2987). In the idealized validation approach, we generate 10 5 predictor-response pairs from the true conditional distribution to create a validation set. The estimated cost of ˆ x is θ v (ˆ x) = 1 Nv P Nv i=1 F(ˆ x,ξ ω i ) and θ ∗ = min x∈X 1 Nv P Nv i=1 F(x,ξ ω i ) is the estimated “optimal” cost of the perfect-foresight problem. In particular, θ ∗ is 295.792. Because the validation data is gen- erated by assuming the known conditional distribution, the value of θ ∗ defined above must agree with that obtained by the non-parametric scheme of [12]. In order to give a reader a sense how sensitive the method is to sample sizes, we test each algorithm with two differ- ent sample sizes as shown in Tables 3.3 (smaller sample size) and 3.4 (larger sample size). For each variation of LEON, we replicate the experiments 10 times and output the average performance below. Algorithm N Avg(Obj) |Avg(Obj)− θ ∗ | |θ ∗ | Dev(Obj) Dev(Obj) |θ ∗ | kNN 209,700 296.6567 0.29% 0.9453 0.32% kNN(k N =⌊N β ⌋) 209,700 296.2448 0.15% 0.4891 0.17% Na¨ ıve 209,700 296.5802 0.27% 0.1541 0.05% Epanechnikov 209,700 296.2298 0.15% 0.1086 0.04% Quartic 209,700 296.0987 0.10% 0.0882 0.03% Gaussian 209,700 296.3892 0.20% 0.1702 0.06% Table 3.3: Computational Results for BK19 Algorithm N Avg(Obj) |Avg(Obj)− θ ∗ | |θ ∗ | Dev(Obj) Dev(Obj) |θ ∗ | kNN 603,225 296.2767 0.16% 0.5311 0.18% kNN(k N =⌊N β ⌋) 603,225 295.9441 0.05% 0.1537 0.05% Na¨ ıve 603,225 296.3386 0.18% 0.0504 0.02% Epanechnikov 603,225 296.0758 0.10% 0.0297 0.01% Quartic 603,225 295.9751 0.06% 0.0393 0.01% Gaussian 603,225 296.0966 0.10% 0.0456 0.02% Table 3.4: Computational Results with Larger Sample Size for BK19 123 The data in Tables 3.3 and 3.4 present the average solution quality and the stability of LEON algorithms estimated by the idealized validation approach. The first column of each tableshowsthenon-parametricestimationmethodsusedforweightfunctioncalculation. The columnofAvg(Obj)showstheaveragesolutionqualityintermsoftheaverageestimatedcost of the solution (lower the better) and the column of Dev(Obj) shows the standard deviation of the solution qualities. Note that as we increase the sample size, Table 3.4 shows that the values in the column of Dev(Obj) decrease, implying greater stability. In summary, all of the tables above validate the performance of the LEON algorithm. 3.6 Remarks In this chapter, we present an assessment of the contributions which were mentioned in the Introduction. a) We design the LEON algorithm which adopts a simultaneous estimation- optimization process, including streaming data in the form of covariates. Under mild as- sumptions, we have shown the Lemma 3.11 (Contraction in NSQG Update), which is used to prove the asymptotic convergence of LEON (Theorem 3.12) under multiple structures (e.g., LEON-kNN). Since the entire framework is parametrized by ω, it needs to be high- lighted that the convergence of LEON is consistent with µ ω almost every ω∈R nω . With a further assumption of H¨ older continuity of the true objective function, f(x,ω), we prove the asymptotic convergence of LEON-kNN (Corollary 3.17) with fixed k. By assuming uniform Lipschitz continuity of F(x,ξ ) (assumption C1) instead of assumption B5, we prove the asymptotic convergence of LEON-kernel (Corollary 3.21). b) The convergence result is predicated on diminishing both statistical estimation bias andsystematicoptimalitygap. Thisisdonebyiterativelyproducinganaverageofestimated solutions from NSQG updates with increasing number of data points (i.e., more estimated solutions and more predictor-response data). This coordinated effort is the key to asymp- totic results presented in this paper. 124 c) Our results have been developed to apply to kNN and a variety of kernels (Na¨ ıve, Quartic, Epanechnikov and Gaussian) under a common umbrella. For each NSQG update of LEON algorithm, a NSQG of f(x,ω) is calculated by using kNN or kernel method and then the estimated solution is updated based on such NSQG. We show that the finite-time bias of NSQG can be overcome by successively increasing sample size, which allows us to establish the optimality (in terms of the expected optimality gap) of solutions produced the LEON algorithm. To the best of our knowledge, this is the first result for this simultaneous estimation and optimization procedure using non-parametric methods. d) The numerical performance of the LEON algorithm was verified using the well-known two-stage shipment planning problem. Moreover, we presented a distribution-free approach to validate the solutions provided by the algorithm for a case in which the true conditional distribution is unknown. The test instance that we use is an extension of BAA99 [6], in which the randomness is expanded to allow covariates, including a relatively high dimen- sional dataset (i.e., 25 predictors and 50 responses). Whilekernelapproximationshavebeenusedtoapproximatedynamicprogrammingvalue functions, we are not aware of SP algorithms which have imported kernel and other non- parametricestimationmethodsasanintegralpartoftheoptimizationalgorithm. Weexpect thiscombinationofestimationandoptimizationtobethefundamentalstepfordistribution- free algorithms in SP with covariates. Is this the start of new era of “estimization” (estima- tion + optimization)? 125 Chapter 4 Non-parametric Stochastic Decomposition for Two-Stage Predictive Stochastic Programming 4.1 Overview This chapter presents a mathematical fusion of non-parametric estimation and Stochastic Decomposition (SD) algorithms which we refer to as Non-parametric SD to solve two-stage Predictive Stochastic Programming problems. This permits simultaneous updates of the expectedvalueobjective,aswellasfirst-stagedecisionsbysequentiallycomputingminorants of the kNN estimate of the objective functions. Following Chapter 3, we consider the data in which the underlying the uncertainty are available as predictors and responses. Mathematically speaking, we consider that the tuple (˜ ω, ˜ ξ ):Ω 7→X×Y isarandomvariabletakingvaluesintheproductspace(X×Y ,Σ X ⊗ Σ Y ). Let the joint distribution of (˜ ω, ˜ ξ ) be µ ˜ ω, ˜ ξ . Let µ ˜ ω and µ ˜ ξ denote the marginal distribution functions of ˜ ω and ˜ ξ , respectively. Let µ ˜ ξ |ω denote the conditional distribution of ˜ ξ given ˜ ω = ω. The goal here is to design a data-driven decomposition-based algorithm for solving the following two-stage predictive stochastic programming problem. min x∈X f(x)+E ˜ ξ ∼ µ ˜ ξ |ω [h(x, ˜ ξ )|˜ ω =ω] (4.1) 126 where µ ˜ ξ |ω is the regular condition distribution of ˜ ξ given ˜ ω = ω, f(x) is a linear/quadratic function, h(x,ξ ) is the minimum value of the second-stage linear programming/quadratic programming problem for a given (x,ξ ). ˜ ω and ˜ ξ are two correlated random variables. For instance, ˜ ω may denote the attribute of the firm-generated tweets (e.g., sentiment) and the associated ˜ ξ may denote the stock price ([59]). Another example is that ˜ ω may denote a vector of temperature and humidity, while ˜ ξ may denote the precipitation ([83]). The word “predictive” comes from the conditional expectation with respect to the observation of the predictor. The model formulation in (4.1) can regarded as a two-stage realization of the generic problem below. min x∈X E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω =ω]. (4.2) Problem (4.2) is known to be effective in capturing the evolution of a data process (see [12] and[55]fordetails),whichturnsouttoenhancedecision-makingduetoitspredictivepower. As opposed to standard distribution-driven (i.e., requires the full knowledge of distribution and the distribution must be fixed) stochastic programming algorithms, LEON algorithm is a data-driven first-order stochastic programming algorithm that utilizes non-parametric es- timationtoadapttotheevolutionofthedataprocess. Nonetheless,ithastwoshortcomings: (1) Since a new dataset with an increasing sample size needs to be generated per iteration, the LEON algorithm requires tremendously a large number of samples to maintain good so- lution quality. (2) The convergence result of LEON only shows that the expected optimality gap converges to zero, which is not sufficient to imply the convergence of solutions. Inthischapter, weaimtodesignadata-drivendecomposition-basedalgorithmtoexploit the structure of the two-stage optimization problem (e.g., finiteness of the second-stage dual extreme points) while overcoming the challenges raised above. The contributions of this paper are two-fold: (1) We propose a decomposition-based algorithm to merge the power of the SD algorithm and k nearest neighbors estimation. This is the first decomposition-based algorithmthatusesnon-parametricestimationtoattainthepredictivecapabilitytothebest of our knowledge. (2) The proposed algorithm overcomes the issues of large sample size and 127 weaker convergence results that the LEON algorithm possesses, which is validated theoreti- cally and numerically. This chapter is organized as follows. In section 4.3.1, we present the algorithm design of Stochastic Decomposition with k nearest neighbors estimation (SD-kNN) to solve two-stage predictive stochastic linear programming problems. The convergence analysis of SD-kNN algorithm is provided in section 4.3.3. In section 4.4, we extend SD-kNN algorithm to solve two-stage predictive stochastic quadratic-quadratic programming problems. Finally, we demonstrate the computational performance of SD-kNN and SD-kNN-QQ algorithms in section 4.5. 4.2 Preliminary We use the Euclidean distance to measure the distance between the two predictors. We let ∥·∥ denote the Euclidean norm of the vector and spectral norm of the matrix. Since the true objective function in (4.2) is defined as a function of the decision x and the predictor ω, we defines the directional derivative of such class of functions at x∈R n in the direction of d∈R n for a given parameter ω∈X. Definition 4.1. Let ζ (x,ω) : S×X 7→ R, where S is a convex set of R n and X is a Borel subset ofR nω . For a given ω∈X, suppose that the directional derivative of ζ (·,ω) at x∈X in the direction of d∈R n exists and it is defined below: ζ ′ (x,ω;d)=lim τ ↓0 ζ (x+τd,ω )− ζ (x,ω) τ . 128 4.3 Two-Stage Predictive SLP 4.3.1 Problem Formulation Consider the following two-stage predictive stochastic programming problem. min x f(x,ω)=c ⊤ x+E ˜ ξ ∼ µ ˜ ξ |ω [h(x, ˜ ξ )|˜ ω =ω] s.t. x∈X≜{x∈R n :Ax≤ b}, (4.3) where h(x,ξ )=min{d ⊤ y :Dy =e(ξ )− C(ξ )x, y≥ 0}. (A1) X is a compact and convex subset ofℜ n . (A2) The second stage subproblem satisfies the relative complete recourse property. (A3) (Regularity of the probability space) The set of all the possible outcomes of ˜ ξ , Y, is compact. Y is a Borel subset of a complete separable metric space and Σ Y is the σ -algebra generated by Y. X is a Borel subset of a complete separable metric space and Σ X is the σ -algebra generated byX. Let S =X ×Y and let Σ S be the σ -algebra generatedbyS. S isaBorelsubsetofacompleteseparablemetricspace. Furthermore, the joint distribution function µ ˜ ω, ˜ ξ exists and the conditional distribution of ˜ ξ given ˜ ω =ω is defined by (3.8). (A4) For any ξ ∈Y,P(∥ ˜ ξ − ξ ∥<δ )>0 for any δ > 0. (A5) (˜ ω 1 , ˜ ξ 1 ),(˜ ω 2 , ˜ ξ 2 ),... are i.i.d. random variables. (A6) There exists M h < ∞, such that 0 ≤ h(x,ξ ) ≤ M h for all x ∈ X and almost every ξ ∈Y. (A7) There exists M C ∈(0,∞) such that∥C(ξ )∥≤ M C for almost every ξ ∈Y. For the sample size equal to l, the k l nearest neighbor estimate of f(x,˜ ω) is ˆ f k l l (x,ω;S l )≜c ⊤ x+ 1 k l l X i=1 I ω i ∈S(k l ,ω;{ω j } l j=1 ) h(x,ξ i ), 129 where S l ={(ω i ,ξ i )} l i=1 is data and S(k l ,ω,{ω j } l j=1 )) is the k nearest neighbors of ω from the sample set{ω j } l j=1 . An alternative representation of ˆ f k l l is by using order statistics, which is written below, ˆ f k l l (x,ω;S l )=c ⊤ x+ 1 k l k l X i=1 h(x,ξ l [i] ), wherethetuple(ω l [j] ,ξ l [j] )hasω l [j] whichisthej th nearestneighborofω fromtheset{ω j } l i=1 . Since this non-parametric extension of SD algorithm makes use of k nearest neighbors estimation, we shall refer to this algorithm as the SD-kNN algorithm. 4.3.2 Algorithm Design Recall that in the LEON algorithm, we generate a dataset of covariates that is independent of the past to calculate NSQG. As a result, the previously calculated NSQG will never be reused in the future. On the other hand, SD-kNN uses an incremental-sampling schema in the sense that it uses all the generated data to calculate a new NSQG. Moreover, it uses previously calculated NSQGs and current NSQGs to construct an outer approximation of the objective function. Similar to SD, SD-kNN also keeps a record of captured second-stage duals to accelerate the computation of NSQG. Compared to SD, the main difference of SD-kNN lie in the minorant construction and minorant updates, which are presented in the following algorithm design. Namely, the mi- norant construction will be achieved by the kNN estimation method. We propose a new minorant update rule to ensure the updated old minorants remain the lower bound of the new kNN estimate of the objective function. Note that in the l th iteration of the SD-kNN algorithm, ˆ f k l (x,ω;S l ) is used as the kNN estimate of true objective function in (4.3). The proof of the valid minorant update is provided in the following proposition. IntheSD-kNN,weuseatie-breakingbyindicespolicytosolvetie-breakingwhenfinding 130 the k l nearest neighbors. That is, if k l = k l+1 and ∥ω l [k l ] − ω∥ =∥ω l+1 − ω∥, then ω l+1 will not be included in the k l+1 nearest neighbor set. Proposition 4.2. Suppose that assumptions A1 - A7 hold. Let C l i (x,ω) = α l i + (β l i ) ⊤ x, i = 1,2,3,...,l be the affine function that is firstly generated at iteration i and updated at iteration l. Let k l =⌊l β ⌋ and β ∈ (0,1). Suppose that ˆ f k l l (x,ω;S l )≥ c ⊤ x+C l i (x,ω) for all i=1,2,...,l. Then the following holds:: a. If k l+1 =k l , then ˆ f k l+1 l+1 (x,ω;S l )≥ c ⊤ x+C l i (x,ω)− M h k l , i=1,2,...,l. b. If k l+1 >k l , then ˆ f k l+1 l+1 (x,ω;S l+1 )≥ c ⊤ x+ k l k l+1 C l i (x,ω), i=1,2,...,l. Proof. Case a. (k l+1 =k l ) Let ω l [k l ] be the k l nearest neighbor of ω from the set{ω j } l j=1 . If∥ω l [k l ] − ω∥≤∥ ω l+1 − ω∥, thenI(ω l+1 ∈S(k,ω,{ω j } l+1 j=1 )))=0 and I(ω l [i] ∈S(k,ω,{ω j } l+1 j=1 )))=1, for i=1,2,...,k l . Hence, ˆ f k l+1 l+1 (x,ω;S l+1 )= ˆ f k l l (x,ω;S l ). By assumption, ˆ f k l l (x,˜ ω)≥ c ⊤ x+C l i (x,ω) for i∈J l . Since M h >0, it is obvious that ˆ f k l+1 l+1 (x,ω;S l+1 )= ˆ f k l l (x,ω;S l )≥ c ⊤ x+C l i (x,ω)− M h k l . If∥ω l [k l ] − ω∥>∥ω l+1 − ω∥, then ω l [k l ] will be removed from the k l+1 nearest neighbor set in iteration l+1 while ˜ ω l+1 will be added to the nearest neighbor set in iteration l+1. That isI(ω l [k l ] ∈S(k,ω,{ω i } l+1 j=1 ))) = 0 andI(ω l+1 ∈S(k,ω,{ω j } l+1 j=1 ))) = 1. Since the sample size only increases by 1 per iteration, I(ω l [i] ∈ S(k,ω,{ω l [k l ] } l+1 j=1 ))) = 1 for all 1 ≤ i ≤ k l − 1. Hence, 0≤ h(·,·)≤ M h implies that ˆ f k l+1 l+1 (x,ω;S l+1 )− ˆ f k l l (x,ω;S l )= 1 k l+1 h(x,ξ l+1 )− 1 k l h(x,ξ l [k l ] ) ≥ 1 k l+1 × 0− 1 k l × M h =− M h k l . 131 By the assumption onC l i (x,ω) in the proposition, it follows from the equation above that ˆ f k l+1 l+1 (x,ω;S l+1 )≥ c ⊤ x+C l i (x,ω)− M h k l . ∀ i∈J l . Case b. (k l+1 >k l ). Since the sample size only increases by 1 per iteration, we have I(ω l [i] ∈S(k,ω,{ω l j } l+1 j=1 )))=1 (4.4) for all 1≤ i≤ k l . On the other hand, equation (4.4) implies that l X i=1 I ω i ∈S(k,ω,{˜ ω j } l j=1 ) h(x,ξ i )= k l X i=1 I(ω l [i] ∈S(k,ω,{ω l j } l+1 j=1 )))h(x,ξ l [i] ) Hence, h(·,·)≥ 0 implies that ˆ f k l+1 l+1 (x,˜ ω;S l+1 )− c ⊤ x≥ 1 k l+1 k l X i I(ω l [i] ∈S(k,ω,{ω l j } l+1 i=1 )h(x,ξ l [i] )+ k l+1 − k l k l+1 × 0 ≥ 1 k l+1 l X i=1 I ω i ∈S(k,ω,{ω j } l j=1 ) h(x,ξ i ) Since 1 k l P l i=1 I ω i ∈S(k,ω,{ω j } l j=1 ) h(x,ξ i )= ˆ f k l l (x,ω;S l )− c ⊤ x, it implies that l X i=1 I ω i ∈S(k,ω,{ω j } l j=1 ) h(x,ξ i )=k l ˆ f k l l (x,ω;S l )− c ⊤ x which further implies that ˆ f k l+1 l+1 (x,ω;S l+1 )− c ⊤ x≥ k l k l+1 ( ˆ f k l l (x,ω;S l )− c ⊤ x)+ k l+1 − k l k l+1 × 0 = k l k l+1 C l i (x,ω). for i∈J l Hence this shows that ˆ f k l+1 l+1 (x,ω;S l+1 )≥ c ⊤ x+ k l k l+1 C l i (x,ω), for all i∈J l . 132 Algorithm 16 SD-kNN Set V 0 =∅, ˆ x 0 ∈X, k 0 =0, q∈(0,1), σ l ≥ 1 and β ∈(0,1). SetJ 0 ={0},C 0 0 (x)=0, ˆ H 0 (x)=max{C 0 0 (x)}, and ˇ f 0 (x,ω)=c ⊤ x+ ˆ H 0 (x). for l =1,2,3,... do Solve the following proximal mapping problem Candidate Selection x l =argmin x∈X ˇ f l− 1 (x,ω)+ σ l 2 ∥x− ˆ x l− 1 ∥ 2 . Record the Lagrangian multipliers{θ l− 1 i } i∈J l− 1 of{C l− 1 i (x)} i∈J l− 1 . Set k l ←⌊ l β ⌋ and generate a new sample (ω l ,ξ l )∼ µ ˜ ω, ˜ ξ . Find the k l nearest neighbors of ω from the sample set{ω i } l i=1 , which is denoted byS(k,ω;{ω j } l j=1 )={ω l [i] } k l i=1 . Compute dual extreme points π l ∈argmax{π ⊤ (e(ξ l )− C(ξ l )x l )| π ⊤ D≤ d}. Update V l ← V l− 1 ∪{π l } andJ l ←J l− 1 ∪{− l,l}\{i∈J l− 1 :θ l− 1 i =0}. for i=1,2,...,k l do Dual Selection Compute ˆ π l [i] ∈argmax{π ⊤ (e(ξ l [i] )− C(ξ l [i] )x l ) : π ∈V l } Compute ˆ π − l [i] ∈argmax{π ⊤ (e(ξ l [i] )− C(ξ l [i] )ˆ x l− 1 ) : π ∈V l } end for Calculate α l l = 1 k l P k l i=1 (ˆ π l [i] ) ⊤ e(ξ l [i] ). Minorant Construction Calculate β l l =− 1 k l P k l i=1 C(ξ l [i] ) ⊤ ˆ π l [i] . Construct minorant at the candidate: C l l (x,ω)=α l l +(β l l ) ⊤ x. Calculate α l − l = 1 k l P k l i=1 [ˆ π − l [i] ] ⊤ e( ˜ ξ l [i] ). Calculate β l − l =− 1 k l P k l i=1 C( ˜ ξ l [i] ) ⊤ ˆ π − l [i] . Construct minorant at the incumbent: C l − l (x,ω)=α l − l +(β l − l ) ⊤ x. if k l =k l− 1 then UpdateC l i (x,ω)←C l− 1 i (x,ω)− M h k l− 1 , ∀ i∈J l \{− l,l}. else UpdateC l i (x,ω)← k l− 1 k l C l− 1 i (x,ω), ∀ i∈J l \{− l,l}. end if Construct ˆ H l (x,ω)=max{C l i (x,ω):i∈J l } and ˇ f l (x,ω)=c ⊤ x+ ˆ H l (x,ω). if ˇ f l (x l ,ω)− ˇ f l (ˆ x l− 1 ,ω)≤ q ˇ f l− 1 (x l ,ω)− ˇ f l− 1 (ˆ x l− 1 ,ω) then Incumbent Set ˆ x l ← x l . Selection else Set ˆ x l ← ˆ x l− 1 . end if end for Remark 1: In the beginning, M h k l− 1 may be so large that the updated cuts remain below zero for large portion of x∈X. On the other hand, for large enough k l , M h k l− 1 will not be too large to reuse the previous cuts. Based on this observation, we suggest that k 0 should not be too small. Remark2: Theassumptionthathisboundedbelowby0canberelaxedtotheassumption 133 that h is bounded below by the constant, m h . In this case, the update rule can be modified to the following form. a. If k l+1 =k l , then ˆ f k l+1 l+1 (x,ω;S l )≥ c ⊤ x+C l i (x,ω)− M h − m h k l , i=1,2,...,l. b. If k l+1 >k l , then ˆ f k l+1 l+1 (x,ω;S l+1 )≥ c ⊤ x+ k l k l+1 C l i (x,ω)+ k l+1 − k l k l+1 m h , i=1,2,...,l. 4.3.3 Theoretical Results Here we provide several preliminary results of the SD-kNN algorithm. We let V l denote the setofdualverticesexploredbySD-kNNinthel th iterationandleth l (x,ξ )=max{π ⊤ (e(ξ )− C(ξ )x)| π ∈V l }. Let V denote the set of all the vertices of{π | π ⊤ D≤ d}. Lemma 4.3. Suppose that assumptions A1 - A7 are satisfied, then the following hold: The subdifferential of h(·,ξ ) is contained in an compact set for all ξ ∈ S ξ . Furthermore, h(·,ξ ) is Lipschitz continuous on X with a common constant for almost every ξ ∈Y. Proof. LetV denotethesetofdualextremepointsof{π :D ⊤ π ≤ d}, whichcontainsfinitely many unique elements. The dual of the second-stage problem can be written as h(x,ξ ) = max{π ⊤ (e(ξ )− C(ξ )x) : π ∈ V}. By assumption A7, we have ∥C(ξ ) ⊤ π ∥ ≤ M C max π ∈V ∥π ∥. Furthermore, since V is finite, there exists M π ∈ (0,∞) such that ∥C(ξ ) ⊤ π ∥ ≤ M C M π . Combining with the fact that h(·,ξ ) is a piecewise linear function on X, we finish the proof that the subdifferential of h(·,ξ ) is contained in an compact set for all ξ ∈Y. Furthermore, this implies that h(·,ξ ) is Lipschitz continuous on X with a common constant for almost every ξ ∈Y. Which is similar to regularized SD in [47], we can get the similar descent condition based on the introduced regularized term. Lemma 4.4. Given σ ≥ 1, then the following holds: ˇ f l (x l+1 ,ω)− ˇ f l (ˆ x l ,ω)≤− σ ∥x l+1 − ˆ x l ∥ 2 ≤−∥ x l+1 − ˆ x l ∥ 2 ≤ 0. 134 By the minorant construction and updates in the SD-kNN algorithm, it is easy to verify that ˇ f l (·,ω)istheouterapproximationof ˆ f k l l (·,ω),whichissummarizedinthelemmabelow. Lemma 4.5. Suppose that assumptions A1 - A7 are satisfied. Let k l = ⌊l β ⌋, β ∈ (0,1). The following hold: (a) h l (x,ξ )≤ h(x,ξ ), for all x∈X,l∈N. (b) c ⊤ x+C l l (x,ω)≤ f k l l (x,ω), for all x∈ X, l ∈N. c ⊤ x+C l − l (x,ω)≤ f k l − l (x,ω), for all x∈X, l∈N. (c) ˇ f l (x,ω)≤ ˆ f k l l (x,ω), for all x∈X, l∈N. Lemma 4.6. Suppose that assumptions A1 - A7 are satisfied. The sequence of functions {h l } l converges uniformly on X. Proof. Note that the dual extreme point exploration in SD-kNN is the same as the one in SD. See Lemma 1 in [46]. The rest of this section is organized as follows. We start with the theorem which shows that the kNN estimate of the true objective function converges to the original objective function with probability one. Then we show that, with probability one, the sequence of minorants from SD-kNN converges to E ˜ ξ ∼ µ ˜ ξ |ω [h(¯x, ˜ ξ )|˜ ω = ω], where ¯x is any arbitrary accumulation point of the sequence of candidate/incumbent solutions. Next, we proceed to show that, with probability one, there exists a subsequence of iterations so that the incumbent solutions and the candidate solutions on that subsequence converge to the same point and the directional derivative of f(·,ω) at that limiting point is bounded below by the limsup of the directional derivative of the estimated function from SD-kNN evaluated at the candidate solution. Finally, we show that, with probability one, SD-kNN produces a sequence of incumbents that converge to the optimal solution to (4.3). 135 Theorem 4.7. Suppose that assumptions A1 - A7 are satisfied. Let k l be monotone in- creasing withl, k l →∞, k l l →0 (asl→∞) and (k l ) varies regularly with exponentβ ∈(0,1] (e.g., k l =⌊l β ⌋, β ∈(0,1)). Then the following holds: lim l→∞ 1 k l l X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } l j=1 ) h(x, ˜ ξ i )=E ˜ ξ ∼ µ ˜ ξ |ω [h(x, ˜ ξ )|˜ ω =ω], with probability one. Proof. See theorem 1 of [109]. Since 1 k l P k l i=1 h(x,ξ l [i] ) = 1 k l P l i=1 I ω i ∈S(k l ,ω;{ω j } l j=1 ) h(x,ξ i ), and ˆ f k l l (x,ω;S l ) = c ⊤ x + 1 k l P l i=1 I ω i ∈S(k l ,ω;{ω j } l j=1 ) h(x,ξ i ), Theorem 4.7 implies the following results hold. Corollary 4.8. Suppose that assumptions A1 - A7 are satisfied. Let k l be monotone in- creasing withl, k l →∞, k l l →0 (asl→∞) and (k l ) varies regularly with exponentβ ∈(0,1] (e.g., k l =⌊l β ⌋, β ∈(0,1)). Then the following holds: a. lim l→∞ 1 k l P k l i=1 h(x,ξ l [i] )=E ˜ ξ ∼ µ ˜ ξ |ω [h(x, ˜ ξ )|˜ ω =ω], with probability one. b. lim l→∞ ˆ f k l l (x,ω;S l )=c ⊤ x+E ˜ ξ ∼ µ ˜ ξ |ω [h(x, ˜ ξ )|˜ ω =ω]=f(x,ω), with probability one. Lemma 4.9. Suppose that assumptions A1 - A7 are satisfied. The sequence of functions { ˆ f k l l (x,ω;S l )} ∞ l=1 is equicontinuous on X. Proof. By Lemma 4.3, the subdifferential of h(·,ξ ) is contained in an compact set for all ξ ∈ Y. Let x,x ′ ∈X. This implies that h(·,ξ ) is Lipschitz continuous with a common constant, L, forallξ ∈Y. Since 1 k l P l i=1 I ω i ∈S(k l ,ω;{ω j } l j=1 ) =1andI ω i ∈S(k l ,ω;{ω j } l j=1 ) ≥ 0 for all i, we have | ˆ f k l l (x,ω;S l )− ˆ f k l l (x ′ ,ω;S l )|≤ 1 k l l X i=1 I ω i ∈S(k l ,ω;{ω j } l j=1 ) |h(x,ξ i )− h(x ′ ,ξ i )| =L∥x− x ′ ∥. (4.5) 136 Since (4.5) is true for l, this implies that { ˆ f k l l (x,ω;S l )} ∞ l=1 is Lipschitz continuous on X with a common constant L, which further implies that { ˆ f k l l (x,ω;S l )} ∞ l=1 is equicontinuous on X. With the equicontinuity of the sequence of kNN benchmark functions,{ ˆ f k l l (x,ω;S l )} ∞ l=1 , we further show{ ˆ f k l l (x,ω;S l )} ∞ l=1 converges to the true objective function, f(x,ω), in (4.3) uniformly on X with probability one. Theorem 4.10. Suppose that assumptions A1 - A7 are satisfied. Let k l be monotone increasing with l, k l → ∞, k l l → 0 (as l → ∞) and (k l ) varies regularly with exponent β ∈(0,1] (e.g., k l =⌊l β ⌋, β ∈(0,1)). With probability one, lim l→∞ ˆ f k l l (x,ω;S l )=f(x,ω), uniformly on X. Proof. The result follows by using Theorem 4.7 and Lemma 4.9. Lemma 4.11. Suppose assumption A1 is satisfied. Let {x l } ∞ l=1 and {ˆ x l } ∞ l=1 denote the sequence of candidate solutions and the sequence of incumbent solutions generated by SD- kNN. a. With probability one, an accumulation point of{x l } ∞ l exists. b. With probability one, an accumulation point of{ˆ x l } ∞ l exists. Proof. The result follows since X is compact. Lemma 4.12. Suppose assumption A3 and A4 hold and {ξ l } ∞ l are i.i.d. With probability one, for all i ∈ N and δ > 0, the sequence of { ˜ ξ l } ∞ l=i+1 enters the ball with center ˜ ξ i and radius δ infinitely often. Proof. Let{ξ i } i∈N bethecountablesubsetofY thatisalsodenseinY. Let ˆ Φ i,k ={∥ ˜ ξ l − ξ i ∥< 1 k i.o.}. For fixed i and k, by assumption A4, there exists ϵ > 0 such that P(∥ ˜ ξ l − ξ i ∥ ≤ 1 k ) = ϵ for all l ≥ i+1. By the second Borel-Cantelli Lemma (see Theorem 2.3.6 in [29]), 137 P ∞ l=i+1 P(∥ ˜ ξ l − ξ i ∥ < 1 k ) = P ∞ l=i+1 ϵ =∞ implies that P( ˆ Φ i,k ) = P(∥ ˜ ξ l − ξ i ∥ < 1 k i.o.) = 1. Let ¯Φ i = ∩ k∈N ˆ Φ i,k . Hence, this implies that P( ¯Φ i ) = P(∩ k∈N ˆ Φ i,k ) = 1. Furthermore, by considering all i ∈ N, we have P(∩ i∈N ¯Φ i ) = 1. Note that for any ξ ∈ Y and δ > 0, there exists i,k ∈ N such that 1 k < 1 2 δ and ∥ξ − ξ i ∥ < 1 2 δ . Therefore, P(∥ ˜ ξ l − ξ ∥ < δ i.o. ) = 1. Hence,thisshowsthatwithprobabilityone,foralli∈N andδ > 0,thesequenceof{ ˜ ξ l } ∞ l=i+1 enters the ball with center ˜ ξ i and radius δ infinitely often. To prepare for the following analysis, we denote following set of events: Ψ 1 ≜{ζ ∈Ω: lim l→∞ l X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } l j=1 ) h(x, ˜ ξ i )=E[h(x, ˜ ξ )|˜ ω =ω]}, Ψ 2 ≜{ζ ∈Ω:accumulation point of {x l (ζ )} exists}, Ψ 3 ≜{ζ ∈Ω:for all i∈N and δ > 0, the sequence of{ ˜ ξ l } ∞ l=i+1 enter the ball with center ˜ ξ i and radius δ infinitely often }. By Theorem 4.7, Lemma 4.11, and Lemma 4.12, P(Ψ 1 ) = 1, P(Ψ 2 ) = 1, and P(Ψ 3 ) = 1. This further implies that P(Ψ 1 ∩Ψ 2 ∩Ψ 3 ) = 1. Hence, to show P(A) = 1 is equivalent to showP(A|Ψ 1 ∩Ψ 2 ∩Ψ 3 )=1. Theorem 4.13. Suppose assumptions A1 - A7 hold. Let {x l } ∞ l=1 and {ˆ x l } ∞ l=1 be the se- quencesofcandidatesolutionsandincumbentssolutionsgeneratedbySD-kNN.Let{(α l l ,β l l )} l and{(α l − l ,β l − l )} be the sequence of coefficients calculated in the minorant construction of SD- kNN (Algorithm 16). Then the following hold: a. With probability one, for any accumulation point, ˆ x, of{x l } ∞ l=1 , the following holds lim n→∞ α ln ln +(β ln ln ) ⊤ x ln =E ˜ ξ ∼ µ ˜ ξ |ω [h(ˆ x, ˜ ξ )|˜ ω =ω], where{x ln } ∞ n=1 is an infinite subsequence of {x l } ∞ l=1 and lim n→∞ x ln = ˆ x. 138 b. With probability one, for any accumulation point, ˆ x ′ , of{ˆ x l } ∞ l=1 , the following holds lim n ′ →∞ α l n ′ − l n ′ +(β l n ′ − l n ′ ) ⊤ ˆ x l n ′ =E ˜ ξ ∼ µ ˜ ξ |ω [h(ˆ x ′ , ˜ ξ )|˜ ω =ω], where{ˆ x l n ′ } ∞ n ′ =1 is an infinite subsequence of {ˆ x l } ∞ l=1 and lim n ′ →∞ ˆ x l n ′ = ˆ x ′ . Proof. Consider ζ ∈Ψ 1 ∩Ψ 2 ∩Ψ 3 . By definition, h ln (x ln (ζ ),ξ ln [i] (ζ ))= ˆ π ln (ζ )(x ln (ζ ),ω ln [i] (ζ ),ξ ln [i] (ζ )) ⊤ (e(ξ ln [i] )(ζ )− C(ξ ln [i] (ζ ))x ln (ζ )). Let ˆ x(ζ ) be one accumulation point of{x l (ζ )} l so that lim ln→∞ x ln (ζ )= ˆ x(ζ ). For notational simplicity, we shall remove ζ from the random variables when unnecessary. By Lemma 4.6, there exists a function g such that lim ln→∞ 1 k ln ln X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } ln j=1 ) h ln (x ln , ˜ ξ i )− g(ˆ x, ˜ ξ i ) =0. For a given ˜ ξ i (ζ ) (i.e., the response generated in the i th iteration), since h(·,·) is continuous on X×Y and{h l } ∞ l is uniformly convergent on X, for any ϵ> 0, there exist δ and L such that ∀ (x,ξ )∈X×Y s.t.∥(x,ξ )− (ˆ x, ˜ ξ i (ζ ))∥≤ δ, and the following holds |h(x,ξ )− h(ˆ x, ˜ ξ i (ζ ))|≤ ϵ 3 . (4.6) Moreover, |h ln (x,ξ )− h ln (ˆ x, ˜ ξ i (ζ ))|≤ ϵ 3 , ∀ l n ≥ L. (4.7) 139 SinceP(∥ ˜ ξ − ˜ ξ i (ζ )∥<δ )>0 for any δ > 0, (By the Borel-Cantelli Lemma)P(∥ ˜ ξ l − ˜ ξ i (ζ )∥< δ i.o. ) = 1, there exists a further subsequence of {x ln } ∞ n=1 , denoted {x l ′ n } ∞ n=1 , such that ∥(x l ′ n (ζ ), ˜ ξ l ′ n (ζ ))− (ˆ x, ˜ ξ i (ζ ))∥<δ for l ′ n ≥ L. Hence, |h(ˆ x, ˜ ξ i (ζ ))− h l ′ n (x l ′ n (ζ ), ˜ ξ i (ζ ))|≤| h(ˆ x, ˜ ξ i (ζ ))− h(ˆ x, ˜ ξ l ′ n (ζ ))| +|h(ˆ x, ˜ ξ l ′ n (ζ ))− h ln (x l ′ n , ˜ ξ l ′ n (ζ ))| +|h l ′ n (x l ′ n , ˜ ξ l ′ n (ζ ))− h l ′ n (x ln , ˜ ξ i (ζ ))| = 2ϵ 3 +|h(ˆ x, ˜ ξ l ′ n (ζ ))− h l ′ n (x l ′ n , ˜ ξ l ′ n (ζ ))|. (4.8) Note that h l ′ n (x l ′ n (ζ ),ξ l ′ n (ζ ))=h(x l ′ n (ζ ),ξ l ′ n (ζ )). It follows from (4.8) that 2ϵ 3 +|h(ˆ x,ξ l ′ n (ζ ))− h ln (x l ′ n (ζ ),ξ l ′ n (ζ ))|= 2ϵ 3 +|h(ˆ x, ˜ ξ l ′ n (ζ ))− h(x l ′ n (ζ ), ˜ ξ l ′ n (ζ ))| ≤ ϵ. (4.9) Hence, by the uniqueness of the limit, it follows that g(ˆ x, ˜ ξ i (ζ )) = h(ˆ x, ˜ ξ i (ζ )). Since i is arbitrary, g(ˆ x, ˜ ξ )=h(ˆ x, ˜ ξ ) for µ ξ almost all the ξ . Thus, we have lim n→∞ 1 k ln ln X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } ln j=1 ) h ln (x ln , ˜ ξ i )− h(ˆ x, ˜ ξ i ) =0, (4.10) Since ζ ∈Ψ 1 , equation (4.10) further implies that lim n→∞ 1 k ln ln X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } ln j=1 ) h ln (x ln , ˜ ξ i )=E ˜ ξ ∼ µ ˜ ξ |ω [h(x, ˜ ξ )|˜ ω =ω]. Since ˆ x(ζ ) is an arbitrary accumulation point, ζ ∈Ψ 1 ∩Ψ 2 ∩Ψ 3 is arbitrary andP(Ψ 1 ∩Ψ 2 ∩ Ψ 3 )=1, which proves the statement a in the Theorem. It is obvious that one can mimic the proof of a. to prove b. 140 IntheSD-kNNalgorithm,wedonotexplicitlycomputetheLagrangemultiplierforevery sample. Instead, we only compute the Lagrange multiplier for samples that are from the k nearest neighbor set of ω. Theorem 4.13 further implies that a. With probability one, for any accumulation point, ˆ x, of{x l } ∞ l=1 , the following holds lim n→∞ C ln ln (x ln ,ω)=E ˜ ξ ∼ µ ˜ ξ |ω [h(ˆ x, ˜ ξ )|˜ ω =ω]. where{x ln } ∞ n=1 is an infinite subsequence of {x l } ∞ l=1 and lim n→∞ x ln = ˆ x. b. With probability one, lim n ′ →∞ C l n ′+1 − (l n ′+1) (x l n ′ ,ω)=E ˜ ξ ∼ µ ˜ ξ |ω [h(ˆ x ′ , ˜ ξ )|˜ ω =ω]. where{ˆ x l n ′ } ∞ n ′ =1 is an infinite subsequence of {ˆ x l } ∞ l=1 and lim n ′ →∞ ˆ x l n ′ = ˆ x ′ . One direct result of Theorem 4.13 is summarized in the following Corollary, which states that the estimated solution obtained by the SD-kNN algorithm will converges to the true objective function at any accumulation point of the sequence of incumbent solutions with probability one. Corollary 4.14. Suppose assumptions A1 - A7 hold. With probability one, for any accu- mulation point, ˆ x, of{ˆ x l }, the following holds lim n→∞ ˇ f ln (ˆ x ln ,ω)= lim n→∞ ˇ f ln+1 (ˆ x ln ,ω)=f(ˆ x,ω), where{ˆ x ln } ∞ n=1 is the subsequence of{ˆ x l } ∞ l=1 such that lim n→∞ ˆ x ln = ˆ x. Proof. By the minorant construction in the SD-kNN, we have max{α ln ln +(β ln ln +c) ⊤ ˆ x ln ,α ln − ln +(β ln − ln +c) ⊤ ˆ x ln } ≤ c ⊤ x+ 1 k ln ln X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } ln j=1 ) h(x ln , ˜ ξ i ) 141 which implies that liminf n→∞ max{α ln ln +(β ln ln +c) ⊤ ˆ x ln ,α ln − ln +(β ln − ln +c) ⊤ ˆ x ln }≤ limsup n→∞ ˇ f ln (ˆ x ln ,ω) ≤ limsup n→∞ c ⊤ ˆ x ln + 1 k ln ln X i=1 I ˜ ω i ∈S(k l ,ω;{˜ ω j } ln j=1 ) h(x ln , ˜ ξ i ) (4.11) By Theorem 4.13, with probability one, we have lim n→∞ max{α ln ln +(β ln ln +c) ⊤ ˆ x ln ,α ln − ln +(β ln − ln +c) ⊤ ˆ x ln }=f(ˆ x,ω). (4.12) By Theorem 4.10, with probability one, we have lim n→∞ ˆ f k ln ln (ˆ x ln ,ω;{(˜ ω i , ˜ ξ i )} l i=1 )=f(ˆ x,ω). (4.13) Hence, the combination of equations (4.11), (4.12), and(4.13) implies that lim n→∞ ˇ f ln (ˆ x ln ,ω)= f(ˆ x,ω), w.p.1.Since ˇ f ln+1 (·,ω)containsanaffinefunctiongeneratedat ˆx ln initerationl n +1, we can mimic the analysis above and get lim n→∞ ˇ f ln+1 (ˆ x ln ,ω)=f(ˆ x,ω), w.p.1. Before we prove the convergence of SD-kNN, we shall introduce the following behavior of an infinite sequence. Lemma 4.15. Let{s l } l be an infinite sequence. Then the following holds: limsup n→∞ 1 n n X l=1 s l ≤ limsup n→∞ s n . Proof. 1 n n X l=1 s l = 1 n k X l=1 s l + 1 n n X l=k+1 s l ≤ 1 n k X l=1 s l + n− k n sup l≥ k s l . (4.14) Let n go to infinity, it follows from (4.14) that limsup n→∞ 1 n P n l=1 s l ≤ sup l≥ k s l . Again, let k go to infinity, we have limsup n→∞ 1 n P n l=1 s l ≤ limsup k→∞ s k . 142 With the help of the incumbent selection rule, we can show that limsup l→∞ ˇ f l (x l+1 ,ω)− ˇ f l (ˆ x l ,ω)=0 with probability one. Lemma 4.16. Suppose assumptions A1 - A7 hold and σ ≥ 1. Let {x l } ∞ l=1 and {ˆ x l } ∞ l=1 be the sequence of master problem and incumbent solutions identified by the SD-kNN. With probability one, limsup l→∞ ˇ f l (x l+1 ,ω)− ˇ f l (ˆ x l ,ω)=0 Proof. Let θ l = ˇ f l (x l+1 ,ω)− ˇ f l (ˆ x l ,ω)). Let {l n } n∈N be a sequence of iterations that the incumbent changes. Case 1: N is an infinite set. According to the incumbent selection rule and Lemma 4.4, we have ˇ f ln (x ln ,ω)− ˇ f ln (ˆ x ln− 1 ,ω)≤ q ˇ f ln− 1 (x ln ,ω)− ˇ f ln− 1 (ˆ x ln− 1 ,ω) ≤− q∥x ln − ˆ x ln− 1 ∥ 2 (4.15) Also note that θ ln− 1 = ˇ f ln− 1 (x ln ,ω)− ˇ f ln− 1 (ˆ x ln− 1 ,ω) . By averaging the first m terms in (4.16),wehave∆ m = 1 m P m n=1 ˇ f ln (x ln ,ω)− ˇ f ln (ˆ x ln− 1 ,ω)≤ q m P m n=1 θ ln− 1 ≤ 0. Byassumption, ˆ x ln =x ln and ˆ x ln− 1 = ˆ x l n− 1 . Then 1 m P m n=1 ˇ f ln (x ln ,ω)− ˇ f ln (ˆ x ln− 1 ,ω) can be rewritten as 1 m m X n=1 ˇ f ln (x ln ,ω)− ˇ f ln (ˆ x ln− 1 ,ω)= 1 m m X n=1 ˇ f ln (ˆ x ln ,ω)− ˇ f ln (ˆ x l n− 1 ,ω) = 1 m ˇ f lm (ˆ x lm ,ω)− ˇ f l 1 (ˆ x l 1 ,ω) + 1 m m− 1 X n=1 ˇ f ln (ˆ x ln ,ω)− ˇ f l n+1 (ˆ x ln ,m) (4.16) By assumptionsA1,A2 andA6, there exists M f , such that| ˇ f lm (ˆ x lm ,ω)− ˇ f l 1 (ˆ x l 1 ,ω)|<M f , which further implies that 0= lim m→∞ − 1 m M f ≤ lim m→∞ 1 m ˇ f lm (ˆ x lm ,ω)− ˇ f l 1 (ˆ x l 1 ,ω) ≤ 1 m M f =0, w.p.1. (4.17) By Corollary 4.14, we have limsup m→∞ 1 m P m− 1 n=1 ˇ f ln (ˆ x ln ,ω)− ˇ f l n+1 (ˆ x ln ,m) = 0, w.p.1. Therefore,wehave0=limsup m→∞ ∆ m ≤ limsup m→∞ q m P m n=1 θ ln− 1 ≤ 0, w.p.1.Furthermore,Lemma 143 4.15 implies 0=limsup m→∞ 1 m P m n=1 θ ln− 1 ≤ limsup n→∞ θ ln− 1 ≤ 0, w.p.1. which completes the proof for case 1. Case 2: N is finite. We argue that we can apply the proof of Theorem 3 in [47] by pattern- matching the notations. The proximal mapping problem in the (l +1) th iteration can be written as min σ 2 ∥x− ˆ x l ∥ 2 +η s.t. Ax≤ b, η ≥ c ⊤ x+α l i +(β l i ) ⊤ x∀ i∈J l (4.18) The necessary condition of (4.18) is σ (x− ˆ x l )+A ⊤ λ + X i∈J l θ i (c+β l i ) =0, X i∈J l θ i =1, λ ⊤ (Ax− b)=0 θ i (c ⊤ x+α l i +(β l i ) ⊤ x− η )=0, i∈J l , λ ≥ 0, θ i ≥ 0, i∈J l (4.19) Similarly, we assume that there exists L<∞, ˆ x l = ˆ x for all l≥ L and provide the following notations: ˇ f l i =α l i +(β l i +c) ⊤ ˆ x, ˇ F l is a vector with components ˇ f l i , i∈J l , ˆ x=Aˆ x− b, B l is a matrix whose rows are given by (c+β l i ) ⊤ , i∈J l , e=(1,...,1) ⊤ , θ is vector of Lagrange multipliers, θ i ,i∈J l , defined in (4.19) , λ is the vector of Lagrange multipliers defined in (4.19) . Then the necessary condition for the proximal mapping problem in the l th iteration can be written as A(x− ˆ x)+ ˆ b≤ 0, − ηe + ˇ F l +B l (x− ˆ x)≤ 0, λ ⊤ (A(x− ˆ x)+ ˆ b)=0, θ ⊤ (− ηe + ˇ F l +B l (x− ˆ x))=0, λA +θ ⊤ B l +(x− ˆ x) ⊤ =0, θ ⊤ e=1, θ ≥ 0, λ ≥ 0. (4.20) Then we can see that the necessary condition in (4.20) follows the same format as the one in [47]. 144 Lemma 4.17. Suppose that assumptions A1 - A7 hold and σ ≥ 1. For a given x∈ X, let d l (x)= x− x l+1 ∥x− x l+1 ∥ . Then with probability one, the following holds: (a) Thereexistsaninfinitesubsequenceofiterations L⊂ Nsuchthatlim l∈L x l+1 =lim l∈L ˆ x l = ˆ x, for some ˆ x∈X. (b) limsup l∈L ˇ f ′ l (x l+1 ,ω;d l (x))≤ f ′ (ˆ x,ω; x− ˆ x ∥x− ˆ x∥ ). Proof. Proof of (a). By Lemma 4.4, we have ˇ f l (x l+1 ,ω)− ˇ f l (ˆ x l ,ω)≤−∥ x l+1 − ˆ x l ∥ 2 ≤ 0. By Lemma 4.16, we have 0=limsup l→∞ ˇ f l (x l+1 ,ω)− ˇ f l (ˆ x l ,ω)≤ limsup l→∞ −∥ x l+1 − ˆ x l ∥ 2 ≤ 0, w.p.1 (4.21) Equation (4.21) implies that with probability one, there exists L⊂ N such that lim l∈L x l+1 = lim l∈L ˆ x l = ˆ x. So this proves Lemma 4.17 (a). Proof of (b). The proof is similar to the proof Lemma 4 in [47]. We let (c+ ˆ β l )∈argmax{(c+β l i ) ⊤ d l (x):α l i +(c+β l i ) ⊤ x l+1 = ˇ f l (x l+1 ,ω),i∈J l }, and ˆ α l = ˇ f l (x l+1 ,ω)− (c+ ˆ β l ) ⊤ x l+1 . By the construction of ˇ f l , we have ˇ f ′ l (x l+1 ,ω;d l (x)) = (c + ˆ β l ) ⊤ d l (x). By Lemma 4.17 (a), with probability one, there exists L ⊂ N such that lim l∈L x l+1 = lim l∈L ˆ x l = ˆ x. This implies that lim l∈L x− x l ∥x− x l ∥ = x− ˆ x ∥x− ˆ x∥ . Denote d(x) = x− ˆ x ∥x− ˆ x∥ . On the other hand, note that ˆ β l is a convex combination (i.e., P l i=1 1 k l I(˜ ω i ∈S(k l ,ω;{˜ ω i } l j=1 )) = 1, 1 k I(˜ ω i ∈S(k l ,ω;{˜ ω i } l j=1 ))≥ 0 for all i) of elements of{− C(ξ ) ⊤ π :ξ ∈S ξ ,π ∈V}, which is a compact set. As a result, limsup l∈L ˇ f ′ l (x l+1 ,ω;d l (x)) is finite and there exists L ′ ⊂L such that lim l∈L ′ ˆ β l = ˆ β, and limsup l∈L ˇ f ′ l (x l+1 ,ω;d l (x))= lim l∈L ′ ˇ f ′ l (x l+1 ,ω;d l (x))=(c+ ˆ β ) ⊤ d(x). By Theorem 4.13, we have lim l∈L ′ ˆ α l i +(c+ ˆ β l ) ⊤ x l+1 = f(ˆ x,ω) w.p.1. On the other hand, by Lemma 4.5, ˆ α l i +(c+ ˆ β l ) ⊤ x ≤ ˇ f l (x) ≤ ˆ f k l l (x), for all x ∈ X. By Theorem 4.10, lim l→∞ ˆ f k l l (x) = f(x,ω) 145 uniformly on X with probability one. This implies that (c+ ˆ β )∈∂ x f(x,ω) with probability one. Finally, we have limsup l∈L ˇ f ′ l (x l+1 ,ω;d l (x))=(c+ ˆ β ) ⊤ d(x)≤ max{µ ⊤ d(x):µ ∈∂ x f(x,ω)} =f ′ (x,ω;d(x)), w.p.1. With Lemma 4.17, we can prove that the incumbent solutions and candidate solutions that correspond to a subsequence of iterations converge to the same optimal solution to the problem (4.3) with probability one. Theorem4.18. Suppose thatassumptionsA1 -A7 hold andσ ≥ 1. Let{x l } ∞ l=1 and{ˆ x l } ∞ l=1 be the sequence of master problem and incumbent solutions identified by the SD-kNN. With probability one, there exists an infinite subsequence of iterations L⊂ N such that {x l+1 } l∈L and{ˆ x l } l∈L converges to the same optimal solution to (4.3). Proof. By Lemma 4.17, with probability one, there exists L ⊂ N, such that lim l∈L x l+1 = lim l∈L ˆ x l = ˆ x, for some ˆ x∈X. The following steps are to show that ˆ x is an optimal solution to (4.3). We let x∈X and x̸= ˆ x. By the triangle inequality, we have ∥ˆ x l − x∥≥∥ ˆ x− x∥−∥ ˆ x− ˆ x l ∥ (4.22) x̸= ˆ x implies that ∥x− ˆ x∥ > 0. Since lim l∈L ˆ x l = ˆ x, there exists L 0 such that ∥ˆ x− ˆ x l ∥≤ 1 2 ∥x− ˆ x∥, for all l∈L and l≥ L 0 . Let s∈(0, 1 2 ∥ˆ x− x∥], then it follows from (4.22) that ∥ˆ x l − x∥≥ 1 2 ∥ˆ x− x∥≥ s>0, ∀ l∈{l∈L| l≥ L 0 }. (4.23) Denote d l = x− x l+1 ∥x− x l+1 ∥ , d= x− ˆ x ∥x− ˆ x∥ , s l =∥x− x l ∥. 146 On the other hand, since x l+1 is the optimal solution to the proximal mapping problem in the l l+1 iteration, for all l∈{l∈L| l≥ L 0 }, ˇ f l (x l+1 ,ω)+ σ 2 ∥x l+1 − ˆ x l ∥ 2 ≤ ˇ f l (x l+1 +sd l ,ω)+ σ 2 ∥x l+1 +sd l − ˆ x l ∥ 2 . (4.24) Since s>0, equation (4.24) implies that for all l∈{l∈L| l≥ L 0 }, 0≤ ˇ f l (x l+1 +sd l ,ω)− ˇ f l (x l+1 ,ω) s + σ 2 ∥x l+1 +sd l − ˆ x l ∥ 2 − σ 2 ∥x l+1 − ˆ x l ∥ 2 s . (4.25) Note that the directional derivative of g l (x) = σ 2 ∥x− ˆ x l ∥ 2 at x l+1 in the direction d l is σ (x l+1 − ˆ x l ) ⊤ d l . Let s decrease to 0, it follows from (4.25) that 0≤ ˇ f ′ l (x l+1 ,ω;d l )+σ (x l+1 − ˆ x l ) ⊤ d l , ∀ l∈{l∈L| l≥ L 0 }. (4.26) Since lim l∈L x l+1 =lim l∈L ˆ x l = ˆ x, lim l∈L σ (x l+1 − ˆ x l ) ⊤ d l =0. By Lemma 9 (b), we also have limsup l∈L ˇ f ′ l (x l+1 ,ω;d l (x))≤ f ′ (ˆ x,ω;d). Hence, by taking the supremum of both sides of equation (4.26), we have 0≤ limsup l∈L ˇ f ′ l (x l+1 ,ω;d l )+limsup l∈L σ (x l+1 − ˆ x l ) ⊤ d l =limsup l∈L ˇ f ′ l (x l+1 ,ω;d l )+0 ≤ f ′ (ˆ x,ω;d). (4.27) Since x is arbitrary and hence d is arbitrary, equation (4.27) implies that ˆ x is an optimal solution to (4.3). Next, we show that the entire sequence of incumbents converge to an optimal solution. 147 Theorem 4.19. Suppose that assumptionsA1 -A7 hold and σ ≥ 1. Let{ˆ x l } ∞ l=1 denote the sequence of incumbent solutions identified by the SD-kNN. With probability one, the sequence {ˆ x l } ∞ l=1 converges to an optimal solution of the problem (4.3) with probability one. Proof. By Theorem 4.13 and Theorem 4.18, we can apply Theorem 1 from [98] to finish the proof. 4.3.4 Non-parametric SD with Batch Sampling Running SD-kNN without batch sampling will be very time-consuming and ineffective when the sample size is very large. The reason why it is ineffective is that when the sample size is large, theeffectofaddingonemoresampleonthe k nearestneighborestimationwillbe very small. Asaresult, generatingonesameatatimecannotimprovethesolutioninanefficient way. Inspired by this challenge, we propose to modify the original SD-kNN algorithm by generating a batch of samples per iteration. The tricky part is how we update the previous minorants when we sample a batch. The answer is that we recursively call the minorant update rule in proposition 4.2 batch size number of times per iteration. We illustrate the details of the modification in Algorithm 17. Theorem 4.20. Suppose that assumptionsA1 -A7 hold and σ ≥ 1. Let{ˆ x l } ∞ l=1 denote the sequence of incumbent solutions identified by the SD-kNN-Batch. With probability one, the sequence{ˆ x l } ∞ l=1 converges to an optimal solution of the problem (4.3) with probability one. Proof. The minorant construction, incumbent selection, candidate selection are the same as the ones in SD-kNN. All the updated minorants in iteration l are the lower bound of the k l -NN benchmark function with samples {(ω i ,ξ i )} N l i . Hence, the proof is the same as the one in SD-kNN (see theorem 4.19). 148 Algorithm 17 SD-kNN-Batch Set V 0 =∅, N 0 >0, n b >0, q∈(0,1), σ l ≥ 1 and β ∈(0,1). SetJ 0 ={0},C 0 0 (x)=0, ˆ H 0 (x)=max{C 0 0 (x)}, and ˇ f 0 (x,ω)=c ⊤ x+ ˆ H 0 (x). Set k 0 ← ⌊ N β 0 ⌋, find k 0 nearest neighbors of ω from the sample set {ω i } N 0 i , which is denoted asS(k,ω;{ω j } N 0 j=1 ) ={ω 0 [i] } k 0 i=1 . (Remark: The tuple (ω 0 [i] ,ξ 0 [i] ) is generated at the same time.) Compute ¯ξ 0 = 1 k 0 P k 0 i=1 ξ 0 [i] , and obtain x 0 ∈argmin x∈X c ⊤ x+h(x, ¯ξ 0 ). Warm Starting for l =1,2,3,... do Solve the following proximal mapping problem Candidate Selection x l =argmin x∈X ˇ f l− 1 (x,ω)+ σ l 2 ∥x− ˆ x l− 1 ∥ 2 . Record the Lagrangian multipliers{θ l− 1 i } i∈J l− 1 of{C l− 1 i (x)} i∈J l− 1 . Set N l ← N l− 1 + n b , k l ← ⌊ N β l ⌋, and generate a batch of new samples {(ω N l− 1 +j ,ξ N l− 1 +j )} n b j=1 i.i.d. ∼ µ ˜ ω, ˜ ξ . Find the k l nearest neighbors of ω from the sample set{ω i } N l i=1 , which is denoted byS(k,ω;{ω j } N l j=1 )={ω l [i] } k l i=1 . UpdateJ l ←J l− 1 ∪{− l,l}\{i∈J l− 1 :θ l− 1 i =0}. Set V l ← V l− 1 , k← k l− 1 , andC l i (x,ω)←C l− 1 i (x,ω), ∀i∈J l \{− l,l}. for j =1,2,...,n b do Obtain dual vertices π l N l− 1 +j ∈argmax{π ⊤ (e(ξ N l− 1 +j )− C(ξ N l− 1 +j )x l )| π ⊤ D≤ d}. Update V l ← V l ∪{π l N l− 1 +j } and set ˆ k =⌊(N l− 1 +j) β ⌋. if ˆ k =k then UpdateC l i (x,ω)←C l i (x,ω)− M h k , ∀ i∈J l \{− l,l}. else UpdateC l i (x,ω)← C l i (x,ω)k/ ˆ k, ∀ i∈J l \{− l,l}. end if Set k← ˆ k. end for *Dual selection* (See Algorithm 16). *Minorant Construction* (See Algorithm 16). if ˇ f l (x l ,ω)− ˇ f l (ˆ x l− 1 ,ω)≤ q ˇ f l− 1 (x l ,ω)− ˇ f l− 1 (ˆ x l− 1 ,ω) then Incumbent Selection Set ˆ x l ← x l . else Set ˆ x l ← ˆ x l− 1 . end if end for 149 4.4 Two-Stage Predictive SQQP Similar to the extension of SD to two-stage stochastic quadratic-quadratic programming (SQQP) problems [69], there also exists a direct extension of SD-kNN algorithm to solve two-stagepredictiveSQQPproblems. Themathematicalformulationoftwo-stagepredictive SQQP problem can be written as follows. min x f QQ (x,ω)= 1 2 x ⊤ Qx+c ⊤ x+E[h QQ (x, ˜ ξ )|˜ ω =ω] s.t. x∈X ={x∈ℜ n :Ax≤ b}, (4.28) where h QQ (x,ξ )=min 1 2 y ⊤ Py+d ⊤ y :Dy =e(ξ )− C(ξ )x, y≥ 0 . (4.29) Given the sample size equal to l, the k l nearest neighbor estimate of f QQ (x,ω) is ˆ f k l QQ,l (x,ω;{ω i ,ξ i } l i=1 )≜ 1 2 x ⊤ Qx+c ⊤ x+ 1 k l l X i=1 I ω i ∈S(k,ω;{ω j } l j=1 ) h QQ (x,ξ i ). (4.30) Throughout this section, we make the following additional assumptions. (B1) P and Q are symmetric positive definite matrices. (B2) h QQ (x,ξ )≥ 0 for all x∈ X and almost every ξ ∈Y. There exists M h <∞, such that h QQ (x,ξ )≤ M h for all x∈X and almost every ξ ∈Y. (B3) There exists L h such that ∥h QQ (x,ξ )− h QQ (y,ξ )∥ ≤ L h ∥x− y∥ for all x,y ∈ X and almost every ξ ∈Y. According to Liu and Sen [69], the dual representation of (4.29) is h QQ (x,ξ )=max s,t g QQ (s,t;x,ξ )≜− 1 2 (− d+D ⊤ t+s) ⊤ P − 1 (− d+D ⊤ t+s) +(e(ξ )− C(ξ )x) ⊤ t s.t. s∈ℜ n 2 , s≥ 0, t∈ℜ m 2 . (4.31) 150 For given (s,t,ξ ), g QQ (s,t;·,ξ ) is linear in x. Liu and Sen [69] show that problem (4.31) can be further simplified to the following problem whose feasible region has finitely many faces. h QQ (x,ξ )=max s − 1 2 s ⊤ Hs+ϕ (x,ξ ) ⊤ s s.t. s≥ 0, (4.32) where M =DP − 1 2 , H =P − 1 2 (I− M ⊤ (MM ⊤ ) − 1 M)P − 1 2 , ϕ (x,ξ )=Hd− P − 1 2 (MM ⊤ ) − 1 (e(ξ )− C(ξ )x). Motivated by the cut update in the SD-kNN algorithm in algorithm 16, and the SD- QQ algorithm in [69], we present the extension of SD-kNN algorithm for solving two-stage predictive SQQP problems. Proposition 4.21. Suppose assumptions A1-A5 and B1-B3 hold. Further suppose σ ≥ 1. Let{x l } ∞ l=1 and{ˆ x l } ∞ l=1 be the sequences of candidate solutions and incumbent solutions gen- erated by the SD-kNN-QQ algorithm. Let (s l (x,ω,ξ ),t l (x,ω,ξ )) ∈ argmax{g QQ (s,t;x,ξ ) : u∈ U l ,s (i) = 0,∀ i∈ u,t is free}. Let v ln i (ω) = 1 k l I ˜ ω i ∈S(k l ,ω;{˜ ω j } l j=1 ) , for 1≤ i≤ l. Then the following hold: a) With probability one, for any accumulation point, ˆ x, of{x l } ∞ l=1 , it holds that lim n→∞ k ln X i=1 v ln i (ω)g QQ s l (x ln ,˜ ω i , ˜ ξ i ),t l (x ln ,˜ ω i , ˜ ξ i );x ln , ˜ ξ i =E ˜ ξ ∼ µ ˜ ξ |ω [h QQ (ˆ x, ˜ ξ )|˜ ω =ω], where{x ln } ∞ n=1 is an infinite subsequence of {x l } ∞ l=1 and lim n→∞ x ln = ˆ x. b) With probability one, for any accumulation point, ˆ x ′ , of{ˆ x l } ∞ l=1 , it holds that lim n ′ →∞ l n ′ X i=1 v l n ′ i (ω)g QQ s l (ˆ x l n ′ ,˜ ω i , ˜ ξ i ),t l (x l n ′ ,˜ ω i , ˜ ξ i );ˆ x l n ′ , ˜ ξ i =E ˜ ξ ∼ µ ˜ ξ |ω [h QQ (ˆ x ′ , ˜ ξ )|˜ ω =ω], where{ˆ x l n ′ } ∞ n ′ =1 is an infinite subsequence of {ˆ x l } ∞ l=1 and lim n ′ →∞ ˆ x l n ′ = ˆ x ′ . 151 Algorithm 18 SD-kNN-QQ Set U 0 =∅, ˆ x 0 ∈X, k 0 =0, q∈(0,1), β ∈(0,1), and σ l ≥ 1 for l≥ 1. SetJ 0 ={0},C 0 0 (x)=0, ˆ H 0 (x)=max{C 0 0 (x)}, and ˇ f 0 (x,ω)=c ⊤ x+ ˆ H 0 (x). for l =1,2,3,... do Solve the following proximal mapping problem Candidate Selection x l =argmin x∈X ˇ f l− 1 (x,ω)+ σ l 2 ∥x− ˆ x l− 1 ∥ 2 . and record the Lagrangian multipliers{θ l− 1 i } i∈J l− 1 of{C l− 1 i (x)} i∈J l− 1 . Set k l =⌊l β ⌋ and randomly generate a new independent sample (ω l ,ξ l )∼ µ ˜ ω, ˜ ξ . Find the k l nearest neighbors of ω from the sample set {ω i } l i=1 , which is denoted by S(k,ω;{ω j } l j=1 ) = {ω l [i] } k l i=1 . (Remark: The tuple (ω l [i] ,ξ l [i] ) is generated in the same iteration.) Compute (s l ,t l )∈argmax{g QQ (s,t;x,ξ l ):s≥ 0, t is free}. Update u l ={i:s l (i) =0, 1≤ i≤ n 2 }, U l ← V l− 1 ∪{u l }, andJ l =J l− 1 ∪{− l,l}\{i∈ J l− 1 :θ l− 1 i =0}. for j =1,2,...,k l do Dual Selection Compute (s l [j] ,t l [j] )∈argmax{g QQ (s,t;x l ,ξ l [j] ):u∈U l ,s (i) =0, ∀ i∈u, t is free}. Compute (s − l [j] ,t − l [j] ) ∈ argmax{g QQ (s,t;ˆ x l− 1 ,ξ l [j] ) : u ∈ U l ,s (i) = 0, ∀ i ∈ u, t is free}. end for ConstructC l l (x,ω)= 1 k l P k l i=1 g QQ (s l [i] ,t l [i] ;x,ξ l [i] ). ConstructC l − l (x,ω)= 1 k l P k l i=1 g QQ (s − l [i] ,t − l [i] ;x,ξ l [i] ). if k l =k l− 1 then UpdateC l i (x,ω)←C l− 1 i (x,ω)− M h k l− 1 , ∀ i∈J l \{− l,l}. else UpdateC l i (x,ω)← k l− 1 k l C l− 1 i (x,ω), ∀ i∈J l \{− l,l}. end if Construct ˆ H l (x,ω)=max{C l i (x,ω):i∈J l } and ˇ f l (x,ω)= 1 2 x ⊤ Qx+c ⊤ x+ ˆ H l (x,ω). if ˇ f l (x l ,ω)− ˇ f l (ˆ x l− 1 ,ω)≤ q ˇ f l− 1 (x l ,ω)− ˇ f l− 1 (ˆ x l− 1 ,ω) then Incumbent Selection Set ˆ x l ← x l . else Set ˆ x l ← ˆ x l− 1 . end if end for 152 c) With probability one, there exists an infinite subsequence of iterations L⊂ N such that {x l+1 } l∈L and{ˆ x l } l∈L converges to the same optimal solution to (4.28). Proof. Since the feasible region of (4.32) has finitely many faces, the set of faces explored by SD-kNN-QQ will converge to some set with finitely many faces with probability one. Then we can mimic the proof of Theorem 4.13 a. to prove a) and b). With the convergence of minorants in a) and b), we can follow the proving process of Theorem 4.18 to prove c). Similar to the proof of Theorem 4.19, we can use the results from Proposition 4.21 to prove that the entire sequence of incumbent solutions generated by SD-kNN-QQ converges to some optimal solution with probability one. Theorem4.22. SupposeassumptionsA1-A5andB1-B3hold. Furthersupposethatσ ≥ 1. Let {ˆ x l } ∞ l=1 denote the sequence of incumbent solutions identified by the SD-kNN-QQ. With probability one, the sequence{ˆ x l } ∞ l=1 converges to an optimal solution of problem (4.28) with probability one. Proof. The proof is similar to the proof of Theorem 4.19. 4.5 Numerical Experiments The algorithms are implemented in the C++ environment and CPLEX v12.8.0, as a base solver, to solve convex quadratic programming problems and obtain the second-stage dual multipliers. All the programs are run on the Macbook Pro 2017 with 3.1 GHz Dual-Core Intel Core i5. In the SD-kNN, we design the dynamic step size rule based on [99]. Namely, if theincumbentselectionispassed(i.e., ˇ f l (x l ,ω)− ˇ f l (ˆ x l− 1 ,ω)≤ q ˇ f l− 1 (x l ,ω)− ˇ f l− 1 (ˆ x l− 1 ,ω) ), then we let σ l =max{ 1 2 σ l ,σ }; otherwise, we let σ l =min{2σ l ,σ }, where 1≤ σ <σ < ∞. TodeterminetheinitialincumbentsolutionoftheSD-kNN,wedesigna presolveprocess. In particular, we use certain number of data points to compute the kNN estimate of the 153 response, ¯ξ | ˜ ω=ω . Let ¯ N denote the sample size for the presolve process, and let ¯k = ⌊ ¯ N β ⌋ denote the corresponding number of nearest neighbors, the mathematical formulation is ¯ξ | ˜ ω=ω = 1 ¯k P ¯ N i=1 I ω i ∈S( ¯k,ω;{ω j } ¯ N j=1 ) ξ i . Then the initial incumbent solution, x 0 is ob- tained by solving a deterministic two-stage problem with ¯ξ | ˜ ω=ω as the realization of the second-stage scenario (i.e., x 0 ∈argmin{c ⊤ x+h(x, ¯ξ | ˜ ω=ω ):x∈X}). 4.5.1 Two-Stage Predictive SLP In this section, we demonstrate the computational performance of SD-kNN by a two-stage shipmentplanningproblem,whichcanbeseenin[12]andatwo-stagemultiproductinventory problem, which is a modification of single-period multiproduct inventory model proposed by [6]. Throughout this section, we refer to the two-stage shipment planning problem and two-stage multiproduct inventory problem as BK19 and BAA99, respectively. The problem complexityissummarizedintheTable4.1. Weletn ω andn ξ denotethesizeofpredictorand response, respectively. Matrix A in the first stage is of size m 1 × n 1 . We let m 2,ineq and m 2,eq denote the number of inequality constraints and equality constraints, respectively. We let n 2,ineq andn 2,eq denotethetotalnumberofcolumnsintheinequalityconstraintsandequality constraints, respectively. In particular, BK19 has four first-stage decision variables and four inequality constraints in the first stage, and it has 52 inequality constraints (excluding non-negativity constraints) and 16 decision variables in one realization of the second-stage subproblem. BAA99 has 25 decision variables and 50 inequality constraints in the first stage, and it has 1,375 equality constraints and 100 decision variables in one realization of thesecond-stagesubproblem. Forsimplicity,weletLEON-E,LEON-Q,LEON-G,LEON-N, Instance n ω n ξ n 1 m 1 n 2,ineq m 2,ineq n 2,eq m 2,eq BK19 3 12 4 4 52 16 0 0 BAA99 25 50 50 100 0 0 1,375 100 Table 4.1: Problem Complexity of the Two-Stage Predictive SLPs and LEON-k denote the LEON with Epanechnikov kernel, quartic kernel, Gaussian kernel, 154 Na¨ ıve kernel, and kNN method, respectively. Then, we replicate each algorithm ten times for each instance and summarize the numerical results below. Algorithm Min(Obj) Max(Obj) Avg(Obj) |Avg(Obj)− θ ∗ | |θ ∗ | Dev(Obj) ¯t(sec) SD-kNN 295.961 297.572 296.3628 0.19% 0.536 2673.462 LEON-E 296.715 300.734 298.1979 0.81% 1.540 8.41 LEON-Q 296.328 303.231 298.2611 0.83% 2.033 8.46 LEON-G 296.683 300.848 298.2938 0.85% 1.394 27.81 LEON-N 296.632 302.091 298.6216 0.96% 1.889 8.38 LEON-k 296.538 309.649 299.3122 1.19% 4.407 5.79 Table 4.2: BK19, sample size is 8,645 Figure 4.1: BK19, SD-kNN versus LEON-kNN In BK19, the observed predictor is set to ω = (− 0.3626,0.5871,− 0.2987) and the es- timated true cost is calculated by using 100,000 samples from the true conditional distri- bution (see [12] for the data generation process). As a result, the estimated cost of ˆ x is θ v (ˆ x)= 1 Nv P Nv i=1 F(ˆ x,ξ ω i )andθ ∗ =min x∈X 1 Nv P Nv i=1 F(x,ξ ω i )istheestimated“optimal”cost of the perfect-foresight problem. In particular, θ ∗ is 295.792. Table 4.2 summarizes the computational performance of the SD-kNN and LEON in the instance of BK19. In BK19, the initial sample size for LEON is set to be 50, and the sample incremental is set to be 1. The maximum number of outer loops is 13, and the scale of the averaging window (m) is set to be 1. As a result, the total number of samples used for LEON is 8,645. To make a fair comparison, we also set the total number of samples used for SD-kNN to be 8,645. Based on Table 4.2, SD-kNN outperforms LEON in terms of solution quality and stability. It is partially due to the incremental sampling and use of past NSQG in the SD-kNN. To fully compare the progress of the two algorithms, we draw the box plot of solution quality of 155 SD-kNN and LEON-kNN for multiple sample sizes in Figure 4.1. One can see that SD-kNN converges faster than LEON-kNN. Algorithm Min(Obj) Max(Obj) Avg(Obj) |Avg(Obj)− θ ∗ c | |θ ∗ c | Dev(Obj) ¯t(sec) SD-kNN -14253 -14248.1 -14251.87 0.03% 1.47 2022.82 LEON-Q -14251.7 -14248.7 -14250.03 0.05% 0.96 62.59 LEON-k -14253.4 -14245.8 -14250.02 0.05% 2.66 12.06 LEON-E -14250.8 -14248.9 -14249.68 0.05% 0.61 62.98 LEON-G -14246.6 -14244.4 -14245.14 0.08% 0.69 130.32 Table 4.3: BAA99, Sample Size 32,445 Figure 4.2: BAA99 In BAA99, the predictor and demand (i.e., response) altogether follows a multivariate normal distribution with mean vector, µ , and covariance matrix, Σ (i.e., ( ω,ξ )∼ N(µ, Σ)). The observed predictor is ω i = 108− i 3 , i = 1,2,...,25. We utilize kNN to estimate the true cost of a given solution. Although such an approach will inevitably incur some bias in measuring solution quality, this approach is available in practice, and its performance is ensured by the pointwise convergence of kNN estimator [109]. To initiate the validation for BAA99, we generate 10 6 predictor-response data pairs from unconditional distribution and find 10 3 nearest neighbors of the observed predictor. As a result, the associated estimated optimal cost (i.e., θ ∗ c ) is -14,256.5. The computational performance of LEON and SD-kNN is summarized Table 4.3 and Figure 4.2. When the sample size is 32,445, we observe from Table4.3thatSD-kNNhasthebestaveragesolutionqualitywhileLEONwithEpanechnikov has the lowest variation of the solution quality. On the other hand, we observe from Figure 4.2 that LEON with quartic kernel has a better overall performance than SD-kNN when the 156 sample size is 6,093. This may be due to the effect of the averaging window of the estimated solutions and the chosen kernels. Instance Iterations N N pre n b ˆ θ | ˆ θ − θ ∗ θ ∗ | δ | δ θ ∗ | ¯t BK19 100 85,000 80,000 50 296.32 0.18% 0.33 0.11% 133.23 BAA99 200 16,000 15,000 5 -14250.10 0.04% 1.92 0.01% 25.89 Table 4.4: Summary of Computation Experiments of the SD-kNN-Batch According to Table 4.4, SD-kNN-Batch uses 16,000 samples to attain average solution qualitywith(| ˆ θ − θ ∗ θ ∗ |,| δ θ ∗ |)=(0.04%,0.01%)inBAA99. SD-kNN-Batchalgorithmuses85,000 samples to attain average solution quality with (| ˆ θ − θ ∗ θ ∗ |,| δ θ ∗ |)=(0.18%,0.11%) in BK19. 4.5.2 Two-Stage Predictive SQQP By introducing quadratic terms to the BAA99 problem to, we propose an instance of two- stage SQQP prolem test on the capability of SD-kNN-QQ algorithm. We shall refer to this problem as BAA99-QQ. The mathematical formulation of the BAA99-QQ problem is given below. min y 1 2 N X i=1 N X j=1 Q ij y i y j + N X k=1 c k y k +E[h(y, ˜ ξ )|˜ ω =ω] s.t. 0≤ y k ≤ m y , k = 1,...,N (4.33) 157 where h(y,ξ )= min w,v,u 1 2 N X i=1 N X j=1 P 1 ij v i + 1 2 N X i=1 N X j=1 P 2 ij u i + 1 2 N X i 1 =1 i 1 X j 1 =1 N X i 2 =1 i 2 X j 2 =1 P 3 j 1 ,i 1 ,j 2 ,i 2 w j 1 ,i 1 w j 2 ,i 2 − N X i=1 i X j=1 a ji w ji − N X i=1 s i v i + N X i=1 π i u i s.t. u i + i X j=1 w ij =e i (ξ ), for i=1,...,N v i + N X j=i w ij =C i (ξ )y i , for i=1,...,N w, v, u≥ 0. (4.34) C (which represents the impact of an uncertain factor) is random and Q,P matrix stores the information of the variance of each decision variable. We let diag(v) denote a diagonal matrix whose entries are the elments of the vector v. The model parameters for BAA-QQ are given below. c=(4,3.8,3.6), m y =217, Q=diag(7,10,2), a=(8.0,7.6,7.6,7.2,7.2,7.2), s=(− 0,2,− 0.2,− 0.2), π =(10,10,10), P =diag(9,1,9,5,2,6,9,5,7,7,8,8). 158 To ease the notation, we let ω = (ω 1 ,ω 2 ), where ω 1 ∈ℜ and ω 2 ∈ℜ nω− 1 . In the design of the random process, we generate (ω 2 ,e(ξ )) from a multivariate normal distribution with mean vector µ and covariance matrix Σ. µ =(110.0,105.0,100.0,95.0,90.0) Σ= 4.96 0.48 2.52 − 0.25 0.72 0.48 5.53 − 0.136 0.60 0.63 2.52 − 0.136 7.87 − 0.26 2.255 − 0.25 0.60 − 0.26 4.81 0.61 0.72 0.63 2.255 0.61 3.63 ω 1 is generated from the uniform distribution with range [0.6,1.0). Then we generate N i.i.d normal random variables (i.e., {ϕ i } N i ) with mean ω 1 and standard deviation σ = 0.2. Finally, if ϕ i < 0.6, then we set C i (ξ ) = 0.6; if 0.6≤ ϕ i ≤ 1.0, we set C i (ξ ) = ϕ i ; if ϕ i > 1, we set C i (ξ )=1. The problem size of BAA99-QQ is provided below. Similar to the validation process in Instance n ω n ξ n 1 m 1 n 2,ineq m 2,ineq n 2,eq m 2,eq BAA99-QQ 3 6 3 6 0 0 12 6 Table 4.5: Problem Size of SQQP Problem BAA99, the baseline cost, θ ∗ , is calculated by using 1,000 nearest neighbors from 100,000 samples from the unconditional distribution. Similarly, the solution quality is evaluated by calculating the average objective cost associated with those 1,000 nearest neighbors by pluggingintheestimatedsolutionobtainedfromtheSD-kNN-QQalgorithm. Werepeat the experiments for 10 times and summarizes the computational performance of SD-kNN-QQ in the following table. Table 4.6 shows that the solution quality increases as the sample size for the pre-solve increases. When N pre = 100, the solution quality increases as the number of iterations increase (i.e., ˆ θ and δ decrease as N increases). When N pre = 1,000, N does 159 Instance Iterations N N pre ˆ θ | ˆ θ − θ ∗ θ ∗ | δ | δ θ ∗ | ¯t BAA99-QQ 100 200 100 43831.95 0.0191% 10.53 0.0240% 30.99 BAA99-QQ 300 400 100 43827.16 0.0088% 3.65 0.0083% 127.12 BAA99-QQ 100 1,100 1,000 43824.42 0.0026% 0.73 0.0017% 82.35 BAA99-QQ 300 1,300 1,000 43824.67 0.0031% 0.93 0.0022% 269.13 Table 4.6: Summary of the Numerical Experiments of SD-kNN-QQ not play a significant role on the solution quality. One possible explanation is that the bias of the k nearest neighbor estimation is small when N pre =1,000. 160 Chapter 5 Online Non-parametric Estimation for Nonconvex Stochastic Programming In this chapter, we address the so-called Predictive Stochastic Programming (PSP) model in which data are available as covariates, (ω,ξ ), although the underlying conditional distribu- tionof ˜ ξ given ˜ ω =ω is unknown. Oneexampleofusingcovariateinformationisthehedonic pricingmodelforhousingprices(seeOwusu-Ansah[84]foranoverview). Inparticular,home pricesareoftenmodeledasaregressionfunctionofdwellingattributes(e.g., landarea, num- berofbedrooms,andsoon)andlocationattributes(e.g.,qualityoftheschooldistrict,crime in the neighborhood, distance from central business district, and so on), see Kumbhakar and Parmeter [58], Martins-Filho and Bin [76], and Witte et al. [114]. In order to distinguish models using covariate data from traditional SP models (without covariates), authors have usedalternativedescriptorssuchas“predictive”,“contextual”,and“data-driven”SP.Inany event,manymodernapplicationsinvolvesuchcovariatedatawhichhasitsrootsinstatistical modeling using parametric as well as non-parametric statistics. The particular form of PSP which we study in this chapter may be stated as follows. min x∈X ζ ω (x)≜E ˜ ξ ∼ µ ˜ ξ |ω h F(x, ˜ ξ )+H(x, ˜ ξ )|˜ ω =ω i , (5.1) 161 where µ ˜ ξ |ω denotes the conditional distribution of ˜ ξ given ˜ ω =ω (an observed/fixed param- eter). The various mathematical objects used in (5.1) may be specified by F : S×Y 7→ R a L-smooth/differentiable concave function, S ⊂ R n an open superset of the convex set X, and H(x,ξ ) is the minimum cost of a second-stage linear programming problem given the first-stage decision variable, x and, the realization of the random variable, ξ for a given ω which will be fixed during the decision-making phase: H(x,ξ )=min y d ⊤ y s.t. Dy =e(ξ )− C(ξ )x, y≥ 0, y∈R n 2 . (5.2) We observe that in this case, the objective function is parametrized by ω. To clarify the setup, note that the optimization is undertaken only over the decision variable, x, because the parameter, ω, is merely a “leading-indicator” of future uncertainty. Depending on the specific application, such indicators are known to provide valuable information for decision- making. Letuspresentsomeconcreteexamples: a)supposethatω representswind-speedsat an upstream location which can be used to predict wind energy production at a downstream wind farm; b) the current location, speed of movement of the “eye of a hurricane” while predicting the path of a hurricane etc. In the latter example, the random variable ˜ ξ is often represented via the so-called “cone of uncertainty” 1 . These types of settings lead to stochasticprogrammingmodelsoftheformgivenin(5.1)wherethedistributionofξ depends ontheobservationω,andhistoricalcovariatedataiscriticaltopredictionsregardingprogress of the event. The decision model in (5.1) requires simultaneous estimation-and-optimization because the former (estimation) provides an approximation of the possible trajectory of uncertaintyembodiedbytheconditionaldistributionµ ˜ ξ |ω ,andthelatter(optimization)seeks decisions which respond to the event, using information available in the form of covariates 1 https://www.nhc.noaa.gov/aboutcone.shtml 162 ( ˜ ξ, ˜ ω =ω). Note that the decisions associated with PSP are based on the observation ω and in this sense, it is fixed during the decision (optimization) process. One additional wrinkle that motivates our work is the presence of concave costs in the first-stage objective F. Within the context of such decision models, costs of the first stage modelareoftenmodeledusingconcavecostsbecauseofdecreasingmarginalcostswithlarger levelsofdeployment. Thisistrueinpreparingforsuchuncertainevents, includinghurricane response, as well as other applications including energy storage for grid applications 2 . As a result, the first stage cost associated with such systems exhibit decreasing marginal costs, leading to a concave cost functions in the first stage. Similar costs also appear in other infrastructure projects (such as Cafaro and Grossmann [17] for pipeline installation) and modeling transactions costs in portfolio management (Konno and Wijayanayake [56]). It turns out that the overall objective function of such problems include a combination of concave first stage costs, and the expectation of second stage costs, leading to a non-convex SP model. 5.1 Literature Review The use of feature information (ω) to assist decision making has gained increasing attention intheoptimizationcommunity. DengandSen[24]proposeaLearningEnabledOptimization (LEO) framework for solving min x∈X f(x)+E ˜ ξ ∼ µ ˜ ξ |ω [H(x, ˜ ξ |˜ ω =ω)]. 2 see Greening the Grid: Grid-scale Battery Storage 163 Here, ˜ ξ |˜ ω =ω means that a “true” regression model exists. Given a regression model using training setS T ={(ω i ,ξ i )} N i=1 and a set of deterministic models,M, and a loss function, L, they build a regression model by solving the following empirical risk minimization problem: ˆ m∈argmin{ 1 N N X i=1 L(ξ i ,m(ω i )):m∈M}. Then they use ˆ m and a certain known distribution of noise (e.g., white noise, see Appendix I of [24] for more information) in the model to construct a generative oracle that approxi- mately simulates the outcomes from the conditional distribution of µ ˜ ξ |ω . Additionally, Ban and Rudin [4] propose a feature-based newsvendor problem, while Elmachtoub and Grigas [31] propose a Smart “Predict, then Optimize” framework to solve what they refer to as a contextual stochastic optimization model. In a similar vein, Kannan et al. [55] analyze vari- ousversionsofempiricalresiduals-basedsampleaverageapproximationunderassumptionof the homoscedasticity and then analyze the heteroscedastic case in [54]. Hu et al. [50] study noise dependent/independent rate of regret bound of the estimate-and-then-optimization model and induced empirical-risk-minimization (i.e., simultaneous estimation and optimiza- tion) model for solving contextual linear optimization. Framework Problem Class Approach Smart “Predict, min x∈X E ˜ ξ ∼ µ ˜ ξ |ω [ ˜ ξ |˜ ω =ω] ⊤ x Step 1: Construct a regression function then Optimize” Step 2: Optimize the decision based on the regression function CLEO [68] min x∈X E ˜ ξ |x [F(x, ˜ ξ )+H(x, ˜ ξ )] Stochastic trust region method with sequential local regression From Predictive to min x∈X E ˜ ξ ∼ µ ˜ ξ |ω h c(x, ˜ ξ )|˜ ω =ω i Solve an SAA-type problem with Prescriptive non-parametric estimation Analytics [12] Table 5.1: Frameworks for PSP Problems Generally speaking, all the approaches mentioned above can be regarded as a two-step approach: Step1buildsaregressionmodelbasedonatrainingdataset, andstep2optimizes a decision-making model based on a regression model. 164 On the other hand, Liu et al. [68] proposes a simultaneous learning and optimization framework, which they refer to as Coupled Learning Enabled Optimization (CLEO) for solving decision-dependent PSP problems of the form min x∈X E ˜ ξ |x [F(x, ˜ ξ )+H(x, ˜ ξ )], where randomness, ˜ ξ , is assumed to be a regression of the decision, x. Given ˆ x ℓ at the beginning of the ℓ th iteration of CLEO, they generate a batch of samplesS ℓ ={(x i ,ξ i )} N i=1 , where {x i } N i are all in the trust region of ˆ x ℓ with radius δ ℓ . Then they construct a local regressionmodel, ˆ m, basedonthetrainingsetS ℓ . Next, theyconstructatrust-regionmodel based on ˆ m and compute the trial step for the solution update. Another framework is proposed by Bertsimas and Kallus [12], where they study a fusion of optimization and machine learning for solving min x∈X E ˜ ξ ∼ µ ˜ ξ |ω h c(x, ˜ ξ )|˜ ω =ω i , where c(x,ξ ) is a cost function decision x and response ξ . They introduce new type of SAA analog with predictive power and then apply the model in the inventory management problem. In particular, they propose a non-parametric approximation of (5.1) as follows, min x∈X N X i=1 v N,i (ω)c(x,ξ i ), (5.3) where v N,i (ω) is calculated by some non-parametric estimation method (e.g., in the case of kNN estimation, v N,i (ω)= 1 k N I(ω i ∈S(k N ,ω;{ω i } N i )) andS(k N ,ω;{ω i } N i ) denotes the kNN set of ω from the set{ω i } N i ). Additionally, Bertsimas and McCord [13] further extends the framework to multistage SP. We summarize three popular PSP frameworks in Table 5.1. 165 On the other hand, Diao and Sen [27] design an online first-order method with non- parametric estimation by simultaneously refining subgradient estimation and estimated so- lutions,whichtheyrefertoasLearningEnabledOptimizationwithNon-parametricApprox- imation (LEON). In this paper, we focus on extending LEON to nonconvex SP problems in which the first stage allows concave costs to accommodate costs with decreasing returns to scale. Indeed, we shall design an online SP algorithms with “predictive” power and study an efficient way to reuse the previous non-parametric estimations for enhancing future esti- mations in a more “dynamic” way. Therefore, we refer to this type of methodology as PSP in this paper. To motivate the methodology for solving PSP in (5.1), we introduce its classic stochastic programming counterpart as follows: min x∈X ζ (x)≜E ˜ ξ h F(x, ˜ ξ )+H(x, ˜ ξ ) i (5.4) Problem (5.4) not only includes the classical two-stage stochastic linear program (SLP) as a special case but also two-stage stochastic linear programs with a deterministic nonconvex first-stage objective function. Two-stage stochastic programming has been applied in many areas such as network capacity planning (Sen et al. [97]), water management (Huang and Loucks [51]), logistics (Barbarosoˇ glu and Arda [5]), power transmission (Phan and Gosh [87]), and others. Recently, the majorization-minimization (MM) method [2, 21, 62, 64, 66, 67, 72, 73, 85, 91, 107] has attracted considerable attention in the machine learning (ML) and SP communities. One core step of the MM method can be described as follows: a (usually convex) upper-bounding function of the (often nonconvex) original objective is created at the current iterate, and then the upper-bounding function is minimized over the feasible region to obtain the next iterate. For the readers who are interested in MM algorithm for SP, we recommend the Chapter 10 of the monograph by Cui and Pang [21]. 166 On the other hand, the Stochastic Decomposition (SD, [46, 47, 69, 98]) algorithm is a successive approximation-decomposition algorithm for solving two-stage stochastic linear programming models. SD successively constructs a local outer (lower-bounding) approx- imation of the sampled approximation of the second-stage cost-to-go functions by using previously calculated stochastic subgradients at the candidate/incumbent solutions. 5.2 Scope and Contributions The goal of the chapter is to investigate a computational methodology which can deliver a d(irectional)-stationary point of the PSP problem in (5.1), and we refer to such a decision as a solution to the PSP problem. Moreover, our approach will combine both estimation and optimization, with the former being carried out using non-parametric methods (e.g., kNN estimation), and the latter is made possible via a combination of a stochastic pro- gramming algorithm (SD) and the non-convex optimization algorithm commonly known as MM (majorization-minimization). The manner in which these pieces (estimation-and- optimization), are integrated within a convergent algorithm is one of the main contributions of this work. In this chapter, we shall answer the following questions: (i) How should one ensure that locally lower-bounding function of the second-stage cost-to-go function is “good” enough per iteration so that the algorithm converges? (ii) How to synthesize the kNN estimation, SD algorithm, and MM algorithm (NSD-MM) to obtain d-stationary point of the two-stage nonconvex PSP problems in (5.1)? (iii) How to efficiently reuse previous kNN estimation of the objective function for streaming data applications? As suggested in the preceding paragraph, the contributions of this chapter are two-fold: (1) We propose a unique double-loop algorithm that combines kNN, SD, and MM to solve nonconvexPSP.(2)WeextendtheboundaryofSDandprovideamoregeneralmethodology 167 and proof. The chapter is organized as follows. In section 5.3, we design a fusion of kNN method, SD algorithm, and MM algorithm (NSD-MM) to solve the PSP problem in (5.1). Finally, we apply NSD-MM algorithm to numerically solve a class of two-stage stochastic program and two-stage predictive stochastic program in sections 5.4, respectively. 5.2.1 Technical Preliminaries and Notations Without further specification, we shall let ∥·∥ denote the Euclidean norm of the vector and spectral norm of the matrix. Let ˜ ξ : Ω 7→ Y ⊂ R m 2 be a random vector built upon the probability space (Ω ,Σ Ω ,P). We let µ ˜ ξ denote the distribution of the random vector ˜ ξ and letξ denotearealizationof ˜ ξ . Whenwesay“foralmosteveryξ ∈Y”,itmeansforµ ˜ ξ almost every ξ ∈Y. Throughout, we assume that X is a compact convex set. Definition 5.1. A function F : R n 7→ R is a L-smooth function if it is differentiable and its gradient is L-Lipschitz continuous on R n (i.e., ∥∇F(x)−∇ F(y)∥ ≤ L∥x− y∥, for all x,y∈R n ). We define a class of surrogate function of F(x,ξ ) near x ′ below. Definition 5.2. Let F : S×Y 7→ R be a L-smooth/differentiable concave function on X, where S is a superset of X. G(x,ξ ;x ′ ) is a convex surrogate function of F(x,ξ ) near x ′ ∈X which satisfies the following conditions: M1 G(x ′ ,ξ ;x ′ )=F(x ′ ,ξ ). M2 G(x,ξ ;x ′ )≥ F(x,ξ ),∀ x∈X. M3 G(λx 1 +(1− λ )x 2 ,ξ ;x ′ )≤ λG (x 1 ,ξ ;x ′ )+(1− λ )G(x 2 ,ξ ;x ′ ),∀ x 1 ,x 2 ∈X,∀ λ ∈[0,1]. We define the directional derivative of a function ζ below. 168 Definition 5.3. Let ζ : S 7→ R, where S is a convex open set in R n . Suppose that the directional derivative of f at x∈X in the direction of d∈R n exists and it is defined below: ζ ′ (x;d)=lim τ ↓0 ζ (x+τd )− ζ (x) τ . The d(irectional)-stationary point of the problem (5.4) is defined as follows. Definition 5.4. ¯x∈X is a d(irectional)-stationary point of (5.4) if ζ ′ (¯x;x− ¯x)≥ 0, ∀ x∈X. In the PSP model, Let X be a Borel subset of R m 1 and let Y be a Borel subset of R m 2 . Let (Ω ,Σ Ω ,P) be the probability space of the correlated continuous random vectors ˜ ω and ˜ ξ taking the values in the measurable spaces, (X,Σ X ) and (Y,Σ Y ), respectively. In particular, the tuple (˜ ω, ˜ ξ ) : Ω 7→ X ×Y is a random vector taking values in the product space (X ×Y ,Σ X ⊗ Σ Y ). Let the joint distribution of (˜ ω, ˜ ξ ) be µ ˜ ω, ˜ ξ . Let µ ˜ ω and µ ˜ ξ denote the marginal distribution functions of ˜ ω and ˜ ξ , respectively. Let µ ˜ ξ |ω denote the conditional distribution of ˜ ξ given ˜ ω = ω. We refer to C ¸ınlar [20] and Durrett [29] for readers who are interested in joint distributions and conditional distributions. As a result, for the problem parametrized by ω, we shall define the directional derivative of the first argument, x, as follows. Definition 5.5. Let ζ ω (x) : S× S ω 7→ R, where S is a convex open set in R n and S ω is a subset of R nω . For a given observation ω ∈ S ω , suppose that the directional derivative of ζ ω (·) at x∈X in the direction of d∈R n exists and it is defined below: ζ ′ ω (x;d)=lim τ ↓0 ζ ω (x+τd )− ζ ω (x) τ . 169 Definition 5.6. ¯x is a d(irectional)-stationary point of (5.1) if ζ ′ ω (¯x;x− ¯x)≥ 0, ∀ x∈X. 5.3 Nonconvex Predictive Stochastic Programming Recall in PSP model, we consider the random vectors which appear as a tuple (˜ ω, ˜ ξ ) : Ω 7→ X ×Y , where ˜ ω correlates with previously introduced response, ˜ ξ . This section first introduces a base MM algorithm for solving nonconvex PSP problems and then combines it with the SD algorithm and MM algorithm to design an efficient algorithm for solving the PSP problem in (5.1). We shall refer to the base MM algorithm as N-LEON. 5.3.1 N-LEON Here, we consider solving a PSP as follows: min x∈X f ω (x)=E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω =ω]. (5.5) We make the following assumption of the relation of objective function (F : X×Y 7→ R) and the tuple (˜ ω, ˜ ξ ). A1 X is a compact convex set. A2 |F(x, ˜ ξ )| is uniformly bounded with probability one (i.e., there exists M such that |F(x, ˜ ξ )|≤ M <∞∀ x∈X w.p.1). A3 For almost every ˜ ξ , F(x, ˜ ξ ) is Lipschitz continuous on X with a common Lipschitz modulus, Lip F . A4 F(·, ˜ ξ ) is directionally differentiable on X for almost every ˜ ξ . A5 (˜ ω, ˜ ξ ) follows a joint distribution µ ˜ ω, ˜ ξ . The regular conditional distribution of ˜ ξ given ω exists, which is denoted by µ ˜ ξ |ω . 170 Assumptions A1 - A4 are standard in the SP literature. Note that the smoothness assump- tion of F(·,ξ ) on X is not necessary in this subsection. Assumption A5 ensures that the conditional expectation in (5.5) is well defined. Given a dataset, S n ≜ {(ω i ,ξ i )} n i=1 , where {(ω i ,ξ i )} n i=1 are n realizations of the i.i.d. copies of (˜ ω, ˜ ξ )∼ µ ˜ ω, ˜ ξ . We define the k-nearest neighbors estimate of f ω (x) as f kn ω,n (x)= 1 k n n X i=1 I(ω i ∈S(k n ,ω;{ω i } n i=1 ))F(x,ξ i ). (5.6) FirstintroducedbyFixandHodgesin[36],thekNNmethodhasbeenwidelyusedinhedonic housing pricing (Oladunni and Sharma [82]), traffic flow prediction (Smith and Demetsky [103]), battery capacity prediction (Hu et al. [49]), wind power forecast (Mangalova and Agafonov [75]) and others. On the other hand, the asymptotic properties of kNN method has been extensively studied by Stone [104], Devroye et al. [26], Gy¨ orfi et al.[44], Walk [109] and so on. Now we introduce the pointwise convergence of the kNN estimator below, which is fundamental to the convergence of the following proposed algorithms for PSP. Theorem 5.7. Suppose that assumptions A1 - A5 are satisfied. Further suppose {(ω i ,ξ i )} n i=1 i.i.d. ∼ µ ˜ ω, ˜ ξ . Let k n be monotonically increasing with n, k ℓ → ∞, kn n → 0 (as ℓ → ∞) and (k n ) varies regularly with exponent β ∈(0,1] (e.g., k n =⌊n β ⌋, β ∈(0,1)). Then the following holds: lim n→∞ 1 k n n X i=1 I ˜ ω i ∈S(k n ,ω;{˜ ω j } n j=1 ) F(x, ˜ ξ i )=E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω =ω], w.p.1. Proof. See theorem 1 of Walk [109]. One key benefit of using kNN is that 1 kn P n i=1 I(ω i ∈S(k n ,ω;{ω i } n i=1 ))=1 and 1 kn I(ω i ∈ S(k n ,ω;{ω i } n i=1 ))≥ 0 for all i∈{1,2,...,n}, which implies that many results which hold for the sample averaging can also be transferred to an algorithm powered by kNN. Under 171 mild assumptions, we can show the uniform convergence of kNN estimate of the objective function. Proposition 5.8. Suppose Assumption A1 - A5 hold. Then lim n→∞ f kn ω,n (x)=f ω (x) uniformly on X with probability one. Proof. AssumptionA3impliesthat{f kn ω,n (x)} n isLipschitzcontinuouswithacommonmodu- lus,Lip F ,whichfurtherimpliestheequicontinuityof{f kn ω,n (x)} n . ByTheorem4.7,{f kn ω,n (x)} n converges pointwise to E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω = ω] on X with probability one. Finally, the equicontinuityandpointwiseconvergenceof{f kn ω,n (x)} n impliestheuniformconvergence. Inthefollowingalgorithm,welet{x ℓ } ℓ and{¯x ℓ } ℓ denoteasequenceofcandidatedecisions and incumbent decisions, respectively. Given the incumbent solution, ¯x ℓ , at the beginning of the ℓ th iteration, we solve the following regularized problem to get a candidate solution, x ℓ+1 : x ℓ+1 =argmin x∈X 1 k ℓ ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 ))G(x,ξ i ;¯x ℓ )+ c 2 ∥x− ¯x ℓ ∥ 2 . (5.7) Algorithm 19 N-LEON-kNN (Initialization) Pick ¯x 1 ∈ X, c > 0, r ∈ (0,1), and β ∈ (0,1). Let ℓ = 1. Let k ℓ =⌊ℓ β ⌋. Generate (ω 1 ,ξ 1 )∼ µ ˜ ω, ˜ ξ . (Step 1: Candidate Selection) Solve (5.7) to get x ℓ+1 (Step 2) Generate (ω ℓ+1 ,ξ ℓ+1 )∼ µ ˜ ω, ˜ ξ , which is independent from the past samples. (Step 3: Incumbent Selection) if f k ℓ+1 ω,ℓ+1 (x ℓ+1 )− f k ℓ+1 ω,ℓ+1 (¯x ℓ )<r h f k ℓ ω,ℓ (x ℓ+1 )− f k ℓ ω,ℓ (¯x ℓ ) i then Set ¯x ℓ+1 =x ℓ+1 else Set ¯x ℓ+1 = ¯x ℓ end if (Step 4) Set ℓ← ℓ+1, k ℓ ←⌊ ℓ β ⌋, and go to Step 1. The“relative”descentisensuredbysolving(5.7),whichissummarizedintheproposition below. 172 Proposition5.9. {x ℓ } ℓ and{¯x ℓ } ℓ generated by Algorithm 19 satisfies the following relation: f k ℓ ω,ℓ (x ℓ+1 )− f k ℓ ω,ℓ (¯x ℓ )≤− c 2 ∥x ℓ+1 − ¯x ℓ ∥ 2 . (5.8) Proof. By the optimality of Candidate Selection in Algorithm 19, we have 1 k ℓ ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 ))G(x,ξ i ;¯x ℓ )+ c 2 ∥x ℓ+1 − ¯x ℓ ∥ 2 ≤ 1 k ℓ ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 ))G(¯x ℓ ,ξ i ;¯x ℓ ). (5.9) The final result follows from the definition of the surrogate function in Definition 5.2. Notethatf kn ω,n (x)isabiasedestimateoff ω (x)anditisunclearwhetherE ˜ Sn [∥∇ x f kn ω,n (x)− ∇ x f ω (x)∥ 2 ] can be bounded by some function that depends on the sample size n. Although Gy¨ orfi et al. [44] in Theorem 6.2 derives the rate of convergence of kNN estimation in terms of E ˜ Sn [ R ω (f kn ω,n (x)− E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω = ω]) 2 µ ω (d ω )] for a given x, the convergence rate of |f kn ω,n (x)− E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω =ω]|isstillunknown. Toovercomethischallenge, thealgorithm will generate a sequence of candidate decisions and a sequence of incumbent decisions. We usetheincumbentselectionrulebasedontheregularizedSDinHigleandSen[47]torecover a sequence of convergent incumbents. Mathematically speaking, we construct a new kNN estimate of objective after the Candidate Selection, say f k ℓ+1 ω,ℓ+1 in the ℓ th iteration. Then we check if f k ℓ+1 ω,ℓ+1 (x ℓ+1 )− f k ℓ+1 ω,ℓ+1 (¯x ℓ ) < r h f k ℓ ω,ℓ (x ℓ+1 )− f k ℓ ω,ℓ (¯x ℓ ) i (r ∈ (0,1)). If yes, we update ¯x ℓ+1 = x ℓ+1 ; otherwise, we update ¯x ℓ+1 = ¯x ℓ . Powered by the incumbent selection rule discussed before, we show an important limiting property of the estimated function value at the candidate and incumbent decisions (i.e.,{x ℓ } ℓ and{¯x ℓ } ℓ , respectively). The proof of the following proposition is inspired by Lemma 6 in [46]. 173 Proposition 5.10. Suppose Assumption A1 - A5 hold. Let {¯x ℓ } ℓ and {x ℓ } ℓ denote the sequence of incumbent decisions and candidate decisions generated by N-LEON-kNN in Al- gorithm 19, respectively. Then limsup ℓ→∞ f k ℓ ω,ℓ (x ℓ+1 )− f k ℓ ω,ℓ (¯x ℓ )=0 w.p.1. (5.10) Proof. Case 1: {¯x ℓ } changes finitely often. Then there exists ℓ > 0 and ¯x ∈ X, such that ¯x ℓ = ¯x for all ℓ≥ L. Then the incumbent selection step implies that f k ℓ ω,ℓ (x ℓ )− f k ℓ ω,ℓ (¯x)≥ r[f k ℓ− 1 ω,ℓ− 1 (x ℓ )− f k ℓ− 1 ω,ℓ− 1 (¯x)], ∀ ℓ≥ L+1. (5.11) Since X is compact, the accumulation point of {x ℓ } exists. Let {x ℓ } ℓ∈L be a subse- quence whose accumulation point is ˆ x. By Proposition 5.8, we have lim ℓ(∈L)→∞ f k ℓ ω,ℓ (x ℓ ) = lim ℓ(∈L)→∞ f k ℓ ω,ℓ− 1 (x ℓ ) = f ω (ˆ x) and lim ℓ(∈L)→∞ f k ℓ− 1 ω,ℓ− 1 (¯x) = lim ℓ(∈L)→∞ f k ℓ− 1 ω,ℓ (¯x) = f ω (¯x). Let ℓ(∈L)→ ∞, equation (5.11) implies that f ω (ˆ x)− f ω (¯x)≥ r[f ω (ˆ x)− f ω (¯x)]. (5.12) Sincef ω,ℓ− 1 (x ℓ )− f ω,ℓ− 1 (¯x)≤ 0byProposition5.9,wehavelimsup ℓ→∞ [f ω,ℓ− 1 (x ℓ )− f ω,ℓ− 1 (¯x)]≤ 0, which implies that f ω (ˆ x)− f ω (¯x)= lim ℓ(∈L)→∞ [f ω,ℓ− 1 (x ℓ )− f ω,ℓ− 1 (¯x)]≤ 0. Thus, combining with equation (5.12), we have f ω (ˆ x)− f ω (¯x)=0. This completes the proof in Case 1. Case 2: {¯x ℓ } changes infinitely often. Let {ℓ n } denote the subsequence of iterations so that {¯x ℓn } changes. By the incumbent selection rule, we have f k ℓn ω,ℓn (x ℓn )− f k ℓn ω,ℓn (¯x ℓn− 1 )≤ r h f k ℓn− 1 ω,ℓn− 1 (x ℓn )− f k ℓn− 1 ω,ℓn− 1 (¯x ℓn− 1 ) i ≤ 0. (5.13) 174 Let θ ℓn =f k ℓn− 1 ω,ℓn− 1 (x ℓn )− f k ℓn− 1 ω,ℓn− 1 (¯x ℓn− 1 ). Then for m>0 1 m m X n=1 [f k ℓn ω,ℓn (¯x ℓn )− f k ℓn ω,ℓn (¯x ℓn− 1 )]≤ r m m X n=1 θ ℓn ≤ 0. (5.14) Also note that 1 m m X n=1 [f k ℓn ω,ℓn (¯x ℓn )− f k ℓn ω,ℓn (¯x ℓn− 1 )] = 1 m m− 1 X n=1 (f k ℓn ω,ℓn (¯x ℓn )− f k ℓ n+1 ω,ℓ n+1 (¯x ℓn ))+f k ℓm ω,ℓm (¯x ℓm )− f k ℓ 1 ω,ℓ 1 (¯x ℓ 0 ) ! = 1 m m− 1 X n=1 f k ℓn ω,ℓn (¯x ℓn )− f k ℓ n+1 ω,ℓ n+1 (¯x ℓn ) + 1 m (f k ℓm ω,ℓm (¯x ℓm )− f k ℓ 1 ω,ℓ 1 (¯x ℓ 0 )) ≥ 1 m m− 1 X n=1 f k ℓn ω,ℓn (¯x ℓn )− f k ℓ n+1 ω,ℓ n+1 (¯x ℓn ) − 2M m . (5.15) The last inequality holds by Assumption A2. By Proposition 5.8, lim n→∞ (f k ℓn ω,ℓn (¯x ℓn )− f k ℓ n+1 ω,ℓ n+1 (¯x ℓn ))=0 w.p.1. Thus, let m→∞, it follows from (5.15) that lim m→∞ 1 m m X n=1 [f k ℓn ω,ℓn (¯x ℓn )− f k ℓn ω,ℓn (¯x ℓn− 1 )]=0 w.p.1. Thus, (5.14) implies that 0= lim m→∞ 1 m m X n=1 [f k ℓn ω,ℓn (¯x ℓn )− f k ℓn ω,ℓn (¯x ℓn− 1 )]≤ lim m→∞ r m m X n=1 θ ℓn ≤ 0 w.p.1 (5.16) Equation (5.16) implies that 0 = limsup m→∞ 1 m P m n=1 θ ℓn ≤ limsup m→∞ θ ℓm ≤ 0 w.p.1, which completes the proof. 175 5.3.1.1 Convergence Analysis Analogous to Theorem 7.44 in [102], we shall get the kNN version of the results by using the “strong law of large numbers” of kNN estimators (Theorem 5.7). Theorem 5.11. If Assumptions A1 - A5 hold, then for any ˆ x∈ X, f(x,ω) is directionally differentiable at ˆ x and f ′ ω (ˆ x;x− ˆ x)=E ˜ ξ ∼ µ ˜ ξ |ω [F ′ ˜ ξ (ˆ x;x− ˆ x)|˜ ω =ω], ∀ x∈X (5.17) Furthermore, if Assumption A4 is replaced by “F(·, ˜ ξ ) is differentiable on X for almost every ˜ ξ ”, then the following alternative conclusion holds: For any ˆ x∈X, f(x) is differentiable at ˆ x and ∇ x f ω (ˆ x)=E ˜ ξ ∼ µ ˜ ξ |ω [∇ x F(ˆ x, ˜ ξ )|˜ ω =ω]. (5.18) Proof. The proof is similar to the proof of Theorem 7.44 (b) and (c) of [102]. The only difference is that we use regular conditional distribution to take the integral when applying Dominated Convergence Theorem. With Propositions 5.9 and 5.10 providing a subsequence so that distance between the incumbentsreducetozero,weshowtheconvergenceofN-LEON-kNNinthefollowingpropo- sition by providing a structural assumption on the objective function. Proposition 5.12. Suppose F(x,ξ ) = ψ 1 (x,ξ )− ψ 2 (x,ξ ), where ψ 1 : S ×Y 7→ R and ψ 2 :S×Y 7→ R are convex on a open set S⊆ R n and S is a superset of X. Further assume that Assumptions A1 - A5 hold. Further suppose that ψ 2 (·,ξ ) is differentiable on S for almost every ξ ∈Y.Then, with probability one, Algorithm 19 produces a subsequence of{¯x ℓ } so that lim ℓ(∈L)→∞ ∥x ℓ+1 − ¯x ℓ ∥=0 and any accumulation point of such a subsequence is a d-stationary point of (5.5). 176 Proof. The surrogate function is G(x,ξ ;x ′ )=ψ 1 (x,ξ )− [ψ 2 (x ′ ,ξ )+∇ x ψ 2 (x ′ ,ξ ) ⊤ (x− x ′ )]. (5.19) Let ϕ ω,1 (x) =E ˜ ξ ∼ µ ˜ ξ |ω [ψ 1 (x, ˜ ξ )|˜ ω = ω] and ϕ ω,2 (x) =E ˜ ξ ∼ µ ˜ ξ |ω [ψ 2 (x, ˜ ξ )|˜ ω = ω]. By Proposition 5.10, there exists L such that lim l(∈L)→∞ ∥x ℓ+1 − ¯x ℓ ∥ = 0. In the ℓ th iteration, the optimality condition of Candidate Selection in the Algorithm 19 implies that for any x∈X, we have 0≤ c(x ℓ+1 − ¯x ℓ ) ⊤ (x− x ℓ+1 ) + 1 k ℓ ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 )) ψ ′ ξ i ,1 (x ℓ+1 ;x− x ℓ+1 )−∇ x ψ 2 (¯x ℓ ,ξ i ) ⊤ (x− x ℓ+1 ) (5.20) For L ′ ⊂L such that lim ℓ(∈L ′ )→∞ ¯x ℓ = ¯x ∞ , let ℓ(∈L ′ )→∞, by the pointwise convergence of kNN estimator (Theorem 4.7), we have lim l(∈L ′ )→∞ 1 k ℓ ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 )) ψ ξ i ,1 ′(x ℓ+1 ;x− x ℓ+1 ) =E ˜ ξ ∼ µ ˜ ξ |ω ψ ˜ ξ, 1 ′ (¯x ∞ ;x− ¯x ∞ )|˜ ω =ω , and lim l(∈L ′ )→∞ 1 k ℓ P ℓ i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 ))∇ x ψ 2 (¯x ℓ ,ξ i ) =E ˜ ξ ∼ µ ˜ ξ |ω [∇ x ψ 2 (¯x ∞ , ˜ ξ )|˜ ω = ω]. It follows from (5.20) that 0≤ E ˜ ξ ∼ µ ˜ ξ |ω ψ ˜ ξ, 1 ′ (¯x ∞ ;x− ¯x ∞ )|˜ ω =ω − E ˜ ξ ∼ µ ˜ ξ |ω [∇ x ψ 2 (¯x ∞ , ˜ ξ )|˜ ω =ω]. (5.21) By Theorem 5.11, we have E ˜ ξ ∼ µ ˜ ξ |ω ψ ˜ ξ, 1 ′ (¯x ∞ ;x− ¯x ∞ )|˜ ω =ω =ϕ ′ ω,1 (¯x ∞ ;x− ¯x ∞ ), (5.22a) E ˜ ξ ∼ µ ˜ ξ |ω [∇ x ψ 2 (¯x ∞ , ˜ ξ )]=∇ x ϕ ω,2 (¯x ∞ ) (5.22b) 177 Thus, the combination of equations (5.21) and (5.22) implies that 0≤ ϕ ′ ω,1 (¯x ∞ ;x− ¯x ∞ )−∇ x ϕ ω,2 (¯x ∞ ) ⊤ (x− ¯x ∞ ). (5.23) Thus, this shows that ¯x ∞ is the d-stationary point of (5.5). Remark: Note that the same results hold for other structural assumptions such as “H(x,ξ ;x ′ ) ≜ G(x,ξ ;x ′ )− F(x,ξ ), H(x,ξ ;x ℓ ) is L-smooth and ∇ x H(x ′ ,ξ ;x ′ ) = 0 for all x ′ ∈X and ξ ∈Y”. 5.3.2 NSD-MM In this subsection, we transits from the N-LEON-kNN algorithm to a decomposition-based algorithm enhanced by MM and kNN method for solving the PSP problem in (5.1). min x∈X E ˜ ξ ∼ µ ˜ ξ |ω h F(x, ˜ ξ )+H(x, ˜ ξ )|˜ ω =ω i (5.24) We let F :S×Y 7→ R be a L-smooth/differentiable concave function, where S is a superset of X, and let H(x,ξ ) be a pointwise maximum of finitely many linear functions of x. To the end of this subsection, we shall assume that assumption A5 hold. Additionally, we assume that B1 X is a convex compact (possibly polyhedron) set. B2 For almost every ξ ∈Y, F(·,ξ ) is differentiable on S and there exists a finite κ f > 0 such that∥∇F(x,ξ )∥≤ κ f for all x∈X. B3 There exists ¯ f ∈(0,∞) such that|F(x,ξ )|≤ ¯ f on X for almost every ξ ∈Y. B4 For almost every ξ ∈ Y, the set {y : Dy = e(ξ )− C(ξ )x,y ≥ 0} ̸= ∅ for all x ∈ X. Furthermore, H(x,ξ )≥ 0 for all x∈X and almost every ξ ∈Y. B5 The subdifferential of H with respect to x is nonempty on X for almost every ξ ∈Y (i.e., ∂ x H(x,ξ )̸=∅). 178 B6 There exists κ e ,κ C ∈ (0,∞) such that ∥e(ξ )∥ ≤ κ e and ∥C(ξ )∥ ≤ κ C for almost all ξ ∈ Y. The dual feasible region of the second-stage problem, {π : π ⊤ D ≤ d ⊤ } is bounded. B7 There exists M h ∈ (0,∞) such that H(x,ξ ) < M h for all x ∈ X and almost every ξ ∈Y. Assumptions B1 - B3 are necessary uniform of kNN estimate of the objective function. AssumptionsB4-B6arecommonintwo-stagestochasticlinearprogramliterature,inwhich Assumption A4 corresponds to the relative complete recourse assumption, Assumptions A5 and A6 altogether ensures that the subdifferential of the recourse function exists and it is bounded for all x ∈ X. Assumption B7 is important in updating the piecewise linear approximation function of the convex component when kNN being used. Throughout this subsection, we adopt the notations eblow. ℓ: iteration number of the outer loop, ν : iteration number of the inner loop ω : observed predictor, N ℓ : sample size in the ℓ th outer loop ν ℓ : number of inner loops in the outer iteration ℓ {¯x ℓ }: sequence of incumbents generated by the outer loop {x 1 2 ,ℓ ν }: sequence of candidates generated by the inner loops {x ℓ }: sequence of candidates generated by the outer loops f k ℓ ω,ℓ (x)= 1 k ℓ P N ℓ i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))F(x,ξ i ) h k ℓ ω,ℓ (x)= 1 k ℓ P N ℓ i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))H(x,ξ i ) ζ k ℓ ω,ℓ (x)=f k ℓ ω,ℓ (x)+h k ℓ ω,ℓ (x) f ω (x)=E ˜ ξ ∼ µ ˜ ξ |ω [F(x, ˜ ξ )|˜ ω =ω], h ω (x)=E ˜ ξ ∼ µ ˜ ξ |ω [H(x, ˜ ξ )|˜ ω =ω], ζ ω (x)=f ω (x)+h ω (x) ˆ h k ℓ ω,ℓ,ν (x): piecewise linear approximation of h k ℓ ω,ℓ in the ν th inner iteration of the ℓ th outer loop g k ℓ ω,ℓ (x;x ′ )= 1 k ℓ P ℓ i=1 I(ω i ∈S(k ℓ ,ω;{ω i } ℓ i=1 ))G(x,ξ i ;x ′ ): surrogate function of f k ℓ ω,ℓ (x) near x ′ Table 5.2: Notations in the NSD-MM Algorithm NSD-MM algorithm design can be regarded as a combination of SD, MM, and N-LEON- kNN. We elaborate on the procedure of NSD-MM in the ℓ th iteration. Given the incumbent solution, ¯x ℓ , we solve the regularized problem below to obtain a candidate solution. x 1 2 ,ℓ+1 ν =argmin x∈X g k ℓ ω,ℓ (x;¯x ℓ )+ ˆ h k ℓ ω,ℓ,ν (x)+ c 2 ∥x− ¯x ℓ ∥ 2 (5.25) 179 Then we check if the lower-bounding approximation is accurate enough in terms of h k ℓ ω,ℓ,ν (x 1 2 ,ℓ+1 ν ,ω)− ˆ h k ℓ ω,ℓ,ν (x 1 2 ,ℓ+1 ν ,ω)≤ c 4 ∥x 1 2 ,ℓ+1 ν − ¯x ℓ ∥ 2 . If no, we construct a new piecewise linear approximation as follows: ˆ h k ℓ ω,ℓ,ν +1 (x)=max{ ˆ h k ℓ ω,ℓ,ν (x),h k ℓ ω,ℓ (x 1 2 ,ℓ+1 ν )+u ω,ℓ (x 1 2 ,ℓ+1 ν ) ⊤ (x− x 1 2 ,ℓ+1 ν ), (5.26) where u ω,ℓ (x 1 2 ,ℓ+1 ν ) is the subgradient of h k ℓ ω,ℓ (x) at the candidate solution x 1 2 ,ℓ+1 ν . In other words, we iteratively compute the dual extreme points of the sampled second-stage problem to construct the piecewise linear approximation of h ω (x). The design of the inner loop is inspired by Higle and Sen [46, 47] and Philpott and Guan [88] since the kNN estimate of the E ˜ ξ [H(x, ˜ ξ )] in iteration ℓ is equivalent to a finite sum of piecewise linear function of x each with finitely many pieces. Therefore, this allows us to iteratively sample the active pieces of the kNN estimate of h ω (x) and include it in the lower-bound approximation function of h k ℓ ω,ℓ (x). The finite convergence of the inner loop is stated in the following proposition. Proposition 5.13. Suppose that Assumptions A5 and B1 - B7 hold. With probability one, the inner loop as described in the algorithm 22 will terminate in finite number of iterations. Proof. It is obvious since h k ℓ ω,ℓ (x) is a piecewise linear function with finitely many piecewises and each iteration of the inner loop will add a new piece if the stopping criterion is not passed. Minorant Construction in NSD-MM For all i ∈ {1,...,N ℓ+1 }, compute ¯π ℓ i ∈ argmax{π ⊤ (e(ξ i )− C(ξ i )¯x ℓ ) : π ⊤ D ≤ d ⊤ } 180 and π ℓ+1 i ∈ argmax{π ⊤ (e(ξ i )− C(ξ i )x ℓ+1 ) : π ⊤ D ≤ d ⊤ } for i = 1,2,...,N ℓ . The new minorants for the (ℓ+1) st iteration are constructed by using the following formula. u k ℓ+1 ω,ℓ+1 (¯x ℓ )= 1 k ℓ+1 N ℓ+1 X i=1 I(ω i ∈S(k ℓ+1 ,ω;{ω i } N ℓ+1 i=1 ))C(ξ i ) ⊤ ¯π ℓ i , C 1 ω,ℓ+1 (x)=h k ℓ+1 ω,ℓ+1 (¯x ℓ )+u k ℓ+1 ω,ℓ+1 (¯x ℓ ) ⊤ (x− ¯x ℓ ) (5.27) u k ℓ+1 ω,ℓ+1 (x ℓ+1 )= 1 k ℓ+1 N ℓ+1 X i=1 I(ω i ∈S(k ℓ+1 ,ω;{ω i } N ℓ+1 i=1 ))C(ξ i ) ⊤ π ℓ+1 i C 2 ω,ℓ+1 (x)=h k ℓ+1 ω,ℓ+1 (x ℓ+1 )+u k ℓ+1 ω,ℓ+1 (x ℓ+1 ) ⊤ (x− x ℓ+1 ) (5.28) Algorithm 20 Minorant Pruning I Input: ˆ h ω,ℓ,ν +1 (x) if number of minorants in ˆ h ω,ℓ,ν +1 (x) is greater than N then Remove the earliest minorants so that the number of minorants in ˆ h ω,ℓ,ν +1 (x) equal to N end if Output: ˆ h ω,ℓ,ν +1 (x) Algorithm 21 Minorant Pruning II Input: ˆ h ω,ℓ,ν +1 (x) Remove the minorants from ˆ h ℓ− 1,ν ℓ− 1 (x) that are not active at x 1 2 ,ℓ ν ℓ− 1 (i.e., This can be achieved by letting T ℓ,ν denote index set of minorants of ˆ h ω,ℓ,ν +1 (x) (i.e., ˆ h ω,ℓ,ν +1 (x) = max{α ℓ ω,t + (β ℓ ω,t ) ⊤ x : t ∈ T ℓ,ν }) and solving the Candidate Update in its alternative formulation in the following problem, min x g k ℓ ω,ℓ (x;x ℓ )+η + c 2 ∥x− x ℓ ∥ 2 s.t. α ℓ ω,t +(β ℓ ω,t ) ⊤ x≤ η, t ∈T ℓ,ν x∈X (5.29) and recording the Lagrange multipliers of minorants of ˆ h ℓ− 1,ν ℓ− 1 +1 (x) after processing the Candidate Update and then deleting the minornats whose Lagrange multipliers are 0). Output: ˆ h ω,ℓ,ν +1 (x) 181 Algorithm 22 NSD-MM (Initialization) Pick ¯x 1 ∈ X, N 1 > 0, c > 0, β ∈ (0,1), and L > 0 (number of iterations). Set ν = 0, and k 1 = ⌊N β 1 ⌋. Generate i.i.d. (ω i ,ξ i ) ∼ µ ˜ ω, ˜ ξ for all i ∈ {1,2,...,N 1 }. Calculate ¯π 1 i ∈ argmax{π ⊤ (e(ξ i )− C(ξ i )¯x 1 ) : π ⊤ D ≤ d ⊤ } for all i ∈ {1,...,N 1 } and calculate u k 1 ω,1 (¯x 1 ) = 1 k 1 P N 1 i=1 I(ω i ∈S(k 1 ,ω;{ω i } N 1 i=1 ))C(ξ i ) ⊤ ¯π 1 i . Set ˆ h k 1 ω,1,1 (x) = h ω,1 (¯x 1 )+ u k 1 ω,1 (¯x 1 ) ⊤ (x− ¯x 1 ). for ℓ=1,2,...,L do Outer Loop do Inner Loop Set ν ← ν +1. (Candidate Update) x 1 2 ,ℓ+1 ν =argmin x∈X g k ℓ ω,ℓ (x;¯x ℓ )+ ˆ h k ℓ ω,ℓ,ν (x)+ c 2 ∥x− ¯x ℓ ∥ 2 Computeπ 1 2 ,ℓ+1 ν,i ∈argmax{π ⊤ (e(ξ i )− C(ξ i )x 1 2 ,ℓ+1 ν ):π ⊤ D≤ d ⊤ }fori=1,2,...,N ℓ Compute u ω,ℓ (x 1 2 ,ℓ+1 ν )= 1 k ℓ P N ℓ i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))C(ξ i ) ⊤ π 1 2 ,ℓ+1 ν,i . Set ˆ h k ℓ ω,ℓ,ν +1 (x)← max{ ˆ h k ℓ ω,ℓ,ν (x),h ω,ℓ (x 1 2 ,ℓ+1 ν )+u ω,ℓ (x 1 2 ,ℓ+1 ν ) ⊤ (x− x 1 2 ,ℓ+1 ν )} | {z } Piecewise Linear Approximation Update III . while h k ℓ ω,ℓ,ν (x 1 2 ,ℓ+1 ν )− ˆ h k ℓ ω,ℓ,ν (x 1 2 ,ℓ+1 ν )> c 4 ∥x 1 2 ,ℓ+1 ν − ¯x ℓ ∥ 2 End Inner Loop Set x ℓ+1 ← x 1 2 ,ℓ+1 ν , N ℓ+1 ← N ℓ +1 Generate (ω N ℓ+1 ,ξ N ℓ+1 )∼ µ ˜ ω, ˜ ξ , which is independent from the past samples. Perform Minorant Pruning I/II for ˆ h k ℓ ω,ℓ,ν +1 (x). Calculate k ℓ+1 =⌊N β ℓ+1 ⌋ andC 1 ω,ℓ+1 (x) andC 2 ω,ℓ+1 (x) using (5.27) and (5.28). if k ℓ+1 =k ℓ then Set ˆ h k ℓ+1 ω,ℓ+1,1 (x)← max{ ˆ h k ℓ ω,ℓ,ν +1 (x)− M h k ℓ ,C 1 ω,ℓ+1 (x),C 2 ω,ℓ+1 (x)} | {z } Piecewise Linear Approximation Update I else Set ˆ h k ℓ+1 ω,ℓ+1,1 (x)← max{ k ℓ k ℓ+1 ˆ h k ℓ ω,ℓ,ν +1 (x),C 1 ω,ℓ+1 (x),C 2 ω,ℓ+1 (x)} | {z } Piecewise Linear Approximation Update II end if if ζ k ℓ+1 ω,ℓ+1 (x ℓ+1 )− ζ k ℓ+1 ω,ℓ+1 (¯x ℓ )<r h ζ k ℓ ω,ℓ (x ℓ+1 )− ζ k ℓ ω,ℓ (¯x ℓ ) i then Incumbent Selection Set ¯x ℓ+1 ← x ℓ+1 else Set ¯x ℓ+1 ← ¯x ℓ end if Set ν ℓ ← ν and ν ← 0. end for End Outer Loop 182 The “relative” descent (in terms of ζ k ℓ ω,ℓ ) is ensured at the end of each innner loop of the NSD-MM algorithm, which is stated in the proposition below. Proposition 5.14. Suppose Assumptions A5 and B1 - B7 hold. {x ℓ } ℓ and {¯x ℓ } ℓ generated by Algorithm 22 satisfies the following relation: ζ k ℓ ω,ℓ (x ℓ+1 )− ζ k ℓ ω,ℓ (¯x ℓ )≤− c 4 ∥x ℓ+1 − ¯x ℓ ∥ 2 . (5.30) Proof. By the optimality of the Candidate Update, we have g k ℓ ω,ℓ (x ℓ+1 ;¯x ℓ )+ ˆ h k ℓ ω,ℓ (x ℓ+1 )+ c 2 ∥x ℓ+1 − ¯x ℓ ∥ 2 ≤ g k ℓ ω,ℓ (¯x ℓ ;¯x ℓ )+ ˆ h k ℓ ω,ℓ (¯x ℓ ) =f k ℓ ω,ℓ (x)+h k ℓ ω,ℓ (¯x ℓ ) =ζ k ℓ ω,ℓ (x). (5.31) The equality of (5.31) holds because of the Property M1 of the majorization function and Piecewise Linear Approximation Update I and II. By the Incumbent Selection I, we have h k ℓ ω,ℓ (x ℓ+1 )≤ ˆ h k ℓ ω,ℓ (x ℓ+1 )+ c 4 ∥x ℓ+1 − ¯x ℓ ∥. (5.32) By the property M2 of the majorization function, we have f k ℓ ω,ℓ (x ℓ+1 )≤ g k ℓ ω,ℓ (x ℓ+1 ;¯x ℓ ). (5.33) Hence, inequality (5.30) follows from the combination of (5.31), (5.32), and (5.33). After the inner loop, we generate two new minorants at x ℓ+1 and x ℓ , respectively. We also introduce two minornat pruning strategies to maintain a finite set of minorants. We also design a unique update rule in the NSD-MM so that the we can reuse the minorants 183 in ˆ h k ℓ ω,ℓ,ν +1 (x) for the construction of the lower bounding approximation of h k ℓ+1 ω,ℓ+1 (x) under certain rescaling based on the following observations. Case 1. If k ℓ+1 =k ℓ , then ˆ h k ℓ ω,ℓ,ν +1 (x)− M h k ℓ ≤ h k ℓ+1 ω,ℓ+1 (x) for all x∈X. Case 2. If k ℓ+1 >k ℓ , then k ℓ k ℓ+1 ˆ h k ℓ ω,ℓ,ν +1 (x)≤ h k ℓ+1 ω,ℓ+1 (x) for all x∈X. The proposition below shows that the piecewise linear approximations functions generated by NSD-MM will provide lower bounding approximation of h k ℓ ω,ℓ for all ℓ. Proposition5.15. Suppose that Assumptions A5 and B1 - B7 hold. Tie-breaking by indices is used (i.e., if ∥ω i − ω∥ = ∥ω j − ω∥ and i < j, we will order omega i before ω j in the ascending order of the distance to ω). With probability one, ˆ h k ℓ ω,ℓ,ν (x)≤ h k ℓ ω,ℓ (x) ∀ x∈ X for all ℓ≥ 1, ν ≥ 1. Proof. Let ω ℓ [k ℓ ] be the k ℓ nearest neighbor of ω from the set {ω j } N ℓ j=1 and let ξ ℓ [k ℓ ] be the associated response of ω ℓ [k ℓ ] . We let ˆ h k 1 ω,1,ν denote the updated piecewise linear approximation functionafterperformingtheν th PiecewiseLinearApproximationUpdateIII intheℓ th outer loop. By the initilization of Algorithm 22, ˆ h k 1 ω,1,1 (x)=C 1 ω,1 (x)≤ h k 1 ω,1 (x), ∀ x∈X. Case I: ℓ = 1 and the inner loop is terminated. By the convexity of h k 2 ω,2 (·) and (5.27), h k 2 ω,2 (x)≥C 1 ω,2 (x)andh k 2 ω,2 (x)≥C 2 ω,2 (x)forallx∈X. Ifk 2 =k 1 and∥ω 1 [k 1 ] − ω∥≤∥ ω N 2 − ω∥, then I(ω N 2 ∈ S(k 2 ,ω;{ω j } N 2 j=1 )) = 0 and I(ω 1 [i] ∈ S(k 2 ,ω;{ω j } N 2 j=1 )) = 1 for i = 1,2,...,k 1 . Hence, by the definition of h k ℓ ℓ , h k 2 ω,2 (x) = h k 2 ω,1 (x), which is obvious that h k 2 ω,2 (x)≥ h k 2 ω,1 (x)− M h k 1 ∀ x ∈ X. On other hand, if k 2 = k 1 and ∥ω 1 [k 1 ] − ω∥ > ∥ω N 2 − ω∥, then I(ω N 2 ∈ S(k 2 ,ω;{ω j } N 2 j=1 )) = 1, I(ω 1 [k 1 ] ∈S(k 2 ,ω;{ω j } N 2 j=1 )) = 0, and I(ω 1 [i] ∈S(k 2 ,ω;{ω j } N 2 j=1 ))) = 1 for i=1,2,...,k 1 − 1. By Assumptions B4 and B7, h k 2 ω,2 (x)− h k 1 ω,1 (x)= 1 k 2 H(x,ξ N 2 )− 1 k 1 H(x,ξ 1 [k 1 ] )≥ 0− M h k 1 , ∀ x∈X. (5.34) 184 Hence, it follows from (5.34) that h k 2 ω,2 (x) ≥ ˆ h k 1 ω,1 (x)− M h k 1 , ∀ x ∈ X, which further implies that h k 2 ω,2 (x)≥ max{ ˆ h k 2 ω,1,ν +1 (x)− M h k 1 ,C 1 ω,2 (x),C 2 ω,2 (x)}= ˆ h k 2 ω,2,1 (x)∀ x∈X. Finally, let us consider the case when k 2 >k 1 . Let ¯ S 2 ={1≤ i≤ N 2 :ω i / ∈S(k 1 ,ω;{ω j } N 1 j=1 ),ω i ∈S(k 2 ,ω;{ω j } N 2 j=1 )} denote a set of indices of covariates which are not in the first kNN set, S(k 1 ,ω;{ω j } N 1 j=1 ) but in the second kNN set,S(k 2 ,ω;{ω j } N 2 j=1 ). N 2 =N 1 +1 implies I(ω 1 [i] ∈S(k 2 ,ω;{ω j } N 2 j=1 )))=1, i=1,2,...,k 1 , which further implies that h k 2 ω,2 (x)= 1 k 2 X i∈ ¯S 2 H(x,ξ i )+ 1 k 2 k 1 X i=1 H(x,ξ 1 [i] ) ≥ 0+ k 1 k 2 1 k 1 k 1 X i=1 H(x,ξ 1 [i] )≥ k 1 k 2 ˆ h k 2 ω,1,ν +1 (x). (5.35) Hence, this shows that h k 2 ω,2 (x)≥ max{ k 1 k 2 ˆ h k 2 ω,1,ν +1 (x),C 1 ω,2 (x),C 2 ω,2 (x)}= ˆ h k 2 ω,2,1 (x),∀ x∈X. Case II: ℓ=1 and the inner loop is not terminated. By the convexity of h k 1 ω,1 (x), we have h k 1 ω,1 (x 1 2 ,2 1 )+u ω,1 (x 1 2 ,2 1 ) ⊤ (x− x 1 2 ,2 1 )≤ h k 1 ω,1 (x), ∀ x∈X. Hence, it is obvious that ˆ h k 1 ω,1,2 (x)=max{ ˆ h k 1 ω,1,1 (x),h ω,1 (x 1 2 ,2 1 )+u ω,1 (x 1 2 ,2 1 ) ⊤ (x− x 1 2 ,2 1 )}≤ h k 1 ω,1 (x), ∀ x∈X. 185 Therefore,itisobviousthat ˆ h k 1 ω,1,ν (x)≤ h k 1 ω,1 (x)forallν ≥ 1untiltheinnerloopisterminated. Case III: ℓ>1 and the inner loop is terminated. By induction, suppose that ˆ h k ℓ ω,ℓ,ν +1 (x)≤ h k ℓ ω,ℓ (x) after the inner loop. The proof in this case is similar to Case I. Case IV: ℓ>1 and the inner loop is not terminated. The proof is similar to Case II. At the end of the outer loop, we introduce an incumbent selection similar to the one in Algorithm 19 to ensure it produces a subsequence of incumbents so that any accumulation point on the subsequence is a d-stationary point of (5.1). Proposition 5.16. Suppose Assumptions A5 and B1 - B7 hold. Then limsup ℓ→∞ ζ k ℓ ω,ℓ (x ℓ+1 )− ζ k ℓ ω,ℓ (¯x ℓ )=0 w.p.1. (5.36) Proof. The proof strategy is the same as the proof of Proposition 5.10. Proposition 5.17. Suppose Assumptions A5 and B1 - B7 hold. Then with probability one, NSD-MM (Algorithm 22) produces a subsequence,{¯x ℓ } ℓ∈L , such that any accumulation point of{x ℓ } ℓ∈L (i.e.,L ′ ⊆L , lim ℓ(∈L ′ )→∞ x ℓ =x ∞ ) satisfies the following relation: lim ℓ(∈L ′ )→∞ x ℓ+1 = lim ℓ(∈L ′ )→∞ x ℓ =x ∞ ∈X, (5.37) and limsup ℓ(∈L ′ )→∞ ( ˆ h k ℓ ω,ℓ,ν ℓ ) ′ (x ℓ+1 ;x− x ℓ+1 )≤ h ′ ω (x ∞ ;x− x ∞ ), ∀ x∈X. (5.38) Proof. The proving strategy is similar to the proof of Lemma 4 in Higle and Sen [47]. The only differences are (1) we apply Proposition 5.14 and Proposition 5.16 to get that there exists a subsequence, {¯x ℓ } ℓ∈L , such that lim ℓ(∈L ′ )→∞ ∥x ℓ+1 − ¯x ℓ ∥ = 0 w.p.1 and (2) we utilize pointwise convergence of kNN estimator (Theorem 4.7) and the equicontinuity of {h k ℓ ℓ } ℓ to derive the uniform convergence of{h k ℓ ℓ } ℓ on X. 186 Next,weshowthatNSD-MMalgorithmsolvesthePSPproblemin(5.1)withprobability one. Theorem5.18. Suppose Assumptions A5 and B1 - B7 hold. Suppose either of the following two conditions holds: B8 F(·,ξ ) is differentiable and concave on X for almost every ξ . B9 F(·,ξ ) is L-smooth on X for almost every ξ . Then, with probability one, NSD-MM (Algorithm 22) produces a subsequence,{¯x ℓ } ℓ∈L , such that any accumulation point of{¯x ℓ } ℓ∈L is a d-stationary point of (5.24). Proof. Case 1: Assumption B8 holds. By Proposition 5.14 and Proposition 5.16, there existsasubsequence,{¯x ℓ } ℓ∈L ,suchthatlim ℓ(∈L ′ )→∞ ∥x ℓ+1 − ¯x ℓ ∥=0w.p.1. Since{¯x ℓ } ℓ∈L ⊂ X and X is compact, the accumulation point of {¯x ℓ } ℓ∈L exists, say L ′ ⊆ L , lim ℓ(∈L ′ )→∞ ¯x ℓ = lim ℓ(∈L ′ )→∞ x ℓ+1 = 0. By the Candidate Update of Algorithm 22, we have the optimality as follows: for any x∈X, 0≤ 1 k ℓ N ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))[∇ x F(¯x ℓ ,ξ i ) ⊤ (x− x ℓ+1 )] +( ˆ h k ℓ ω,ℓ,ν ℓ ) ′ (x ℓ+1 ;x− x ℓ− 1 )+c(x ℓ+1 − ¯x ℓ ) ⊤ (x− x ℓ+1 ). (5.39) Let ℓ(∈ L ′ ) → ∞, by the pointwise convergence of kNN estimator (Theorem 5.7) and Theorem 5.11, we lim ℓ(∈L ′ )→∞ 1 k ℓ N ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))∇F(¯x ℓ ,ξ i ) ⊤ (x− x ∞ ) =E ˜ ξ ∼ µ ˜ ξ |ω [∇F(x ∞ , ˜ ξ )|˜ ω =ω] ⊤ (x− x ∞ ) =∇E ˜ ξ ∼ µ ˜ ξ |ω [F(x ∞ , ˜ ξ )|˜ ω =ω] ⊤ (x− x ∞ ) =∇f ω (x ∞ ) ⊤ (x− x ∞ ). 187 On the other hand, by Proposition 5.17, we have limsup ℓ(∈L ′ )→∞ ( ˆ h k ℓ ω,ℓ,ν ℓ ) ′ (x ℓ+1 ;x− x ℓ+1 )≤ h ′ ω (x ∞ ;x− x ∞ ), ∀ x∈X. Thus, it follows from (5.39) that 0≤ liminf ℓ(∈L ′ )→∞ 1 k ℓ N ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))∇ x F(¯x ℓ ,ξ i ) ⊤ (x− x ∞ ) + liminf ℓ(∈L ′ )→∞ ( ˆ h k ℓ ω,ℓ,ν ℓ ) ′ (x ℓ+1 ;x− x ℓ+1 )+ liminf l(∈L ′ )→∞ c(x ℓ+1 − ¯x ℓ ) ⊤ (x− x ℓ+1 ) ≤∇ f ω (x ∞ ) ⊤ (x− x ∞ )+h ′ ω (x ∞ ;x− x ∞ ), which shows that x ∞ is a d-stationary point of (5.24) with probability one. Case 2: Assumption B9 holds. In this case, the surrogate function of F(x,ξ ) near x ′ ∈X is written as G(x,ξ ;x ′ )=F(x,ξ )+∇ x F(x,ξ ) ⊤ (x− x ′ )+ L 2 ∥x− x ′ ∥ 2 . The optimality condition of the Candidate Update of Algorithm 22 is 0≤ 1 k ℓ N ℓ X i=1 I(ω i ∈S(k ℓ ,ω;{ω i } N ℓ i=1 ))[∇ x F(¯x ℓ ,ξ i ) ⊤ (x− x ℓ+1 )] +( ˆ h k ℓ ω,ℓ,ν ℓ ) ′ (x ℓ+1 ;x− x ℓ− 1 )+(c+L)(x ℓ+1 − ¯x ℓ ) ⊤ (x− x ℓ+1 ). (5.40) Then the rest of the proof follows the same procedure of the case 1. 5.4 Computational Results In this subsection, we apply NSD-MM algorithm to solve a class of two-stage PSP problems withlinearorconcavefirst-stagecost. ThealgorithmsareimplementedintheC++environ- ment, and CPLEX v12.8.0, as a base solver, is used to solve convex quadratic programming problems and obtain the second-stage dual multipliers. All the programs are run on the Macbook Pro 2017 with 3.1 GHz Dual-Core Intel Core i5. 188 5.4.1 Two-Stage PSP We use a shipment planning problem, PBK19(sto klogbx), and a freight scheduling prob- lem, 4NODE(det klogbx), to illustrate the computational performance of NSD-MM. The mathematical formulation of the first instance, which we refer to as PBK19(sto klogbx), is: min x∈X E ˜ ξ ∼ µ ˜ ξ |ω [ P n i=1 k i ( ˜ ξ )log(b i ( ˜ ξ )x i +1)+H(x, ˜ ξ )|˜ ω = ω]. The parameters in the constraints of PBK19(sto klogbx) is due to the shipment planning problem in [12]. To the end of this section, we let v (i) denote the i th component of the vector v. As for the data generation process, ω ( i) follows Uniform(0.5,1.5) for i = 1,2,3, and the i th component of the demand is: e i (ξ )=max{0,A T i (ω+δ i /4)+(B T i ω)ϵ i } i=1,2,...,12, where δ i and ϵ i are white noises, and A i , B i are constant vectors. k(ξ ) follows k (i) (ξ ) = C ⊤ i ω+τ 1 + ¯k,whereτ 1 istruncatedstandardnormalrandomvariablewith(lowerbound,upper bound)=(-0.5,0.5) and C i is a constant vector, for i = 1,2,3,4. b(ξ ) follows b(ξ ) = ω 1 + τ 2 + ¯b, i = 1,2,3,4, where τ 1 is a truncated standard normal random variable with (lb,ub) = (-0.1,0.1) and ¯b is a constant. We input the dataset {(ω i ,e(ξ i ),k(ξ i ),b(ξ i ))} N i=1 and the Figure 5.1: Computational Results of the NSD-MM observation of the predictor ω = (1,1,1) to the NSD-MM solver. We replicate the data generation and solving processes 10 times and output the average and distribution of the solutionqualitiesevaluatedbytheout-of-samplevalidationsetinleftsubgraphoftheFigure 5.1. It shows that the estimated solution of NSD-MM manage to converge in this case. 189 The mathematical formulation of the second instance, which we refer to as P4NODE(det klogbx),ismin x∈X P n i=1 k i log(b i x i +1)+E ˜ ξ ∼ µ ˜ ξ |ω [H(x, ˜ ξ )|˜ ω =ω]. ω i followsuniform(0,1),thevec- toreintherighthandsideofthesecond-stageproblemfollowsthefollowingdatageneration process: e i (ξ )=Z ⊤ i ω+¯e i +ϵ i , i=1,2,...,12, where ϵ i is truncated standard normal random variable with (lower bound,upper bound)=(- 0.5,0.5), Z i is a constant vector, and ¯e i is a constant scalar. The constraint parameters are due to the instance 4NODE which can be seen from [98]. Then we feed the dataset {(ω i ,e(ξ i ))} N i=1 and the observation of the predictor ω = (0.4,0.6) to the NSD-MM solver. Again, we replicate the data generation and solving processes ten times and then plot the average solution qualities of ten replications in the right subgraph of the Figure 5.1. Its large shadow area in the case of P4NODE(det klogbx) is possibly due to the higher problem complexity and the weak convergence result of NSD-MM as shown in Theorem 5.18. 5.5 Remarks In this chapter, we have designed a fusion of kNN method, SD algorithm and MM algorithm to solve a class of non-smooth nonconvex predictive stochastic programming problems. The proposed NSD-MM algorithm has a unique property that successively approximates the second-stage cost-to-go function (convex component) from below and approximates the con- cave or L-smooth first-stage component from above. This technique exploits the historic approximationsofthesample-basedconvexcomponenttoestimatethecurrentsample-based convex component. As a result, NSD-MM algorithm can store the memory of the functional estimates in the previous iterations and thus use the memory to attain the next iterate. 190 Chapter 6 On the Combination of Stochastic Decomposition and Majorization Minimization for Nonconvex Stochastic Programming In this chapter, we show that NSD-MM algorithm under several minor modification can be reducedtoadecomposition-basedalgorithmwithpiecewiselinearapproximations(whichwe refer to as SD-MM algorithm) for solving the following nonconvex SP problem: min x∈X E ˜ ξ h F(x, ˜ ξ )+H(x, ˜ ξ ) i (6.1) where F : S×Y 7→ R is a L-smooth/differentiable concave function on X, where S is a superset of X, and H(x,ξ ) is the minimum cost of the second-stage linear programming problem as defined as minimum cost of the second stage problem: H(x,ξ )=min y d ⊤ y s.t. Dy =C(ξ )x− e(ξ ), y≥ 0. (6.2) The majorization-minimization (MM) method can be described as follows: a (usually convex) upper-bounding function of the (often nonconvex) original objective is created near 191 the current iterate, and then the next iterate is obtained by minimizing the upper-bounding functionoverthefeasibleregion. Suchmethodshavebeenwidelyusedindifference-of-convex (dc) programs (Le and Pham [62], Pang et al. [85], Tao and An [107]). Mairal [73] proposes aMinimizationbyIncrementalSurrogateOptimization(MISO)methodtosolvealargesum ofcontinuousfunctions(possiblywithcompositionandusuallyrelatedtothelossfunctionor log-likelihoodfunctioninmachinelearning)whereheusesasequenceofpreviouslygenerated surrogate functions to create an upper bound of the objective function. Razaviyayn et al. [91]designaStochasticSuccessiveUpper-boundMinimization(SSUM)methodtosolvemore general SP models which go beyond finite-sum problems. Similar to the MISO method from Mairal [73], SSUM successively uses a sequence of previously generated surrogate functions to construct an upper bound of sample approximation functions. It is worth noting that, in the SSUM algorithm, sample approximation functions are iteratively updated by generating i.i.d. samples, while the upper-bound surrogate functions are updated accordingly. Mairal [72] proposes Stochastic MM (SMM) to solve single-stage nonconvex SP, and Liu et al. [66] extend SMM algorithm for compound SP, in which the isotone outer objective function is nested with the expectations of continuous random functions. More recently, An et al. [2] propose the Stochastic dc algorithm (SDCA) to solve the SP problems where the objective function is dc. All in all, those methods can be regarded as a family of sampled surrogation algorithms. For the readers who are interested in sampled surrogation algorithm, we recom- mendtheChaptertenofthemonographbyCuiandPang[21]. Asforthetwo-stageSPwith a nonconvex second-stage recouse, Liu et al. [67] propose Regularization-Convexification- Sampling (RCS) approach to solve two-stage SP with bi-parameterized quadratic recourse, and Li and Cui [64] propose a decomposition algorithm which uses an upper approximating functionofthepartialMoreauenvelopeoftherecoursefunctionforeachscenariotoconstruct the surrogate function of the objective function. Note that the RCS approach generates a batch of samples that are independent of the past and only uses the currently generated samples to construct an upper bound of the sample approximation objective function. The 192 novelty of the RCS is that it introduces a quadratic regularizer to make the second-stage objective function strongly convex so that obtaining the dc decomposition of the recourse function by its dual becomes viable. The Stochastic Decomposition (SD, Higle and Sen [46, 47], and Sen and Liu [98]) algo- rithmisasuccessiveapproximation-decompositionalgorithmforsolvingtwo-stagestochastic linear programming models. SD successively constructs a local outer (lower-bounding) ap- proximation of the sampled approximation of the second-stage cost-to-go functions by using previously calculated stochastic subgradients (or stochastic quasi-gradients) at the candi- date/incumbent solutions. It is worth noting that SD assumes that the objective function consists of a deterministic linear or convex quadratic first-stage component (see Liu and Sen [69]forthequadraticcase)andanexpectationofthestochasticsecond-stagecost-to-gofunc- tion. The value of the second-stage cost-to-go function at a given feasible first-stage decision is the expectation of the minimum value of a linear or quadratic programming problem. One of the main factors motivating this chapter is that while the SD algorithm and the MM algorithm share some similarities, they also provide complement strengths within their focal points: SD uses convex lower-bound function (i.e., piecewise linear function) to ap- proximatethesample average approximation oftheobjective functionfrombelow, while the MM algorithm uses a convex upper-bound function to approximate the nonconvex objective function from above. Inspired by these complementary strengths, we aim to design a fusion of SD and MM (SD-MM) to approximate the convex component of the objective function from below and approximate the rest of the nonconvex objective from above in each itera- tion. It is worth noting that the constructed surrogate function is no longer a global upper bound of the sample average approximation of the objective function. 193 6.1 Notations We let∥·∥ denote the Euclidean norm of the vector and spectral norm of the matrix. Let ˜ ξ :Ω 7→Y ⊂ R m 2 be a random vector built upon the probability space (Ω ,Σ Ω ,P). We let µ ˜ ξ denote the distribution of the random vector ˜ ξ and let ξ denote a realization of ˜ ξ . When we say “for almost every ξ ∈Y”, it means for µ ˜ ξ almost every ξ ∈Y. Throughout, we assume that X is a compact convex set. We define L-smooth function, convex surrogate function, and d(irectional)-stationary point in definitions 5.1, 5.2, and 5.4, respectively. 6.2 Algorithm Design We shall make the following assumptions: B1 X is a convex compact (possibly polyhedron) set. B2 For almost every ξ ∈Y, F(·,ξ ) is differentiable on S and there exists a finite κ f > 0 such that∥∇F(x,ξ )∥≤ κ f for all x∈X. B3 There exists ¯ f ∈(0,∞) such that|F(x,ξ )|≤ ¯ f on X for almost every ξ ∈Y. B4 For almost every ξ ∈ Y, the set {y : Dy = e(ξ )− C(ξ )x,y ≥ 0} ̸= ∅ for all x ∈ X. Furthermore, H(x,ξ )≥ 0 for all x∈X and almost every ξ ∈Y. B5 The subdifferential of H with respect to x is nonempty on X for almost every ξ ∈Y (i.e., ∂ x H(x,ξ )̸=∅). B6 There exists κ e ,κ C ∈ (0,∞) such that ∥e(ξ )∥ ≤ κ e and ∥C(ξ )∥ ≤ κ C for almost all ξ ∈ Y. The dual feasible region of the second-stage problem, {π : π ⊤ D ≤ d ⊤ } is bounded. ComparedtoNSD-MMalgorithmforsolvingPSPproblems, theupperboundofthesecond- stage linear recourse function is not necessary to classic SP. For the reminder of this section, we introduce the notations in Table 6.1. The SD-MM algorithm consists of inner loops for successively refining the lower-bound approximation of the sample average of E ˜ ξ [H(x, ˜ ξ )] and outer loops for finding a sequence of incumbent solutions, {x ℓ }. We provide a detailed 194 ℓ: iteration number of the outer loop, ν : iteration number of the inner loop ν ℓ : number of inner loops in the outer iteration ℓ {x ℓ }: sequence of incumbents generated by the outer loop {x 1 2 ,ℓ ν }: sequence of candidates generated by the inner loops f ℓ (x)= 1 ℓ P ℓ i=1 F(x,ξ i ), h ℓ (x)= 1 ℓ P ℓ i=1 H(x,ξ i ), ζ ℓ (x)=f ℓ (x)+h ℓ (x) f(x)=E ˜ ξ [F(x, ˜ ξ )], h(x)=E ˜ ξ [H(x, ˜ ξ )] ζ (x)=f(x)+h(x), ∥ζ − ζ ℓ ∥ ∞ =sup x∈X |ζ (x)− ζ ℓ (x)| ˆ h ℓ,ν (x): piecewise linear approximation of h ℓ (x) in the ν th inner iteration of the ℓ th outer loop g ℓ (x;x ′ )= 1 ℓ P ℓ i=1 G(x,ξ i ;x ′ ): surrogate function of f ℓ (x) near x ′ f 0 ≡ 0, h 0 ≡ 0, ζ 0 ≡ 0 Table 6.1: Notations in the SD-MM Algorithm explanation of the SD-MM algorithm below. At the beginning of the ℓ th outer loop, SD-MM construts a linear lower bound of the h ℓ (x) at the current incumbent solution, x ℓ : C ℓ (x;x ℓ ) ≜ h ℓ (x ℓ ) + u ℓ (x ℓ ) ⊤ (x− x ℓ ), where u ℓ (x ℓ ) the subgradient of h ℓ (x ℓ ) at x ℓ . u ℓ (x ℓ ) is computed by solving the dual of the linear program in (5.2) (i.e., solve max{π ⊤ (e(ξ i )− C(ξ i )x) : π ⊤ D ≤ d ⊤ } for i = 1,2,...,ℓ). To make the description concise and intuitive, we shall refer to the linear lower bound of the h ℓ (x) as a minorant of h ℓ (x). SD-MM algorithm prunes the piecewise linear approximation of h ℓ− 1 (x) in iteration ℓ− 1 to avoid storing redundant and inactive minorants (the process will be introduced after pre- sentingthemainalgorithm). Welet ˆ h ℓ− 1,ν ℓ− 1 +1 (x)denotethepiecewiselinearapproximation after the minorant pruning. Next, we construct the new piecewise linear approximation of the h ℓ (x) as follows: ˆ h ℓ,1 (x)← max{ ℓ− 1 ℓ ˆ h ℓ− 1,ν ℓ− 1 +1 (x),h ℓ (x ℓ )+u ℓ (x ℓ ) ⊤ (x− x ℓ ), where the rescaling ℓ− 1 ℓ is used to ensure that ℓ− 1 ℓ ˆ h ℓ− 1,ν ℓ− 1 +1 (x) is the lower bound of the h ℓ (x). In ν th inner loop of the ℓ th outer loop, we compute the following regularized problem to get a candidate solution: x 1 2 ,ℓ+1 ν =argmin x∈X g ℓ (x;x ℓ )+ ˆ h ℓ,ν (x)+ c 2 ∥x− x ℓ ∥ 2 . 195 The optimization problem in the candidate update can be reformulated as min x g ℓ (x;x ℓ )+η + c 2 ∥x− x ℓ ∥ 2 s.t. α ℓ t +(β ℓ t ) ⊤ x≤ η, t ∈T ℓ,ν , x∈X (6.3) where ˆ h ℓ,ν (x)=max{α ℓ t +(β ℓ t ) ⊤ x:t∈T ℓ,ν }. Then we check if h ℓ (x 1 2 ,ℓ+1 ν )− ˆ h ℓ,ν (x 1 2 ,ℓ+1 ν )≤ c 4 ∥x 1 2 ,ℓ+1 ν − x ℓ ∥ 2 . (6.4) If(6.4)issatisfied, weterminatestheinnerloopofupdatetheincumbentsolution. If(6.4)is notsatisfied,weconstructaminorantat x 1 2 ,ℓ+1 ν andaddthenewminoranttopiecewiselinear approximation of the h ℓ (x) (i.e., ˆ h ℓ,ν +1 (x)). The stopping criterion of the inner loop ensures that the difference between the lower bounding approximation function and the sample average of the second-stage recourse function in the next incumbent solution (also referred to a trial points and or an outcome of a serious step) is bounded by a certain portion of the proximal term. Therefore, it ensures a relative descent (based on the value of the sample approximation of the objective function). SD-MM is summarized in the Algorithm 23. Figure 6.1: Illustration of the SD-MM methodology. The first figure (starting from the left handside)illustratesthatSD-MMapproximatestheconcavefunction f(x)fromabove. The second figure illustrates that SD-MM approximates the convex function h(x) from below. The last figure shows that the sum of the two function approximations becomes a local approximation of f(x)+h(x). 196 Algorithm 23 SD-MM (Initialization) Pick x 1 ∈ X, c > 0, and L > 0 (number of iterations). Let ν = 0, and ν 0 =0. Set ˆ h 0,1 (x)≡ 1. for ℓ=1,2,...,L do Outer Loop Generate ξ ℓ ∼ µ ˜ ξ , which is independent from the past samples. Compute π ℓ i ∈argmax{π ⊤ (e(ξ i )− C(ξ i )x ℓ ):π ⊤ D≤ d ⊤ } for i=1,2,...,ℓ. Compute u ℓ (x ℓ )= 1 ℓ P ℓ i=1 C(ξ i ) ⊤ π ℓ i if ℓ>1 then Remove inactive minorants of ˆ h ℓ− 1,ν ℓ− 1 +1 (x) at x ℓ by solving (6.3). end if Set ˆ h ℓ,1 (x)← max{ ℓ− 1 ℓ ˆ h ℓ− 1,ν ℓ− 1 +1 (x),h ℓ (x ℓ )+u ℓ (x ℓ ) ⊤ (x− x ℓ )} | {z } Piecewise Linear Approximation Update I . Set ν ← 0. do Inner Loop Set ν ← ν +1. Candidate Update: x 1 2 ,ℓ+1 ν =argmin x∈X g ℓ (x;x ℓ )+ ˆ h ℓ,ν (x)+ c 2 ∥x− x ℓ ∥ 2 . Compute π 1 2 ,ℓ+1 ν,i ∈argmax{π ⊤ (e(ξ i )− C(ξ i )x 1 2 ,ℓ+1 ν ):π ⊤ D≤ d ⊤ } for i=1,2,...,ℓ Compute u ℓ (x 1 2 ,ℓ+1 ν )= 1 ℓ P ℓ i=1 C(ξ i ) ⊤ π 1 2 ,ℓ+1 ν,i . Set ˆ h ℓ,ν +1 (x)← max{ ˆ h ℓ,ν (x),h ℓ (x 1 2 ,ℓ+1 ν )+u ℓ (x 1 2 ,ℓ+1 ν ) ⊤ (x− x 1 2 ,ℓ+1 ν )} | {z } Piecewise Linear Approximation Update II . while h ℓ (x 1 2 ,ℓ+1 ν )− ˆ h ℓ,ν (x 1 2 ,ℓ+1 ν )> c 4 ∥x 1 2 ,ℓ+1 ν − x ℓ ∥ 2 End Inner Loop Set x ℓ+1 ← x 1 2 ,ℓ+1 ν and ν ℓ ← ν . end for End Outer Loop We summarize the property of the piecewise linear approximation of the h ℓ (x) and finite convergence of the inner loop as follows. Proposition 6.1. Suppose that Assumptions B1 - B6 hold. With probability one, ˆ h ℓ,ν (x)≤ h ℓ (x) for all x∈X and for all ℓ≥ 1 and ν ≥ 1. Proof. By the lower bound of H(x,ξ ) in Assumption B4 and the direct use of subgradients to construct the minorants, the results follow. Proposition 6.2. Suppose that Assumptions B1 - B6 hold. With probability one, the inner loop as described in the algorithm 23 will terminate in finite number of iterations. Proof. The argument is the same as the one in Proposition 5.13. 197 It is worth noting that g ℓ (x;x ℓ )+ ˆ h ℓ,ν (x) constructed by the SD-MM is neither an upper bound nor lower bound of the function ζ ℓ (x), which is vastly different from the most of the MMliterature. Withthedesignoftheinnerloop, thedescentintermsofthefunctionvalues of ζ ℓ at x ℓ and x ℓ+1 is ensured, which is formally stated in the following proposition. Proposition 6.3. Suppose Assumptions B1 - B6 hold. Let {x ℓ } ℓ be the sequence generated by Algorithm 23. Then the following holds: f ℓ (x ℓ+1 )+h ℓ (x ℓ+1 )+ c 4 ∥x ℓ+1 − x ℓ ∥ 2 ≤ f ℓ (x ℓ )+h ℓ (x ℓ ). (6.5) Proof. The proof is similar to Proposition 5.14. 6.3 Convergence Analysis In this section, we aim to show the asymptotic convergence of the SD-MM algorithm. First, we show that {∥x ℓ+1 − x ℓ ∥} ℓ converges to 0 with probability one. Second, we show that any accumulation of {x ℓ } ℓ generated by Algorithm 23 is a d-stationary point of (5.4). The proposition below establishes the key convergence of the SD-MM algorithm. Proposition6.4. Suppose Assumptions B1 - B6 hold, then with probability one,{f ℓ− 1 (x ℓ )+ h ℓ− 1 (x ℓ )} ℓ converges and P ∞ ℓ=1 ∥x ℓ+1 − x ℓ ∥ 2 <∞. Proof. Note that x 1 is given in the initialization of the algorithm, f 0 (x 1 )+h 0 (x 1 ) = 0, and ∥x 2 − x 1 ∥ 2 <∞ w.p.1, by the definition of f 0 , h 0 , and boundedness of X. So it is equivalent 198 to show{f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ )} ℓ≥ 2 and P ∞ ℓ=2 ∥x ℓ+1 − x ℓ ∥ 2 <∞w.p.1. To the end of this proof, we consider that ℓ≥ 2. f ℓ (x ℓ+1 )+h ℓ (x ℓ+1 )− f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ ) (6.6a) =f ℓ (x ℓ+1 )+h ℓ (x ℓ+1 )− f ℓ (x ℓ )+h ℓ (x ℓ ) + f ℓ (x ℓ )+h ℓ (x ℓ ) − f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ ) ≤− c 4 ∥x ℓ+1 − x ℓ ∥ 2 + f ℓ (x ℓ )+h ℓ (x ℓ ) − f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ ) (6.6b) =− c 4 ∥x ℓ+1 − x ℓ ∥ 2 + − 1 ℓ(ℓ− 1) ℓ− 1 X i=1 [F(x ℓ ,ξ i )+H(x ℓ ,ξ i )] + 1 ℓ [F(x ℓ ,ξ ℓ )+H(x ℓ ,ξ ℓ )] =− c 4 ∥x ℓ+1 − x ℓ ∥ 2 + − 1 ℓ ζ ℓ− 1 (x ℓ )+ 1 ℓ [F(x ℓ ,ξ ℓ )+H(x ℓ ,ξ ℓ )]. (6.6c) Inequality in (6.6b) holds by Proposition 1. Let F ℓ = σ (x 1 ,x 2 ,...,x ℓ ,ξ 1 ,ξ 2 ,...,ξ ℓ− 1 ) de- note the natural history (filtration) before the outer iteration ℓ. By taking the conditional expectation of (6.6a) and (6.6c) with respect toF ℓ , we have E[f ℓ (x ℓ+1 )+h ℓ (x ℓ+1 )|F ℓ ] ≤ f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ )− c 4 E[∥x ℓ+1 − x ℓ ∥ 2 |F ℓ ]+ − 1 ℓ ζ ℓ− 1 (x ℓ )+ 1 ℓ ζ (x ℓ ) ≤ f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ )− c 4 E[∥x ℓ+1 − x ℓ ∥ 2 |F ℓ ]+ ∥ζ − ζ ℓ− 1 ∥ ∞ ℓ (6.7) The first inequality of (6.7) holds because E[F(x ℓ ,ξ ℓ )|F ℓ ]=E ˜ ξ [F(x ℓ , ˜ ξ )]=f(x ℓ ) and E[H(x ℓ ,ξ ℓ )|F ℓ ]=E ˜ ξ [H(x ℓ , ˜ ξ )]=h(x ℓ ). 199 The second inequality of (6.7) holds because ζ (x ℓ )− ζ ℓ− 1 (x ℓ ) ℓ ≤ ∥ζ − ζ ℓ− 1 ∥∞ ℓ . By Donsker Theorem (see Lemma 7 in [74]), there exists K∈(0,∞) such that E[∥ζ − ζ ℓ− 1 ∥ ∞ ]≤ K (ℓ− 1) 1 2 ,for ℓ≥ 2 Hence, ∞ X ℓ=2 E ∥ζ − ζ ℓ− 1 ∥ ∞ ℓ ≤ ∞ X ℓ=2 K (ℓ− 1) 3 2 <∞ (6.8) which implies that P ∞ ℓ=2 ∥ζ − ζ ℓ− 1 ∥∞ ℓ < ∞ w.p.1. We substract by ¯ f on both sides of (6.7) , then f ℓ− 1 (x ℓ ) + h ℓ− 1 (x ℓ )− ¯ f ≥ 0 for all ℓ ≥ 2. Hence, by Supermartingale Conver- gence Theorem (Proposition 2 in Bertsekas [10]), since P ∞ ℓ=2 ∥ζ − ζ ℓ− 1 ∥∞ ℓ < ∞ w.p.1, we have {f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ )− ¯ f} ℓ≥ 2 converges and hence {f ℓ− 1 (x ℓ )+h ℓ− 1 (x ℓ )} ℓ≥ 2 converges with probability one. Furthermore, Supermartingale Convergence Theorem implies that P ∞ ℓ=2 E[∥x ℓ+1 − x ℓ ∥ 2 |F ℓ ] <∞ w.p.1 which further implies that P ∞ ℓ=2 E[∥x ℓ+1 − x ℓ ∥ 2 ] <∞ and hence P ∞ ℓ=2 ∥x ℓ+1 − x ℓ ∥ 2 <∞ w.p.1. Remark: Identifying the “descent” relation in (6.6) of Proposition 6.4 is inspired by proof ofLemma1in[91]. Itisworthnotingthattheanalysisofalmostallthesampledsurrogation algorithms with incremental sampling involves using the supermartingale convergence the- orem (or its variation) and the rate of convergence of the sampled objective function. The novel part of the SD-MM algorithm is that it designs a double-loop structure to incorporate the decomposition-based methods into this unifying framework. Compared to NSD-MM, SD-MM does not need to Incumbent Selection (see Algorithm 22) to ensure the convergence of the incumbent solutions. The main reason is the convergence rate of the SAA objective function is knwon in the classic SP. The proof of Proposition 6.4 is straightforward to be extended to prove the convergence ofthemini-batchversionoftheSD-MMalgorithm. Fortherestofthissection, westickwith the base version of the SD-MM algorithm (i.e., the batch size is 1), but it is straightforward to extend the proof to the mini-batch version of the SD-MM algorithm. 200 ThefollowingpropositionisinspiredbyLemma4inHigleandSen[47]buttheconclusion isstronger(i.e.,theargumentontheentiresequenceversustheargumentonasubsequence). Proposition 6.5. Suppose Assumptions B1 - B6 hold. Then with probability one, any accumulation point of{x ℓ } ℓ (i.e., lim ℓ(∈L)→∞ x ℓ =x ∞ ) satisfies the following relations: lim ℓ(∈L)→∞ x ℓ+1 = lim ℓ(∈L)→∞ x ℓ =x ∞ ∈X, (6.9) and limsup ℓ(∈L)→∞ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 )≤ h ′ (x ∞ ;x− x ∞ ), ∀ x∈X. (6.10) Proof. Proof. Since X is compact and {x ℓ } ℓ ⊂ X, the accumulation point of {x ℓ } ℓ exists, say lim ℓ(∈L)→∞ x ℓ =x ∞ ∈X. By Proposition 6.4, we have P ∞ ℓ=1 ∥x ℓ+1 − x ℓ ∥<∞ w.p.1, which implies that lim ℓ→∞ ∥x ℓ+1 − x ℓ ∥=0 and thus lim ℓ(∈L)→∞ x ℓ+1 = lim ℓ(∈L)→∞ x ℓ =x ∞ . Pick x∈ X. Let{(α ℓ t ,β ℓ t )} t∈T ℓ,ν ℓ denote the collection of minorant coefficients of ˆ h ℓ,ν ℓ (x) at the end of the ℓ th outer loop. That is, ˆ h ℓ,ν ℓ (x)= max t∈T ℓ,ν ℓ {α ℓ t +(β ℓ t ) ⊤ x}. Let (α ℓ t ℓ ,β ℓ t ℓ )∈argmax{(β ℓ t ) ⊤ (x− x ℓ+1 )| α ℓ t +(β ℓ t ) ⊤ x ℓ+1 = ˆ h ℓ,ν ℓ (x ℓ+1 ), t∈T ℓ,ν ℓ }. By Danskin’s theorem, ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 )=(β ℓ t ℓ ) ⊤ (x− x ℓ+1 ). By Assumption A4,{β ℓ t ℓ } ℓ∈L is bounded. Hence, limsup ℓ(∈L)→∞ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 )<∞ and there exists a further subsequence ofL, sayL 0 , such that lim ℓ(∈L 0 )→∞ (α ℓ t ℓ ,β ℓ t ℓ )=(α ∞ ,β ∞ ), limsup ℓ(∈L)→∞ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 )= lim ℓ(∈L 0 )→∞ (β ℓ t ℓ ) ⊤ (x− x ℓ+1 )=(β ∞ ) ⊤ (x− x ∞ ). By Proposition 6.1 and the terminating criterion of the inner loop, we have h ℓ (x ℓ+1 )− c 4 ∥x ℓ+1 − x ℓ ∥ 2 ≤ ˆ h ℓ,ν ℓ (x ℓ+1 )=α ℓ t ℓ +(β ℓ t ℓ ) ⊤ x ℓ+1 ≤ h ℓ (x ℓ+1 ) (6.11) 201 It is easy to verify that{h ℓ } ℓ is eqicontinuous on X and hence by the Strong Law of Large Numbers,{h ℓ } ℓ converges to h uniformly on X. Hence, it follows from (6.11) that h(x ∞ )= lim ℓ(∈L)→∞ h ℓ (x ℓ+1 )− c 4 ∥x ℓ+1 − x ℓ ∥ 2 ≤ lim ℓ(∈L)→∞ α ℓ t ℓ +(β ℓ t ℓ ) ⊤ x ℓ+1 ≤ h(x ∞ ). (6.12) Equation (6.12) implies that lim ℓ(∈L)→∞ α ℓ t ℓ +(β ℓ t ℓ ) ⊤ x ℓ+1 =h(x ∞ ). SinceL 0 ⊆L , we have h(x ∞ )= lim ℓ(∈L 0 )→∞ α ℓ t ℓ +(β ℓ t ℓ ) ⊤ x ℓ+1 =α ∞ +(β ∞ ) ⊤ x ∞ . (6.13) On the other hand, for any x∈X, α ℓ t ℓ +(β ℓ t ℓ ) ⊤ x≤ ˆ h ℓ (x)≤ h ℓ (x), which implies that lim ℓ(∈L 0 )→∞ α ℓ t ℓ +(β ℓ t ℓ ) ⊤ x=α ∞ +(β ∞ ) ⊤ x≤ h(x). (6.14) Thus, it follows from the combination of equation (6.12), (6.13), and (6.14) that β ∞ ∈ ∂h(x ∞ ). Therefore, limsup ℓ(∈L)→∞ h ′ ℓ (x ℓ+1 ;x− x ℓ+1 )=(β ∞ ) ⊤ (x− x ∞ ) ≤ max{u ⊤ (x− x ∞ )| u∈∂h(x ∞ )} =h ′ (x ∞ ;x− x ∞ ). (6.15) We first state a important theorem that will be used in the proof of the convergence of the SD-MM algorithm. The following theorem states that the swapping of the expectation operator and directional derivative operator is allowed under mild assumptions. Theorem 6.6. If Assumptions B1 - B4 hold, then for any ˆ x ∈ X, f(x) is directionally differentiable ˆ x and f ′ (ˆ x;x− ˆ x)=E ˜ ξ [F ′ ˜ ξ (ˆ x;x− ˆ x)], ∀ x∈X 202 Furthermore, if Assumption A4 is replaced by “F(·, ˜ ξ ) is differentiable on X for almost every ˜ ξ ”, then the following alternative conclusion holds: For any ˆ x∈X, f(x) is differentiable at ˆ x and ∇f(ˆ x)=E ˜ ξ [∇ x F(ˆ x, ˜ ξ )]. Proof. See Theorem 7.44 (b) and (c) of Shapiro et al. [102]. The theorem below states that, with probability one, any accumulation point produced by the SD-MM algorithm is a d-stationary point to the problem in (6.1). Theorem 6.7. Suppose Assumptions B1 - B6 hold. Suppose either of the following two conditions holds: C1 F(·,ξ ) is differentiable and concave on X for almost every ξ . C2 F(·,ξ ) is L-smooth on X for almost every ξ . Then, with probability one, any accumulation point of{x ℓ } ℓ is a d-stationary point of (6.1). Proof. Case 1: Assumption C1 holds. In this case, the surrogate function of F(x,ξ ) near x ′ ∈ X is written as: G(x,ξ ;x ′ ) = F(x ′ ,ξ )+∇ x F(x ′ ,ξ ) ⊤ (x− x ′ ) for x ∈ X. Since X is compact and {x ℓ } ℓ ⊂ X, the accu- mulation point of{x ℓ } ℓ exists, say lim ℓ(∈L)→∞ x ℓ = x ∞ ∈ X. By Proposition 6.4, we also have lim ℓ(∈L)→∞ x ℓ+1 = x ∞ . By the update rule in Algorithm 23, we have the optimality condition in the ℓ th iteration as follows: ∀ x∈X, 0≤ ( 1 ℓ ℓ X i=1 ∇ x F(x ℓ ,ξ i )) ⊤ (x− x ℓ+1 )+ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 )+c(x ℓ+1 − x ℓ ) ⊤ (x− x ℓ+1 ). (6.16) By Assumption B3 and Strong Law of Large Number (theorem 2.5.6 in Durrett [29]), lim ℓ(∈L)→∞ 1 ℓ P ℓ i=1 ∇ x F(x ℓ ,ξ i ) =E ˜ ξ [∇ x F(x ℓ , ˜ ξ )]. By Theorem 6.6, it further implies that lim ℓ(∈L)→∞ 1 ℓ ℓ X i=1 ∇ x F(x ∞ ,ξ i )=∇E ˜ ξ [F(x ∞ , ˜ ξ )]=∇f(x ∞ ). 203 By Proposition 6.5, it follows from (6.16) that 0≤ liminf ℓ(∈L)→∞ {( 1 ℓ ℓ X i=1 ∇ x F(x ℓ ,ξ i )) ⊤ (x− x ℓ+1 )+ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 ) +c(x ℓ+1 − x ℓ ) ⊤ (x− x ℓ+1 )} ≤∇ f(x ∞ ) ⊤ (x− x ∞ )+ limsup ℓ(∈L)→∞ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 ) ≤∇ f(x ∞ ) ⊤ (x− x ∞ )+h ′ (x ∞ ;x− x ∞ ), ∀ x∈X. (6.17) Case2: AssumptionC2holds. Inthiscase,thesurrogatefunctionofF(x,ξ )nearx ′ ∈X is written as (by the descent lemma in [7]): G(x,ξ ;x ′ ) = F(x,ξ )+∇ x F(x,ξ ) ⊤ (x− x ′ )+ L 2 ∥x− x ′ ∥ 2 . Again, the optimality condition in the ℓ th iteration is as follows: ∀ x∈X, 0≤ ( 1 ℓ ℓ X i=1 ∇ x F(x ℓ ,ξ i )) ⊤ (x− x ℓ+1 )+ ˆ h ′ ℓ,ν ℓ (x ℓ+1 ;x− x ℓ+1 )+(c+L)(x ℓ+1 − x ℓ ) ⊤ (x− x ℓ+1 ). (6.18) By following the same procedure in Case 1, we can conclude that any accumulation point of {x ℓ } ℓ is a d-stationary point of (6.1). 6.4 Computational Results In this section, we apply SD-MM algorithm to solve a class of two-stage shipment problems withlinearorconcavefirst-stagecost. ThealgorithmsareimplementedintheC++environ- ment, and CPLEX v12.8.0, as a base solver, is used to solve convex quadratic programming problems and obtain the second-stage dual multipliers. All the programs are run on the Macbook Pro 2017 with 3.1 GHz Dual-Core Intel Core i5. The following results show that the SD-MM algorithm solves the two-stage stochastic linear programming problems near the optimum and maintains a stable descent in the objective value, and converges in the nonconvex case. 204 6.4.1 Two-Stage SLP The problem formulation of the two-stage stochastic linear programming problem is formu- lated as min x∈X c ⊤ x+E ˜ ξ [H(x, ˜ ξ )], where x denotes the decision of the amount of production in each factory, and H(x,ξ ) denotes the second-stage recourse function. We apply the SD-MM algorithm to solve eight different instances. LandS, LandS2 and PGP2 are power generation problems, BK19 and RETAIL are shipment planning problems, 4NODE and 20TERM are freight fleet scheduling problems, and SSN is a network expansion problem. The cost coefficients and the transportation network of BK19 are due to the two-stage ship- ment planning example in Bertsimas and Kallus [12], while the data generation process is different. The demand random variables follow the truncated normal (i.e, truncating the non-negative realization of the random variable) with means (5,6,7,8,5,6,7,8,5,6,7,8), standard deviations all equal to 1, lower bounds all equal 0, and upper bounds equal to (10,12,14,16,10,12,14,16,10,12,14,16). The model parameters and data generation pro- cess of the other instances can be available in Sen and Liu [98]. Table 6.2 summarizes the Instance Outer Average inner Avg(Obj) Std(Obj) Avg(95%CI Time name iterations iterations half margin) elapse (secs) LandS 200 369.3 382.12 0.01 1.33 6.77 LandS2 200 317.7 226.95 0.32 1.54 7.12 PGP2 200 573.1 447.89 0.66 1.34 9.90 BK19 100 246.1 729.29 1.32 0.77 2.68 RETAIL 500 671.9 154.14 0.16 0.73 45.29 4NODE 200 1741.3 447.07 0.32 0.05 54.95 20TERM 300 2279 254491.10 112.43 156.25 296.19 SSN 1100 2253.9 10.21 0.11 0.12 961.70 Table 6.2: Computational Results of the SD-MM in Two-Stage SLPs computational results of the SD-MM algorithm for solving eight Two-Stage SLP instances. Each instance is replicated ten times. The column of “average inner iterations” shows the average total number iterations of the SD-MM algorithm for solving each instance. 4NODE requires more than eight inner loops per outer iteration, which is the highest. The column 205 “Avg(Obj)” summarizes the average solution qualities in terms of the out-of-sample vali- dated costs. To ensure the sufficient size of the out-of-sample validation set, we output the average of the 95% CI half margin in the column “Avg(95%CI half margin)”. The column “Std(Obj)” shows the oscillation of the estimated solutions in terms of the standard de- viation of out-of-sample validated costs. According to Table 5 from Sen and Liu [98], we conclude that SD-MM can efficiently solve all of the eight instances to near-optimality. 6.4.2 Two-Stage SP with Concave First-Stage Cost In the second class of problems, we apply the SD-MM algorithm to solve 2 instances of two-stage SP with concave first-stage cost. The mathematical formulation of the first in- stance is min x∈X E ˜ ξ [ P n i=1 k i ( ˜ ξ )log(b i ( ˜ ξ )x i +1)+H(x, ˜ ξ )], which is the modification of BK19, which is referred to as BK19(sto klogbx). The second instance is based on the 4NODE, Figure 6.2: Computational Results of the SDMM where the second-stage subproblem remains the same and the objective function becomes min x∈X P n i=1 k i log(b i x i +1)+E ˜ ξ [H(x, ˜ ξ )]. Weshallrefertothisinstanceas4NODE(detklogbx). For each instance, we compare the SD-MM to the Stochastic Approximation (SA) and repli- cate the entire process ten times. In both graphs of Figure 6.2, the solid black line with “*” andtheredareashowtheaverageperformanceandthedistributionofthetenreplicationsof the SD-MM algorithm, respectively. Similarly, the dotted black line with “+” and the blue area show the average performance and the distribution of the ten replications of the SA 206 algorithm,respectively. InthecaseofBK19(stoklogbx)(whichisshownintheleftsubgraph of Figure6.2), both SD-MM algorithmand SA algorithmare able toconverge while SD-MM algorithm converges faster and the solution quality of the SD-MM algorithm is more stable (i.e., red area is smaller than the blue area). As for the 4NODE(det klogbx) (which is shown in the right subgraph of Figure 6.2), SA fails to converge. This is because SA has no theo- reticalconvergenceguaranteeinthenon-smoothnonconvexSPand4NODE(detklogbx)has much higher problem complexity than BK19(sto klogbx) has (i.e., the dimension of the dual in the second-stage problem of 4NODE(det klogbx) is 74 while the one of BK19(sto klogbx) is only 16 and the dimension of the first-stage decision is much larger). On the other hand, SD-MM algorithm converges after 40 samples generated in 4NODE(det klogbx). 207 Chapter 7 Conclusions In classic stochastic programming, the underlying distribution is usually a given fact. More- over, the risk-neutral approach, while good at capturing the average loss/risk/cost, lacks the predictive power to grasp the predicted loss/risk/cost. In this research, we have combined stochastic programming algorithms and non-parametric statistical learning approaches so that the proposed hybrid algorithms have the power of simultaneous learning, prediction, and optimization. In particular, we proposed a first-order methodology (LEON) to use the so-callednon-parametricstochasticquasi-subgradientstoupdateestimatedsolutions. Inad- dition,wehavealsomadeadirectnon-parametricextensionoftheStochasticDecomposition algorithm (NSD) by using k nearest neighbors estimation to compute the minorants. In all, LEONcansolvegeneralconvexpredictivestochasticprogrammingproblems,whileNSDcan beusedtosolvetwo-stagepredictivestochasticlinearprogrammingandtwo-stagepredictive stochastic quadratic-quadratic programming problems. Onechallengeofstochasticprogrammingisnonconvexity,especiallywhennon-smoothness has already been imposed. This thesis has solved (i.e., designed an algorithm to obtain the directional-stationary point of the true problem) a class of nonconvex predictive stochas- tic programming problems where the objective functions are difference-of-convex functions. In the algorithm design, we propose a novel way to approximate the convex component from below and approximate the concave component from above. Another novelty of the algorithm is that it provides a controllable way to make a local estimation of the objective 208 function. This feature allows the algorithm to efficiently assess any improvement in solution quality without exploring the entire landscape of the objective functions per iteration. We utilizethek nearestneighborsmethodtoanswerthepredictiverequirementfromthepredic- tive stochastic programming problems. The newly proposed algorithm can be regarded as a combinationofstochasticdecomposition,majorizationminimization,andk nearestneighbor estimation (NSD-MM). Therearemanyopenproblemsforfutureresearch. Inparticular,oneinterestingdirection is to study the convergence rate of the proposed algorithms while considering the challenge of the biased estimate from the non-parametric estimation. Another direction is to study the finite-sample convergence of the algorithms. In addition to the objective in the form of conditionalexpectation,thethirddirectionistostudythePSPproblemwiththeconstraints in the form of the conditional expectation given the realization of the predictors. 209 References 1. Altman, N. S. An introduction to kernel and nearest-neighbor nonparametric regres- sion. The American Statistician 46, 175–185 (1992). 2. An, L. T. H., Van Ngai, H., Tao, P. D. & Hau, L. H. P. Stochastic Difference- of-Convex Algorithms for Solving nonconvex optimization problems. arXiv preprint arXiv:1911.04334 (2019). 3. Asamov, T. & Powell, W. B. Regularized decomposition of high-dimensional multi- stage stochastic programs with markov uncertainty. SIAM Journal on Optimization 28, 575–595 (2018). 4. Ban, G.-Y. & Rudin, C. The big data newsvendor: Practical insights from machine learning. Operations Research 67, 90–108 (2019). 5. Barbarosoˇ glu,G.&Arda,Y.Atwo-stagestochasticprogrammingframeworkfortrans- portationplanningindisasterresponse.Journal of the operational research society 55, 43–53 (2004). 6. Bassok, Y., Anupindi, R. & Akella, R. Single-period multiproduct inventory models with substitution. Operations Research 47, 632–642 (1999). 7. Beck, A. First-order methods in optimization (SIAM, 2017). 8. Beck, A. & Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters 31, 167–175 (2003). 9. Benders,J.F.Partitioningproceduresforsolvingmixed-variablesprogrammingprob- lems. Numerische Mathematik 4, 238–252 (1962). 10. Bertsekas, D. P. Incremental proximal methods for large scale convex optimization. Mathematical programming 129, 163–195 (2011). 11. Bertsekas, D. P., Nedi, A., Ozdaglar, A. E., et al. Convex analysis and optimization (2003). 12. Bertsimas, D. & Kallus, N. From predictive to prescriptive analytics. Management Science (2019). 13. Bertsimas,D.&McCord,C.Frompredictionstoprescriptionsinmultistageoptimiza- tion problems. arXiv preprint arXiv:1904.11637 (2019). 14. Birge, J. R. Decomposition and partitioning methods for multistage stochastic linear programs. Operations research 33, 989–1007 (1985). 15. Birge,J.R.&Qi,L.Subdifferentialconvergenceinstochasticprograms. SIAMJournal on Optimization 5, 436–453 (1995). 16. Blundell, R. & Duncan, A. Kernel regression in empirical microeconomics. Journal of Human Resources, 62–87 (1998). 210 17. Cafaro, D. C. & Grossmann, I. E. Alternate approximation of concave cost functions for process design and supply chain optimization problems. Computers & chemical engineering 60, 376–380 (2014). 18. Casey, M. S. & Sen, S. The scenario generation algorithm for multistage stochastic linear programming. Mathematics of Operations Research 30, 615–631 (2005). 19. Cheng, P. E. Applications of kernel regression estimation: survey. Communications in statistics-theory and methods 19, 4103–4134 (1990). 20. C ¸ınlar, E. Probability and stochastics (Springer Science & Business Media, 2011). 21. Cui, Y. & Pang, J.-S. Modern Nonconvex Nondifferentiable Optimization (SIAM, 2021). 22. Dantzig, G. B. & Infanger, G. Multi-stage stochastic linear programs for portfolio optimization. Annals of Operations Research 45, 59–76 (1993). 23. Dempster, M. A. & Thompson, R. Parallelization and aggregation ofnested benders decomposition. Annals of Operations Research 81, 163–188 (1998). 24. Deng, Y. & Sen, S. Predictive stochastic programming. Computational Management Science 19, 65–98 (2022). 25. Devroye, L. The uniform convergence of nearest neighbor regression function estima- tors and their application in optimization. IEEE Transactions on Information Theory 24, 142–151 (1978). 26. Devroye, L., Gyorfi, L., Krzyzak, A., Lugosi, G., et al. On the strong universal consis- tency of nearest neighbor regression function estimates. The Annals of Statistics 22, 1371–1385 (1994). 27. Diao, S. & Sen, S. Distribution-free Algorithms for Learning Enabled Optimization with Non-parametric Estimation. Optimization Online (2020). 28. Diao,S.&Sen,S.OnlineNon-parametricEstimationforNonconvexStochasticStochas- tic Programming. Optimization Online (2022). 29. Durrett, R. Probability: theory and examples (Cambridge university press, 2019). 30. Elmachtoub,A.N.&Grigas,P.Smart“predict,thenoptimize”. Management Science (2021). 31. Elmachtoub,A.N.&Grigas,P.Smart“predict,thenoptimize”. Management Science 68, 9–26 (2022). 32. Ermoliev, Y. Stochastic quasigradient methods and their application to system opti- mization.Stochastics:AnInternationalJournalofProbabilityandStochasticProcesses 9, 1–36 (1983). 33. Ermoliev, Y. M. & Norkin, V. I. Normalized convergence in stochastic optimization. Annals of Operations Research 30, 187–198 (1991). 34. Ermoliev, Y. M. & Norkin, V. I. Sample average approximation method for com- pound stochastic optimization problems. SIAM Journal on Optimization 23, 2231– 2263 (2013). 35. Ermoliev, Y. M. & Gaivoronski, A. A. Stochastic quasigradient methods for optimiza- tion of discrete event systems. Annals of Operations Research 39, 1–39 (1992). 36. Fix, E. & Hodges Jr, J. L. Discriminatory analysis-nonparametric discrimination: consistency properties tech. rep. (California Univ Berkeley, 1951). 37. Frauendorfer, K. Barycentric scenario trees in convex multistage stochastic program- ming. Mathematical Programming 75, 277–293 (1996). 211 38. Friedman,J.,Hastie,T.&Tibshirani,R.The elements of statistical learning (Springer series in statistics New York, 2001). 39. Gangammanavar, H. & Sen, S. Stochastic dynamic linear programming: A sequential sampling algorithm for multistage stochastic linear programming. SIAM Journal on Optimization 31, 2111–2140 (2021). 40. Gassmann, H. I. MSLiP: A computer code for the multistage stochastic linear pro- gramming problem. Mathematical Programming 47, 407–423 (1990). 41. Greblicki, W. & Krzyzak, A. Asymptotic properties of kernel estimates of a regression function. Journal of Statistical Planning and Inference 4, 81–90 (1980). 42. Greblicki, W., Krzyzak, A., Pawlak, M., et al. Distribution-free pointwise consistency of kernel regression estimate. The annals of Statistics 12, 1570–1575 (1984). 43. Guigues, V., Lejeune, M. A. & Tekaya, W. Regularized stochastic dual dynamic pro- gramming for convex nonlinear optimization problems. Optimization and Engineering 21, 1133–1165 (2020). 44. Gy¨ orfi, L., Kohler, M., Krzyzak, A. & Walk, H. A distribution-free theory of nonpara- metric regression (Springer Science & Business Media, 2002). 45. Hafiz, F., de Queiroz, A. R., Fajri, P. & Husain, I. Energy management and opti- mal storage sizing for a shared community: A multi-stage stochastic programming approach. Applied energy 236, 42–54 (2019). 46. Higle, J. L. & Sen, S. Stochastic decomposition: An algorithm for two-stage linear programs with recourse. Mathematics of operations research 16, 650–669 (1991). 47. Higle, J. L. & Sen, S. Finite master programs in regularized stochastic decomposition. Mathematical Programming 67, 143–168 (1994). 48. Hillier,F.S.Introduction to operations research (TataMcGraw-HillEducation,2012). 49. Hu, C., Jain, G., Zhang, P., Schmidt, C., Gomadam, P. & Gorka, T. Data-driven method based on particle swarm optimization and k-nearest neighbor regression for estimating capacity of lithium-ion battery. Applied Energy 129, 49–55 (2014). 50. Hu, Y., Kallus, N. & Mao, X. Fast rates for contextual linear optimization. arXiv preprint arXiv:2011.03030 (2020). 51. Huang, G. & Loucks, D. P. An inexact two-stage stochastic programming model for waterresourcesmanagementunderuncertainty.CivilEngineeringSystems 17,95–118 (2000). 52. Izenman,A.J.Reviewpapers:Recentdevelopmentsinnonparametricdensityestima- tion. Journal of the american statistical association 86, 205–224 (1991). 53. Jost, J. Postmodern analysis (Springer Science & Business Media, 2006). 54. Kannan, R., Bayraksan, G. & Luedtke, J. Heteroscedasticity-aware residuals-based contextual stochastic optimization. arXiv preprint arXiv:2101.03139 (2021). 55. Kannan, R., Bayraksan, G. & Luedtke, J. R. Data-driven sample average approxima- tion with covariate information. Optimization Online. (2020). 56. Konno,H.&Wijayanayake,A.Portfoliooptimizationproblemunderconcavetransac- tion costs and minimal transaction unit constraints. Mathematical Programming 89, 233–250 (2001). 57. Kozek, A. S., Leslie, J. R., Schuster, E. F., et al. On a universal strong law of large numbers for conditional expectations. Bernoulli 4, 143–165 (1998). 212 58. Kumbhakar, S. C. & Parmeter, C. F. Estimation of hedonic price functions with incomplete information. Empirical Economics 39, 1–25 (2010). 59. Lacka, E., Boyd, D. E., Ibikunle, G. & Kannan, P. Measuring the Real-Time Stock MarketImpactofFirm-GeneratedContent.JournalofMarketing,00222429211042848 (2021). 60. Lai, T. L. et al. Stochastic approximation. Annals of Statistics 31, 391–406 (2003). 61. Lan,G.Complexityofstochasticdualdynamicprogramming. Mathematical Program- ming, 1–38 (2020). 62. Le Thi, H. A. & Pham Dinh, T. DC programming and DCA: thirty years of develop- ments. Mathematical Programming 169, 5–68 (2018). 63. Lee, Y., Kim, S.-K. & Ko, I. H. Multistage stochastic linear programming model for daily coordinated multi-reservoir operation. Journal of Hydroinformatics 10, 23–41 (2008). 64. Li, H. & Cui, Y. A decomposition algorithm for two-stage stochastic programs with nonconvex recourse. arXiv preprint arXiv:2204.01269 (2022). 65. Linowsky, K. & Philpott, A. B. On the convergence of sampling-based decomposition algorithms for multistage stochastic programs. Journal of optimization theory and applications 125, 349–366 (2005). 66. Liu,J.,Cui,Y.&Pang,J.-S.SolvingNonsmoothandNonconvexCompoundStochas- tic Programs with Applications to Risk Measure Minimization. Mathematics of Oper- ations Research (2022). 67. Liu,J.,Cui,Y.,Pang,J.-S.&Sen,S.Two-stageStochasticProgrammingwithLinearly Bi-parameterized Quadratic Recourse. SIAM Journal on Optimization 30, 2530–2558 (2020). 68. Liu, J., Li, G. & Sen, S. Coupled learning enabled stochastic programming with en- dogenous uncertainty. Mathematics of Operations Research (2021). 69. Liu,J.&Sen,S.Asymptoticresultsofstochasticdecompositionfortwo-stagestochas- tic quadratic programming. SIAM Journal on Optimization 30, 823–852 (2020). 70. Liyanage, L. H. & Shanthikumar, J. G. A practical inventory control policy using operational statistics. Operations Research Letters 33, 341–348 (2005). 71. Ma,L.,Crawford,M.M.&Tian,J.Localmanifoldlearning-based k-nearest-neighbor for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 48, 4099–4109 (2010). 72. Mairal, J. Stochastic majorization-minimization algorithms for large-scale optimiza- tion. Advances in Neural Information Processing Systems 26 (2013). 73. Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization 25, 829–855 (2015). 74. Mairal, J., Bach, F., Ponce, J. & Sapiro, G. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research 11 (2010). 75. Mangalova, E. & Agafonov, E. Wind power forecasting using the k-nearest neighbors algorithm. International Journal of Forecasting 30, 402–406 (2014). 76. Martins-Filho, C. & Bin, O. Estimation of hedonic price functions via additive non- parametric regression. Empirical economics 30, 93–114 (2005). 77. Morton, D. P. An enhanced decomposition algorithm for multistage stochastic hydro- electric scheduling. Annals of Operations Research 64, 211–235 (1996). 213 78. Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications 9, 141–142 (1964). 79. Nemirovski, A., Juditsky, A., Lan, G. & Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19, 1574–1609 (2009). 80. Nemirovski, A. S. & Yudin, D. B. Cesari convergence of the gradient method of ap- proximating saddle points of convex-concave functions inDoklady Akademii Nauk 239 (1978), 1056–1059. 81. Nemirovskij, A. S. & Yudin, D. B. Problem complexity and method efficiency in op- timization (1983). 82. Oladunni, T. & Sharma, S. Hedonic housing theory—a machine learning investigation in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) (2016), 522–527. 83. Ortiz-Garc´ ıa, E., Salcedo-Sanz, S. & Casanova-Mateo, C. Accurate precipitation pre- diction with support vector classifiers: A study including novel predictive variables and observational data. Atmospheric research 139, 128–136 (2014). 84. Owusu-Ansah, A. A review of hedonic pricing models in housing research. Journal of International Real Estate and Construction Studies 1, 19 (2011). 85. Pang, J.-S., Razaviyayn, M. & Alvarado, A. Computing B-stationary points of nons- mooth DC programs. Mathematics of Operations Research 42, 95–118 (2017). 86. Pereira, M. V. & Pinto, L. M. Multi-stage stochastic optimization applied to energy planning. Mathematical programming 52, 359–375 (1991). 87. Phan, D. & Ghosh, S. Two-stage stochastic optimization for optimal power flow un- der renewable generation uncertainty. ACM Transactions on Modeling and Computer Simulation (TOMACS) 24, 1–22 (2014). 88. Philpott, A. B. & Guan, Z. On the convergence of stochastic dual dynamic program- ming and related methods. Operations Research Letters 36, 450–455 (2008). 89. Polyak,B.T.&Juditsky,A.B.Accelerationofstochasticapproximationbyaveraging. SIAM Journal on Control and Optimization 30, 838–855 (1992). 90. Powell, W. B., George, A., Simao, H., Scott, W., Lamont, A. & Stewart, J. Smart: a stochastic multiscale model for the analysis of energy resources, technology, and policy. INFORMS Journal on Computing 24, 665–682 (2012). 91. Razaviyayn,M.,Sanjabi,M.&Luo,Z.-Q.Astochasticsuccessiveminimizationmethod fornonsmoothnonconvexoptimizationwithapplicationstotransceiverdesigninwire- less communication networks. Mathematical Programming 157, 515–545 (2016). 92. Robbins, H. & Monro, S. A stochastic approximation method. The annals of mathe- matical statistics, 400–407 (1951). 93. Robbins, H. & Siegmund, D. in Optimizing methods in statistics 233–257 (Elsevier, 1971). 94. Rockafellar, R. T. Monotone operators and the proximal point algorithm. SIAM jour- nal on control and optimization 14, 877–898 (1976). 95. Rockafellar, R. T. & Uryasev, S. Optimization of conditional value-at-risk. Journal of risk 2, 21–42 (2000). 214 96. Rotting,T.&Gjelsvik,A.Stochasticdualdynamicprogrammingforseasonalschedul- ingintheNorwegianpowersystem. IEEE Transactions on Power Systems 7,273–279 (1992). 97. Sen, S., Doverspike, R. D. & Cosares, S. Network planning with random demand. Telecommunication systems 3, 11–30 (1994). 98. Sen,S.&Liu,Y.Mitigatinguncertaintyviacompromisedecisionsintwo-stagestochas- ticlinearprogramming:Variancereduction.OperationsResearch64,1422–1437(2016). 99. Sen, S. & Zhou, Z. Multistage stochastic decomposition: a bridge between stochastic programming and approximate dynamic programming. SIAM Journal on Optimiza- tion 24, 127–153 (2014). 100. Shapiro, A. On complexity of multistage stochastic programs. Operations Research Letters 34, 1–8 (2006). 101. Shapiro, A. Stochastic programming approach to optimization under uncertainty. Mathematical Programming 112, 183–220 (2008). 102. Shapiro, A., Dentcheva, D. & Ruszczynski, A. Lectures on stochastic programming: modeling and theory (SIAM, 2021). 103. Smith,B.L.&Demetsky,M.J.Short-termtrafficflowpredictionmodels-acomparison of neural network and nonparametric regression approaches in Proceedings of IEEE international conference on systems, man and cybernetics 2 (1994), 1706–1709. 104. Stone, C. J. Consistent nonparametric regression. The annals of statistics, 595–620 (1977). 105. Takeda, H., Farsiu, S. & Milanfar, P. Kernel regression for image processing and reconstruction. IEEE Transactions on image processing 16, 349–366 (2007). 106. Tan, S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Sys- tems with Applications 28, 667–671 (2005). 107. Tao, P. D. & An, L. T. H. Convex analysis approach to DC programming: theory, algorithms and applications. Acta mathematica vietnamica 22, 289–355 (1997). 108. Van Slyke, R. M. & Wets, R. J.-B. A duality theory for abstract mathematical pro- grams with applications to optimal control theory tech. rep. (BOEING SCIENTIFIC RESEARCH LABS SEATTLE WASH MATHEMATICS RESEARCH LAB, 1967). 109. Walk, H. A universal strong law of large numbers for conditional expectations via nearest neighbors. Journal of Multivariate Analysis 99, 1035–1050 (2008). 110. Walk,H.inRecentDevelopmentsinAppliedProbabilityandStatistics 183–214(Springer, 2010). 111. Wasserman, L. All of nonparametric statistics (Springer Science & Business Media, 2006). 112. Watson, G. S. Smooth regression analysis. Sankhy¯ a: The Indian Journal of Statistics, Series A, 359–372 (1964). 113. Wets,R.J.-B.Chapterviiistochasticprogramming. Handbooks in operations research and management science 1, 573–629 (1989). 114. Witte, A. D., Sumka, H. J. & Erekson, H. An estimate of a structural hedonic price model of the housing market: an application of Rosen’s theory of implicit markets. Econometrica: Journal of the Econometric Society, 1151–1173 (1979). 115. Wright, S. Primal-dual aggregation and disaggregation for stochastic linear programs. Mathematics of Operations Research 19, 893–908 (1994). 215 116. Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S. P. & Glynn, P. W. On the con- vergence of mirror descent beyond stochastic convex programming. SIAM Journal on Optimization 30, 687–716 (2020). 216
Abstract (if available)
Abstract
In this collection of research, we focus on the study of first-order and decomposition-based algorithms for solving predictive stochastic programming problems, multistage stochastic programming problems, and nonconvex stochastic programming problems. Chapter 1 presents the generic formulation of SP problems, real-life examples of stochastic programming problems, types of convergence often used in the analysis of stochastic programming algorithms, and a brief introduction to several popular stochastic programming algorithms.
Chapter 2 studies sample complexity and decomposition-based algorithms for multistage stochastic linear programming problems. Firstly, we provide a different view of the sample complexity of multistage stochastic linear program based on the concepts in compound sample averaging approximation. Secondly, we present a unifying framework to study various decomposition-based methods for solving multistage stochastic linear programming problems.
Chapter 3 demonstrates a fusion of concepts from stochastic optimization and non-parametric statistical learning in which data is available in the form of covariates interpreted as predictors and responses. In this new class of stochastic programming problems with covariates (which we refer to as predictive stochastic programming problems), the objective function is a conditional expectation of the random cost function with respect to the observation of the predictor. The objective function can predict the outcome given the observed characteristic, such as the hedonic pricing model for housing price prediction. To solve the predictive stochastic programming problems, we use non-parametric estimation methods (e.g., kNN and kernel) to compute a weighted average of the subgradients of the cost function for each data pair. This new type of subgradient estimation is defined as the non-parametric stochastic quasi-gradient of the “true” objective function. Then we use the notion of the non-parametric stochastic quasi-gradients to design a double-loop first-order algorithm to ensure asymptotic convergence.
Chapter 4 follows the problem formulation of the previous chapter and makes a non-parametric extension of the Stochastic Decomposition (SD) algorithm (Non-parametric SD) to solve a class of two stage predictive stochastic programming problems, which includes two-stage predictive stochastic linear programming problems as well as two-stage predictive stochastic quadratic-quadratic programming problems. Compared to the original SD, the minorant is constructed via k nearest neighbors estimation instead of sample averaging. Therefore, the minorant update (i.e., rescaling of the previously generated minorants) in the original SD can not be used, and a new minorant update is designed to overcome this challenge.
Chapter 5 is about algorithm design for (predictive) stochastic programming beyond convex optimization. In particular, we present a fusion of stochastic decomposition, majorization minimization, k nearest neighbors method to solve a class of nonconvex predictive stochastic program (PSP), where the randomness is presented as predictor and response. The objective function is a conditional expectation of a smooth concave function and a second-stage linear recourse function given the observation of the predictor. This extension not only allows new stochastic difference-of-convex (dc) functions but allows new applications in which these two crucial paradigms (PSP and dc) can be integrated to provide a more powerful setting for modern applications. This combination also provides an opportunity to study convergence results in a more general setting. In Chapter 6, we show that, under minor modifications, the proposed methodology can be transitioned into a decomposition-based algorithm for solving classic nonconvex stochastic programming problems. Finally, the computational results of various instances prove the efficiency of the methodology.
Chapter 3 is based on the joint working paper with Dr. Suvrajeet Sen and we thank Dr. Phebe Vayanos for the early discussion about the predictive stochastic programming. Chapters 5 and 6 are based on the joint working paper with Dr. Suvrajeet Sen.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational stochastic programming with stochastic decomposition
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
Stochastic games with expected-value constraints
PDF
Computational validation of stochastic programming models and applications
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Learning enabled optimization for data and decision sciences
PDF
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Applications of Wasserstein distance in distributionally robust optimization
PDF
Smarter markets for a smarter grid: pricing randomness, flexibility and risk
PDF
Stochastic optimization in high dimension
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Models and algorithms for the freight movement problem in drayage operations
PDF
Package delivery with trucks and UAVs
PDF
Learning and control in decentralized stochastic systems
Asset Metadata
Creator
Diao, Shuotao
(author)
Core Title
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2022-08
Publication Date
07/15/2022
Defense Date
05/16/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
decomposition algorithm,non-parametric statistical estimation,OAI-PMH Harvest,stochastic programming
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sen, Suvrajeet (
committee chair
), Carlsson, John (
committee member
), Jain, Rahul (
committee member
), Vayanos, Phebe (
committee member
)
Creator Email
sdiao@usc.edu,sonnydiao430@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111371449
Unique identifier
UC111371449
Legacy Identifier
etd-DiaoShuota-10832
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Diao, Shuotao
Type
texts
Source
20220715-usctheses-batch-953
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
decomposition algorithm
non-parametric statistical estimation
stochastic programming