Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The fusion of predictive and prescriptive analytics via stochastic programming
(USC Thesis Other)
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
The Fusion of Predictive and Prescriptive Analytics via Stochastic Programming by Junyi Liu A Dissertation Submitted to the FACULTY OF THE USC GRADUATE SCHOOL at the UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Industrial and Systems Engineering) August 2019 Copyright 2019 Junyi Liu Acknowledgements Firstandforemost,Iwouldliketothankmythesisadvisor,Dr. SuvrajeetSen. Igreatlyappreciate his considerable time and continuous support during my study time at USC. He is always patient and inspiring when sharing ideas and knowledge with me. Meanwhile, he always encourages me on pursuing my own ideas. I am deeply a↵ected by his devotion for research, his profound understandingintheareaofoptimizationandhispersistenceofexploringnewresearchdirections. Second, I would like to thank Dr. Jong-Shi Pang, who is an incredible scholar at USC. I was first impressed by his broad and profound knowledge on mathematics and optimization through his intriguing lectures. During our collaborations, he always has the endless energy and pure passion on exploring the untouched research areas. I am very fortunate to experience the joy as well as challenges while working with Dr. Pang in the last two years at USC. I also learned a lot from many other faculty members at USC. I would like to thank my committee member, Dr. Vishal Gupta for being supportive and inspiring through his works on combining optimization with statistics. I would like to thank Dr. Meisam Razaviyayn, whose logical and intriguing lectures greatly helped me on understanding optimization algorithms for machinelearning. IwouldliketothankDr. JuliaL.Higle, Dr. SheldonRoss, Dr. PhebeVeyanos, Dr. John Gunnar Carlsson for many valuable advice and lectures. Finally, I would like to thank my family for their support and endless love. My parents, Jie Hong and Gang Liu have always been supportive for me to pursue academic achievements. My sister, Xiayin Hong is my best friend with whom I shared the most joy. My grandmother, Zongyu Liu, has devoted years on taking care of me. I can not make such achievements without her ii unconditional love and her unique life wisdom. I also would like to thank my partner Guangyu Li, for making my life full of excitements, fulfillment and joy. I dedicated the dissertation to my family. iii Table of Contents Acknowledgements ii List Of Tables vi List Of Figures vii Abstract viii Chapter 1: Introduction 1 Chapter 2: Two-stage Stochastic Quadratic Programming 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Stochastic Decomposition (SD) Algorithm . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Motivation underlying SD: sample average approximation . . . . . . . . . . 14 2.2.2 The SD algorithm for two-stage SQLP and SQQP . . . . . . . . . . . . . . 16 2.3 Convergence Rate of SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 SD Stopping Rule: Consistent Bootstrap Estimators . . . . . . . . . . . . . . . . . 40 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 3: Two-stage stochastic programming with the linearly bi-parameterized quadratic recourse 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.3 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Known Concepts of Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 The PD Case: via the dc Program (3.5) . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Two Simple Positive Semidefinite Cases . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.1 The first simple case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 The second simple case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5 The General Positive Semidefinite case . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.1 A detour: generalized critical points . . . . . . . . . . . . . . . . . . . . . . 73 3.5.2 Convergence of regularization, convexification, and sampling . . . . . . . . 84 3.5.2.1 The sequence {x ⌫ } . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.5.2.2 The sequence {b x ⌫ } . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5.2.3 The sequence {e x ⌫ } . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 iv Chapter 4: Stochastic Programming with Implicitly Endogenous Uncertainty 98 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.1.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.1.3 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2 Coupled Learning Enabled Optimization (CLEO) algorithm . . . . . . . . . . . . . 107 4.3 Convergence analysis of the CLEO algorithm . . . . . . . . . . . . . . . . . . . . . 111 4.3.1 Probabilistically fully linear property of TR model . . . . . . . . . . . . . . 112 4.3.2 Convergence analysis to a d-stationary point . . . . . . . . . . . . . . . . . 121 4.4 Numerical experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4.1 A synthetic nonconvex problem . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4.2 Joint production and pricing problem . . . . . . . . . . . . . . . . . . . . . 128 4.4.3 A pricing problem for hotel rooms . . . . . . . . . . . . . . . . . . . . . . . 129 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Chapter 5: Conclusion and ongoing work 132 Reference List 134 Appendix A Supplement to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Appendix B Supplement to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Appendix C Supplement to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 v List Of Tables 2.1 Convergence results of the stochastic methods in solving SP problems . . . . . . . 12 2.2 SAA Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 “In-sample” Stopping Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 vi List Of Figures 4.1 ComparisonsoftheCLEOalgorithmandthePOschemesforasyntheticnonconvex problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.2 Convergence curve of CLEO for the sample size chosen as 2,3,5,10,20 . . . . . . . . 128 4.3 Convergence curve of the CLEO algorithm and the PO schemes for the joint pro- duction and pricing problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.4 Convergence curve for the price-senstive problem of hotel rooms . . . . . . . . . . 130 vii Abstract To transform the data to insights for decision-making, one needs to go through a scientific pro- cess of predictive and prescriptive analytics. However, the lack of interactions among these two phases could hurt the achievements in each domain of knowledge. In this dissertation, we study both sequential and coupled approaches which integrate predictive and prescriptive analytics and develop the convergent algorithms to provide a mathematically justifiable decision through the lenses of optimization, statistics, and prediction models. This dissertation begins with an introduction to stochastic optimization, which is followed by three chapters of new research results. These three chapters can be classified into two parts, one focusing on methodologies which assume that a distribution describing the uncertainty can be obtained prior to decision optimization, whereas, the second part is devoted to a method where the distribution of uncertain quantities is unknown, although data regarding these variables can be collected as necessary. Further details these of these parts are provided below. In the first part we assume that we are able to infer a distribution associated with uncertainty using statistical methodology. These types of problems lead to stochastic programming (SP) models for which we study sampling-based algorithms. In Chapter 2, we first investigate a class of two-stage stochastic quadratic programming problems and develop the stochastic decomposition algorithm with a sublinear convergence rate. We also design a stopping rule using bootstrap estimators to eciently deliver an asymptotically optimal decision. In chapter 3, we then address a new class of SP problems with a distinguishing feature that the objective function in the second stage is linearly parameterized by the first-stage decision variable, in addition to the standard viii linear parameterization in the constraints. For a convex second-stage objective function, the SP is a di↵erence-of-convex program while the di↵erence-of-convex decomposition is not practically suitable for computation. This feature leads to the introduction of the notion of a generalized criticalpointtowhichweshowthealmost-suresubsequentialconvergenceofthecombinedsample average approximation, di↵erence-of-convex approximation is established via regularization. Inthesecondpartofthedissertation,wedevelopacoupledschemeofestimationandoptimiza- tion for the stochastic programming problems with latently endogenous uncertainties. Without assuming any prior knowledge of the dependency, we develop the novel approach which couples the prediction models for estimating the local dependency and the optimization method for seek- ing the better decision. We prove that the sequence of decisions produced by the coupled scheme converges to a first-order stationary solution with probability 1. Numerical experiments show that our coupled approach improves significantly upon the sequential scheme. By harnessing the algorithmicadvancesinstochasticprogramsandthepowerofprediction, wecanprovidethedata- drivendecisionwithahigherperformanceintermsofthestatisticaloptimalityandcomputational eciency. ix Chapter 1 Introduction In the last 20 years, with the explosion in the amount of data and computing power, there has been substantial development of analytical tools for description, prediction and decision. According to KPMG2014, one top question that concerns most of the ocers in ”C” suites is how predictive analytics should drive future decision making. To answer this question from an Operations Research (OR) viewpoint, we develop algorithmic schemes that integrate predictive and prescriptive analytics in this dissertation. In real world applications, decision makers usually seek decisions under the presence of uncer- tainties which can not be truly quantified. One approach is to replace the uncertainty with its estimated mean value of historical observations in the optimization problem. Then the decision- maker seeks a solution of the corresponding deterministic optimization problem using numerical algorithms. However, the deterministic approach has several disadvantages. The most vital one is that the derived solution looks much more optimistic in deterministic formulation than in the real problem under uncertainties. The second disadvantage is that in a dynamic setting the static formulationdoesnotallowtheoptionoftakingrecoursedecisionsaftertheuncertaintyisobserved. To deal with issues of the deterministic approach, another approach takes a stochastic per- spective using the estimated probability distribution of the uncertainty. The decision-making 1 problem is formulated as a stochastic programming (SP) problem. In traditional SP, the esti- mated probability distribution is constructed using historical observations, which is also referred to as scenarios. Because the number of scenarios can be large, the complexity of solving SP models is much greater than a model using the deterministic approach. Nevertheless, Deng, Liu and Sen in [23] show that with ecient stochastic algorithms, solutions obtained from SP models are more reliable than those obtained using deterministic approaches according to the validation processinvariouskindsofapplicationproblemssuchasnetworkflow, inventorycontrolandsoon. The most critical feature that we will address in the dissertation is that the uncertainty in SP is latently a↵ected by the external or internal factors. This motivates us to integrate the predictive and prescriptive analytics via stochastic programming. In combining the predictive and prescriptive analytics, there are two major questions that we want to address. First, for various classes of optimization problems which are formulated by the combination of prediction and prescription, what specific algorithms could be used and what type of solutions will be derived? Second, when integrating prediction and prescription, how should one balances the trade-o↵ between accuracy and complexity? In this dissertation, we address these questions with the focus on: (i) gaining a better understanding of stochastic programming modeling and algorithms; (ii) for the sequential scheme of prediction and stochastic optimization, developing sample-based algorithms respectively for convex problems with the convergence to global optimality and for nonconvex problems with the convergence to stationarity ; (iii) develop- ing the algorithm coupling prediction and stochastic programming which automatically balances the trade-o↵ between complexity and accuracy. In the remainder of this chapter, we provide a brief summary for each of the remaining chapters of this dissertation. In Chapter 2, we focus on two classes of stochastic programming problems: 1) two-stage stochastic quadratic-linear programming (SQLP) of which the objective function is defined by the sum of a quadratic function and a value function of a linear program in the second stage; 2) two-stagestochasticquadratic-quadraticprogramming(SQQP)whichhasquadraticprogramming 2 problems in both two stages. We design stochastic decomposition (SD) algorithms in which the iterative schemes approximate the objective function using piecewise linear/quadratic minorants and the next iterate is obtained through a stochastic proximal mapping. We show that under some assumptions, the proximal mapping applied in SD obeys a contraction mapping property even though the approximations are based on sequential random samples. Following that, we demonstrate that under certain assumptions, SD can provide a sequence of solutions converging to the optimal solution with the sublinear convergence rate in both SQLP and SQQP problems. Finally, we present an“in-sample” stopping rule to assess the optimality gap by constructing consistent bootstrap estimators In Chapters 3, motivated by the sequential prediction and optimization, we extend our scope to a more general two-stage SP problem with both the cost vector and the constraint vector in the second-stage quadratic problem parameterized by the first-stage decision variables. This class of two-stage SPs problem is shown to be a di↵erence-of-convex (dc) problem when the quadratic matrix in the second-stage problem is positive semidefinite, though the dc decomposition is not applicable for computation. For this special class of SP problems, we define a new notion of stationarity and develop algorithms that combines the sampling, linearization and regularization with the subsequence convergence to stationarity with probability 1. In Chapter 4, for the SP with latently endogenous uncertainty, we extend the above sequential prediction-optimization paradigm to a coupled scheme such that the parameter estimation model can guide the optimization model to achieve higher levels of performance. For the SP with latently endogenous uncertainty, we design the Coupled Learning Enabled Optimization (CLEO) algorithm without assuming any parametric knowledge of the latent dependence. The CLEO algorithm iteratively undertakes the learning step using local linear regression (LLR) and the optimizationstepbysolvingatrustregion(TR)subproblem. Weprovethatthesolutionsequence obtained by CLEO converges almost surely to a stationary point of a nonconvex and nonsmooth SP problem. Furthermore, in numerical experiments of synthetic and realistic problems, we show 3 that CLEO can provide a decision with smaller objective value and lower variance than the sequential schemes. 4 Chapter 2 Two-stage Stochastic Quadratic Programming 2.1 Introduction OneofthestandardSPmodelsisthetwo-stagestochasticprograminwhichthefirst-stagedecision is made prior to observing the uncertainty, and the second-stage recourse decision is undertaken by adapting to the observation in an optimal manner. In this chapter, we consider the a class of two-stage SP problems whose objective is to minimize the sum of a quadratic function and the expectation of a random value function of the second-stage (future) quadratic program. An example of two-stage stochastic quadratic programming problems is portfolio optimization with two time periods. When the random variable has a very large or even continuous probability space, it is impos- sible to identify a deterministic optimal solution. To address this issue, there are at least two classical methods based on Monte-Carlo sampling techniques, namely, Sample Average Approx- imation (SAA) and Stochastic Approximation (SA). Both SAA and SA approaches work under the condition that the uncertainty is independent of the decision variable and its scenarios could be generated independently and identically. In practice, when it is expensive or even impossible to obtain real scenarios, a practical approach is to first predict the probability distribution of the uncertainty and then solve the SP problem equipped with the predicted probability distribution 5 using Monte-Carlo sampling methods. For instance, in portfolio optimization, scenarios of the uncertain investment return in the first period could be simulated from the predicted probabil- ity distribution and then applied in the Monte-Carlo sampling methods to derive an investment decision. This method is referred as “predict, then optimize” (PO) (see [29]) approach. In this chapter, we focus on the Monte-Carlo sampling methods assuming the availability of scenarios regardless of the sampling process. The detailed discussion of the PO approach will be covered in Chapter 3 and Chapter 4. Unlike SA and SAA methods for general convex problems, the SD algorithm was first devel- oped in [36] by taking the advantage of the piecewise structure of two-stage SP. SD inherits the convergence result to the optimum while maintaining an iteration complexity which is slightly greater than SA. As reported in the empirical experiments by Sen and Liu [72], there has been significant progress of SD algorithms in terms of the computational eciency and the statisti- cal accuracy for a fairly large battery of test instances. This chapter is intended to provide the mathematical support via a study of convergence rates of SD algorithms. 2.1.1 Problem setting We consider two classes of two-stage stochastic quadratic programming problems: one is referred to as a two-stage SQLP problem in which the first stage contains a quadratic program together with the value function defined by a collection of (scenario) linear programs in the second stage; the other one is referred to as a two-stage SQQP problem which has the quadratic programming structures in both stages. In the mathematical formulations of these two classes of problems, we use the subscript QL and QQ to identify SQLP and SQQP problems respectively. The mathematical formulation of a two-stage SQLP is given below. min f QL (x), 1 2 x > Qx+c > x+E [h QL (x,e !)] s.t. x2X ={x :Ax b}✓ R n1 , (2.1) 6 where the recourse function h QL is defined as, h QL (x,!),min d > y s.t. Dy = ⇠ (!)C(!)x y0,y2R n2 . (2.2) Here A2R m1⇥ n1 is a deterministic matrix with row vectors denoted by {a i } m1 i=1 , D 2R m2⇥ n2 is a deterministic matrix, e ! denotes a (vector) random variable in a probability space (⌦ ,F,P) with⌦ being the sample space, F being the -algebra generated by subsets of⌦ and P being a probabilitymeasuredefinedonF.SoE[·]denotestheexpectationwithrespecttotheprobability measure of e !. Moreover, ! denotes an observation of the random variable e !. To be consistent with previous stochastic decomposition algorithms, we assume that the second-stage cost vectord is fixed. If Q =0, then (2.1) becomes the general two-stage stochastic linear programming (SLP) problem. However, here we assume Q to be a positive definite matrix in SQLP/SQQP problems. The definition of the recourse function in (2.2) shows that once the first-stage decision x is determined and an outcome of the random variable ! is observed, h QL is the optimal value of a linear optimization problem. This optimal value reflects the cost associated with adapting to the information revealed through an outcome !. Nevertheless, the first-stage decision x must be chosen before the randomness e ! is realized. So the “cost-to-go” function is evaluated by its expectation in the first-stage objective. It is also interesting to note that ordinary support vector machines (SVM) in ML are simple two-stage SQLP problems. Here, the first stage decision (x) gives the weights of the SVM, the quadratic term is derived from the regularizer of SVM, and the secondstagedenotesthepenaltytermforbeingonthe“wrong”sideoftheseparatinghyperplane. In the terminology of SP, this penalty formulation is the so-called “simple recourse” model of SP. As for a two-stage SQQP, we introduce a quadratic objective function in the recourse program and the SQQP problem is defined below. 7 min f QQ (x), 1 2 x > Qx+c > x+E [h QQ (x,e !)] s.t. x2X ={x :Ax b}✓ R n1 , (2.3) where h QQ is the value function of a quadratic program defined as follows, h QQ (x,!), min 1 2 y > Py+d > y s.t. Dy = ⇠ (!)C(!)x y0,y2R n2 . (2.4) The notations are the same as in the two-stage SQLP problem except that the matrix P 2 R n2⇥ n2 is a positive definite matrix. We make the assumptions for two-stage SQLP and SQQP problems below. (A1) Q and P are symmetric and positive definite matrices. (A2) The set X is convex and compact. The outcome set⌦ is compact. (A3) The second-stage problem satisfies the relatively complete recourse property, i.e. the re- course function h QL (x,!) and h QQ (x,!) are finite for all x2X and almost every !2⌦ . It is appropriate to comment on the nature of these assumptions. In (A1), since the square matrixQisassumedtobepositivedefinite, theSQLP/SQQP(2.1)and(2.3)areconvexproblems. In addition, the matrix P is assumed to be positive definite because of the dual constructions in the SD algorithm. However, this assumption could be relaxed to positive semi-definiteness using a hybrid scheme between the SD algorithms of SQLP and SQQP. The assumption (A2) follows the previous work since we focus on the convex problems here. Moreover, the assumption (A3) means that the recourse function achieves the optimum at one of extreme points or faces for any (x,!) pair almost surely. 8 One of the more demanding aspects of an SP model is the need to convey the impact of uncertainty in the recourse function in the second stage on decisions of the first stage. In order to do so, it is best to take advantage of the structure of the recourse functionl, i.e. the value function of a linear or quadratic program. Based on the dual form of the recourse function, SD approximates the sample average of recourse functions by using a collection of piecewise linear or quadratic functions. These families of functions are updated sequentially so that SD is able to discover the structure of the recourse function as the sequential Monte Carlo sampling scheme proceeds. This particular way of using dual approximations of the recourse functions allows SD to create more accurate approximations in areas of the feasible set where the algorithm tends to visit. We first present the recourse function h QL (x,!) as the optimal value of the dual problem. h QL (x,!) = maximum ⇡ > (⇠ (!)C(!)x) s.t. ⇡ 2⇧= {⇡ :D > ⇡ d}. (2.5) Let {⇡ 1 ,⇡ 2 ,...,⇡ l } denote the extreme points in the polyhedron ⇧. Since h QL is finite almost surelyforanysolutionxandrealization!,theoptimalvalueofthedualproblemcouldbeachieved at one of the extreme points almost surely. Therefore h QL (x,!) = max i=1,...,l (⇡ i ) > (⇠ (!)C(!)x). (2.6) Consequently given any observation !, h QL (·,!) is the maximum of l ane functions, thus a piecewise linear function with respect to x. For the recourse function h QQ , we can derive the dual representation ([24]) as follows. h QQ (x,!) = maximum s2 R n 2,s 0,t2 R m 2 g QQ (t,s;x,!) := 1 2 (d+D > t+s) > P 1 (d+D > t+s) +(⇠ (!)C(!)x) > s (2.7) 9 When D has full row rank,DP 1 D > is invertible. Then the dual problem of h QQ can be written as a non-negative quadratic programming (NNQP) in (2.8). Both two dual formulations (2.7) and (2.8) are convex quadratic program with the cost vector linearly parameterized in x.Since the constraint sets are polyhedra with finitely many faces, the value functionh QQ (·,!) are convex piecewise quadratic function in x. h QQ (x,!) = maximize 1 2 s > Hs+e(x,!) > s subject to s2R n2 ,s0 (2.8) where M =DP 1/2 ,H =P 1/2 (IM > (MM > ) 1 M)P 1/2 , e(x,!)=HdP 1/2 (MM > ) 1 (⇠ (!)C(!)x). In the SD approach, the approximations are built from the dual of the second-stage linear or quadratic optimization problems. As a result, asymptotic convergence analysis of SD relies on learning the piecewise linear or quadratic approximations of the recourse functions. Because of the polyhedral nature of the structure that randomness only appears on the right hand side of the constraints in the second stage, the pertinent faces of the dual problems all remain fixed and finite both in SQLP and SQQP. As these dual polyhedra are shared by all scenarios, dual approximation schemes could be used in SD in a manner that exploits such commonality across scenarios and maintains a list of dual faces visited by the algorithm. 2.1.2 Literature review ThereisavastliteratureonconvergencerateanalysispertainingtoSAAandSAmethods, andwe refer readers to the survey [43] by Homem-de-Mello and Bayraksan for a thorough understanding of convergence analysis and state-of-art sampling methods used in stochastic programming. SAA approach approximates the expected objective by the sample average of finite scenarios and then requires a numerical algorithm to solve the finite dimensional optimization problem. Theoretical 10 studies of Shapiro and Homem-de-Mello [77], and computational studies of Linderoth et al. [51] showed the convergence quality of an SAA solution based on statistical validation of Mak et al. [52]. With the uniform convergence property of the SAA model, it exhibits exponential convergence to the global optimum for convex and nonconvex problems under certain conditions. Withmoregeneralsampling,Xuin[87]studiedtheconvergencerateofSAAforsolvingnonsmooth stochasticoptimizationproblems,stochasticNashequilibriumproblemsandstochasticgeneralized equations. However, in the large-scale problems, the SAA approach can be potentially expensive because a sequence of approximated optimization problems of di↵erent sample sizes need to be explored with the restart of the numerical algorithm for each sample size. Ontheotherhand, anSAmethodiscomputationallyimplementableandtractable(seeChung [16]). It has a long history tracing back to the pioneering work of Robbins and Monroe in [64]. In the context of a non-smooth optimization problem, SA is a stochastic subgradient method which updates a solution in a direction opposite the subgradient at each iteration (see N. Shor [81]). These algorithms are not only simple, they are also executable on ordinary computers in terms of the memory and the speed. However, early versions of the SA method required a fair amount of parameter tuning, and even so it was not particularly reliable. Modern extensions of SA methods are based on incorporating a “distance generating” function as in the mirror-descent method (see Nemirovskii et al. [55]). More recently, the work by A. Nemirovski et al. in [54] presents a robust SA method which is less sensitive to parameter choices. For a standard stochastic programming problem, given a sample set of size N corresponding to i.i.d observations of the random variable, the SAA method provides an "-optimal solution with the probability greater than 1O(C " e "N )where"> 0 (see [79]). Moreover, for convex stochastic programming, under the assumption of the sharpness of the solution set, the solution provided by SAA converges to an optimal solution with probability 1O(C 0 e 0N ) (see [76] and [77]). In addition, for convex stochastic programming, the modified robust SA method [54] providesasequenceofsolutionsofwhichtheobjectivevalueconvergesinexpectationatO(N 1/2 ). 11 With assumptions of strong convexity and Lipschitz gradient, the classical SA method provides a solution sequence which converges in expectation at O(N 1/2 ) and the objective value of a solution sequence converges in expectation at O(N 1 ). As for SD, it is shown that in two-stage stochastic linear programming (SLP), SD gives rise to the convergence to an optimal solution with statistically verifiable bounds almost surely (see [36],[37],[72]). Throughstraightforwardanalysis,thisconvergenceresultcouldbeextendedtotwo- stage SQLP and SQQP problems. However, the lack of study on the asymptotic convergence rate of SD motivates our work in this chapter. Specifically, we analyze the contraction property of the stochastic proximal mapping with constraints and demonstrate that for two-stage SQLP/SQQP problems, SD with a sequence of diminishing step sizes can provide a solution sequence which converges to the optimum in expectation at rate of O(N 1 ). One should notice that convergence rates of these stochastic methods are analyzed in di↵erent but not equivalent scales, such as probability, expectation of the distance to the optimal solutions or to the optimal value. Because the convergence in expectation in the scale of solutions implies the convergence in expectation in the objective value, the former type of convergence is stronger and more stable than the latter. We summarize the currently known results on the convergence rate for SAA and SA methods in Table 2.1 where x ⇤ ,S ⇤ ,f ⇤ denote the optimal solution, the optimal solution set, the optimal value respectively and " is a positive constant. We present the results of SD in the last row of Table 2.1. Table 2.1: Convergence results of the stochastic methods in solving SP problems Method Structure Convergence type Rate SAA [79] Non-convex P(f(x N )f ⇤ >") O(C " e "N ) SAA [76] convex P(x N / 2S ⇤ ) O(C 0 e 0N ) robust SA [54] convex E |f(x N )f ⇤ | O(N 1/2 ) SA [54] strongly convex E x N x ⇤ O(N 1/2 ) Lipschitz gradient E |f(x N )f ⇤ | O(N 1 ) SD strongly convex E x N x ⇤ O(N 1 ) 12 2.1.3 Notations Weclarify the notationswhich are used in the chapter. Because we arein the convex optimization world, there is no loss of generality assuming that all subdi↵erentials coincide. For any convex function g(x)wethuswrite @g (x) to denote the subdi↵erential of g at x. Moreover, let |a| + , max{a,0} for any a2R. For a positive integer m,let[m] denote the set {1,...,m}. For any vector d2R m ,let d (i) denote the ith component of d for i2[m]. Letk·k denote the ` 2 norm of a vector without the subscript since we only consider the ` 2 norm in this chapter. For a square matrixB, ✓ min (B) and ✓ max (B) denote the smallest and the largest eigenvalues ofB respectively. In addition, the projection operator onto a set X✓ R n is defined by P X (x), argmin z2 X kzxk 2 , (2.9) and the proximal mapping point (see Rockafellar [67]) with respect to a convex function g is defined by T X ↵g (x), argmin z2 X n ↵g (z)+ 1 2 kzxk 2 o . (2.10) In this chapter, we may drop the superscript X of the proximal mapping for simplicity in some cases without causing confusion. The rest of this chapter is organized as follows. In section 2.2, we discuss a standard SAA process and introduce the SD algorithm for SQLP and SQQP with its statistical convergence result. In section 2.3 we present the convergence rate analysis of SD in two-stage SQLP/SQQP problems. In section 2.4, we design the “in-sample” stopping rule to bound the optimality gap using consistent bootstrap estimators. 13 2.2 Stochastic Decomposition (SD) Algorithm If we have access to the distribution of random vectors, we could solve two-stage SQLP and SQQP as deterministic quadratic programming problems. However, in most practical cases when thescenariospaceispotentiallyverylarge,ortherandomvariablehasacontinuousbutunknkown distribution, itisimpossibletofindtheexactoptimum. Thus, sample-basedmethodsarecommon ways to estimate the true objective by the sampled-based approximations. In this section, we first present the Sample Average Approximation (SAA) approach with its convergence result. Then motivated by SAA, we design the Stochastic Decomposition (SD) algorithms for two-stage SQLP and SQQP problems. 2.2.1 Motivation underlying SD: sample average approximation AnSAAinstancewithN samples{! i } N i=1 inatwo-stageSQLP/SQQPproblemcanbeformulated as follows. min F N (x), 1 2 x > Qx+c > x+ 1 N N X i=1 h(x,! i ) s.t. x2X ={x :Ax b}✓ R n1 . (2.11) In (2.11), an SAA function denoted by F N (x), is defined to be the sum of the quadratic term and the sample average of the recourse functions. The recourse function should be h QL (x,! i ) or h QQ (x,! i ) respectively in two-stage SQLP and SQQP problems. We present the SAA process in 2.2 following the setup in Table [72] by Sen and Liu. In step 2, the estimated confidence in the statistical validation step follows the work of Mak et al. in [52]. Following that, in step 3 a Pessimistic Gap, the worst case gap of estimated confidence intervals, evaluates the quality of sampling-based approximation approach. Since the solution ˆ x N of the SAA instance (2.11) is a random variable with respect to the samples {! i } N i=1 , from the theoretical viewpoint, we consider the probability of ˆ x N being in the "-optimal solution set in 14 Table 2.2: SAA Process For a fixed number of replications M, do the following. 1. Optimization: For each replication m2[M] with the sample set of sizeN, create an SAA function F m N (x) where the superscipt represents the replication number. Solve the SAA instance (2.11) by a numerical optimization algorithm with an optimal solution ˆ x m N and the corresponding optimal value ˆ v m N . 2. Statistical Validation: Estimate the lower bound confidence interval using the optimal values {ˆ v m N } M m=1 and the upper bound confidence interval using the potential solutions {ˆ x m N } M m=1 at a specified level of accuracy. 3. Pessimistic Gap: If the upper end of the estimated upper bound confidence interval is close to the lower end of the estimated lower bound confidence interval at an acceptable level of accuracy, then stop. Else, increase the sample size N and repeat from Step 1. terms of the number of samples N. Under some assumptions, here is an extended result of the convergence rate of SAA by A. Shapiro et al. in [76] and [77]. Proposition 2.1 (Convergence rate of SAA). In SQLP and SQQP problems (2.1) and (2.3), besides the assumptions (A1)(A3), we also assume that the optimal solution setS is nonempty. Let S " denote the set of "-optimal solutions of the problem considered. Then in each problem (SQLP/SQQP), for any"> 0 there exist constants C " > 0 and " > 0 such that the inequality 1P(ˆ x N 2S " ) C " e "N (2.12) holds for all N2N. Moreover, if the optimal solution set S is sharp, then we have 1P(ˆ x N 2S) C 0 e 0N . (2.13) The probability here is with respect to N iid samples and the sharpness of the solution set is defined in Definition 5.22 [76]. ⇤ 15 The above proposition shows that the solution provided by solving the SAA instance is in the suboptimal set with exponentially increasing probability as the sample size increases. Besides, for smaller value of " the exponential constant " is smaller. 2.2.2 The SD algorithm for two-stage SQLP and SQQP Although the SAAmethod exhibitsa linear convergencerate to thesuboptimal solution set, there are some algorithmic issues when actually implementing SAA for large scale instances: a) The computational e↵ort in solving the SAA instance using a numerical QP solver might be very expensive in a large-scale problem, b) the sample size recommended by the theory is often too conservative for practical purposes. Moreover, many of the parameters required to estimate the sample size (e.g., Lipschitz constants) are unknown before the iterations begin. As a result, it is customary to use multiple trials of sample sizes which leads to a large increase of computational time, andc)thereisalackoftheintegrationofstatisticalstoppingandnumericaloptimization, so thatapproximationscreatedduringthecourseofcertainalgorithms(e.g.,Benders’decomposition) cannot be used in subsequent runs. These issues are mainly because SAA does not explicitly specify an algorithmic mechanism to exploit the structure of recourse functions. Hence, the SD algorithmisdesignedtoaccommodatetheseissues, whilemaintainingapproximationsinthespirit of the SAA approach. For two-stage stochastic linear programming (SLP) problems, Higle and Sen in [36] designed the Stochastic Decomposition (SD) algorithm to accommodate the structure of SLP. In SD, the objective function is approximated by a bundle of stochastic minorants using approximate sub- gradients of the sample average of recourse functions. In order to make use of the minorants during the process for subsequent runs, a unified lower bound of recourse functions is required for all x2X and almost every !2⌦. Without loss of generality, we borrow the non-negativity assumption for the recourse functions which was originally made in [36]. 16 (A4) The recourse function h(x,!) is non-negative for all x2X and almost every !2⌦. Because the context (SQLP or SQQP) should be clear from the problem statement, we use one notation h for the recourse function in the assumption (A4) without the subscripts. Incidentally, it is interesting to note that the assumption of non-negative recourse functions holds for most stochastic optimization models in machine learning since loss functions are usually non-negative. We present the SD algorithm for two-stage SQLP problems in Algorithm 1 with minor mod- ifications of its predecessor, the SD algorithm for two-stage SLP problems. The idea is that we successively create a sequence of value function approximations{f k }, each of which is a piecewise lower bound of the sample average function. The earliest version of the SD algorithm in [36] obtains the solution sequence by minimizing the value function approximations{f k } successively. Subsequently, theproximitycontrolwasaddedwiththestepsizeasaregularizationtermbyHigle and Sen in [37]. Its interpretation as a stochastic proximal mapping, similar to its deterministic version (Boyd and Parikh [60]) leads to a more succinct algorithmic statement. Therefore, at iteration k, the updated solution x k is computed as a proximal mapping point T ↵ k 1 f k 1 (ˆ x k 1 ) where ↵ k 1 is the step size, f k 1 is a value function approximation and ˆ x k 1 is an incumbent solution all of which are constructed in the previous iteration. It is worth noticing that one might view SA as a stochastic forward Euler method, whereas such the stochastic proximal algorithm can be viewed as a stochastic backward Euler method. However, it is di↵erent from the stochastic proximal iteration (SPI) algorithm proposed by Ryu and Boyd in [70] because the value function approximation is created using the sample information of the entire history. With the newly updated solution x k , we now explain the construction of the value function approximation f k in detail. We first generate a new sample ! k independently with samples {! i } k 1 i=1 in the history. Because of the fixed recourse assumption, all realizations ! 2⌦ share the same collection of dual extreme points. We solve (2.5) at (x k ,! k ) and include any new dual optimal vector supporting h(·,! k ) at x k in the set V k ✓ ⇧, so the extreme points will be 17 discovered during the course of SD. We then create two minorants of the SAA function which are defined in line 8 in Algorithm 1. In the notations of these two minorants, h k k (x) and h k k (x), the superscipts refer to the iteration number and the subscripts refer to the index number of the minorants. Notice that these two minorants are constructed using the true subgradient for the sample ! k and the subgradient approximations for the samples {! i } k 1 i=1 by restricting the dual variablesinthesecond-stageprograminthesetV k . Moreover, wekeepthosepreviouslygenerated minorants which have nonzero Lagrangian multipliers at T ↵ k 1 f k 1 (ˆ x k 1 ) and reweigh them such that they are the lower bounds of the SAA function F k as well. From the optimality condition of T ↵ k 1 f k 1 (ˆ x k 1 ), Higle and Sen in [37] have shown that at each iteration the number of nonzero Lagrangian multipliers of the minorants is no greater thann 1 +1wheren 1 is the dimension of the first-stage decision variable. Therefore, with two newly constructed minorants, the total number of minorants kept in the index setJ k is no greater thann 1 +3. These minorants are called sample average subgradient approximation (SASA) functions in SQLP problems. Consequently, a value function approximation f k is created to be the sum of the quadratic function in the first stage and the maximum of the SASA functions recorded in J k which is formally defined at line 9 in Algorithm 1. Besides the construction of value function approximations in the SD algorithm, there are three more improvements on eciency compared to the SAA method. First, to achieve the estimated improvements in objective, we introduce a competition between the current incumbent solution ˆ x k 1 and the newly updated solution x k , which is called a candidate solution. The ratio between the reductions f k (x k )f k (ˆ x k ) and f k 1 (x k )f k 1 (ˆ x k ) decides whether we should accept the candidate solution, i.e., ˆ x k = x k or proceed to the next iteration with the unchanged incumbent solution, i.e. ˆ x k =ˆ x k 1 . Note that the estimates f k (x k ) and f k (ˆ x k ) are positively correlated, so as in the simulation-optimization literature, this correlation reduces variance in estimation of the di↵erence F k (x k )F k (ˆ x k ). Moreover, this update rule is also closely related to the traditional update rule in trust region methods (see Conn et al. [18]) except that the functions used in 18 measuring reductions are value function approximations instead of the true function values. Such interplay between optimization and statistical features makes SD unique in its design. Another improvement is that instead of having a fixed sample size N as in SAA, an in-sample stoppingrulewithbootstrapestimatorsispresentedinsection5sothatthesamplingprocessstops automatically during the optimization procedure. Finally, we mention that SD also facilitates an additional step of variance reduction via replications. This replication mechanism, which was presented in Sen and Liu in [72], recommends optimizing a “grand mean” value function based on the terminal value functions from each replication. Since the results of this chapter are restricted to only one replication, we refer the readers to [72] for further details. Algorithm 1 SD-QL Require: ⌧> 0,↵ 0 =⌧, r 2(0,1),k=0,V 0 =; and J 0 ={0}. a feasible solution ˆ x 0 2X. h j,0 (x) = 0,8x2X and8j2J 0 . f 0 (x)= 1 2 x > Qx+c > x+max{h j,0 (x),j2J 0 },8x2X. Ensure: 1: while the stopping rule (see section 5) is not satisfied do 2: k k+1 3: computex k =T X ↵ k 1 f k 1 (ˆ x k 1 )followingthedefinitionin(2.10)andrecordtheLagrangian multipliers {µ k 1 j } j2J k 1 of {h k 1 j (x)} j2J k 1 respectively 4: compute e ⇡ k = argmax ⇡ 2 ⇧ ⇡ > (⇠ (! k )C(! k )x k ) with a new sample ! k 5: update V k =V k 1 S {e ⇡ k } and J k =J k 1 S {k,k} ✏ {j2J k 1 :µ k 1 j =0} 6: compute ⇡ k,i 2argmax ⇡ 2 V k ⇡ > (⇠ (! i )C(! i )x k ) for i2[k] 7: compute ˆ ⇡ k,i 2argmax ⇡ 2 V k ⇡ > (⇠ (! i )C(! i )ˆ x k 1 ) for i2[k] 8: construct h k k (x)= 1 k P k i=1 ⇡ > k,i (⇠ (! i )C(! i )x), h k k (x)= 1 k P k i=1 ˆ ⇡ > k,i (⇠ (! i )C(! i )x), h k j (x)= |j| k h |j| j (x) for j2J k \{k,k} 9: construct f k (x)= 1 2 x T Qx+c T x+max{h k j (x),j2J k } 10: if f k (x k )f k (ˆ x k 1 ) r(f k 1 (x k )f k 1 (ˆ x k 1 )) then 11: ˆ x k =x k , ↵ k = ⌧ k+1 12: else 13: ˆ x k =ˆ x k 1 , ↵ k = ↵ k 1 14: end if 15: end while As for two-stage SQQP problems, if the matrix P is positive semidefinite, the SD algorithm for SQLP remains valid since SASA functions can be constructed by solving the second-stage 19 quadratic programming (2.4) with the ecent solvers. However, if the matrix P is positive definite, such curvature of the second-stage problem could be incorporated into the SD algorithm forbetterperformance. WhenP ispositivedefinite,thedualform(2.8)isanon-negativequadratic programming (NNQP) problem and it reveals that the recourse function h QQ is a piecewise quadratic function with respect tox. The methods to solve the dual problem (2.8) include active- set methods, iterative algorithms and also some state-of-art algorithms with faster convergence. We refer the interested readers to [10], [12], [48] and [73] for details. Given the wide variety of ecient algorithms currently available, we assume that such a method can be used for solving the NNQPproblem(2.8). Therefore,wedesignamethodoftrackingthedualfacesofthesecond-stage problem in SQQP, instead of the dual extreme points in SQLP. Specifically, in the SD algorithm for SQQP, we construct a set U k to record the index sets {˜ u t } k t=1 of zero dual variables in the dual problem (2.8) discovered in k iterations. Hence the sample average quadratic approximation (SAQA) functions are constructed to approximate the sample average of the recourse functions. Then the value function approximation is the sum of the quadratic function in the first stage and the maximum of the SAQA functions recorded in J k . With such modifications, we present the SD algorithm for two-stage SQQP problem in Algorithm 2. Because of the tremendous recent growth in the application of non-smooth optimization in the area of SP as well as ML, we should mention connections between SD and several methods in SP and ML. We begin by observing that SD can be seen as one of the inexact bundle methods (see Oliveira et al. in [57]) in which SD at each iteration utilizes an exact subgradient of a random sample function and inexact subgradient approximations of all other sample functions in the construction of value function approximations. Similarly, the proximal Stochastic Variance Reduction Gradient (SVRG) algorithm developed by Xiao and Zhang in [86] is also one of the inexact gradient methods for solving the ERM problems. Thus, both SD and SVRG share a similar idea which is the inclusion of the exact and also the inexact gradients/subgradients for all the samples in history in order to reduce the variance along the stochastic path with the 20 Algorithm 2 SD-QQ Require: ⌧> 0,↵ 0 =⌧, r 2(0,1),k=0,U 0 =; and J 0 ={0}. a feasible solution ˆ x 0 2X. h j,0 (x) = 0,8x2X and8j2J 0 . f 0 (x)= 1 2 x > Qx+c > x+max{h j,0 (x),j2J 0 },8x2X. Ensure: 1: while the stopping rule (see section 5) is not satisfied do 2: k k+1 3: computex k =T ↵ k 1 f k 1 (ˆ x k 1 )followingthedefinitionin(2.10)andrecordtheLagrangian multipliers {µ k 1 j } j2J k 1 of {h k 1 j (x)} j2J k 1 respectively 4: compute dual variables (s k ,t k ) in the second stage by solving (2.7) at (x k ,! k ). 5: update ˜ u k ={i :s k (l) =0,1 l n 2 }, U k =U k 1 S {˜ u k }, J k =J k 1 S {k,k} ✏ {j2J k 1 :µ k 1 j =0} 6: compute (t k,i ,s k,i ) 2 argmax {(t,s,u):u2 U k ,s 0,s (l) =0,8 l2 u} g QQ (t,s;x k ,! i ) for i2[k] 7: compute ( ˆ t k,i ,ˆ s k,i )2 argmax {(t,s,u):u2 U k ,s 0,s (l) =0,8 l2 u} g QQ (t,s;ˆ x k 1 ,! i ) for i2[k] 8: construct h k k (x)= 1 k P k i=1 g QQ (t k,i ,s k,i ;x,! i ), h k k (x)= 1 k P k i=1 g QQ ( ˆ t k,i ,ˆ s k,i ;x,! i ), h k j (x)= |j| k h |j| j (x) for j2J k \{k,k} 9: construct f k (x)= 1 2 x T Qx+c T x+max{h k j (x),j2J k } 10: if f k (x k )f k (ˆ x k 1 ) r(f k 1 (x k )f k 1 (ˆ x k 1 )) then 11: ˆ x k =x k , ↵ k = ⌧ k+1 12: else 13: ˆ x k =ˆ x k 1 ,↵ k = ↵ k 1 14: end if 15: end while 21 small computational cost. However, because of the structural di↵erences between SQLP/SQQP problems and ERM problems, we are able to develop subgradient approximations in the SD algorithm, which are much di↵erent from the inexact oracles in proximal SVRG. Moreover, in ERM problems the performance is measured by empirical risk, while in SQLP/SQQP problems one is measured by the expectation for the generalizability. HigleandSenin[36]and[37]studiedtheconvergencepropertiesofSDinsolvingtwo-stageSLP problems, which can be viewed as a special version of two-stage SQLP when Q = 0. Thus the SD algorithm for the SLP problem is almost the same as Algorithm 1 except that the value function approximations do not include the quadratic term. For the sequence of solutions generated by the SD algorithm of two-stage SLP problems, we summarize the convergence result (see Higle & Sen [36]) in the following lemma. Lemma 2.1. Suppose the assumptions (A2)(A4) hold for SQLP problem (2.1) with Q = 0. Let {f k (x)},{x k },{ˆ x k } respectively be the sequence of value function approximations, candidate solutions and incumbent solutions, generated by the SD algorithm with the diminishing step size in two-stage SLP. Let K denote the set of iteration numbers where the incumbent solution updates. (a) If there exists x ⇤ 2X such that {ˆ x k } k2 K ! x ⇤ ,then {f k (ˆ x k )} k2 K ! f(x ⇤ ) with proba- bility 1. (b) With probability 1, there exists a subsequence of iterations, indexed by a set K ⇤,such that lim k2 K⇤ f k 1 (x k )f k 1 (ˆ x k 1 )+ 1 2 x k ˆ x k 1 2 =0. Moreover, every accumulation point of {ˆ x k } k2 K⇤ is an optimal solution of the two-stage SLP problem. ⇤ Proof. (a) See Theorem 2 by Higle and Sen in [36] . (b) See Theorem 5 by Higle and Sen in [37]. 22 Since in two-stage SQLP and SQQP, there is a finite number of extreme points or faces in the second-stage problem, following the proofs in [36] and [37], Lemma 2.1 can be easily shown to hold for two-stage SQLP and SQQP with the assumption (A1). Accordingly, the entire sequence convergence result shown by Sen and Liu in [72] based on Lemma 2.1 could then be extended to the two-stage SQLP and SQQP as well. We present the convergence result in the following proposition showing that the entire sequence of incumbent solutions converges to an optimal solution with probability 1. Proposition2.2 (ConvergenceresultofSD). Supposetheassumptions(A1)(A4)hold. Thenin two-stage SQLP/SQQP, the sequence of incumbent solutions{ˆ x k } generated by the SD algorithm with the diminishing step size converges to the unique optimal solution x ⇤ with probability one. ⇤ Proof. See Theorem 1 by Sen and Liu in [72]. 2.3 Convergence Rate of SD From Proposition 2.2 , the SD algorithm is able to produce the sequence of incumbent solutions convergingtotheuniqueoptimalsolutionwithprobabilityone. Thereupontheconvergencerateis the next issue of focus. In SD, the vital step dominating its convergence is the proximal mapping of the value function approximation. In this section by analyzing the contraction property of the stochastic proximal mapping, we derive the sublinear convergence rate of SD for two-stage SQLP and SQQP problems. It will become clear subsequently that the two classes of problems share the same convergence analysis,thereforeinthefollowinganalysisweunifythenotation(x ⇤ , ⇤ )tobetheoptimalsolution pair of SQLP or SQQP, F k to be the SAA function defined in (2.11) with the sample set of size k without identifying its specific structures, i.e. the recourse functions h QL or h QQ . Similarly, let f k represent the value function approximation created at iteration k with the definition at 23 line 9 in Algorithm 1 or line 9 in Algorithm 2 without recognizing its specific structures, i.e., the SASA or the SAQA functions. Moreover, since we consider the convergence result of the sequence of incumbent solutions, without loss of generality we filter out all candidate solutions which are rejected as incumbent solutions. Hence for the sake of propositions and theorems in this section, we simplify the notation by denoting the sequence of incumbent solutions using {x k }. In studying the convergence rate of a randomized algorithm such as SD, it is customary to treat x k+1 as a random variable which is governed by the randomness of Monte Carlo sampling {! i } k i=1 generated in the entire history of the algorithm. Let{F k } k2 N+ denote the filtration with F k = {e ! 1 ,...,e ! k }. For any k 2N + ,let ˆ E k denotes the expectation taken with respect to product of the probability measures of {e ! i } k i=1 in the sense that ˆ E k [·]= E [ ·|F k ]. Therefore, the idea is to analyze ˆ E k x k+1 x ⇤ in terms of the number of iterations. Besides assumptions (A1)(A4), we make the following assumptions for the convergence rate analysis of the SD algorithm in this section. (B1) Linear independence constraint qualification (LICQ) holds at the optimum x ⇤ for the feasible solution set X ={x :Ax b}. (B2) There exists a nonnegative-valued measurable function L(!) : ⌦ ! [0,1) such that E[L(!)]<1 and |h(x,!)h(x 0 ,!)| L(!) xx 0 , for all x,x 0 2X and almost every !2⌦. (B3) There exists a neighborhood B(x ⇤ , )with> 0 such that recourse function h(x,!)is di↵erentiable for all x2X\ B(x ⇤ , ) and almost every !2⌦. (B4) The unique optimumx ⇤ is sharp, i.e., there exists a constant ⇢ such thatf(x)f(x ⇤ )+ ⇢ kxx ⇤ k for any x2X. (B5) Strict complementarity holds at x ⇤ , which means b r a > r x ⇤ + ⇤ r > 0 for all r 2[m] where a > r is rth row vector of the matrix A and b r is rth component of the vector b. At this point it is appropriate to comment about the above assumptions. The LICQ assump- tion in (B1) is defined such that the active constraints at x ⇤ are linearly independent. This 24 local condition at x ⇤ will be used for analyzing the convergence rate, wheareas in nonlinear pro- gramming, LICQ is intended to manage diculties which might arise due to the linearization of nonlinear constraints. Assumption (B2) actually can be directly derived from the assumption (A3) that the recourse function is fixed and relatively complete. However, due to the usage of the Lipchitz constant in the convergence rate analysis, we list it as an assumption here. Moreover, it is worth noticing that a sucient condition for the local Lipschitz continuity of the optimal value function was presented in Lemma 5.1 in [62]. Assumption (B3) can be derived from the almost surely di↵erentiability of h(x,!) at x ⇤ with an additional condition that the radius of the neigh- borhoodscenteredatx ⇤ withinwhichh(x,!)isdi↵erentiablehaveauniformpositivelowerbound for almost every !2⌦. A stronger version of the di↵erentiability is assumed in Assumption 3 in [69] in which the recourse function is di↵erentiable for all x2X and almost every !2⌦. The assumption (B4) is critical to ensure the finite exponential convergence rate of the SAA method according to Proposition 2.1. Moreover, this assumption is equivalent to f 0 (x ⇤ ,q)rkqk for any q2T X (x ⇤ ), where T X (x ⇤ ) denotes the tangent cone of set X at point x ⇤ (see (5.135) in [76]). Hence as a local condition, it suces to impose the assumption ( B4) in a neighborhood of x ⇤ . Finally, the assumption (B5) of strict complementarity is needed for showing the stability of the active constraints of the stochastic proximal mappings even for the case of linearly constrained problems. Recall that ˆ E k denotes the expectation taken with respect to product of the probability mea- sures of {! i } k i=1 . By using the triangle inequality, we are able to separate the distance between the current incumbent solution ˆ E k x k+1 x ⇤ into three terms. ˆ E k x k+1 x ⇤ ˆ E k T X ↵ k f k (x k )T X ↵ k F k (x k ) + ˆ E k T X ↵ k F k (x k )T X ↵ k F k (x ⇤ ) + ˆ E k T X ↵ k F k (x ⇤ )x ⇤ . (2.14) 25 Ontherighthandsideof (2.14),thefirsttermisthedistanceoftheproximalmappingpointofthe value function approximation and the SAA function. We will analyze the two cases in Proposition 2.3 such that this distance is bounded in O(↵ k ) and O(↵ 2 k ) respectively. For the second term, we will show the contraction property in Proposition 2.4 for the stochastic proximal mapping of the SAA function with constraints. With the quadratic programming in the first stage and the sequential sampling process in SD, this contraction property can be seen as an improvement of the well-known non-expansive property of deterministic proximal methods. Moreover, the third term will be shown to exponentially converge to zero in Proposition 2.5 from the exponential con- vergence rate of the SAA method. By combining these three propositions together with a lemma on limiting properties, in Theorem 2.1 we derive a recursion showing that the distance between the incumbent solution and the optimal solution is diminishing in the sublinear convergence rate Lemma 2.2. Suppose assumptions (A1)(A4) and (B3) hold for the two-stage SQLP (2.1) or SQQP (2.3) considered. There exists a finite number ˆ k 1 , such that for any k ˆ k 1 , points T X ↵ k f k (x k ) and x k are on the same piece of f k with probability 1. ⇤ Proof. According to the algorithm, for any k2Z + , V k ✓ V k+1 ✓ V with probability 1. Moreover the set V has only finitely many elements, so by the Monotone Convergence Theorem, we have lim k!1 V k = ¯ V ✓ V with probability 1 (see Lemma 1 in [36]). Similarly, for the two-stage SQQP, we have lim k!1 U k = ¯ U ✓ U with probability 1 where U is the set of the indexes of all the dual faces in the second stage program of SQQP. Then there exist finite numbers k 0 1 and k 0 2 such that when k max{k 0 1 ,k 0 2 },wehave V k = ¯ V and U k = ¯ U. It means that after finitely many iterations, V k are stable to include all the necessary extreme points ( ¯ V) with probability 1. Similarly, after finitely many iterations, U k are stable to include all the necessary indexes of zero dualvariables( ¯ U)withprobability1. Therefore,thesampleaverageofsubgradientapproximation h k k (x k ) is equal to the sample average of recourse function at x k ,i.e., h k k (x k )= 1 k P k i=1 h(x k ,! i ) withprobability1. Followingthat, wehavef k k (x k )=F k (x k )forbothSQLPandSQQPproblems. 26 From Proposition 2.2 and the assumption (B3), there exists a finite number k 1 , such that for any kk 1 , h(x,!) is di↵erentiable with respect to x at x k for almost every !2⌦. Hence when kmax{k 1 ,k 0 1 ,k 0 2 }, both F k (x) and f k (x) are di↵erentiable at x k with the same gradient, i.e., r| x=x kf k (x)=r| x=x kF k (x). Moreover, 1 2 kT X ↵ k f k (x k )x k k 2 ↵ k ⇣ f k (x k )f k (T X ↵ k f k (x k )) ⌘ C↵ k for some constant C 2R + ,sowehavelim k!1 kT X ↵ k f k (x k )x k k = 0 with probability 1. Hence, there exists a finite number ˆ k 1 such that for any k ˆ k 1 , the points T X ↵ k f k (x k ) and x k are on the same linear or quadratic piece of f k with probability 1 and the function f k is di↵erentiable at x k and T X ↵ k f k (x k ) with probability 1. Proposition2.3. Supposeassumptions(A1)(A4), (B2)and(B3)holdforthetwo-stageSQLP (2.1) or SQQP (2.3) considered. There exist finite numbers k a k b , such that at iteration k the proximal mapping over a feasible solution set X satisfies: (a) ˆ E k T X ↵ k f k (x k )T X ↵ k F k (x k ) 4M 0 ↵ k , for any kk a . (b) ˆ E k T X ↵ k f k (x k )T X ↵ k F k (x k ) 4M 0 M 1 ↵ 2 k , for any kk b , where M 0 , max x2 X kc+Qxk+E[L(e !)] and M 1 is a positive constant independent of k. ⇤ Proof. First we notice that the proximal mapping can be represented as an implicit projection step, i.e., T X ↵ k f k (x k )=P X (x k ↵ k ⇠ f k (T X ↵ k f k x k )), where P X (·) is a projection mapping defined in (2.9), ⇠ f k (T X ↵ k f k x k ) denotes the subgradient of f k under the optimality conditions at the proximal mapping point T X ↵ k f k x k . Similarly let ⇠ F k (T X ↵ k F k x k ) denote the subgradient of F k under the optimality conditions at the proximal map- pingpointT X ↵ k F k x k . AccordingtothereasoninginLemma2.2,thereexistsafinitenumberk a such that whenkk a ,thesetsV k andU k include all the dual extreme points and faces respectively in the second-stage dual problem of SQLP and SQQP. Therefore, F k and f k have the same sample 27 average value at x k with at least one common subgradient with kk a . Let ⇠ F k (x k )= ⇠ f k (x k ) be one common subgradient of F k and f k at x k . With the representations of implicit projection steps, we derive the following inequality. T X ↵ k f k (x k )T X ↵ k F k (x k ) P X (x k ↵ k ⇠ f k (T X ↵ k f k (x k ))P X (x k ↵ k ⇠ f k (x k )) + P X (x k ↵ k ⇠ F k (x k ))P X (x k ↵ k ⇠ F k (T X ↵ k F k (x k )) ↵ k ⇠ f k (T X ↵ k f k (x k ))⇠ f k (x k ) +↵ k ⇠ F k (x k )⇠ F k (T X ↵ k F k (x k )) (2.15) From the assumption (B2), the recourse functionh(·,!) is Lipschitz continuous with the constant L(!) with probability 1. Moreover, X is a compact set from the assumption (A2). Then the subgradient of the value function approximation can be bounded uniformly for all x 2X,i.e., ⇠ f k (x) max x2 X kc+Qxk+ 1 k P k i=1 L(! i ). Hence the expected subgradient with respect to the product of the probability measures of the samples on the history until iterationk is bounded by max x2 X kc+Qxk+E[L(e !)]. Such a bound also holds for the sample average function F k . Let M 0 , max x2 X kc+Qxk+E[L(e !)]. From (2.15), we prove the statement (a), i.e., for any kk a , ˆ E k T X ↵ k f k (x k )T X ↵ k F k (x k ) 4M 0 ↵ k . Moreover, fromLemma2.2thereexistsafinitenumberk b suchthatforanykk b ,T X ↵ k f k (x k )and x k areonthesamelinear/quadraticpieceofapiecewisefunctionf k withprobability1. Therefore, there exists a constant M 1 such that ˆ E k ⇠ f k (T X ↵ k f k (x k ))⇠ f k (x k ) M 1 ˆ E k T X ↵ k f k (x k )x k . (2.16) The argument of Lemma 2.2 should also hold for F k following almost the same analysis. Without loss of generality, with the same constant M 1 the following inequality holds, 28 ˆ E k ⇠ F k (T X ↵ k F k (x k ))⇠ F k (x k ) M 1 ˆ E k T X ↵ k F k (x k )x k . (2.17) Then, from the optimality condition of T X ↵ k f k (x k ) and the boundness of the subgradient of f k , 1 2 T X ↵ k f k (x k )x k 2 ↵ k (f k (x k )f k (T X ↵ k f k (x k ))) ↵ k 0 @ max x2 X kc+Qxk+ 1 k k X i=1 L(! i ) 1 A T X ↵ k f k (x k )x k . Since T X ↵ k f k (x k )x k 6= 0 with probability 1, dividing it and taking expectations on both sides, we then derive ˆ E k T X ↵ k f k (x k )x k 2M 0 ↵ k . Similarly for the SAA function F k ,wehave ˆ E k T X ↵ k F k (x k )x k 2M 0 ↵ k . Combining the above two inequalities with inequalities (2.15), (2.16) and (2.17), we have the statement (b), i.e., ˆ E k T X ↵ k f k (x k )T X ↵ k F k (x k ) 4M 1 M 0 ↵ 2 k . It is known that the proximal mapping obeys only a non-expansive property (see [60] and [67]). However as for the second term in the triangle inequality (2.14), we will improve the result showing that the proximal mapping of the SAA function with constraints has a contraction property. Similar contraction property of stochastic proximal iteration has been analyzed by Ryu and Boyd [70] under the assumption of M-Restricted strong convexity without constraints. In order to do so, we start by showing that under some assumptions, the active constraints of the proximal mapping T X ↵F k (x k ) should stabilize after finitely many iterations with probability 1. 29 Lemma 2.3. Suppose the assumptions (A1)(A4), (B1), (B3) and (B5) hold for the two-stage SQLP (2.1) or SQQP (2.3). Let I k , {r : a > r T X ↵F k (x k )= b r } and I ⇤ , {r : a > r x ⇤ = b r }.Then there exists a finite number ˆ k 2 , such that for any k ˆ k 2 , I k =I ⇤ with probability 1. ⇤ Proof. See Appendix A. Let ¯ X , {x : a > r x = b r ,r 2I ⇤ }, as the feasible solution set constructed by the active constraints at x ⇤ . From Lemma 2.3, T X ↵F k (x k )= T ¯ X ↵F k (x k )when k ˆ k 2 . Therefore, in order to bound the second term in (2.14) for the asymptotic convergence rate, we only need to consider T ¯ X ↵F k (x k ) by restricting the feasible solution set to be ¯ X. In doing so, we first prove that the proximal mapping of a quadratic function with the positive definiteness obeys the contraction property. Following that, we prove that the proximal mapping of the sum of a quadratic function and a convex random function obeys the contraction property as well. Then by interpreting the sample average function as the sum of a quadratic function and a sample average of convex functions, we thus have the contraction property for the second term in the triangle inequality (2.14). We should mention that for simplicity we omit the superscript ¯ X in the notations of the proximal mapping in the following two lemmas, instead we claim that the proximal mapping is taken over a feasible solution set ¯ X in the statements. Lemma 2.4. Suppose ˇ g(z)= 1 2 z > Qz where Q is a positive definite matrix and ¯ A is a matrix consistingofthelinearlyindependentrowvectors. ThentheproximalmappingT ↵ ˇ g overafeasible solution set ¯ X ={x : ¯ Ax = ¯ b} satisfies (2.18). T ↵ ˇ g xT ↵ ˇ g x 0 =K ↵ (xx 0 ), 8x,x 0 2 ¯ X, (2.18) where G ↵ =(I +↵Q ) 1 and K ↵ =G ↵ G ↵ ¯ A > ( ¯ AG ↵ ¯ A > ) 1 ¯ AG ↵ . Moreover, we have kT ˇ g xT ˇ g x 0 k ✓ max (G ↵ )·kxx 0 k, (2.19) 30 where ✓ max (·) denotes the largest eigenvalue of the matrix. ⇤ Proof. See Appendix A. With Lemma 2.4 we then prove that the proximal mapping of the sum of a quadratic function and a random function also obeys a contraction property with the same contraction factor. Lemma 2.5. Suppose ˇ g(z)= 1 2 z > Qz where Q is a positive definite matrix, r(z) is a random convex function, and ¯ A is a matrix consisting of the linearly independent row vectors. Let g(z)= ˇ g(z)+ r(z). Then with probability 1 the proximal mapping T ↵g over a feasible solution set ¯ X ={x : ¯ Ax = ¯ b} satisfies (2.20). T ↵g xT ↵g x 0 ✓ max (G ↵ )· xx 0 , 8x,x 0 2 ¯ X, (2.20) where G ↵ =(I +↵Q ) 1 and ✓ max (·) denotes the largest eigenvalue of the matrix. ⇤ Proof. Assumex6=x 0 andT ↵g x6=T ↵g x 0 , otherwisethereisnothingtoshow. Letµ x 2@r (T ↵g (x)) with which the following optimality conditions hold for the solution pair (T ↵g x, x ). 0 B B @ I +↵Q ¯ A > ¯ A 0 1 C C A 0 B B @ T ↵g x x 1 C C A = 0 B B @ x↵µ x ¯ b 1 C C A . (2.21) Similarly, let µ x 02@r (T ↵g (x 0 )) with which the optimal conditions hold at T ↵g x 0 . Then let ˜ r ✏ be a quadratic function for some✏>0definedbelow, ˜ r ✏ (z),µ > x z+ 1 2 (zT ↵g x) > v ✏ v > ✏ a ✏ (zT ↵g x), where v ✏ , µ x + µ x 0 + ✏(T ↵g x 0 T ↵g x), and a ✏ , v T ✏ (T ↵g x 0 T ↵g x). In this construction, v ✏ 2 @r (T ↵g x)+@r (T ↵g x 0 )+✏(T ↵g x 0 T ↵g x). 31 Moreover, since @r is a monotone operator, we derive a ✏ =v > ✏ (T ↵g x 0 T ↵g x)✏kT ↵g x 0 T ↵g xk 2 0. Thus by design ˜ r ✏ is a convex quadratic function determined by x and x 0 . In addition, simple calculations show that r ˜ r ✏ (T ↵g x)=µ x , (2.22) r ˜ r ✏ (T ↵g x 0 )=µ x 0 +✏(T ↵g x 0 T ↵g x), (2.23) r ˜ r ✏ (z)=µ x + v ✏ v > ✏ a ✏ (zT ↵g x). (2.24) We take (2.22) into the optimality conditions (2.21) and derive 0 B B @ I +↵Q ¯ A > ¯ A 0 1 C C A 0 B B @ T ↵g x x 1 C C A = 0 B B @ x↵ r ˜ r ✏ (T ↵g x) ¯ b 1 C C A , which is the optimality condition of the proximal mapping T ↵ (ˇ g+˜ r✏) (x). Therefore T ↵ (ˇ g+˜ r✏) x =T ↵g x. (2.25) Similarly we take (2.23) into the optimality conditions at (T ↵g x 0 , x 0) and derive, 0 B B @ I +↵Q ¯ A > ¯ A 0 1 C C A 0 B B @ T ↵g x 0 x 0 1 C C A = 0 B B @ x 0 ↵ (r ˜ r ✏ (T ↵g x 0 )✏(T ↵g x 0 T ↵g x)) ¯ b 1 C C A = 0 B B @ x 0 ↵ r ˜ r ✏ (T ↵g x 0 ) ¯ b 1 C C A + 0 B B @ ↵✏ (T ↵g x 0 T ↵g x) 0 1 C C A . (2.26) 32 Moreover, the optimality conditions hold at the solution pair (T ↵ (ˇ g+˜ r✏) x 0 , ✏,x 0), 0 B B @ I +↵Q ¯ A > ¯ A 0 1 C C A 0 B B @ T ↵ (ˇ g+˜ r✏) x 0 ✏,x 0 1 C C A = 0 B B @ x 0 ↵ r ˜ r ✏ (T ↵ (ˇ g+˜ r✏) x 0 ) ¯ b 1 C C A . (2.27) From (2.24), we have r ˜ r ✏ (T ↵g x 0 )r ˜ r ✏ (T ↵ (ˇ g+˜ r✏) x 0 )= v✏v > ✏ a✏ (T ↵g x 0 T ↵ (ˇ g+˜ r✏) x 0 ). Thus by sub- tracting two optimality conditions (2.27) and (2.26), we derive 0 B B @ I +↵ (Q+ v✏v > ✏ a✏ ) ¯ A > ¯ A 0 1 C C A 0 B B @ T ↵g x 0 T ↵ (ˇ g+˜ r✏) x 0 x 0 ✏,x 0 1 C C A = ↵✏ (T ↵g x 0 T ↵g x) 0 B B @ 1 0 1 C C A . Since ¯ A has linearly independent rows and Q is positive definite, we have lim ✏!0 T ↵ (ˇ g+˜ r✏) x 0 =T ↵g x 0 =T ↵ (ˇ g+r) x 0 . (2.28) Let G ↵,✏ , (I+↵ (Q+ v✏v > ✏ a✏ )) 1 . Because v✏v > ✏ a✏ is positive semidefinite, from the positive semidef- inite ordering we have G ↵,✏ G ↵ . Then from Lemma 2.4 we derive kT ↵ (ˇ g+˜ r✏) xT ↵ (ˇ g+˜ r✏) x 0 k kxx 0 k ✓ max (G ↵,✏ ) ✓ max (G ↵ )< 1. (2.29) With the limiting equation in (2.25) and (2.28), let ✏! 0 in (2.29) and we have T ↵g xT ↵g x 0 ✓ max (G ↵ ) xx 0 with probability 1. Now equipped with Lemma 2.3 and Lemma 2.5, we are able to prove the contraction property of the proximal mapping of the SAA function in expectation. 33 Proposition2.4. Intwo-stageSQLP/SQQPproblemswithassumptions(A1)(A4),(B1),(B3) and (B5), when k ˆ k 2 with ˆ k 2 defined in Lemma 2.3, the proximal mapping T X ↵ k F k (·) satisfies ˆ E k kT X ↵ k F k x k T X ↵ k F k x ⇤ k (↵ k ) ˆ E k 1 kx k x ⇤ k, (2.30) where G ↵ =(I +↵Q ) 1 and (↵ )= ✓ max (G ↵ ). ⇤ Proof. From Lemma 2.3, when k ˆ k 2 ,wehave T X ↵ k F k x k = T ¯ X ↵ k F k x k . With almost the same analysis, the argument in Lemma 2.3 should also hold for T X ↵ k F k x ⇤ such that T X ↵ k F k x ⇤ =T ¯ X ↵ k F k x ⇤ whenk ˆ k 2 . Therefore, to apply the result in Lemma 2.5, we take ˇ g(x) as the quadratic function 1 2 x > Qx andr(x) as a random function in the form ofc > x+ 1 k P k i=1 h(x,! i )where{! i } k i=1 are iid random variables. Then ˇ g +r is the sample average function F k . From the assumption (B1), ¯ A has independent row vectors. By applying Lemma 2.5 for the SAA function F k , with probability 1wehave kT ↵ k F k x k T ↵ k F k x ⇤ k ✓ max (G ↵ k ) kx k x ⇤ k. (2.31) Then by the law of iterated expectations, we have the contraction property in expectation where E is taken with respect to the probability measure of ! k and ˆ E k is taken with respect to the probability measure of {! i } k i=1 . ˆ E k kT ↵ k F k x k T ↵ k F k x ⇤ k = ˆ E k 1 h E h kT ↵ k F k x k T ↵ k F k x ⇤ k {! i } k 1 i=1 ii ˆ E k 1 h E h ✓ max (G ↵ k ) kx k x ⇤ k {! i } k 1 i=1 ii = ✓ max (G ↵ k ) ˆ E k 1 kx k x ⇤ k. Proposition 2.4 presents a valuable contraction factor (↵ ) for the proximal mapping with respect to the sample average function F k (x). However, the stochastic proximal mapping still 34 may not have a fixed point. Here in the case of stochastic proximal mapping of the sample average approximation functions, we show that the expectation of the proximal mapping gap at x ⇤ converges to zero exponentially. Proposition 2.5 (Convergence of the fixed-point gap). Let F k be the SAA function defined in (2.11). Suppose assumptions (A1)(A3) and (B4) hold for the two-stage SQLP/SQQP problems considered. Then there exist constantsC> 0 and> 0 such that for the optimal solution x ⇤ , we have ˆ E k kT X ↵F k x ⇤ x ⇤ k<Ce k , (2.32) where the expectation ˆ E k is taken with respect to the product of probability measures of{! i } k i=1 . ⇤ Proof. Let ˜ x k , argmin x2 X F k (x) and x ⇤ be the unique and sharp optimal solution of a two-stage SQLP/SQQP. From Proposition 2.1,P k ˜ x k x ⇤ k> 0 C 0 e 0k . Moreover, from the optimal- ity conditions of T X ↵F k (x ⇤ ), we have F k (T X ↵F k x ⇤ )+ 1 2↵ T X ↵F k x ⇤ x ⇤ 2 <F k (˜ x k )+ 1 2↵ ˜ x k x ⇤ 2 . (2.33) Since F k (T X ↵F k x ⇤ )>F k (˜ x k ), from (2.33) we have T ↵F k x ⇤ x ⇤ < ˜ x k x ⇤ . From the assumption (A2), there exists a constant C 1 such that for any x and x 0 in the set X,wehavekxx 0 k C 1 . Therefore, we have ˆ E k T X ↵F k x ⇤ x ⇤ < ˆ E k ˜ x k x ⇤ C 1 P ✓ ˜ x k x ⇤ > 0 ◆ C 1 C 0 e 0k , which proves the statement. The following lemma will be used in deriving the convergence rate in Theorem 2.1. 35 Lemma 2.6. For ⇤ > 0 and any two large integers k and k 0 satisfying k 0 ⌧ k,let Z ⇤ (k 0 ,k), ⇧ k i=k 0 +1 (1 ⇤ i ). We have the following limiting properties. 1. Z ⇤ (k 0 ,k). ⇣ k 0 k ⌘ ⇤ . 2. P k i= ˆ k+1 e i Z ⇤ (i,k). 8 > > > > > < > > > > > : 1 e k ⇤ 1 k 1 ⇤ e k ⇤ < 1. 3. P k i= ˆ k+1 1 i Z ⇤ (i,k). 1 ⇤ 4. P k i= ˆ k+1 1 i 2 Z ⇤ (i,k). 8 > > > > > > > > > < > > > > > > > > > : log(k) k ⇤=1 1 (⇤ 1)k ⇤ > 1 1 (1 ⇤) k ⇤ ⇤ < 1 Proof. See Appendix A. Notice that the properties in Lemma 2.2 and 2.3 hold under the existence of some finite numbers, such as k a ,k b and ˆ k 2 . Now in the following theorem, we inherit these finite numbers in deriving the sublinear convergence rate of the SD algorithm. Theorem 2.1 (Convergence rate of SD in SQLP). Suppose the assumptions (A1)(A4) and (B1)(B5) hold for the two-stage SQLP (2.1) or SQQP (2.3) considered. Let x ⇤ denote the optimal solution and {x k } denote the sequence of incumbent solutions provided by SD with the step size { ⌧ k+1 } for a given constant⌧> 0. Let⇤ , ⌧ ·✓ min (Q). Then, (a) for any k ˆ k a , max{k a , ˆ k 2 }, ˆ E k kx k+1 x ⇤ k. ˆ E ˆ ka 1 kx ˆ ka x ⇤ k ˆ k a k ! ⇤ +C k |1 ⇤ |+ e k + 4M 0 ⇤ , 36 (b) for any k ˆ k b , max{k b , ˆ k 2 } , ˆ E k kx k+1 x ⇤ k. ˆ E ˆ k b 1 kx ˆ k b x ⇤ k ˆ k b k ! ⇤ +C k |1 ⇤ |+ e k +4M 0 M 1 ⌧ 2 8 > > > > > > > > > < > > > > > > > > > : log(k) k ⇤=1 1 (⇤ 1)k ⇤ > 1 1 (1 ⇤) k ⇤ ⇤ < 1. where C and are constants defined in Proposition 2.5, and M 0 and M 1 are positive constants defined in Proposition 2.3. ⇤ Proof. Since we only consider the sequence of incumbent solutions in SD, every x k is a proximal mapping point of the value function approximation. The idea of deriving the convergence rate is to obtain the recurrence by bounding the expectation of the distance between x k and x ⇤ in three terms in the triangle inequality below, ˆ E k x k+1 x ⇤ ˆ E k T X ↵ k f k (x k )T X ↵ k F k (x k ) + ˆ E k T X ↵ k F k (x k )T X ↵ k F k (x ⇤ ) + ˆ E k T X ↵ k F k (x ⇤ )x ⇤ . From Proposition 2.3, the bound of the first term in the above triangleq inequality is O(↵ k ) and O(↵ 2 k )respecitivelywhen kk a and when kk b . Therefore, for the convergence analysis, these two cases are considered separately here. Case (a): when kmax{k a , ˆ k 2 }, we combine the results in Proposition 2.3, Proposition 2.4, and Proposition 2.5 with the triangle inequality (2.14) to derive the following recurrence, ˆ E k kx k+1 x ⇤ k (↵ k ) ˆ E k 1 kx k x ⇤ k +Ce k +4M 0 ↵ k . (2.34) 37 From assumption (A1), ✓ min (Q)> 0. Moreover, ↵ k = ⌧ k+1 then we have (↵ k )= ✓ max (G ↵ k )= ✓ max ((I +↵ k Q) 1 )=1 ⌧ k+1 ✓ min (Q)+o( ⌧ k ). (2.35) Suppose in the notation of the product k ⇧ i , it equals to 1 wheni>k. By recursively applying (2.34) till the ˆ k a th iteration with (2.35), we derive ˆ E k ||x k+1 x ⇤ || ˆ E ˆ ka 1 ||x ˆ ka x ⇤ || k ⇧ i= ˆ ka+1 (↵ i ) + k X i= ˆ ka+1 " ⇣ Ce i +4M 0 ↵ i ⌘ k ⇧ j=i+1 (↵ j ) # . ˆ E ˆ ka 1 ||x ˆ ka x ⇤ || " k ⇧ i= ˆ ka+1 ✓ 1 ⌧ i+1 ✓ min (Q) ◆ # + k X i= ˆ ka+1 " ✓ Ce i + 4M 0 ⌧ i+1 ◆ k ⇧ j=i+1 ✓ 1 ⌧ j+1 ✓ min (Q) ◆ # . Let ⇤ , ⌧ · ✓ min (Q) and Z ⇤ (k 0 ,k) , k ⇧ i=k 0 (1 ⇤ i+1 ) for any positive integers k 0 <k, the above inequality can be rewritten below. ˆ E k ||x k+1 x ⇤ ||. ˆ E ˆ ka 1 ||x ˆ ka x ⇤ ||Z ⇤ ( ˆ k a +1,k)+ k X i= ˆ ka+1 (Ce i + 4M 0 ⌧ i+1 )Z ⇤ (i+1,k). (2.36) When k ˆ k a = max{k a , ˆ k 2 }, by applying Lemma 2.6 to inequality (2.36) we thus have, ˆ E k kx k+1 x ⇤ k. ˆ E ˆ ka kx ˆ ka x ⇤ k ˆ k a 1 k ! ⇤ +C k |1 ⇤ |+ e k + 4M 0 ⌧ ⇤ . Case (b): when k ˆ k b = max{k b , ˆ k 1 }, we combine the results in Proposition 2.3, Proposition 2.4, and Proposition 2.5 with the triangle inequality (2.14) and derive, ˆ E k kx k+1 x ⇤ k (↵ k ) ˆ E k 1 kx k x ⇤ k +Ce k +4M 0 M 1 ↵ 2 k . 38 With the first-order approximation of (↵ k ), we follow the similar derivation of the recursion as in case (a) and derive ˆ E k kx k+1 x ⇤ k. ˆ E ˆ k b 1 kx ˆ k b x ⇤ kZ ⇤ ( ˆ k b +1,k) + k X i= ˆ k b +1 Ce i + 4M 0 M 1 ⌧ 2 (i+1) 2 ! Z ⇤ (i+1,k). ThenwiththelimitingpropertiesinLemma2.6,wehavetheasymptoticalresultstatedin(b). In Theorem 2.1, statement (a) shows the non-asymptotical convergence result with a constant error when k is small, while statement (b) proves that when k larger than a finite number ˆ k b , the distance between the incumbent solution and the optimum in expectation is controlled by three diminishing terms. We also observe that when ⇤ > 1 the overall rate of convergence is O( 1 k ). Therefore, we conclude that by appropriately choosing an initial step size and a sequence of diminishing step sizes, the sequence of incumbent solutions obtained in SD converges to the optimum with a sublinear convergence rate. Moreover, Theorem 2.1 indicates that when k ˆ k b , the larger ⇤ is, the faster the solution sequence converges. In fact ⇤ = ⌧ ·✓ min (Q) implies that ⇤ could be large if we choose a large initial step size ⌧ . However, if the initial step size is too large, the number of iterations that are required to generate a stable set of extreme points or faces may become very large. This kind of stability has been discussed experimentally by Sen and Liu in [72] which was interpreted as the shadow price stability and taken into the consideration in the design of the in-sample stopping rule. Therefore, our analysis shows a trade-o↵ between the convergence rate within a neighborhood of the optimal solution and the rate of entering into that neighborhood. However, the e↵ect of initial step size on this trade-o↵ needs more study to be fully understood. 39 2.4 SD Stopping Rule: Consistent Bootstrap Estimators For the SD algorithm in two-stage SQLP and SQQP problems, in practice the goal is to end with a good incumbent solution at some finite iteration. In other words, we need a stopping rule to recognize whether the solution produced by the SD algorithm could be accepted with a prede- termined accuracy. In this section we design a stopping rule to test whether the optimality gap between the optimal value and the objective value at ˆ x k using bootstrap estimators is statistically acceptable or not. Higle and Sen in [36] first applied bootstrap directly to the linear programs of primal and dual in a stopping rule. But the dual multiplier in the solution pair might be infeasible in resampled linear programs, therefore it requires the solution of the dual problem at each resampled instance. ThenHigleandSenin[38]proposedproximalmappingupdatesinSLPforbetterconvergenceand showed that the dual multiplier can remain feasible in all resampled problems. Accordingly, by using bootstrap they took the duality gap at the incumbent solution in the proximal mapping of the value function approximation as the measurement of optimality. However, such measurement does not fully explain the optimality gap at the incumbent solution because the duality gap is associated with the value function approximation, not the true objective function. In this section, for two-stage SQLP and SQQP, we propose an approximate optimality gap and design the “in- sample” stopping rule using the consistent bootstrap estimators for the approximate optimality gap. This scheme of ”in-sample” stopping rule should be applicable in two-stage SLP problems as well. At iteration k, suppose the SD algorithm has generated a set of k samples {! i } k i=1 and a sequence of incumbent solutions {ˆ x i } k i=1 . To test the optimality of the incumbent solution ˆ x k , we define the optimality gap d k as the di↵erence between the current objective value f(ˆ x k ) and the optimal value f(x ⇤ ), i.e., d k , f(ˆ x k )f(x ⇤ ). Since we do not have any knowledge about the mathematical formulation of the distribution of random variables ! and nor have the optimal 40 solution x ⇤ , we need to construct statistical estimations of f(ˆ x k ) and f(x ⇤ ) in order to estimate the optimality gap d k . We first notice that f(ˆ x k ) has an unbiased estimator, the sample average value F k (ˆ x k ) with a sample set {! i } k i=1 of size k independent of {! i } k i=1 . However, as for f(x ⇤ ), the sample average minimal value min x2 X F k (x) is not only biased but also computationally expensive when k is large. Therefore, we consider constructing a statistical lower bound of f(x ⇤ ) using the value function approximation satisfying ˆ E k [min x2 X f k (x)] f(x ⇤ ). This is because underassumption(A4),f k isalowerboundofthesampleaveragefunctionsF k ,i.e. f k (x) F k (x) for allx2X. By taking the expectation and minimization, with Jensen’s inequality we then have ˆ E k [min x2 X f k (x)] min x2 X ˆ E k [f k (x)] f(x ⇤ ). LetE 0 k denotetheexpectationwithrespecttotheproductoftheprobabilitymeasuresofthesam- ple set {! i } k i=1 used in the construction of the unbiased estimator of f(ˆ x k ). Then the optimality gap can be bounded as below, d k =f(ˆ x k )f(x ⇤ ) E 0 k [F k (ˆ x k )] ˆ E k [min x2 X f k (x)]. (2.37) Motivated by (2.37), we aim to obtain a conservative confidence interval of the optimality gap by using the consistent bootstrap estimators of statistics F k (ˆ x k ) and min x2 X f k (x)withmultiple replications. The idea of bootstrap is to estimate the distribution of a statistic of independent observations by the distribution of the same statistic of the resample set. This comes from the intuition that the sample set should be a good representation of the population of the true data. Under mild conditions, the bootstrap estimator yields a consistent estimator of a statistic’s distri- bution, whichmeansthedistributionofbootstrapestimatorisuniformlyclosetotheasymptotical distribution of original statistic when the sample size is large. A definition of consistency is pro- vided here following Definition 2.1 in [44]. 41 Definition 2.1. Let {X i } n i=1 be a random sample generated from a probability distribution F 0 and {Z i } n i=1 be a random sample generated from the empirical distribution F n of {X i } n i=1 .We consider the statistic T n = T(X 1 ,X 2 ,...,X n ) and its estimator e T n = T(Z 1 ,Z 2 ,...,Z n ). We define G n (⌧,F n ),P( e T n ⌧ ),G 1 (⌧,F 0 ), lim n!1 P(T n ⌧ ). Let P n be the joint probability measure of sample{Z i } n i=1 . Then the bootstrap estimator e T n is a consistent estimator of T n if for each✏> 0, lim n!1 P n sup ⌧ 2 R |G n (⌧,F n )G 1 (⌧,F 0 )|>✏ ! =0. (2.38) ⇤ In the literature, Singh [82] proved that pivotal statistic and asymptotically pivotal statis- tic have consistent bootstrap estimators when the sample size increases to infinity. Moreover, Mammen [53] shows that the sample average of the functions has consistent bootstrap estima- tors if and only if the statistic satisfies the asymptotic normality condition. According to the literature study on the consistency properties, we construct the bootstrap estimators of F k (ˆ x k ) and min x2 X f k (x) respectively. First we independently generate two resample sets {! i } k i=1 for the bootstrap estimator of F k (ˆ x k ) and {! i } k i=1 for the bootstrap estimator of min x f k (x) from the empirical distribution of the sample set {! i } k i=1 . Then the bootstrap estimator of F k (ˆ x k )is constructed as below. e F k (ˆ x k ), 1 2 (ˆ x k ) > Qˆ x k +c > ˆ x k + 1 k k X i=1 h(ˆ x k ,! i ). (2.39) Since F k (ˆ x k ) is the sample average of objective values, it is asymptotically pivotal and e F k (ˆ x k )is a consistent bootstrap estimator. 42 Then we consider the bootstrap estimator of min x2 X f k (x). Recall that the value function approximation f k (x) has the form that f k (x)= 1 2 x > Qx +c > x + max{h k j (x),j2J k } and the SASA functions or SAQA functions {h k j } j2J k are the sample average of functions defined in line 8 in Algorithm 1 and line 8 in Algorithm 2. Thus from Mammen in [53], SASA functions and SAQA functions have consistent bootstrap estimators. Formally, a resampled SASA function in two-stage SQLP is constructed such that for each j2J k , ˜ h k j (x), 8 > > > > < > > > > : 1 k P |j| i=1 (⇡ 0 j,i ) > [⇠ (! i )C(! i )x]if j6=0, 0if j=0, (2.40) ⇡ 0 j,i = 8 > > > > < > > > > : ⇡ j,l ifj> 0, ˆ ⇡ j,l ifj< 0, where l is a constant such that ! i = ! l , ⇡ j,l and ˆ ⇡ j,l are computed respectively following line 6 and line 7 in Algorithm 1. In addition, a resampled SAQA function in two-stage SQQP is constructed as ˜ h k j (x), 8 > > > > < > > > > : 1 k P |j| i=1 g QQ (t 0 j,i ,s 0 j,i ;x,! i )if j6=0, 0if j=0, (2.41) s 0 j,i = 8 > > > > < > > > > : s j,l ifj> 0, ˆ s j,l ifj< 0, and t 0 j,i = 8 > > > > < > > > > : t j,l ifj> 0, ˆ t j,l ifj< 0, where l is a constant such that ! i = ! l , s j,l , ˆ s j,l ,t j,l , ˆ t j,l are computed respectively following line 6 and line 7 in Algorithm 2. Accordingly, the resampled lower bound estimation is min x2 X ˜ f k (x) :=min x2 X ⇢ 1 2 x > Qx+c > x+max{ ˜ h k j (x),j2J k } . (2.42) 43 Hence, basedontheoptimalitygapd k in(2.37)andbootstrapestimatorsconstructedin(2.39) and(2.42), wedesignthe“In-sample”stoppingruleinTable2.3totestthestatisticalperformance of the optimality gap. Table 2.3: “In-sample” Stopping Rule At iteration k, with the sample set {! i } k i=1 and the incumbent solution ˆ x k , 1. Initialization: let ✏ be the predetermined parameter and e F be the empirical distribution of the sample data {! i } k i=1 . For each replication m=1,...,M(M30), do the following. 2. Bootstrap estimators: – generate a resample {! i } k i=1 of size k from e F. Compute the bootstrap estimator of sample average value e F m k (ˆ x k ) according to (2.39). – generate a new resample {! i } k i=1 of size k from e F. Compute the bootstrap estimator min x ˜ f m k (x) according to (2.42). – compute the optimality gap estimation ˆ d m k , ˜ F m k (ˆ x k )min x ˜ f m k (x). 3. Optimality gap: compute the mean of the optimality gap estimations ¯ d k , P M m=1 ˆ d m k /M and the sample varianceVar k , 1 M 1 P M m=1 ( ˆ d m k ¯ d k ) 2 .If ¯ d k t M 1 0.01 p Var k /M +✏,then the incumbent solutionx k is accepted as the approximation of the optimum. Otherwise, we move forward to the next iteration of the SD algorithm. 2.5 Summary This chapter studies the convergence rate of SD algorithms for two-stage SQLP and SQQP prob- lems in which we have the quadratic program in the first stage and the linear/quadratic program in the second stage. With the assumption of the positive definiteness of the quadratic matrix Q in the first stage, we present the contraction property of stochastic proximal mapping with con- straints. We then prove a sublinear convergence rate of SD in SQLP and SQQP problems. The e↵ect of the curvature of the second-stage problem in SQQP has not been incorporated into the convergence rate analysis, however, it has the potential to improve the rate with modifications in the convergence analysis. A deeper look at the convergence analysis reveals an interesting 44 trade-o↵. By appropriatly choosing a large initial step size ⌧ such that⇤ = ⌧ ·✓ min (Q) is greater than one, we could increase the asymptotic convergence rate. However, it should not be too large because of the trade-o↵ between the convergence rate within a neighborhood of the optimum and the rate to enter into that neighborhood. There are many previous works which gave the convergence rate of algorithms for the SAA scenario problem. For example, Chen et al. in [14] gave the convergence rate of Newton’s method for two-stage SQLP to the optimal point in the SAA problem. However, to the best of our knowledge, there is no prior theoretical analysis giving a convergence rate of SD for solving the two-stage stochastic programming in the decision space. Our work thus provides the theoreti- cal support of the convergence of SD algorithms in two-stage stochastic quadratic programming problems. More importantly, our work proves a faster convergence rate O(N 1 ) of SD in the dis- tance x N+1 x ⇤ , whereas, the convergence rate of the classical SA method is O(N 1/2 )under equivalent conditions. 45 Chapter 3 Two-stage stochastic programming with the linearly bi-parameterized quadratic recourse 3.1 Introduction To date, in voluminous advances of two-stage stochastic programming models, one overwhelming feature is that the first-stage decisions a↵ect only constraints and not the objective of the second- stage programs. Perhaps a good justification of this restriction is that the resulting recourse function is convex and piecewise ane, thereby enabling the ready employment of powerful linear programming advances for solving two-stage SPs. Without the restriction of solely right-hand side parameterization, the possible loss of convexity is a serious handicap when one is committed to the computation of a globally optimal solution of the problem. However if one is willing to trade global optimality for model fidelity, then one may be interested in an extended paradigm of two-stage SPs. In this chapter, we consider a class of two-stage SPs which has a quadratic program in the secondstagewithboththecostvectorandtheright-handconstraintvectordependingonthefirst- stage decision and the uncertainty. The dependence could be unknown, for instance, demands are always latently a↵ected by the price in price-sensitive problems. In this case of unknown dependence, we use the PO paradigm to combine the prediction and optimization. Generally, 46 the PO scheme yields the two-stage SP with the bi-parameterized recourse. To the best of our knowledge, this is a class of SPs that has not been studied yet and is intrinsically challenging. This chapter develops algorithms for a class of two-stage SP with the linearly bi-parameterized recourse. We hope the work would stimulate further study on theoretical and computational advances for the class of SPs with bi-parameterized recourse. 3.1.1 Problem setting We consider the problem defined in (3.1) and (3.2) below. minimize x ⇣ (x) , '(x)+E e ! ⇥ (x,e !) ⇤ subject to x 2X ✓ R n1 , (3.1) where the recourse function (x,!) is given by: (x,!) , minimum y ⇥ f(!)+G(!)x ⇤ > y+ 1 2 y > Qy subject to y 2Y(x,!) , y 2R n2 |A(!)x+Dy ⇠ (!) . (3.2) In this setting, e ! is a random vector defined on a probability space (⌦ ,A,P), with ⌦ being the sample space,A being the -algebra generated by subsets of⌦ and P being a probability measure defined on A. The tilde on e ! signifies a random variable, whereas ! without the tilde refers to a realization of the random variable. The first-stage objective function ' is assumed to be convex and continuously di↵erentiable; the set X is compact and convex; the random data is as follows: f(!) 2R n2 , G(!) 2R n2⇥ n1 ; A(!) 2R `⇥ n1 , and ⇠ (!) 2R ` ; the deterministic constants are: Q2R n2⇥ n2 is symmetric positive semidefinite, and D2R `⇥ n2 . As we will see, the case where Q is positive definite is considerably easier to treat than the case of a positive semidefinite Q.We should note that the convexity and di↵erentiability of ' and the compactness and convexity of X 47 are made to simplify some technical details so that we can focus on the treatment of the second- stage quadratic resource function , which is complicated by the bi-parameterization byx in both the objective function and the constraint in (3.2). Moreover, we let Q and D be deterministic matrices as an extension of a standard recourse function in traditional two-stage SP which has Q equal to zero (and G(!) identically equal to zero). For ease of reference, we summarize this blanket set-up in the assumption below: (A) The setX is a compact convex set and ' is a convex and continuously di↵erentiable function on an open set⌅ containing X; moreover, Q is a symmetric positive semidefinite matrix. ⇤ It should be pointed out that since the first-stage objective ' is assumed convex and the set X is compact, it follows that there exists a constant Lip ' > 0 such that '(x)'(x 0 ) Lip ' xx 0 for all x,x 0 2X, (3.3) wherek•k denotes (throughout the dissertation ) the ` 2 -norm of vectors (and matrices). This constant Lip ' is not employed explicitly in the algorithms introduced later, however. It will be used in the proof of the last Theorem 3.4. To avoid some technical complications, we assume that (B) the second-stage problem satisfies the relatively complete recourse property on ⌅; i.e. the recourse function (x,!) is finite for all x2⌅ and almost all !2⌦. ⇤ We refer to the monograph [49] for a contemporary comprehensive study of the qualitative properties of solutions to (convex) quadratic programs and to [59, Lemma 1] for a supplement to thereference; thecitedlemmasummarizesmanyofthesolutionpropertiesofaquadraticprogram thatwewillfreelyuseinthischapter. Inparticular, withQbeingsymmetricpositivesemidefinite, the relatively complete recourse assumption stipulates the validity of the following two conditions for all x2⌅ and almost all !2⌦; i.e., for all ! in a subset b ⌦ of⌦ satisfying P(!2 b ⌦) = 1: (B1) Feasibility: the set Y(x,!)6=;;i.e., ⇠ (!)A(!)x 2Range DR ` + ; 48 (B2) Finiteness: the objective function of the recourse function is bounded below on its feasible set; i.e., [Dv 0,Qv=0] ) ⇥ f(!)+G(!)x ⇤ > v 0. ⇤ We say that a random variable Z is essentially bounded if its essential supremum essup(Z) , inf ↵ | P{Z ||Z|>↵ }=0 is finite. We further assume that (C) the given random functionsf(e !),G(e !),A(e !), and ⇠ (e !) are all essentially bounded; i.e., their norms are essentially bounded. ⇤ A question that needs to be noted at the outset is the well-definedness of the expected re- course function E e ! ⇥ (x,e !) ⇤ under the above setting, particularly because of the unusual bi- parameterization therein. For a general treatment of this issue, the reader can consult [76, Section 2.3], where properties of such an expected-value function are addressed for an abstract optimization based recourse function, see in particular Theorem 7.37 of the cited reference. For our bi-parameterized quadratic recourse function, we can apply [59, Proposition 2] to deduce sev- eral properties of the resource function (•,!) and its expectation. Let M (x,!) be the argmin associated with (x,!). It then holds: the sets QM (x,!) and ⇥ f(!)+G(!)x ⇤ M (x,!) are singletons for allx2⌅ under the relatively completely recourse assumption; moreover, for almost all ! 2⌦, the unique elements in these sets are Lipschitz continuous functions on X which is compact by assumption; moreover the expectation of the !-dependent Lipschitz constant is finite by assumption (C). The same Lipschitz property holds for (•,!). Therefore, throughout the rest of the chapter, it is justified to take the expected recourse function E e ! ⇥ (x,e !) ⇤ to be well defined, finite, and continuous [76, Theorem 7.43] for all x2X. 3.1.2 Examples Example 1: Two-stage shipment planning with pricing 49 ThisexamplewasdiscussedinBertsimasandKallus[6]. Weconsideroneproductinanetwork of M facility locations and N retailer stores. The decision-making problem has two stages. In the first stage, we determine the product’s price p and choose the amount x i to produce and store at each facility location i for i=1,...,M at the cost of c 1i per unit produced. After the demand d j at each location j for j=1,...,N realizes, in the second stage, we have an option of having the last-minute production y i it the factory i at a cost c 2i >c 1i per unit produced and we decide the units z ij to be shipped from facility location i to location j with the cost s ij per unit shipped. The lost demands su↵er a penalty c 3 per unit and the leftover products have a storage cost c 4 per unit. Suppose that the customer demand is random, and is denoted d j at the customer location j. The demand quantity is approximately linearly dependent on the price p, that is d j = ↵ j (e !)p+ j (e !)where {↵ j (e !)} and { j (e !)} are random coecients independent of p. The corresponding two-stage stochastic programming problem is constructed as follows. minimize x,p c > 1 x+E e ! [h(x,p,e !)] subject to x0,p0, where h(x,p,e !) is a recourse function satisfying h(x,p,e !)= minimize y,z c > 2 y+ P M i=1 P N j=1 (s ij p)z ij subject to P M i=1 z ij ↵ j (!)p+ j (!),j=1,...,N P N j=1 z ij x i +y i ,i=1,...,M y0, z0 Example 2: Economic dispatch problem Another example arises in power systems planning, where there is a significant proportion of renewable energy in the system. For such systems, the two-stage paradigm of stochastic pro- gramming reflects a planning process in which electricity production from traditional thermal 50 plants are pre-planned in the first stage, whereas, backup fast-ramping generators are to supple- ment production after the production from the renewable energy (wind and solar) are observed. Production costs in the second stage pertain to fuel costs of fast-ramping generators, and these unit costs depend on the rate of production that is necessary to meet demand. In power system planning, it is customary to go beyond constant (or even random unit costs) to ane unit costs whichreflecttheramp-ratesrequiredtomeetdemand. Becauserampingratesdependnotonlyon the observed renewable production, but also on the shortfall due to the first stage thermal power decisions, thesecondstageunitcostsareapproximatedbytheparameterswhicharedependenton the random renewable generation, as well as the first stage thermal generation. With the growth of renewable energy in many jurisdictions, such modeling tools will become more appropriate in future generations of power system planning problems. 3.1.3 Literature review Thepresentresearchisbuiltontherecentworkof[56], whereitwasshownthatthevaluefunction (•,!) is a di↵erence-of-convex (dc) function for fixed !. This class of nonconvex functions has a long history; an early work is the unpublished manuscript [74]; a brief history is documented in the former reference. Thus the recourse function E e ! ⇥ (•,e !) ⇤ and the combined objective function ⇣ are also dc. Hence, in principle, the di↵erence-of-convex algorithm (DCA) [83, 1] could be applied to solve the two-stage SP (3.1). Nevertheless, there are several major concerns in a direct application of this DCA approach. One: the dc decomposition of the value function (•,!) given in [56] is only of conceptual value and practically not suitable for computation. Two: some kind of discretization is needed to approximate the expectation operator. In this vein, advances in SP methodologies can deal with the latter task; techniques such as sample average approximation (SAA),stochasticdecomposition,andstochasticapproximationcanallbeapplied. However,these techniques should be combined with some convex approximations of (3.1) so that the resulting solutionprocedurewouldbecomepracticallyimplementableandnotmerelyoftheoreticalinterest. 51 Another concern is that the convergence of the DCA pertains to a “critical point” of the dc function to be minimized. In turn, the definition of such a point depends on a given dc decomposition of the function, in the absence of which, it is not clear what kind of limit one can expect for an iterative method. The goal of this work is to develop implementable successive SAA based procedures to address these issues. In doing so, we acknowledge that the proposed SAA-based procedure may not be the most ecient way to solve the problem (3.1) in practice; nevertheless, in light of the lack of previous attempts to rigorously study a bi-parameterized second-stagerecourse,itishopedthatourproposalwillstimulatefurtherresearchonthisproblem, which is significantly di↵erent from much of the existing research in computational two-stage SP. 3.2 Known Concepts of Stationarity In the literature of dc programming, there are 3 basic types of stationary solutions that have received the most attention: the directional derivative based stationarity [58], the convex-analysis based Clarke stationarity [17], and the dc-decomposition based critical-point concept [83]. Definition 3.1. Let g :Z! R be a function defined on an open set Z✓ R n . (a) The one sided directional derivative of g at ¯ z2Z along the direction d2R n is g 0 (¯ z;d) , lim ⌧ #0 g(¯ z+⌧d )g(¯ z) ⌧ if the limit exists; g is said to be directionally di↵erentiable at ¯z 2Z if g 0 (¯ z;d) exists for all d2R n . (b) The Clarke directional derivative of g at ¯ z2Z along the direction d2R n is g 0 (¯ z;d) , limsup z! ¯ z ⌧ #0 g(z+⌧d )g(z) ⌧ , which is finite when g is Lipschitz continuous near ¯ z. 52 (c) We say that the function g is Clarke regular at ¯ z2Z if g is directionally di↵erentiable at ¯z and g 0 (¯ z;d)=g 0 (¯ z;d) for all d2R n . ⇤ Clearly, g 0 (¯ z;d) g 0 (¯ z;d) for all d 2R n . The Clarke subdi↵erential of g at ¯ z is the set @ C g(¯ z), v | g 0 (¯ z;d) v T d for all d 2R n . For a convex function g, @ C g coincides with the subgradient@g inconvexanalysis. BasedonDefinition3.1,wedefinethed(irectional)-stationarity andC(larke)-stationarityasfollows. Wealsoincludetheconceptofacriticalpointofadcfunction. Definition 3.2. Let f : Z ! R be a locally Lipschitz and directionally di↵erentiable function defined on an open set Z✓ R n containing the closed convex set X. A vector ¯ x2X is said to be a (a) d(irectional)-stationary point of f on X if f 0 (¯ x;x¯ x)0 for all x2X; (b C(larke)-stationary point of f on X if f 0 (¯ x;x¯ x) 0 for all x 2X; or equivalently, if 02@ C f(¯ x)+N(¯ x;X)where N(¯ x;X) is the normal cone of X at ¯ x. If f =gh is a dc function with g and h being convex, then ¯ x2X is said to be a critical point of f on X if 02 ⇥ @g (¯ x)@h (¯ x) ⇤ +N(¯ x;X). ⇤ It is clear that d-stationarity ) C-stationarity ) criticality, the latter is for a dc function. For a dc function, we have the equivalent conditions for d-stationarity. Proposition 3.1. Letf =gh withg andh both being convex functions defined on an open set containing the closed convex set X. The following statements are equivalent for a vector ¯ x2X: (a) ¯ x is a d-stationary point of f on X; (b) g 0 (¯ x;x¯ x)v T (x¯ x) for all v2@h (¯ x) and all x2X; (c) ¯ x 2argmin x2 X (g(x)v T x) for all v2@h (¯ x); (d) @h (¯ x)✓ @g (¯ x)+N(¯ x;X). ⇤ The above equivalent statements are helpful (albeit not easy) for verifying d-stationarity for a convex-constrained dc program. There is a similar set of equivalent statements for a critical point 53 in which we replace the “for all v” to “there exists v” in (b) and (c) and the inclusion in part (d) to a nonempty intersection between the left and right-hand sets. The practical value of statement (c) is that it reduces the verification of d-stationary (or criticality) of a given feasible vector to the solution of (an unknown) one (for criticality) or possibly infinitely many (for d-stationarity) convex programs. The latter feature makes the computation of a certifiably d-stationary solution certifiably a challenging task in general. We note that if @h (¯ x) is a singleton, then each of the four statements (a–d) in Proposition 3.1 is equivalent to ¯ x being a critical point of f on X.The dicultcaseintermsofcomputingad-orC-stationarypointofadcfunctioniswhenthefunction h is not di↵erentiable at the point in question. This diculty persists in the problem (3.1) when Q is positive semidefinite. While the reference [56] has demonstrated the dc property of the value function of a bi- parameterized convex quadratic program, a dc decomposition of this function is quite involved except in the case when Q is positive definite. Indeed, in the latter case, we have (x,!)=minimum y2 Y(x,!) 2 6 6 6 6 4 1 2 ⇥ y+Q 1 (f(!)+G(!)x) ⇤ > Q ⇥ y+Q 1 (f(!)+G(!)x) ⇤ 1 2 ⇥ f(!)+G(!)x ⇤ > Q 1 ⇥ f(!)+G(!)x ⇤ 3 7 7 7 7 5 = maximum 0 2 6 6 6 6 6 6 6 6 6 4 1 2 > DQ 1 D > 1 2 ⇥ f(!)+G(!)x ⇤ > Q 1 ⇥ f(!)+G(!)x ⇤ + > h ⇠ (!)A(!)x+DQ 1 f(!)+G(!)x i 3 7 7 7 7 7 7 7 7 7 5 where the last equality is by duality. Defining 1 (x,!) , maximum 0 2 6 6 6 6 4 1 2 > DQ 1 D > + > h ⇠ (!)A(!)x+DQ 1 f(!)+G(!)x i 3 7 7 7 7 5 2 (x,!) , 1 2 ⇥ f(!)+G(!)x ⇤ > Q 1 ⇥ f(!)+G(!)x ⇤ , (3.4) 54 both of which are convex functions in x for fixed !,wededuce (x,!)= 1 (x,!) 2 (x,!), obtaining the dc decomposition of the value function (•,!). Note that 1 (•,!)ispiecewise linear-quadraticthusgenerallynotdi↵erentiablewhile 2 (•,!)isquadraticandthusdi↵erentiable. The two-stage stochastic program (3.1) thus becomes the following dc program with an expected- value objective function: minimize x2 X ⇣ (x)= '(x)+E e ! ⇥ 1 (x,e !) ⇤ E e ! ⇥ 2 (x,e !) ⇤ . (3.5) Undertheessentialboundednessassumption(C)oftherandomfunctionsandthepositivedefinite- ness of Q, it follows that E e ! ⇥ 2 (•,e !) ⇤ is a convex quadratic function. In general, the objective function ⇣ (•) in (3.5) is nonconvex and nondi↵erentiable. 3.3 The PD Case: via the dc Program (3.5) Based on the dc formulation (3.5), we can introduce an iterative algorithm for computing a d- stationary point of the two-stage SP (3.1) in the case when the symmetric matrix Q is positive definite. The algorithm is an adaptation of the DCA but with both expectations in the objective function ⇣ (x) being approximated by sample averages; thus the overall algorithm combines both theunderlyingprincipleoftheDCAandtheSAAscheme. Almostsuresubsequentialconvergence to a d-stationary point will be established. Before presenting the algorithm, we make some general remarks about the proposed solution approach. As a first proposal of an implementable approach supported by a convergence analysis for solving the two-stage SP (3.1) with a bi-parameterized quadratic recourse function, we reckon that much further research is needed to render the proposed algorithm practically e↵ective. For 55 example, the question of how the sampled subproblems can be solved eciently should be ad- dressed. As each of these SAA subproblems has a strongly convex objective that is the sum of a smooth function plus a piecewise-linear quadratic function, one possible solution venue is by a semismooth Newton method applied to the dual programs. Furthermore, we expect that other approximations of the expectation operator may be possible, such as stochastic approximation and/or stochastic decomposition. All of these refinements and alternatives need to be carefully investigated in a future study. In order to broaden our discussion beyond the two-stage SP with a linearly bi-parameterized strictly convex recourse function, we focus on the abstract dc formulation (3.5) and introduce two assumptions generalized from the specialized conditions (A), (B), and (C) for the SP (3.1) which are satisfied by the special case of a positive definite Q. Specifically, we assume: (A 0 )X isacompactconvexsetand' isaconvexandcontinuousdi↵erentiablefunctiononanopen set⌅containing X; moreover, thefunctions i (•,!)fori=1,2areconvexand(thusdirectionally di↵erentiable) on ⌅ for almost all ! 2⌦; moreover, 2 (•,!) is continuously di↵erentiable on ⌅ for almost all !2⌦; (B 0 ) the two expectation functions E e ! ⇥ i (x,e !) ⇤ for i=1,2 are finite for all x2X; moreover, the functions i (•,!) fori=1,2 are Lipschitz continuous on⌅ with a uniform Lipschitz constant Lip ; i.e., for i=1,2 and almost all !2⌦, i (x,!) i (x 0 ,!) Lip xx 0 for all x,x 0 2⌅ . (3.6) Thecontinuousdi↵erentiabilityofthefunction 2 (•,!)ismadetosimplifytheanalysissomewhat. Although we expect an extended analysis can be made, by using in particular the results from [80] concerning the limits of subdi↵erentials of random functions, we refrain from this relaxed setting for two reasons. One: the di↵erentiability assumption is satisfied in the case of a positive definite Q and the two special cases of a positive semidefinite cases to be described in the next section. 56 Two: the framework (3.5) with such an explicit computationally viable dc decomposition is not applicable to the general case of a positive semidefiniteQ. Treated in full in Section 3.5, the latter case requires a new stationarity concept and deeper analysis. The global uniform Lipschitz assumption (B 0 ) implies that for almost all !2⌦, sup x2 X 2 4 max max a2 @ x 1(x,!) kak, r x 2 (x,!) ! 3 5 Lip . (3.7) Under Assumptions (A 0 ) and (B 0 ) , we can readily employ all the standard probabilistic results in our subsequent analysis, including the measurability of primal and dual optimal solutions to relevant convex programs in the develpment, and the interchangeability of expectation with limits, subgradients, derivatives, and the subdi↵erential multifunction of convex functions. These subjects are covered extensively in the classic papers [27, 46] and form the foundation of much of the convergence analysis and statistical inference of the SAA scheme. For ease of reference, we summarize in the lemma below these technical prerequisites and freely employ them in the subsequent analysis. For a proof of the lemma, see [76, Theorems 7.43, 7.44 and 7.47] and the previous two references for more details. Lemma 3.1. Suppose that e ! is a random variable defined on a probability space (⌦ ,F,P) and H(•,!)isarandomlowersemicontinuousfunction. Wedefineafinite-valuedexpectationfunction h(•) , E e ! [H(•,e !)] on an open set S✓ R n . The following statements hold at any ¯ x2S: (a) If there exists a positive valued random variable C(!)with E e ! [C(e !)] < +1 such that for any two points x 1 and x 2 in a neighborhood of ¯ x 2S and almost every ! 2⌦, the following inequality holds |H(x 1 ,!)H(x 2 ,!)| C(!)kx 1 x 2 k,then h is Lipschitz continuous in a neighborhood of ¯ x. (b)Ifinadditiontotheassumptionin(a),H(•,!)isdirectionallydi↵erentiableat ¯xforalmostall !2⌦, then the expected value function h is directionally di↵erentiable at ¯x and the directional derivative satisfies h 0 (¯ x;d)= E e ! ⇥ H 0 (•,e !)(¯ x;d) ⇤ for all d 2R n . 57 (c) If in addition to the assumption in (a), H(•,!) is di↵erentiable at ¯x for almost all ! 2⌦, then the expected value function h is di↵erentiable at ¯x and the gradient satisfies r h(¯ x)= E e ! ⇥ r x H(¯ x,e !) ⇤ . (d) If H(•,!) is convex for almost every !2⌦, then, @h (¯ x)= E e ! ⇥ @ x H(¯ x,e !) ⇤ . ⇤ Before presenting the combined DCA and SAA scheme for solving the stochastic dc program (3.5), we introduce some subgradients whose existence are guaranteed by part (d) of Lemma 3.1 under assumptions (A 0 ) and (B 0 ). For an arbitrary vector x2X and any subgradient v(x)in @ E e ! ⇥ 1 (x,e !) ⇤ , there exist subgradients u(x,!)in @ x 1 (x,!) for almost all ! 2⌦ such that v(x)=E e ! ⇥ u(x,e !) ⇤ . By the uniform bound (3.7), these subgradients u(•,!) satisfy V 1 , max x2 X E e ! 0 E e ! ⇥ u(x,e !) ⇤ u(x,e ! 0 ) 2 < 1 (3.8) Similarly we have V 2 , max x2 X E e ! 0 E e ! ⇥ r x 2 (x,e !) ⇤ r x 2 (x,e ! 0 ) 2 < 1. (3.9) These constants V 1 and V 2 are used in the proof of Theorem 3.1. For a given vector x 0 2X, using the gradientr x 2 (x 0 ,!) to define the “linearization” of the function 2 (•,!) at the vector x 0 , we define the semi-linearization of the function ⇣ at x 0 by: b ⇣ (x;x 0 ) , '(x)+E e ! ⇥ 1 (x,e !) ⇤ E e ! h b 2 (x,e !;x 0 ) i with b 2 (x,!;x 0 ) , 2 (x 0 ,!)+r x 2 (x 0 ,!) > (xx 0 ). (3.10) Note that b ⇣ (•;x 0 ) is a convex function on X; moreover, we have ⇣ (x 0 )= b ⇣ (x 0 ;x 0 ) and ⇣ (x) b ⇣ (x;x 0 ) for all x 2X. 58 For solving (3.1), we employ independent and identically distributed (iid) samples of the random variable e ! to approximate the expectation in the above convex majorization b ⇣ (x;x 0 )by their sample averages. We present the algorithm as follows. Algorithm 1 • (Initialization) Let {L ⌫ } 1 ⌫ =0 be a sequence of positive integers. Let> 0 be a given scalar. Let an initial feasible vector e x 0 2X be given. Set ⌫ = 0. • At iteration ⌫ , generate iid samples {! ⌫,i } L⌫ i=1 that are also independent from those in the past iterations. Generate the next iterate e x ⌫ +1 according to: e x ⌫ +1 = argmin x2 X 2 6 6 6 6 6 6 6 6 4 '(x)+ 1 L ⌫ L⌫ X i=1 1 (x,! ⌫,i ) 1 L ⌫ L⌫ X i=1 b 2 (x,! ⌫,i ;e x ⌫ ) | {z } denoted b ⇣ ⌫ (x;e x ⌫ ) + 1 2 kxe x ⌫ k 2 . 3 7 7 7 7 7 7 7 7 5 (3.11) In classical stochastic gradient algorithms for convex optimization problems, either we could choose a fixed batch size (i.e., the integer L ⌫ at each iteration) with decreasing step sizes (given by iteration-dependent ⌫ ) similar to Robbins-Monro stochastic approximation [64], or we could choose an increasing batch size with constant step sizes. However, for a nonconvex problem, the choiceisnotsoclearcut. Accordingtooursubsequentanalysis,diminishingstepsizescouldleadto the accumulated error becoming unbounded, thus making it dicult to establish the convergence of the algorithm. Thus here and in the subsequent algorithms, the step size is assumed to be a fixed positive constant and the sample sizes {L ⌫ } are chosen from an unbounded sequence satisfying a summability condition. The last term 1 2 kxe x ⌫ k 2 in (3.11) is a proximal term added to strongly convexify the convex function e ⇣ ⌫ (•;x ⌫ ). Dependent on the selected samples and the immediate past iterate, each new iterate e x ⌫ +1 is a measurable random vector [27]. In 59 what follows, we aim to show that every accumulation point of the sequence {e x ⌫ +1 } 1 ⌫ =0 of such iterates is a d-stationary point of the problem (3.1) almost surely. Starting from the pioneering workofDupaˇ cov´ aandWets[27],thealmost-sureconvergenceoftheSAAschemehasbeenstudied extensively; some references include [46, 41, 42, 43, 63, 66, 75, 78, 76, 88] and the most recent [4]. Nevertheless, none of the results are applicable to establish the convergence of Algorithm 1, mainly because this algorithm pertains to a particular sequence generated by a combination of the SAA scheme with variable sample sizes and the DCA. In what follows, we present a detailed proof of the claimed almost-sure convergence. Before proceeding to the technical analysis, we need to set down the precise notion of conver- gence,i.e.,themeasure-theoreticsettingoftheconvergence. Thisisneededduetotheincremental samples in the algorithm. Adopting the set-up in [41], we let ⌦ L k denote the L k -fold Cartesian product of the sample space⌦; let P ⌫ be a probability measure on b ⌦ ⌫ , ⌫ Y k=1 ⌦ L k and E ⌫ be the expectationoperatorinducedbyP ⌫ . LetF ⌫ denotethesigma-algebrageneratedbysubsetsof b ⌦ ⌫ so that the family{F ⌫ } ⌫ is a filtration on the probability space (⌦ ,A,P). Let b ⌦ 1 , 1 Y k=1 ⌦ L k and P 1 denotethecorrespondingprobabilitydistributionon b ⌦ 1 . LetE 1 betheexpectationoperator induced byP 1 . Notice ifZ is a random variable dependent only on events in the iterations up to ⌫ , i.e., defined on the sigma-algebra F ⌫ ,then E 1 [Z]= E ⌫ [Z]. Finally, let E ⌫ [•|F ⌫ 1 ] denote the conditional expectation on the probability space ⇣ b ⌦ ⌫ ,F ⌫ ,P ⌫ ⌘ given the -algebra F ⌫ 1 . The convergence analysis of Algorithm 1 requires several lemmas. The first one deals with the prox-operator of a (finite-valued) convex function g defined as: Prox X ,g (x) , argmin z2 X g(z)+ 1 2 kxzk 2 . (3.12) The lemma below asserts a basic property of this operator. 60 Lemma 3.2. [5, Section 27.1] Let g be a convex function defined on an open set Z containing the closed convex set X.Then, p=Prox X ,g (x) () g(p)g(x 0 ) 1 (px 0 ) T (px) for any x2Z, x 0 2X, and> 0. ⇤ The next two lemmas are elementary and standard in this kind of convergence analysis. The first one is about the convergence of quasi-F´ ejer monotone sequences; the second one is about the continuous convergence of real-valued functions. Lemma 3.3. [5, Lemma 5.31] Let {a k } and {b k } be two nonnegative sequences of scalars and {⇠ k } be a sequence such that 1 X k=1 ⇠ k exists. If for any k0, a k+1 a k b k+1 +⇠ k+1 , then 1 X k=1 b k <1 and lim k!1 a k exists. ⇤ Lemma 3.4. [76, Proposition 5.1] Let g and {g N } 1 N=1 be (deterministic) real-valued functions defined on a closed set Z. The following two properties are equivalent: (i)forany ¯ x2Z andanysequence{x N }⇢ Z convergingto ¯ x, itholdsthat lim N!1 g N (x N )= g(¯ x); (ii) g is continuous on Z and g N converges uniformly to g on any compact subset of Z. ⇤ Employing the above lemmas, we now state and prove the almost-sure subsequential conver- gence of a sequence produced by Algorithm 1. Theorem 3.1. Suppose that assumptions (A 0 ) and (B 0 ) hold. If the sequence of sample sizes {L ⌫ } satisfies 1 X ⌫ =1 1 p L ⌫ < 1, then starting at an (arbitrary) feasible point e x 0 2X, Algorithm 1 generates a well-defined sequence {e x ⌫ } of iterates which has at least one accumulation point; morever, any such point is a d-stationary solution of the stochastic dc program (3.5) with P 1 - probability 1. ⇤ 61 Proof. In Algorithm 1, each subproblem uses two types of approximations for the original ob- jective function: one is the linear approximation of the concave summand and the other is the sample average approximation for both expectations. Based on this observation, we analyze the connection of two consecutive iterates e x ⌫ +1 and e x ⌫ in two steps by constructing an intermediate iterate e x ⌫ +1/2 . Specifically, we construct the latter iterate by only using the linear approximation of the concave summand while maintaining the expectation functionals: e x ⌫ +1/2 , argmin x2 X b ⇣ (x;e x ⌫ )+ 1 2 kxe x ⌫ k 2 , (3.13) where b ⇣ (x;x 0 ), '(x)+E e ! ⇥ 1 (x,e !) ⇤ E e ! h b 2 (x,e !;x 0 ) i . On one hand, e x ⌫ +1/2 is connected with e x ⌫ in two ways: the linearization at e x ⌫ and regularization via the proximal map. On the other hand, e x ⌫ +1 is the unique optimal solution of the strongly convex program (3.11), which is the SAA approximation of the two-stage convex stochastic program (3.13) that defines e x ⌫ +1/2 . With the above preparation, the rest of the proof is organized as follows. 1. By Lemma 3.2, we give the relation of two iterates, e x ⌫ and e x ⌫ +1/2 . 2. We use the constants V 1 and V 2 defined in (3.8) and (3.9), respectively, to relate the two iterates, e x ⌫ +1 and e x ⌫ +1/2 . 3. Combiningtheabovetworelations,wederivean(inexact)descentinequalityoftheobjective value ⇣ (e x ⌫ +1 ) with the error dependent on the sample sizes. We then apply Lemma 3.3 to deduce that the sequence of objective values {⇣ (e x ⌫ +1 )} converges and also 1 X ⌫ =1 e x ⌫ +1 e x ⌫ 2 is finite with P 1 -probability 1. 4. Finally we prove the desired convergence asserted by the theorem by analyzing the limit property of the optimality condition in the update (3.11). 62 Beforecarryingouttheabovesteps,wenotethatundertheassumptions,thefunctions 1 (•,!) and 2 (•,!) and r x 2 (•,!) are all continuous and bounded by some integrable random vari- ables. By the uniform law of large numbers, the sample average functions converge uniformly to the expectation functions with P 1 -probability 1 as follows: lim ⌫ !1 max x2 X 1 L⌫ L⌫ X i=1 1(x,! ⌫,i ) E e ! ⇥ 1(x,e !) ⇤ =0 lim ⌫ !1 max x2 X 1 L⌫ L⌫ X i=1 2(x,! ⌫,i ) E e ! ⇥ 2(x,e !) ⇤ =0 lim ⌫ !1 max x2 X 1 L⌫ L⌫ X i=1 r x 2(x,! ⌫,i ) E e ! ⇥ r x 2(x,e !) ⇤ =0, (3.14) where, from here on, it is understood that the convergence probability is with respect to P 1 unless otherwise specified. Step 1. With g = ' +E e ! ⇥ 1 (•,e !) ⇤ and x = e x ⌫ + E e ! ⇥ r x 2 (e x ⌫ ,e !) ⇤ ,wehave e x ⌫ +1/2 = Prox X ,g (x). Thus with x 0 = e x ⌫ , Lemma 3.2 yields ✓ '(e x ⌫ +1/2 )+E e ! h 1 (e x ⌫ +1/2 ,e !) i ◆ ⇣ '(e x ⌫ )+E e ! ⇥ 1 (e x ⌫ ,e !) ⇤ ⌘ 1 ⇣ e x ⌫ +1/2 e x ⌫ ⌘ > ⇣ e x ⌫ +1/2 e x ⌫ E e ! ⇥ r x 2 (e x ⌫ ,e !) ⇤ ⌘ = 1 e x ⌫ +1/2 e x ⌫ 2 + ⇣ e x ⌫ +1/2 e x ⌫ ⌘ > E e ! ⇥ r x 2 (e x ⌫ ,e !) ⇤ , (3.15) which establishes a connection between the two iterates, e x ⌫ +1/2 and e x ⌫ . Step 2. Given e x ⌫ and the samples {{! k,i } L k i=1 } ⌫ 1 k=1 , e x ⌫ +1 is an SAA approximation of e x ⌫ +1/2 using the samples {! ⌫,i } L⌫ i=1 at iteration ⌫ . By the optimality conditions of these two iterates: e x ⌫ +1 and e x ⌫ +1/2 ,thereexists e v ⌫ +1/2 in @ E e ! h 1 (e x ⌫ +1/2 ,e !) i so that using the interchangeability of gradient and expectation, we have, (e x ⌫ +1 e x ⌫ +1/2 ) > 2 6 6 6 6 4 r ' (e x ⌫ +1/2 )+e v ⌫ +1/2 E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ + 1 (e x ⌫ +1/2 e x ⌫ ) 3 7 7 7 7 5 0 63 and for some u(e x ⌫ +1 ,! ⌫,i )2@ x 1 (e x ⌫ +1 ,! ⌫,i ), (e x ⌫ +1/2 e x ⌫ +1 ) > 2 6 6 6 6 6 4 r ' (e x ⌫ +1 )+ 1 L⌫ L⌫ X i=1 u(e x ⌫ +1 ,! ⌫,i ) 1 L⌫ L⌫ X i=1 r x 2(e x ⌫ ,! ⌫,i )+ 1 (e x ⌫ +1 e x ⌫ ) 3 7 7 7 7 7 5 0. Adding the two inequalities, using the monotonicity of the gradient r ' of the convex function ',i.e.,(e x ⌫ +1 e x ⌫ +1/2 ) > h r '(e x ⌫ +1 )r '(e x ⌫ +1/2 ) i 0, and using the fact that there exists u(e x ⌫ +1/2 ,!)2@ x 1 (e x ⌫ +1/2 ,!) for almost all !2⌦ such that e v ⌫ +1/2 = E e ! h u(e x ⌫ +1/2 ,e !) i ,we deduce: 0 (e x ⌫ +1/2 e x ⌫ +1 ) > 2 6 6 6 6 6 6 6 6 6 6 6 6 4 1 L⌫ L⌫ X i=1 ⇣ u(e x ⌫ +1 ,! ⌫,i ) u(e x ⌫ +1/2 ,! ⌫,i ) ⌘ + 1 L⌫ L⌫ X i=1 u(e x ⌫ +1/2 ,! ⌫,i ) E e ! h u(e x ⌫ +1/2 ,e !) i + E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ 1 L⌫ L⌫ X i=1 r x 2(e x ⌫ ,! ⌫,i ) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 + 1 (e x ⌫ +1/2 e x ⌫ +1 ) > (e x ⌫ +1 e x ⌫ +1/2 ) (e x ⌫ +1/2 e x ⌫ +1 ) > 2 6 6 6 6 6 6 4 1 L⌫ L⌫ X i=1 u(e x ⌫ +1/2 ,! ⌫,i ) E e ! h u(e x ⌫ +1/2 ,e !) i + E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ 1 L⌫ L⌫ X i=1 r x 2(e x ⌫ ,! ⌫,i ) 3 7 7 7 7 7 7 5 + 1 (e x ⌫ +1/2 e x ⌫ +1 ) > (e x ⌫ +1 e x ⌫ +1/2 ), where the last inequality holds because we have, by the convexity of 1 (•,! ⌫,i ), (e x ⌫ +1/2 e x ⌫ +1 ) > h u(e x ⌫ +1 ,! ⌫,i )u(e x ⌫ +1/2 ,! ⌫,i ) i 0. Hence, with B ⌫ , E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ 1 L⌫ L⌫ X i=1 r x 2(e x ⌫ ,! ⌫,i ) C ⌫ , 1 L⌫ L⌫ X i=1 u(e x ⌫ +1/2 ,! ⌫,i ) E e ! h u(e x ⌫ +1/2 ,e !) i , 64 we deduce e x ⌫ +1/2 e x ⌫ +1 ⇥ kC ⌫ k+kB ⌫ k ⇤ . We bound the two summands on the right-hand side. We have E⌫ ⇥ kC ⌫ k ⇤ = E⌫ 1 2 6 6 4 E⌫ 2 6 4 1 L⌫ L⌫ X i=1 u(e x ⌫ +1/2 ,! ⌫,i ) e v ⌫ +1/2 |F ⌫ 1 3 7 5 3 7 7 5 E⌫ 1 2 6 6 4 E⌫ 2 6 4 1 L⌫ L⌫ X i=1 u(e x ⌫ +1/2 ,! ⌫,i ) e v ⌫ +1/2 2 |F ⌫ 1 3 7 5 3 7 7 5 1/2 = E⌫ 1 " 1 L⌫ E e ! u(e x ⌫ +1/2 ,e !) e v ⌫ +1/2 2 # 1/2 V 1/2 1 L 1/2 ⌫ . Similarly, we obtain E ⌫ ⇥ kB ⌫ k ⇤ V 1/2 2 L 1/2 ⌫ . Consequently, E ⌫ e x ⌫ +1/2 e x ⌫ +1 b p L ⌫ , where b , ⇣ V 1/2 1 +V 1/2 2 ⌘ . (3.16) Hence,E 1 2 4 1 X ⌫ =1 e x ⌫ +1/2 e x ⌫ +1 3 5 1 X ⌫ =1 E ⌫ e x ⌫ +1/2 e x ⌫ +1 1 X ⌫ =1 b p L ⌫ . Inparticular, 1 X ⌫ =1 e x ⌫ +1/2 e x ⌫ +1 is finite with probability 1. We have b ⇣ (e x ⌫ +1 ;e x ⌫ ) b ⇣ (e x ⌫ +1/2 ;e x ⌫ )+ 1 2 e x ⌫ +1 e x ⌫ 2 1 2 e x ⌫ +1/2 e x ⌫ 2 = ' (e x ⌫ +1 )+E e ! h 1(e x ⌫ +1 ,e !) i E e ! h 2(e x ⌫ ,e !)+r x 2(e x ⌫ ,e !) > (e x ⌫ +1 e x ⌫ ) i ' (e x ⌫ +1/2 ) E e ! h 1(e x ⌫ +1/2 ,e !) i +E e ! h r x 2(e x ⌫ ,e !) > (e x ⌫ +1/2 e x ⌫ ) i +E e ! ⇥ 2(e x ⌫ ,e !) ⇤ + 1 2 ⇣ e x ⌫ +1 e x ⌫ +1/2 ⌘ > ⇣ e x ⌫ +1 +e x ⌫ +1/2 2e x ⌫ ⌘ ⌥ ke x ⌫ +1 e x ⌫ +1/2 k, for some constant⌥ > 0. This establishes a connection between the two iterates, e x ⌫ +1 and e x ⌫ +1/2 . 65 Step 3. Applying (3.15) and the gradient inequality that 2 (e x ⌫ +1 ,e !) 2 (e x ⌫ ,e !) ⇣ e x ⌫ +1 e x ⌫ ⌘ > r x 2 (e x ⌫ ,e !) for the convex function 2 (•,e !), we derive, with probability 1, ⇣ (e x ⌫ +1 ) ⇣ (e x ⌫ )= b ⇣ (e x ⌫ +1 ;e x ⌫ ) b ⇣ (e x ⌫ +1/2 ;e x ⌫ )+ ✓ ' (e x ⌫ +1/2 )+E e ! h 1(e x ⌫ +1/2 ,e !) i ◆ ⇣ ' (e x ⌫ )+E e ! ⇥ 1(e x ⌫ ,e !) ⇤ ⌘ ⇢ E e ! h 2(e x ⌫ +1 ,e !) i E e ! ⇥ 2(e x ⌫ ,e !) ⇤ + ⇢ E e ! h b 2(e x ⌫ +1 ,e !;e x ⌫ ) i E e ! h b 2(e x ⌫ +1/2 ,e !;e x ⌫ ) i ⇣ e x ⌫ +1/2 e x ⌫ ⌘ > E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ +⌥ ke x ⌫ +1 e x ⌫ +1/2 k ⇣ e x ⌫ +1 e x ⌫ ⌘ > E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ 1 2 e x ⌫ +1 e x ⌫ 2 1 2 e x ⌫ +1/2 e x ⌫ 2 + ⇣ e x ⌫ +1 e x ⌫ +1/2 ⌘ > E e ! ⇥ r x 2(e x ⌫ ,e !) ⇤ =⌥ ke x ⌫ +1 e x ⌫ +1/2 k 1 2 e x ⌫ +1 e x ⌫ 2 1 2 e x ⌫ +1/2 e x ⌫ 2 . Since X is bounded and e x ⌫ 2X for all ⌫ , the sequence of objective values ⇣ (e x ⌫ ) is bounded. Since 1 X ⌫ =1 ke x ⌫ +1 e x ⌫ +1/2 k is finite almost surely, by Lemma 3.3, it follows that the sequence {⇣ (e x ⌫ )} converges and the two sums 1 X ⌫ =1 e x ⌫ +1 e x ⌫ 2 and 1 X ⌫ =1 e x ⌫ +1/2 e x ⌫ 2 are finite with probability 1. Step 4. Since the feasible set X is compact, the sequence of iterates {e x ⌫ } has at least one accumulation point. Let {e x ⌫ } ⌫ 2 be a subsequence converging to a limit e x 1 which must belong to X. We will show that e x 1 is a d-stationary point of the two-stage stochastic program (3.1) with probability 1. First, it follows that{e x ⌫ +1 } ⌫ 2 must also converge to e x 1 with probability 1. Notice b ⇣ (x; e x ⌫ )= '(x)+E e ! ⇥ 1 (x,e !) ⇤ E e ! ⇥ 2 (x ⌫ ,e !)+r x 2 (x ⌫ ,e !) > (xx ⌫ ) ⇤ . 66 Moreover b ⇣ ⌫ (x;e x ⌫ )= 2 6 6 6 6 6 6 4 '(x)+ 1 L ⌫ L⌫ X i=1 h 1 (x,! ⌫,i ) 2 (e x ⌫ ,! ⌫,i ) i 1 L ⌫ L⌫ X i=1 h r x 2 (e x ⌫ ,! ⌫,i ) > (xe x ⌫ ) i , 3 7 7 7 7 7 7 5 Thus we have b ⇣ (e x ⌫ +1 ; e x ⌫ ) b ⇣ ⌫ (e x ⌫ +1 ; e x ⌫ )= 0 @ E e ! ⇥ 1 (e x ⌫ +1 ,e !) ⇤ 1 L ⌫ L⌫ X i=1 1 (e x ⌫ +1 ,! ⌫,i ) 1 A 0 @ E e ! ⇥ 2 (e x ⌫ ,e !) ⇤ 1 L ⌫ L⌫ X i=1 2 (e x ⌫ ,! ⌫,i ) 1 A 0 @ E e ! ⇥ r x 2 (e x ⌫ ,e !) ⇤ 1 L ⌫ L⌫ X i=1 r x 2 (e x ⌫ ,! ⌫,i ) 1 A > (e x ⌫ +1 e x ⌫ )= T 1,⌫ T 2,⌫ T ⌫, 3 , where we have employed the short-hands T 1,⌫ and T 2,⌫ for the terms within the first two paren- theses, respectively, and T ⌫, 3 for the last inner product. It follows that ⇣ (e x ⌫ +1 )= ' (e x ⌫ +1 )+E e ! h 1(e x ⌫ +1 ,e !) i E e ! h 2(e x ⌫ +1 ,e !) i ' (e x ⌫ +1 )+E e ! h 1(e x ⌫ +1 ,e !) i E e ! h b 2(e x ⌫ +1 ,e !;e x ⌫ ) i = b ⇣ (e x ⌫ +1 ;e x ⌫ ) = b ⇣ ⌫ (e x ⌫ +1 ;e x ⌫ )+T1,⌫ T2,⌫ T3,⌫ ' (x)+ 1 L⌫ L⌫ X i=1 1(x,! ⌫,i ) 1 L⌫ L⌫ X i=1 b 2(x,! ⌫,i ;e x ⌫ )+ 1 2 x e x ⌫ 2 1 2 e x ⌫ +1 e x ⌫ 2 + T1,⌫ T2,⌫ T3,⌫ for any x2 X. Since lim ⌫ (2 )!1 e x ⌫ = e x 1 , by invoking the limits (3.14) and Lemma 3.4, and letting ⌫ (2K)!1 in the last string of inequalities, we derive that with probability 1, for any x2X, ⇣ (e x 1 ) 2 6 6 6 6 6 4 '(x)+E e ! ⇥ 1 (x,e !) ⇤ E e ! ⇥ 2 (e x 1 ,e !) ⇤ (xe x 1 ) > r E e ! ⇥ 2 (e x 1 ,e !) ⇤ + 1 2 kxe x 1 k 2 . 3 7 7 7 7 7 5 67 This is enough to apply Proposition 3.1 to conclude that e x 1 is a d-stationary point of two-stage stochastic program (3.1) with probability 1. 3.4 Two Simple Positive Semidefinite Cases We next extend our previous analysis to the case where Q is a positive semidefinite (psd) matrix. Asmentionedbefore,while(3.1)remainsadcprograminthiscase,theavailabledcdecomposition of the recourse function (•,!) is only conceptual and does not lend itself to practical computa- tion. We start with two relatively simple psd cases in this section and show that both of them can be transformed to the stochastic dc program (3.5) with explicit convex functions 1 (•,!) and 2 (•,!) with the latter being di↵erentiable. 3.4.1 The first simple case Consider the SP (3.1) assuming: (D 1 ) the vector f and matrix G in the objective of the recourse function are deterministic and satisfy f +GX✓ RangeQ, i.e., for any x2X,thereexists z2R n2 such that f +Gx =Qz. ⇤ Given assumption (A) which stipulates that the set X is bounded, it follows from assumption (D 1 ) that the vector z satisfying f +Gx =Qz may be restricted to be bounded. In this case, the recourse function can be formulated as follows: (x,!) , minimum y [f +Gx] > y+ 1 2 y > Qy subject to y 2Y(x,!) , y 2R n2 | A(!)x+Dy ⇠ (!) . 68 Introducinganauxiliaryvariablez satisfyingf+Gx =Qz andwithachangeofvariable: w,y+z, we can write the above recourse function equivalently as: (x,!)= 2 6 6 6 4 minimum w 1 2 w > Qw subject to A(!)xDz +Dw ⇠ (!) 3 7 7 7 5 | {z } denoted ✓ (x,z,!) 1 2 z > Qz. Consequently, the given two-stage stochastic program (P) is equivalent to minimize x,z ⇣ (x,z) , '(x)+E e ! ⇥ ✓ (x,z,e !) ⇤ 1 2 z T Qz subject to x 2X and f +Gx = Qz | {z } lifted feasible set of pairs (x,z), denoted Z . (3.17) Sincetheright-handsideoftheconstraintsinthetransformedrecoursefunction ✓ (x,z,!)islinear in x and z, the expected recourse E e ! ⇥ ✓ (x,z,e !) ⇤ is convex with respect to (x,z). Thus problem (3.17) is a dc program with a deterministic and smooth concave summand. Moreover, by the above-mentioned remark, the lifted setZ may be assumed to be compact and convex, i.e., has the same property as the given set X. Hence, problem (3.17) has the formulation (3.5) that satisfies the assumptions (A 0 ) and (B 0 ). Moreover, if computationally needed, a bound on the z-variable can be induced from that on the x-variable. 3.4.2 The second simple case LetthePSDmatrixQhastheeigenvaluedecompositionQ = P + P 0 2 6 6 4 ⇤ + 0 00 3 7 7 5 P + P 0 > , where⇤ + isapositivediagonalmatrixwhosediagonalentriesarethepositiveeigenvaluesofQ,the columns of P + consist of the orthonormal eigenvectors corresponding the eigenvalues in⇤ + , and the columns of P 0 consist of the orthonormal eigenvectors corresponding to the zero eigenvalues of Q. 69 Letting b y, P + P 0 > y = 0 B B @ b y 1 b y 2 1 C C A so thaty =P + b y 1 +P 0 b y 2 , we may write the recourse function as (x,!) , minimum b y ⇥ f(!)+G(!)x ⇤ > P + b y 1 + 1 2 ⇣ b y 1 ⌘ > ⇤ + b y 1 + ⇥ f(!)+G(!)x ⇤ > P 0 b y 2 subject to A(!)x+DP + b y 1 +DP 0 b y 2 ⇠ (!). Let 0 B B B @ f + (!) f 0 (!) 1 C C C A , 2 6 6 6 4 P T + P T 0 3 7 7 7 5 f(!) and 2 6 6 6 4 G + (!) D T + G 0 (!) D T 0 3 7 7 7 5 , 2 6 6 6 4 P T + P T 0 3 7 7 7 5 G(!) D T . Considering the dual problem, we deduce that (x,!)= 1 (x,!) 2 (x,!), where 1 (x,!)= 2 6 6 6 6 6 6 6 6 6 4 minimum u 0 1 2 u > D + ⇤ 1 + D > + u+ u > h A(!)x⇠ (!)D + ⇤ 1 + (f + (!)+G + (!)x) i subject to f 0 (!)+G 0 (!)xD > 0 u=0 3 7 7 7 7 7 7 7 7 7 5 2 (x,!)= 1 2 ⇥ f + (!)+G + (!)x ⇤ > ⇤ 1 + ⇥ f + (!)+G + (!)x ⇤ . Proposition 3.2. The following two matrix-theoretic statements are equivalent: (D 2 ) for almost all !2⌦, the matrix G 0 (!) = 0; (D 3 ) for almost all !2⌦, Range G(!)✓ Range Q. ⇤ Proof. (D 3 )) (D 2 ). Condition (D 3 ) is equivalent to the existence of a matrix ¯ G(!)2R n2⇥ n1 such that G(!)=Q ¯ G(!). Hence G 0 (!)=P > 0 G(!)=P > 0 Q ¯ G(!) = 0 because P > 0 Q = 0. (D 2 )) (D 3 ). Let P , P + P 0 and y2Range G(!) be arbitrary. There exists x such that y =G(!)x. 70 Therefore, y = P + P 0 2 6 6 6 4 P > + P > 0 3 7 7 7 5 G(!)x = P + P 0 2 6 6 6 4 P > + G(!)x 0 3 7 7 7 5 =P + P > + G(!)x With the eigenvalue decomposition of Q,wehave y = P 2 6 6 4 ⇤ + 0 00 3 7 7 5 P > P 2 6 6 4 ⇤ 1 + 0 00 3 7 7 5 P > G(!)x =Q 0 B B B @ P 2 6 6 4 ⇤ 1 + 0 00 3 7 7 5 P > G(!)x 1 C C C A , Thus y2Range Q. Undereithercondition(D 2 )or(D 3 ), 1 (•,!)isaconvexfunction; since 2 (•,!)isaquadratic function,itfollowsthatweagainhaveanexplicitdcdecompositionoftherecoursefunction (•,!) with a smooth concave part. Note that in this case, 1 (•,!) is convex, but not necessarily strictly convex, and the feasible set defining this recourse function is dependent on ! albeit without the first-stage variable x; this is in contrast to the case of positive definite Q where the 1 (•,!) function there in its dual form (see (3.4)) has a constant feasible set (the nonnegative orthant in the -space). Theupshotofthissubsectionisthatundercondition(D 3 ),thetwo-stagelinearlybi-parameterized SP (3.1) with a positive semidefinite Q matrix in the recourse function is amenable to the com- bined SAA and DCA scheme for the computation of a d-stationary point, via the explicit dc decomposition of the recourse function described above. Although condition (D 3 ) will appear again in Proposition 3.4(b) in the next section, as a sucient condition for the recourse function (•,!) to be Clarke regular, the convergence results, Theorems 3.2, 3.3, and 3.4 do not assume this condition. 71 3.5 The General Positive Semidefinite case The general psd case is much more challenging than the simple cases discussed in Section 3.4. In fact, there are two factors that contribute to the applicability of the combined SAA-DCA method for solving the previous cases: namely, the availability of an explicit dc decomposition, and the di↵erentiability of the concave part of the dc decomposition. The latter also provides a main reason for the convergence to a directional stationary solution. This feature disappears in the general psd case. In fact, with the lack of a computationally exploitable dc decomposition of the recourse function, even a critical point, let alone a Clarke stationary point is dicult to obtain with sampling and convexification combined in a one-loop iterative scheme. Therefore, the first order of business to treat the general psd case is to understand what kind of a stationary solution one can aim to compute and what sort of an algorithm one can design to approximate such a solution. Regularization plays a key role in this consideration. Indeed, we may consider the following Tikhonov regularization of the recourse function: for a given scalar ↵> 0, ↵ (x,!) , minimum y ⇥ f(!)+G(!)x ⇤ > y+ 1 2 y > [Q+↵ I]y subject to y2Y(x,!),{y2R n2 |A(!)x+Dy ⇠ (!)}, (3.18) where I is the identity matrix. Then a question is: by considering the two-stage SP with the above regularized recourse function, and letting each x ↵ ⌫ denote a d-stationary point of the SP: minimize x2 X ⇣ ↵ ⌫ (x) , '(x)+E e ! ⇥ ↵ ⌫ (x,e !) ⇤ if ¯ xisalimitpointofasequence{x ⌫ ,x ↵ ⌫ }forsomesequence{↵ ⌫ }#0,whatkindofastationary solution is ¯ x? Subsequently, we will introduce a sampled regularization of the recourse function 72 and analyze its convergence. For now, we introduce the key concept of a generalized critical point for the problem (3.1) and study its properties. 3.5.1 A detour: generalized critical points A di↵erence-of-convex function is the sum of a convex and a concave function. It turns out that the recourse function (x,!) also has such a convexity-concavity feature, although not explicit. Specifically, we say that a function g : X ✓ R n ! R,where X is a convex set, satisfies the convexity-concavity property if there exists a bivariate function h :X⇥ X! R such that g(x)= h(x,x) for any x2X and h(•,x) is convex and h(x,•) is concave. To see that the second-stage recourse function (•,!) satisfies the convex-concave property for fixed !, define a lifted recourse function ¯ (x,z,!) , minimum y ⇥ f(!)+G(!)z ⇤ > y+ 1 2 y > Qy subject to y2Y(x,!),{y2R n2 |A(!)x+Dy⇠ (!)}, (3.19) whose optimal solution set we denote ¯ M(x,z,!). Clearly, (x,!)= ¯ (x,x,!) and ¯ (•,•,!) has the desired convexity-concavity property; by [59, Lemma 1], this lifted recourse function ¯ (•,•,!) is continuous on its domain of finiteness. The so-defined convexity-concavity property can be thought of as an implicit dc property. As such, we may define a generalized critical point for a univariate function satisfying the convex-concave property. Definition 3.3. Let S ✓ R n be a convex set and a function g : S ! R satisfy the convex- concave property with the associated bivariate convex-concave function h. We say that ¯ x2S is a generalized critical point of g on S if 0 2@ x h(¯ x,¯ x)@ z (h)(¯ x,¯ x)+N(¯ x;S), (3.20) 73 where@ x h(¯ x,¯ x)isthesubgradientoftheconvexfunctionh(•,¯ x)at ¯ x,andsimilarlyfor@ z (h)(¯ x,¯ x). ⇤ The term “generalized critical point” stems from the special case of a dc function f =g 1 g 2 where g 1 and g 2 are both convex. For this function f, we can associate the bivariate function h(x,z)= g 1 (x)g 2 (z). With this association, it is clear that the expression (3.20) becomes 02 @g 1 (¯ x)@g 2 (¯ x)+N(¯ x;X), which is precisely the definition of a critical point in dc programming. Note that like the latter, a generalized critical point of the function g depends on the bivariate convex-concave function h. We say that ¯ x2X is a generalized critical point of (3.1) if 02r '(¯ x)+@ x E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ @ z E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ +N(¯ x;X), (3.21) where @ x E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ is the subdi↵erential of the function E e ! ⇥ ¯ (•,¯ x,e !) ⇤ at ¯ x in the convex analysis; similarly for @ z E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ . Equivalently, condition (3.21) says that there exists ¯ v2@ z E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ such that 02r '(¯ x)+@ x E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ ¯ v+N(¯ x;X). This is further equivalent to ¯ x 2argmin x2 X '(x)+E e ! ⇥ ¯ (x,¯ x,e !) ⇤ ¯ v > (x¯ x) | {z } convex function in x . (3.22) In order to relate the generalized criticality definition with directional stationarity and Clarke stationarity, we utilize the directional derivative formula for the recourse function (•,!). While such formulas have been obtained for the value function of linear programs (see the early papers 74 [85,40])andgeneralconvexprogramsundersomeregularityconditions,suchassolutionbounded- ness and constraint qualifications, a comprehensive perturbation (including directional) analysis of parametric nonlinear programs can be found in [9], the most relevant to a bi-parameterized convex quadratic program is [45, Corollary 3.3] that gives a directional derivative formula for the value function of a parametric nonlinear program under the constant rank constraint qualification (CRCQ) and certain abstract stability conditions. While the CRCQ is immediately satisfied by linearconstraintsandthestabilityconditionscanbeverifiedtoholdundertherelativelycomplete recourse assumption, we find it interesting to derive the directional derivative of (•,!)usinga somewhat obscure sum property of this derivative. Although not widely known, this property nevertheless is key to the validity of an implicit function theorem for a class of nonsmooth equa- tions [65]; see also [30, Exercise 3.7.4]. This property stipulates under a certain limit condition the equality between the total directional derivative of the bivariate function ¯ (•,•,!) and the sum of the partial directional derivatives of the two partial functions ¯ (•,z,!) and ¯ (x,•,!). To prepare for this derivation, we let M(x,!) and ⇤( x,!) denote respectively the sets of optimal primal and optimal dual solutions of the second-stage value function (x,!) given by (3.2). Using the fact that QM(x,!) and ⇥ f(!)+G(!)x ⇤ M(x,!) are singletons, we can use any ¯ y2M(x,!) to represent the dual optimal set as follows: ⇤( x,!)= 8 > > > > < > > > > : 0 D > = f(!)+G(!)x+Q¯ y > ⇥ ⇠ (!)A(!)x ⇤ =2 (x,!)¯ y > ⇥ f(!)+G(!)x ⇤ 9 > > > > = > > > > ; , which shows in particular that⇤( x,!) is a polyhedral set dependent on the pair (x,!) only and independentoftheprimaloptimalsolution ¯ y. Theprimaloptimalsethasthefollowingpolyhedral representation: define the index set I(x,!) , i|9 2⇤( x,!)with i > 0 ; 75 we then have, for any ¯ 2⇤( x,!), M(x,!)= 8 > > > > < > > > > : y 2Y(x,!) f(!)+G(!)x+Qy = D >¯ ⇥ A(!)x+Dy = ⇠ (!) ⇤ i , 8i2I(x,!) 9 > > > > = > > > > ; . (3.23) While neither M(x,!) nor⇤( x,!) is necessarily bounded, we have the following finiteness result, which is known in the literature and included here for ease of reference. Let L(x,!;y, ) , ⇥ f(!)+G(!)x ⇤ > y+ 1 2 y > Qy+ > ⇥ ⇠ (!)A(!)xDy ⇤ denotetheLagrangianfunctionofthequadraticprogramassociatedwith (x,!)with beingthe constraint multiplier. In the following lemma and all remaining results (including the theorems), the proofs are given in the Appendix B. Lemma 3.5. Under assumptions (A) and (B), both optimal values below min y2 M(x,!) ⇥ G(!)d ⇤ > y and max 2 ⇤( x,!) ⇥ A(!)d ⇤ > (3.24) are finite for all x2X, all d2R n1 , and almost all !2⌦. ⇤ Proof. We prove only the finiteness of the former min value. The recession cone of Y(x,!) is equal to Y(x,!) 1 = v | Dv 0 . Since the relatively complete recourse condition holds on an open set ⌅ containing X, it follows that for every d2R n1 , the optimal set M(x±⌧d,! ) is nonempty for all suciently small scalars ⌧> 0; i.e., for all such pair (d,⌧ ), it holds that (cf. condition (B2) in Section 3.1.1) [Dv 0,Qv=0] ) ⇥ f(!)+G(!)(x±⌧d ) ⇤ > v 0. 76 Since f(!)+G(!)x =D >¯ Qy for any ¯ 2⇤( x,!) and any y2M(x,!), the above implication yields that for I(x,!)= i|9 2⇤( x,!)with i > 0 , 8 > > > < > > > : Dv 0,Qv=0 (Dv) i =0, 8i2I(x,!) 9 > > > = > > > ; )±⌧ h G(!) > v i > d 0. Since this holds for all d2R n1 and all⌧> 0 suciently small, it follows that 8 > > > < > > > : Dv 0,Qv=0 (Dv) i =0, 8i2I(x,!) 9 > > > = > > > ; ) G(!) > v =0. (3.25) Since min y2 M(x,!) ⇥ G(!)d ⇤ > y is the optimal objective value of a linear program in y, for fixed (x,d,!) and since M(x,!) is nonempty, we have, by the representation (3.23) of the latter set, min y2 M(x,!) ⇥ G(!)d ⇤ > y> 1 () 2 6 6 6 6 4 8 > > > < > > > : Dv 0,Qv=0 (Dv) i =0, 8i2I(x,!) 9 > > > = > > > ; ) ⇥ G(!)d ⇤ > v 0 3 7 7 7 7 5 . Hence, min y2 M(x,!) ⇥ G(!)d ⇤ > y is finite for all d2R n1 , by (3.25). ⇤ The two optimal values in (3.24) equal to ¯ (x,•,!) 0 (x;d) and ¯ (•,x,!) 0 (x;d)respectively. Based on this observation, we can establish the desired total directional derivative formula for (•,!) 0 (x;d) by verifying a limit condition (3.27) due to [65]. 77 Proposition 3.3. Under assumptions (A) and (B), the directional derivative of (•,!) exists on ⌅ and for all ¯x2X and d2R n1 , (•,!) 0 (¯ x;d)= min y2 M(¯ x,!) max 2 ⇤(¯x,!) 2 6 6 6 4 G(!) > yA(!) > | {z } =r x L(x,!;y, ) 3 7 7 7 5 > d. (3.26) ⇤ Proof. Clearly, min y2 M(x,!) max 2 ⇤( x,!) h G(!) > yA(!) > i > d =min y2 M(x,!) ⇥ G(!)d ⇤ > y+ max 2 ⇤( x,!) ⇥ A(!)d ⇤ > = ¯ (x,•,!) 0 (x;d)+ ¯ (•,x,!) 0 (x;d). Hence the formula (3.26) reduces to ¯ (•,•,!) 0 ((¯ x,¯ x);(d,d)) = ¯ (•,¯ x,!) 0 (¯ x;d)+ ¯ (¯ x,•,!) 0 (¯ x;d). By [65, Proposition 2.7], the above holds if for any ¯ x2X and ¯ z2X, lim (x,z)!(¯ x,¯ z) ¯ (x,z,!) ¯ (x,¯ z,!) ¯ (¯ x,•,!) 0 (¯ z;z¯ z) kz¯ zk =0. (3.27) In the rest of the proof, we use ¯ M(x,z,!)and ¯ ⇤( x,z,!) to denote the set of primal and dual optimal solutions associated with ¯ (x,z,!). By [59, Lemma 1], for the lifted recourse function ¯ , there exists a constant c Q,D that is only dependent by matrices Q and D, such that for any x and z in X, there exists a pair of solutions y(x,z,!)2 ¯ M(x,z,!) and (x,z,!)2 ¯ ⇤( x,z,!), satisfying 0 B B B @ y(x,z,!) (x,z,!) 1 C C C A c Q,D h f(!)+G(!)z + ⇠ (!)A(!)x i . 78 Let{(x k ,z k )2X⇥ X}beany sequenceconvergingto(¯ x,¯ z) and{y(x k ,z k ,!)2 ¯ M(x k ,z k ,!)}be a corresponding sequence of primal optimal solutions. Due to the uniform bound on this solution sequence, there exists a subsequence {y(x k ,z k ,!)} k2 K converging to a limit, say y 1 ,whichis in the optimal solution set ¯ M(¯ x,¯ z,!) by the KKT conditions. Moreover, for the subsequence {y(x k ,z k ,!)} k2 K ,wehave ¯ (x k ,z k ,!)= ⇥ f(!)+G(!)z k ⇤ > y(x k ,z k ,!)+ 1 2 y(x k ,z k ,!) > Qy(x k ,z k ,!) ¯ (x k ,¯ z,!) ⇥ f(!)+G(!)¯ z ⇤ > y(x k ,z k ,!)+ 1 2 y(x k ,z k ,!) > Qy(x k ,z k ,!), and ¯ (¯ x,•,!) 0 (¯ z;z k ¯ z)= min y2 ¯ M(¯ x,¯ z,!) h G(!) > y i > ⇣ z k ¯ z ⌘ . Hence, ¯ (x k ,z k ,!) ¯ (x k ,¯ z,!) ¯ (¯ x,•,!) 0 (¯ z;z k ¯ z) z k ¯ z h G(!)(z k ¯ z) i > y(x k ,z k ,!) min y2 ¯ M(¯ x,¯ z,!) h G(!) > y i > ⇣ z k ¯ z ⌘ z k ¯ z = h G(!)(z k ¯ z) i > (y(x k ,z k ,!) y 1 ) z k ¯ z min y2 ¯ M(¯ x,¯ z,!) (y y 1 ) > G(!)(z k ¯ z) kz k ¯ zk . Taking the limit k(2K)!1,wederive liminf k(2 K)!1 ¯ (x k ,z k ,!) ¯ (x k ,¯ z,!) ¯ (¯ x,•,!) 0 (¯ z;z k ¯ z) z k ¯ z 0. (3.28) 79 It remains to show that the limsup of the quotients is nonpositive. Notice that the primal optimal solution set ¯ M(x,z,!) is a polyhedron with the following representation: for any y(x,z,!) 2 ¯ M(x,z,!), ¯ M(x,z,!)= 8 > > > > < > > > > : y 2Y(x,!) Qy = Qy(x,z,!) ⇥ f(!)+G(!)z ⇤ > y = ⇥ f(!)+G(!)z ⇤ > y(x,z,!) 9 > > > > = > > > > ; By [59, Lemma 1], the mapping (x,z)2X⇥ X7! 0 B B B @ Q ¯ M(x,z,!) ⇥ f(!)+G(!)z ⇤ > ¯ M(x,z,!) 1 C C C A is Lipschitz continuous. By Ho↵man’s error bound for polyhedron sets (see [30, Corollary 3.2.5]), there exists a constant c A > 0, dependent only on Q, D, and the pair (¯ z,!), such that for all x2X and every y(¯ x,¯ z,!)2 ¯ M(¯ x,¯ z,!), dist(y(¯ x,¯ z,!), ¯ M(x,¯ z,!)) c A kx¯ xk. For each k,let y(¯ x,¯ z,!;z k )2 argmin y2 ¯ M(¯ x,¯ z,!) h G(!) > y i > ⇣ z k ¯ z ⌘ and y(x k ,¯ z,!)2 ¯ M(x k ,¯ z,!)be such that y(x k ,¯ z,!)y(¯ x,¯ z,!;z k ) c A ¯ x¯ x k . (3.29) 80 By the boundedness property, there exists a subsequence {y(x k ,¯ z, !)} k2 K converging to a point which is in the optimal solution set ¯ M(¯ x,¯ z,!) by the Karush-Kuhn-Tucker optimality conditions. Moreover, for the subsequence {y(x k ,¯ z, !)} k2 K ,wehave ¯ (x k ,z k ,!) ⇥ f(!)+G(!)z k ⇤ > y(x k ,¯ z,!)+ 1 2 y(x k ,¯ z,!) > Qy(x k ,¯ z,!) ¯ (x k ,¯ z,!)= ⇥ f(!)+G(!)¯ z ⇤ > y(x k ,¯ z,!)+ 1 2 y(x k ,¯ z,!) > Qy(x k ,¯ z,!) ¯ (¯ x,•,!) 0 (¯ z;z k ¯ z)= h G(!) > y(¯ x,¯ z,!;z k ) i > ⇣ z k ¯ z ⌘ , which yields, ¯ (x k ,z k ,!) ¯ (x k ,¯ z,!) ¯ (¯ x,•,!) 0 (¯ z;z k ¯ z) z k ¯ z ⇥ G(!)(z k ¯ z) ⇤ > y(x k ,¯ z,!) ⇥ G(!) > y(¯ x,¯ z,!;z k ) ⇤ > z k ¯ z z k ¯ z = ⇥ G(!)(z k ¯ z) ⇤ > ⇥ y(x k ,¯ z,!)y(¯ x,¯ z,!;z k ) ⇤ z k ¯ z . Hence, by (3.29), limsup k(2 K)!1 ¯ (x k ,z k ,!) ¯ (x k ,¯ z,!) ¯ (¯ x,•,!) 0 (¯ z;z k ¯ z) z k ¯ z 0. Combining the above limsup with the liminf (3.28), we deduce that the limit (3.27), and thus (3.26) hold. ⇤ Based on the formula (3.26), we can relate the sets S gc of generalized critical points, S C of Clarke stationary points, and S d of directional stationary points. Proposition3.4. Thefollowingstatementsholdforthetwo-stagestochasticprogram(3.1)under assumptions (A) and (B). (a) S d ✓ S C ✓ S gc ; 81 (b) if in addition assumption (D 3 ) holds, i.e., for almost all ! 2⌦, Range G(!) ✓ Range Q, then for almost all !2⌦, the function (•,!) is Clarke regular on X;i.e., (•,!) 0 (x;d)= (•,!) 0 (x;d) for all x2X and all d2R n1 ; (c) if in addition to (D 3 ), assumption (C) also holds, then S d =S C =S gc . ⇤ Proof. (a) From Definition 3.2, we clearly have S d ✓ S C .Toprove S C ✓ S gc , we have, from [84, Proposition 2.12], for any ¯ x2S C and all x2X, r '(¯ x) > (x¯ x)+E e ! ⇥ (•,e !) 0 (¯ x;x¯ x) ⇤ ' 0 (¯ x;x¯ x)+ ⇣ E e ! ⇥ (•,e !) ⇤ ⌘ 0 (¯ x;x¯ x) 0. (3.30) By using a similar proof as that for Theorem 2 in [40] which only assumes the closedness and convexity of the feasible set (see also [68, Theorem 3] that assumes in addition the tameness of the parametric problem), we can deduce an upper bound of Clarke’s directional derivative of the value function (•,!), (•,!) 0 (¯ x;x¯ x) max y2 M(¯ x,!) max 2 ⇤(¯x,!) r x L(x,!;y, ) > (x¯ x) = max y2 M(¯ x,!) h G(!) > y i > (x¯ x)+ max 2 ⇤(¯x,!) h A(!) > i > (x¯ x). From (3.30), we have that for any x2X, 2 6 6 6 6 6 4 r '(¯ x) > (x¯ x)+E e ! " max y2 M(¯ x,e !) h G(e !) > y i > (x¯ x) # +E e ! " max 2 ⇤( x,e !) h A(e !) > i > (x¯ x) # 3 7 7 7 7 7 5 0. (3.31) Since the lifted recourse function ¯ defined in (3.19) is convex-concave, we have 82 G(!) > y 2 @ z ( ¯ (¯ x,¯ x,!)), 8y 2M(¯ x,!) A(!) > 2 @ x ( ¯ (¯ x,¯ x,!)), 8 2⇤(¯x,!). The inequality (3.31) yields 02r '(¯ x)E e ! ⇥ @ z ( ¯ )(¯ x,¯ x,e !) ⇤ +E e ! ⇥ @ x ¯ (¯ x,¯ x,e !) ⇤ +N(¯ x;X). By Lemma 3.1, we can interchange the expectation and the subdi↵erentials; hence ¯x is a gener- alized critical point and part (a) follows. (b)Underassumption(D 3 ),G(!) > M(x,!)isasingleton. ByProposition3.3,foranyd2R n1 , we have (•,!) 0 (¯ x;d)= min y2 M(¯ x,!) ⇥ G(!)d ⇤ > y+ max 2 ⇤(¯x,!) ⇥ A(!)d ⇤ > = ⇥ G(!)d ⇤ > y+ max 2 ⇤(¯x,!) ⇥ A(!)d ⇤ > , for any y 2M(¯ x,!). As before, we have (•,!) 0 (¯ x;d) max y2 M(¯ x,!) ⇥ G(!)d ⇤ > y+ max 2 ⇤(¯x,!) ⇥ A(!)d ⇤ > = ⇥ G(!)d ⇤ > y+ max 2 ⇤(¯x,!) ⇥ A(!)d ⇤ > , for any y 2M(¯ x,!) = (•,!) 0 (¯ x;d) This is enough to establish statement (b). (c) For this part, it suces to show that S gc ✓ S d under the additional condition (C). This follows readily from part (b) under the interchangeability of expectation with the directional derivatives and subdi↵erentials. Details are omitted. ⇤ 83 Propositions 3.3 can be applied to the regularized recourse function ↵ (x,!) in (3.18). In particular, for fixed↵> 0, lettingy ↵ (x,!) denote the unique optimal solution of the QP defining this regularized recourse and ⇤ ↵ (x,!) denote the corresponding optimal dual solution set, we have for all x2X and d2R n1 , ↵ (•,!) 0 (x;d)= ⇥ G(!) > y ↵ (x,!) ⇤ > d+ max 2 ⇤ ↵ (x,!) h A(!) > i > d = ⇥ G(!) > y ↵ (x,!) ⇤ > d+ ¯ ↵ (•,x,!) 0 (x;d), (3.32) where ¯ ↵ (x,z,!), minimum y ⇥ f(!)+G(!)z ⇤ > y+ 1 2 y > [Q+↵ I]y subject to y2Y(x,!),{y2R n2 |A(!)x+Dy⇠ (!)}, (3.33) is the lifted regularized recourse function corresponding to the regularized recourse ↵ (x,!). The formula (3.32) shows that the latter function ↵ (•,!) is not necessarily di↵erentiable. 3.5.2 Convergence of regularization, convexification, and sampling Wereturntothebasicapproximationschemedefinedbytheregularizedrecoursefunction ↵ (x,!) (3.18). In this subsection, we analyze the convergence of three sequences of iterates corresponding to a given sequence of scalars {↵ ⌫ }#0: • (sequential regularization only) the basic iterates{x ⌫ }, where eachx ⌫ is a d-stationary solution of the dc stochastic program: minimize x2 X ⇣ ↵ ⌫ (x) , '(x)+E e ! ⇥ ↵ ⌫ (x,e !) ⇤ ; (3.34) • (simultaneous regularization and convexification) the convexified iterates {b x ⌫ }, where each b x ⌫ +1 , given b x ⌫ , is the unique minimizer of the strongly convex SP: 84 minimize x2 X b ⇣ ↵ ⌫ (x) , '(x)+E e ! ⇥ ↵ ⌫, 1 (x,e !) ⇤ E e ! h b ↵ ⌫, 2 (x,e !;b x ⌫ ) i + 1 2 kxb x ⌫ k 2 , where ↵ (x,!)= ↵, 1 (x,!) ↵, 2 (x,!) is the dc decomposition of the function ↵ (•,!) as introduced in Section 3.3; specifically, ↵, 1 (x,!) , maximum 0 2 6 6 6 6 4 1 2 > D(Q+↵ I) 1 D > + > h ⇠ (!)A(!)x+D(Q+↵ I) 1 f(!)+G(!)x i 3 7 7 7 7 5 ↵, 2 (x,!) , 1 2 ⇥ f(!)+G(!)x ⇤ > (Q+↵ I) 1 ⇥ f(!)+G(!)x ⇤ , (3.35) and b ↵, 2 (x,!;x 0 ) , ↵, 2 (x 0 ,!)+r x ↵, 2 (x 0 ,!) > (xx 0 ) is the linearization of ↵, 2 (•,!) atx 0 ; • (all together): the combined sampled, regularized, and convexified iterates{e x ⌫ }, where given a sequence of sample sizes{L ⌫ } 1 ⌫ =1 , random samples{! ⌫,i } L⌫ i=1 specified as in Algorithm 1, and e x ⌫ , each e x ⌫ +1 is the unique minimizer of the discretized strongly convex (sampled) program: minimize x2 X '(x)+ 1 L ⌫ L⌫ X i=1 ↵ ⌫, 1 (x,! ⌫,i ) 1 L ⌫ L⌫ X i=1 b ↵ ⌫, 2 (x,! ⌫,i ;e x ⌫ ) | {z } convex in x; denoted b ⇣ ↵ ⌫ (x;e x ⌫ ) + 1 2 kxe x ⌫ k 2 . As before, we are not concerned with how these sequences of optimal solutions are obtained; in particular, we take the two sequences {x ⌫ } and {b x ⌫ } as deterministic and understand that the last sequence{e x ⌫ } is random dependent on the samples. We divide the analysis in three separate cases because the assumptions required for convergence are progressively more demanding, yet remain reasonable. Common to the three cases is the fact that at the very minimum, we need the sequence of perturbed recourse values{ ↵ (x,!)} to uniformly converge to the original unper- turbed recourse function (x,!) for almost all x2X and almost !. This seems dicult without 85 a boundedness assumption on the respective sequence of optimal solutions {y ↵ (x,!)}. This and other considerations motivate the introduction of the following assumption: (E) [Dv 0 and Qv=0] ) v = 0. ⇤ We state an important boundedness consequence of the solutions of a regularized QP under this condition. Lemma 3.6. Let Q2R n2⇥ n2 be a symmetric positive semidefinite matrix and D 2R `⇥ n2 .If assumption (E) holds, then there exist positive constants B and ¯ ↵ such that for all ↵ 2[0,¯ ↵ ] and all pairs of vectors (q,b) for which the QP: minimum y q > y+ 1 2 y > [Q+↵ I]y subject to Dy b (3.36) has an optimal solution, say y ↵ (q,b), it holds that y ↵ (q,b) B 0 B B @ q b 1 C C A . ⇤ Proof. The proof is by contradiction. Suppose that no such pair of scalars (B,¯ ↵ )exists. Then there exist two sequences of vectors, {q k } and {b k }, and a sequence of nonnegative scalars {↵ k } converging to zero such that the sequence of optimal solutions {y k , y ↵ k (q k ,b k )} (for ↵ k = 0, the corresponding solution y ↵ k (q k ,b k ) is arbitrary) satisfies y k >k 0 B B @ q k b k 1 C C A . For each k, there exists a multiplier k such that the KKT conditions hold: q k +[Q+↵ k I]y k D > k =0 0 k ? Dy k b k 0. By considering the normalized sequence ( y k y k ) , which must have a nonzero accumulation point, say ¯ y, dividing the above complementarity conditions by y k , and letting k!1 along the subsequence that yields ¯ y, we can deduce that D¯ y0 and Q¯ y = 0. This is a contradiction. ⇤ 86 3.5.2.1 The sequence {x ⌫ } We recall the lifted regularized recourse function ¯ ↵ (x,z,!) given by (3.33) and the directional derivative formula (3.32) for the regularized recourse function ↵ (x,!). Let y ↵ (x,z,!)bethe unique minimizer of the QP associated with the former recourse function for a given triplet (x,z,!). The following lemma pertains to the limit of this lifted regularized recourse function as ↵ #0. Lemma 3.7. Under assumptions (A), (B), (C), and (E), if {(x k ,z k )}⇢ X⇥ X is a sequence converging to (¯ x,¯ z), and {↵ k } is a sequence of positive scalars converging to zero, the following three statements hold: (a) the sequence{y ↵ k (x k ,z k ,!)} is bounded and every accumulation point belongs to ¯ M(¯ x,¯ z,!), i.e., is an optimal solution of the problem: minimize y2 Y(¯ x,!) (f(!)+G(!)¯ z) > y+ 1 2 y > Qy; (b) the following limits hold uniformly for all (x,z)2X⇥ X: lim k!1 ¯ ↵ k (x,z,!)= ¯ (x,z,!), lim k!1 E e ! ⇥ ¯ ↵ k (x,z,e !) ⇤ = E e ! ⇥ ¯ (x,z,e !) ⇤ ; (3.37) moreover, lim k!1 ¯ ↵ k (x k ,z k ,!)= ¯ (¯ x,¯ z,!), lim k!1 E e ! h ¯ ↵ k (x k ,z k ,e !) i = E e ! ⇥ (¯ x,¯ z,e !) ⇤ ; (3.38) (c) the sequence n E e ! ⇥ G(e !) > y ↵ k (x k ,e !) ⇤ o is bounded; moreover, every accumulation point be- longs to@ z E e ! ⇥ ¯ (¯ x,¯ x,e !) ⇤ . ⇤ Proof. For (a), without loss of generality, assume that the sequence {y k , y ↵ k (x k ,z k ,!)} con- verges to a limit y 1 which clearly belongs to Y(¯ x,!). 87 To show thaty 1 has the claimed minimizing property, consider the KKT conditions: for some k : f(!)+G(!)z k +[Q+↵ k I]y k D > k =0 0 k ? A(!)x k +Dy k ⇠ (!) 0. Passing to the limit k!1 , we easily deduce that y 1 2M(¯ x,¯ z,!). For (b), it suces to prove the two limits in (3.37); those in (3.38) follow from Lemma 3.4 and the fact that for fixed !, the function (•,•,!) is continuous on X⇥ X. We have, for any ¯ y2M(x,z,!)whichwecanchoose,byLemma3.6,tobeuniformlyboundedforall(x,z)2X⇥ X and almost all !2⌦, ¯ ↵ k (x,z,!) = ⇥ f(!)+G(!)z ⇤ > y ↵ k (x,z,!)+ 1 2 (y ↵ k (x,z,!)) > [Q+↵ k I]y ↵ k (x,z,!) ⇥ f(!)+G(!)z ⇤ > ¯ y+ 1 2 ¯ y > [Q+↵ k I]¯ y = ¯ (x,z,!)+ ↵ k 2 k¯ yk 2 . Conversely, we have ¯ (x,z,!)= ⇥ f(!)+G(!)z ⇤ > ¯ y+ 1 2 ¯ y > Q¯ y ⇥ f(!)+G(!)z ⇤ > y ↵ k (x,z,!)+ 1 2 y ↵ k (x,z,!) > Qy ↵ k (x,z,!) ¯ ↵ k (x,z,!). Consequently, the first limit in (3.37) holds uniformly in (x,z). The second limit follows by the dominated convergence theorem as { ¯ ↵ k (x,z,!)} is uniformly bounded by a constant indepen- dently of almost all !2⌦. Toprove(c),forevery↵ k > 0,lettinga k ,E e ! ⇥ G(e !) > y ↵ k (x k ,e !) ⇤ = r z E e ! ⇥ ¯ ↵ k (x k ,x k ,e !) ⇤ . We have, for anyz,E e ! ⇥ ¯ ↵ k (x k ,z,e !) ⇤ E e ! ⇥ ¯ ↵ k (x k ,x k ,e !) ⇤ (a k ) > (zx k ). The sequence {a k } is bounded; if ¯ a is the limit of a convergent subsequence {a k } k2 , then passing to the limit k(2 )!1 , the above inequality yields, using the second limit in (3.38), 88 E e ! ⇥ (¯ x,z,e !) ⇤ E e ! ⇥ (¯ x,¯ x,e !) ⇤ ¯ a > (z¯ x). Hence ¯ a is a subgradient of the convex function E e ! ⇥ (¯ x,•,e !) ⇤ at ¯ x. We are now ready to state the following convergence result of the sequence {x ⌫ }, where each x ⌫ is a d-stationary solution of (3.34). Theorem 3.2. Under assumptions (A), (B), (C), and (E), every accumulation point of the sequence{x ⌫ } is a generalized critical point, of (3.1). If in addition (D 3 ) holds, then such a point is d-stationary, or equivalently, C-stationary. ⇤ Proof. Let ¯ x be the limit of a convergence subsequence {x ⌫ } ⌫ 2 . By the directional derivative formula (3.32), we have, for all x2X, 0 r '(x ⌫ ) > (xx ⌫ )+E e ! ⇥ 0 ↵ ⌫ (x ⌫ ;xx ⌫ ) ⇤ = r '(x ⌫ ) > (xx ⌫ )+ n E e ! ⇥ G(e !) > y ↵ ⌫ (x ⌫ ,e !) ⇤ o > (xx ⌫ )+ E e ! ⇥ ¯ ↵ ⌫ (•,x ⌫ ,e !) 0 (x ⌫ ;xx ⌫ ) ⇤ . This is equivalent to saying that x ⌫ is an optimal solution of the convex program: minimize x2 X '(x)+ ⇢ E e ! h G(e !) > y ↵ ⌫ (x ⌫ ,e !) i > (xx ⌫ )+E e ! ⇥ ¯ ↵ ⌫ (x,x ⌫ ,e !) ⇤ . If¯ aisthelimitpointoftheboundedsubsequence n E e ! ⇥ G(e !) > y ↵ ⌫ (x ⌫ ,!) ⇤ o ⌫ 2 ,thenbyLemma3.7 and a simple limiting argument, we can deduce that ¯ x is an optimal solution of the convex pro- gram: minimize x2 X '(x)(¯ a) > (x¯ x)+E e ! ⇥ (x,¯ x,e !) ⇤ . Since, ¯ a2 @ z E e ! ⇥ ¯ ⇤ (¯ x,¯ x,e !) by part (c) of Lemma 3.7, we deduce that ¯ x is a generalized critical point of the SP (3.1), by the equivalent condition (3.22) of such a point. Under the 89 additional assumption (D 3 ), the equivalence of “generalized critical”, “Clarke”, and “directional” follows from part (c) of Proposition 3.4. ⇤ We should recall that if condition (D 3 ) holds, then a combined SAA and DCA scheme will also compute a d-stationary solution of the SP (3.1) without the ↵ -regularization of the recourse function. These are two di↵erent approaches reaching the same goal. In the absence of condition (D 3 ), the former scheme is not applicable and the latter method computes only a generalized critical point. These conclusions persist in the simultaneous regularization and convexification scheme to be analyzed next; see Theorem 3.3. 3.5.2.2 The sequence {b x ⌫ } Under a mild restriction on the regularization parameter ↵ ⌫ , we have the convergence result for the deterministic sequence {b x ⌫ }. Theorem 3.3. Under assumptions (A), (B), (C), and (E), let {↵ ⌫}#0 be a nonincreasing sequence of positive scalars satisfying liminf ⌫ !1 ↵ ⌫ p ⌫> 0, (written as ↵ ⌫ =⌦( p ⌫ )). Then every accumulation point of the sequence {b x ⌫ } is a generalized critical point of the SP (3.1). If in addition (D 3 ) holds, then such a point is a d-stationary point, or equivalently, a C-stationary point. ⇤ Proof. By the definition of b x ⌫ +1 and the obvious nonincreasing property of the regularized recourse function: ↵ ⌫ +1 (x,!) ↵ ⌫ (x,!), we have, ⇣ ↵ ⌫ (b x ⌫ )= '(b x ⌫ )+E e ! ⇥ ↵ ⌫, 1 (b x ⌫ ,e !) ⇤ E e ! h b ↵ ⌫, 2 (b x ⌫ ,e !;b x ⌫ ) i '(b x ⌫ +1 )+E e ! [ ↵ ⌫, 1 (b x ⌫ +1 ,e !)]E e ! h b ↵ ⌫, 2 (b x ⌫ +1 ,e !;b x ⌫ ) i + 1 2 b x ⌫ +1 b x ⌫ 2 '(b x ⌫ +1 )+E e ! ⇥ ↵ ⌫, 1 (b x ⌫ +1 ,e !) ⇤ E e ! ⇥ ↵ ⌫, 2 (b x ⌫ +1 ,e !) ⇤ + 1 2 b x ⌫ +1 b x ⌫ 2 '(b x ⌫ +1 )+E e ! ⇥ ↵ ⌫ +1 (b x ⌫ +1 ,e !) ⇤ + 1 2 b x ⌫ +1 b x ⌫ 2 = ⇣ ↵ ⌫ +1 (b x ⌫ +1 )+ 1 2 b x ⌫ +1 b x ⌫ 2 . 90 Hence, the sequence of function values {⇣ ↵ ⌫ (b x ⌫ )} is nonincreasing. Since X is compact, the sequence {⇣ ↵ ⌫ (b x ⌫ )} converges and 1 X ⌫ =1 b x ⌫ b x ⌫ +1 2 is finite. Since ↵, 1 (x,!)= ↵ (x,!)+ ↵, 2 (x,!), we obtain ↵, 1 (•,!) 0 (x;d)= ↵ (•,!) 0 (x;d)+r x ↵, 2 (x,!) > d = ⇥ G(!) > y ↵ (x,!) ⇤ > d+ ¯ ↵ (•,x,!) 0 (x;d)+r x ↵, 2 (x,!) > d By the optimality of b x ⌫ +1 , we have, for all x2X, r '(x ⌫ +1 )+ n E e ! ⇥ G(e !) > y ↵ ⌫ (b x ⌫ +1 ,e !) ⇤ o > (xb x ⌫ +1 )+ E e ! ⇥ ¯ ↵ ⌫ (•,b x ⌫ +1 ,e !) 0 (b x ⌫ +1 ;xb x ⌫ +1 ) ⇤ + E e ! ⇥ r x ↵ ⌫, 2 (b x ⌫ +1 ,e !)r x ↵ ⌫, 2 (b x ⌫ ,e !) ⇤ + 1 (b x ⌫ +1 b x ⌫ ) > (xb x ⌫ +1 ) 0. Hence b x ⌫ +1 is an optimal solution of minimize x2 X '(x)+ n E e ! ⇥ G(e !) > y ↵ ⌫ (b x ⌫ +1 ,e !) ⇤ o > (xb x ⌫ +1 )+ n E e ! ⇥ r x ↵ ⌫, 2 (b x ⌫ +1 ,e !)r x ↵ ⌫, 2 (b x ⌫ ,e !) ⇤ o > (xb x ⌫ +1 )+ E e ! ⇥ ¯ ↵ ⌫ (x,b x ⌫ +1 ,e !) ⇤ + 1 2 kxb x ⌫ k 2 (3.39) Let b x 1 be the limit of a convergent subsequence {b x ⌫ +1 } ⌫ 2 . By Lemma 3.7 (b), the sequence n E e ! ⇥ ¯ ↵ ⌫ (x,b x ⌫ +1 ,e !) ⇤ o ⌫ 2 uniformly converges to E e ! ⇥ ¯ (x,b x 1 ,e !) ⇤ .Wehave r x ↵ ⌫, 2 (b x ⌫ +1 ,!)r x ↵ ⌫, 2 (b x ⌫ ,!)= G(!) > [Q+↵ ⌫ I] 1 G(!) h b x ⌫ +1 b x ⌫ i . Since the maximum eigenvalue of (Q + ↵ ⌫ I) 1 is bounded above by ↵ 1 ⌫ , it follows that there exists a constantC> 0 such that for all ⌫ and all !, 91 r x ↵ ⌫, 2 (b x ⌫ +1 ,!)r x ↵ ⌫, 2 (b x ⌫ ,!) C ↵ ⌫ b x ⌫ +1 b x ⌫ . (3.40) Sincethesum 1 X ⌫ =1 b x ⌫ b x ⌫ +1 2 isfinite,itfollowsthat lim ⌫ !1 p ⌫ b x ⌫ +1 b x ⌫ = 0. Consequently, if liminf ⌫ !1 ↵ ⌫ p ⌫> 0 as assumed, then lim ⌫ !1 E e ! h r x ↵ ⌫, 2 (b x ⌫ +1 ,e !)r x ↵ ⌫, 2 (b x ⌫ ,e !) i =0. Finally, if ¯ a is an accumulation of the sequence n E e ! ⇥ G(e !) > y ↵ ⌫ (b x ⌫ +1 ,e !) ⇤ o ⌫ 2 , then by part (c) of Lemma 3.7 , we deduce that ¯ a is a subgradient of the expected lifted recourse function with respect to z at b x 1 . Hence, by passing to the limit in the problem (3.39), it follows that b x 1 is an optimal solution of the convex program: minimize x2 X '(x)(¯ a) > (xb x 1 )+E e ! ⇥ (x,b x 1 ,e !) ⇤ + 1 2 kxb x 1 k 2 , or equivalently, an optimal solution of the same convex program without the last proximal term. Hence, the conclusions of the theorem follow as in Theorem 3.2. ⇤ 3.5.2.3 The sequence {e x ⌫ } Whilethemostinvolved,theanalysisofthissequenceismadeeasybyalltheanalysiswehavedone so far. Since we need to work with the directional derivative of the convex regularized function ↵ ⌫, 1 (•,!) at the iterate e x ⌫ +1 , which would involve the optimal solution(s) ↵ ⌫, 1 (e x ⌫ +1 ,!) of this function in the dual form (cf. (3.35)), we need some boundedness property of such solutions. However,wedonotneedthedualsolutionsthemselvestobebounded. Instead,asweshallsee,the key is to obtain a uniform bound on A(!) > ↵ ⌫, 1 (b x ⌫ +1 ,!). This turns out to require a non-trivial argument which employs the following lemma. 92 Lemma 3.8. Let ⌥ , n A2R `⇥ n1 | ⇥ D T =0, 0 ⇤ ) A T =0 o . The following two statements hold for this convex family of matrices: (a) for every A2⌥, the scalar (A), max kA T k 1 |kD T k 1 =1, 0 <1; (b) for every compact subset b ⌥ of ⌥, the scalar max A2 b ⌥ (A) < 1. Thus for any such subset b ⌥, there exists a constant b > 0 such that for every A2 b ⌥, it holds that kA > k b kD T k for all 0. ⇤ Proof. By the property of any matrix A2⌥, the quantity kA T k 1 is bounded above on the set |kD T k 1 =1, 0 which is the union of finitely many polyhedra. By linear program- ming theory, this observation is enough to establish statement (a). It is not dicult to show that the scalar function (•) is continuous on the set ⌥. Thus the finiteness of the scalar max A2 b ⌥ (A) follows. The last statement in part (b) follows by a simple normalization argument. ⇤ Asaconsequenceoftheabovelemma,weestablishaboundforaconstantV ↵, 1 analogoustothe onein(3.8). Thesetupfortheresultisasfollows. Recallthat⇤ ↵ (x,!)istheoptimaldualsolution set of the regularized recourse function ↵ (x,!). For each pair (x,!), let u ↵ (x,!)2@ x ↵, 1 (x,!) such that E e ! ⇥ u ↵ (x,e !) ⇤ 2@ E e ! ⇥ ↵, 1 (x,e !) ⇤ . The subgradient u ↵ (x,!) is a convex combination of finitely many vectors each equal to e u ↵ (x,,! ) , h A(!)x+D(Q+↵ I) 1 G(!)x i > , (3.41) where 2⇤ ↵ (x,!). In the next lemma, we derive a bound ofkD > k. Lemma 3.9. Under assumptions (A), (B) and (C), there exist positive constants ✓ i fori=1,2,3 such that for all x2X,↵> 0, and almost all !2⌦, max ✓ (Q+↵ I) 1 (D > ) , D > ◆ ✓ 1 ↵ +✓ 2 +✓ 3 ↵, 8 2⇤ ↵ (x,!). (3.42) ⇤ 93 Proof. By Ho↵man’s error bound for polyhedral sets, there exists a constant c D >0dependent on the matrix D only such that for all x2X and almost all !2⌦, there exists b y(x,!) satisfying A(!)x+Db y(x,!)⇠ (!) and kb y(x,!)k c D ⇥ ⇠ (!)A(!)x ⇤ + , (3.43) where [•] + , max(•,0) is the plus operator of vectors. For any 2⇤ ↵ (x,!), it satisfies the complementarity conditions: 0 ? 2 6 6 6 6 4 D(Q+↵ I) 1 D > h ⇠ (!)A(!)x+D(Q+↵ I) 1 f(!)+G(!)x i 3 7 7 7 7 5 0. Letting s(x,!),A(!)x+Db y(x,!)⇠ (!)0 be the slack variable associated with the feasible vector b y(x,!), we deduce from the above complementarity conditions, ↵ (Q+↵ I) 1 (D > ) 2 h (Q+↵ I) 1 (D > ) i > [Q+↵ I] h (Q+↵ I) 1 (D > ) i = > D(Q+↵ I) 1 D > = > h ⇠ (!)A(!)x+D(Q+↵ I) 1 f(!)+G(!)x i = > ⇥ Db y(x,!)s(x,!) ⇤ + h (Q+↵ I) 1 (D > ) i > f(!)+G(!)x h (Q+↵ I) 1 (D > ) i >⇥ (Q+↵ I)b y(x,!)+(f(!)+G(!)x) ⇤ which yields (Q+↵ I) 1 (D > ) 1 ↵ n kQ+↵ Ik b y(x,!) + f(!)+G(!)x o . 94 Using the bound (3.43), the inequality: D > k Q+↵ Ik (Q+↵ I) 1 (D > ) , and the essential boundedness assumption (C), we easily deduce from the above inequalities the desired bound (3.42) for some constants ✓ i , i=1,2,3. We will bound E e ! h u ↵ (x,!) i via E e ! h e u ↵ (x,,! ) i for any 2⇤ ↵ (x,!). For this purpose, we need one last assumption which makes it possible to apply Lemma 3.8. Namely, (F) For almost all !2⌦, h D > = 0 and 0 i ) A(!) > =0. (3.44) There are some simple sucient conditions for (F) to hold: (i) the Slater condition holds for the sets Y(x,!); i.e., there exists D¯ y> 0; and (ii) RangeA(!) ✓ RangeD for almost all ! 2⌦. Both sucient conditions are fairly obvious. Note that the second condition (ii) is reminiscent of condition (D 3 ). Specifically, the latter condition pertains to the pair (G(!),Q) which appears in the objective function of the recourse function (x,!), whereas the former condition pertains to the pair (A(!),D) which appears in the constraint of the same recourse function (x,!). Combining assumption (F) with the essential boundedness of kA(!)k, we have the following result. Lemma 3.10. Under assumptions (A), (B), (C) and (F), there exist constants ✓ 0 i for i=1,2,3 such that E e ! h u ↵ (x,e !) i ✓ 0 1 ↵ +✓ 0 2 +✓ 0 3 ↵, 8x 2X. (3.45) ⇤ 95 Proof. BytheessentialboundednessofkA(!)k,thereexistsaconstantM> 0suchthatP[ b ⌦] = 1, where b ⌦ , {!|kA(!)k M}. Without loss of generality, we may assume that A(!) satisfies the implication (3.44) for all !2 b ⌦. Let b ⌥ be the convex hull of these matrices A(!) for !2 b ⌦. Since the family of matrices⌥ in Lemma 3.8 is convex, it follows that b ⌥ is a compact subset of ⌥. Hence, there exists a constant b > 0 such that for all ! 2 b ⌦, kA(!) > k b kD > k for all 0. By the definition (3.41), we obtain, for any 2⇤ ↵ (x,!), ke u ↵ (x,,! )kk A(!) > kkxk+kG(!)xk (Q+↵ I) 1 (D > ) . Hence by Lemma 3.9 and the bound onkA(!) > k in terms ofkD > k, the bound (3.45) follows readily for some constants ✓ 0 i , i=1,2,3. An immediate consequence of the bound (3.45) of the subgradients in @ x ↵, 1 (x,!)isthe Lipschitz continuity of the regularized recourse function ↵ (x,!) with a Lipschitz constant of the order 1/↵ for↵> 0 suciently small. Corollary 3.1. There exists a constant Lip > 0 such that for all↵> 0 suciently small E e ! ⇥ ↵, 1 (x,e !) ⇤ E e ! ⇥ ↵, 1 (x 0 ,e !) ⇤ Lip ↵ xx 0 , for all x and x 0 in X and E e ! h r x ↵, 2 (x,e !) i Lip ↵ . ⇤ Proof. Since ↵, 1 (•,!) is convex, we have ↵, 1 (x,!) ↵, 1 (x 0 ,!) u ↵ (x 0 ,!) > (xx 0 )ku ↵ (x 0 ,!)kkxx 0 k, interchanging x and x 0 yields ↵, 1 (x,!) ↵, 1 (x 0 ,!) sup z2 X ku ↵ (z,!)kkxx 0 k. 96 Thus (3.1) follows from (3.45). Since r x ↵, 2 (x,!)= [Q+↵ I] 1 G(!)x, we easily derive (cf. (3.40)) the existence of a positive constant, which can be taken to be the same as Lip , such that the corollary holds. We can state the following convergence result for the sequence {e x ⌫ } corresponding to a se- quence of positive scalars {↵ ⌫ }#0 with the proof in Appendix B. Theorem 3.4. Under assumptions (A), (B), (C), (E), and (F), let {↵ ⌫}#0 be a nonincreas- ing sequence of positive scalars satisfying liminf ⌫ !1 ↵ ⌫ p ⌫ > 0. Let {L ⌫ } be chosen such that 1 X ⌫ =1 1 ↵ 2 ⌫ p L ⌫ < 1. Then the conclusions of Theorem 3.3 hold for any accumulation point of the sequence {e x ⌫ } with probability 1. ⇤ 3.6 Summary In this chapter, we study the two-stage stochastic program with convex quadratic recourse which is linearly bi-parameterized by the first-stage decision variable. We establish the almost sure subsequential convergence of a combined SAA and DCA algorithmic framework for computing a directional stationary point (in the case of a positive definite second-stage objective) and a generalizedcriticalpoint(inthecaseofapositivesemidefinitesecond-stageobjective). Thenewly introducedconceptsofimplicitconvex-concavefunctionsandgeneralizedcriticalpointsenrichthe foundation of modern nonconvex nondi↵erentiable optimization problems. We anticipate more of their applications in other deterministic and stochastic programs. The work done in this paper on this class of linearly bi-parametrized stochastic programs is by no means complete. The sequential convergence and the convergence rate of the proposed algorithm, as well as the ecient computation of the sample-based DCA subproblems, are waiting to be further explored. 97 Chapter 4 Stochastic Programming with Implicitly Endogenous Uncertainty 4.1 Introduction In stochastic programming, endogenous uncertainty means that the uncertainty is dependent on the decision variable. With decision-dependent uncertainty, traditional Monte-Carlo sampling approaches including SAA, SA and SD are no longer applicable for this class of SP problems. The sequential PO paradigm can be applied. Specifically, one first predicts the probabilistic dependence of uncertainty on decision, then solves the SP problem equipped with the predicted probability distribution. In the PO paradigm, the interdependence between the decision and the uncertainty are at the center of trade-o↵s between predictive accuracy and complexity of decision-making. In this chapter, we explore to solve stochastic optimization problem equipped with a latently decision-dependent uncertainty. We develop a coupled framework in an attempt to balance the trade-o↵ between the accuracy of predictive models and the complexity of the composite opti- mization problem. We refer to this extension of Learning Enabled Optimization (LEO, see [71]) as “Coupled Learning Enabled Optimization” (CLEO). What is necessary for this early foray 98 into coupled estimation and optimization model is the control of overall complexity by making judicious choices in both arenas. 4.1.1 Problem setting We consider a constrained stochastic optimization as follows, (P)minimize f(x) , c(x)+E e !|x [h(x,e !)+g(x,e !)] subject to x2X✓ R p (4.1) where e ! 2R m is a random vector defined on a probability space (⌦ ,F,P), with ⌦ being the sample space,F beingthe -algebra generated by subsets of⌦ and P being a probability measure defined on F. Under the assumption that the conditional probability distribution of e ! given x exists,E e !|x referstotheexpectationtakenwithrespecttotheconditionalprobabilitydistribution of e ! given x. However, the conditional probability distribution is unknown. As a result, we will estimate conditional expectation by using the observations of data pair (x,!). In order to make theconnectionwithstandardstatistical/machinelearningterminology,wetreatxasthepredictor, and ! as the response. We make several assumptions about the problem (P) so that we focus on the treatment of latent dependence. (A1) The feasible set X✓ R p is a deterministic, convex and compact set (A2) The function c(·) :R p ! R is C 1 smooth on X and the function g(·,·)is C 1 smooth on X⇥ ⌦. (A3) The function h(x,!) = max j2J h j (x,!). The set J is a finite index set and for each j2J, h j (·,·) : Z j ⇥ ⌅ j ! R is a convex and smooth function where Z j ✓ R p is a convex open set containing X and⌅ j ✓ R m is a convex open set containing⌦. 99 (A4) The random variable e ! = m ⇤ (x,e ")= ⇤ (x)+ e " where: ⇤ (·) : Z ✓ R p ! R m is a twice-di↵erentiable function, Z is a convex open set containing X, and e " is a bounded random vector with finite variance⌃ ⇤ . Moreover, the regression model is homoscedastic which means that the random error term e " is independent of x. ⇤ We provide some explanations of the above assumptions. The assumption (A1) imposes a deterministic and convex feasible solution set. The objective functionf(x) is characterized as the sum of three functions with the properties given in the assumptions (A2) and (A3). In particular, c(·) and g(·,·) are smooth but potentially nonconvex functions, whereas, h(·,·) is a convex and nonsmooth function due to the structure of finite maximization. Such functions arise in some applications with the simple recourse function or in two-stage models such as two-stage stochastic LPs (SLP). In two-stage SLP, a function h(·) denotes the second-stage value function of a linear program whose right-hand-side is an ane function of ( x,!), and each function piece h j (·) is an ane function associated with some dual feasible extreme point. In the assumption (A4), we assume that the random variable e ! is generated from a homoscedastic regression model with the finite variance so that we are able to estimate the variance in a global manner. Under the above assumptions, we reformulate the problem (P) as minimize x2 X✓ R p f(x)= c(x)+E e " [h(x, ⇤ (x)+e ")]+E e " [g(x, ⇤ (x)+e ")]. (4.2) 4.1.2 Examples The type of decision-dependent uncertainty under the assumption (A4) appears in marketing wherex may denote an advertising plan [71], revenue management wherex may denote a booking policy [19], disaster preplanning where x may denote a preparatory plan [61] and so on. In the following, we discuss two price-sensitive examples which illustrate the need of decision-dependent uncertainty in the stochastic optimization problems. Example 1: Joint production and pricing problem 100 Suppose we have a set of K products to be launched. For each product i,weneedtodecide its price p i and the amount q i to produce at the cost of c 1i per unit produced. The demand d i of each productiisnotonly a↵ected by itsownpricebut alsobythepriceofother products. Weuse bold vectorsp,q,c 1 andd to represent the price, production units, production cost, and demand for K products. Suppose that the demand d is a regression model of the price p in the form that d = ⇤ (p)+e " where e " is a random variable with bounded variance and is independent ofp andq. Due to the variability of the random demand, we have the option of last-minute production at the cost ofc 2 >c 1 per unit to satisfy the demand. Moreover, we su↵er the penalty cost of c 3 per unit if the production exceeds the demand. The goal is to decide the price and production quantity which minimize the expected total cost. This problem is modeled as a stochastic programming problem as follows satisfying the assumptions (A1)-(A4). minimize q,p 0 c > 1 q+E d|p [p > d+c > 2 max{dq,0}+c > 3 max{qd,0}] (4.3) Example 2: Joint shipment and pricing problem This example follows the setting in [6] with the minor changes of notations. We consider only one product in a network of M factories to satisfy the demand at N locations. The decision- making problem has two stages. In the first stage, we determine the product’s price p and choose the amount q i to produce and store at each factory i for i=1,...,M at the cost of c 1i per unit produced. In the second stage, the demand d j at each location j realizes for j=1,...,N and we must ship the units of the product to satisfy the demands. Moreover, in the second stage, we have the option of having the last-minute production y i at the factory i at a cost c 2i >c 1i per unit produced and we decide the units z ij to be shipped from factory i to the location j with the cost s ij per unit shipped. Suppose that the random vector d represents the demands at N locations, and it is a regression model of the price p in the form that d = ⇤ (p)+e " where 101 ⇤ (·) : R! R N and e " is a random variable with bounded variance and is independent of p and q. The corresponding two-stage stochastic programming problem is constructed as follows. minimize q,p 0 c > 1 q+E d|p [h(q,d)]E d|p [p P N i=1 d j ] (4.4) where h(q,d) is a recourse function satisfying h(q,d)= minimize y,z 0 c > 2 y+ P M i=1 P N j=1 s ij z ij subject to P M i=1 z ij d j ,j=1,...,N P N j=1 z ij q i +y i ,i=1,...,M Therecoursefunctionh(q,d)isthevaluefunctionofalinearprogram, whichcanbeexpressed as the maximum of finite number of linear functions ofq andd using its dual formulation. Hence the assumptions (A1) - (A4) hold for the joint shipment and pricing problem. Aswecanseefromtheabovetwoexamples, structureofhandg arederivedfromthedecision- making problem of interest and thus can be assumed as known functions. However, we do not havethepriorknowledgeofthedependenceoftherandomvariableonthedecisionvariableexcept via a regression structure, which means that the function ⇤ and the probability distribution of e " are unknown. Note that even though h is a convex function, non-convexity may creep in because the ⇤ (·) could be a nonconvex function. Indeed, ⇤ can be non-convex even if it is obtained by linear regression. For example, if the linear regression uses any interaction terms (sayx 1 x 2 ), then, the resulting ⇤ could be non-convex. This results in a task of solving an implicitly-composite, nonsmooth and potentially nonconvex optimization problem which is much more challenging to obtain an approximate optimal solution than the well-defined composite optimization problem (see [25]). To seek an approximate solution of (4.2), we will estimate the conditional expectation using a data-driven approach. 102 4.1.3 Literature review In stochastic programming, the uncertainty is categorized as exogenous uncertainty which is not a↵ected by the decision, and endogenous uncertainty which is equivalent to the term of decision- dependent uncertainty. Furthermore, Dupaˇ cov´ a [26] categorized endogenous uncertainty into two sub-classes, where the decision can either a↵ect the probability distribution of the uncertainty or a↵ect the time when the information of the uncertainty is revealed. Within the literature of stochastic programming with endogenous uncertainty, the decision- dependent information revelation has received the most attention. Goel and Grossmann [32] formulate this class of decision-making problems as a mixed integer disjunctive programming problem using a scenario tree and non-anticipativity constraints dependent on binary decision variables. For an extensive review of decision-dependent information revelation, we refer the interested readers to [35]. For decision-dependent probability distribution, the literature is quite sparsesincethenonconvexitycaneasilycreepin. Dupaˇ cov´ a[26]addressestwoclassesofproblems where the decision either a↵ects parameters of parametric probability distribution or influenced the choice of a probability distribution among a finite set of probability distributions. Hellemo et al. [35] add another class of problems where the probabilities of uncertainty are distorted by the decision variables. To the best of our knowledge, current studies of SP with endogenous uncertaintyalltakethedependenceofuncertaintyonthedecisionvariableasthepriorknowledge. Thelatentlydecision-dependentuncertaintyreceivesmoreattentioninoperationsmanagement (OM) models. In revenue management, Cooper et al. [19] analyzed that if the dependence of demand and the price is ignored in the decision-makers’ model, this modeling error could result in the systematic deterioration when the decisions are made based on the prediction using the observed data over time. In [20], Cooper et al. further investigate pricing games between two sellers where each seller estimates the monopoly demand for pricing. They show that the estimated monopoly models implicitly incorporate the e↵ects of competition in some settings. In 103 dynamic pricing, to deal with implicitly price-dependent demand, there is a growing literature ([2],[15], [22]) which combines the demand learning and pricing to control the regret. We refer to [21] for a comprehensive review of research and literature in dynamic pricing. Most of the work in revenue management assumes the known or parametric demand functions while there are a few works ([7], [11]) addressing the nonparametric demand models. In newsvendor-type problems, the optimal decision for the decision maker’s model is given by a particular quantile of the estimated distribution. By utilizing the quantile structure, Lee et al. [50] design the iterative decision process with the forecasting model and Harsha et al. [34] develop a practical framework for price-setting newsvendor problem including statistical estimation and price optimization. In these works, the integration of learning and optimization relies on the special structure of these operations management models. For a general class of continuous stochastic programming problem with the latently decision- dependent uncertainty, there are several methods that could be used by accounting the latent dependence. Predict-then-Optimize(PO) To account the latent dependence, we take the decision vari- able as the predictor and the uncertainty as the response. In the PO paradigm, one sequentially makes the prediction and solves SP problem equipped with the predicted probability model. For instance, in revenue management, it is common to approximate the demand curve by a linear function of price (see [15]) and then take the approximate linear demand curve into the pricing problem. Because of the combination of prediction and optimization in the PO paradigm, the optimality of the derived solution not only depends on the numerical algorithm for optimization but also the prediction model. This leads to a trade-o↵ issue of the PO approach: a) the de- pendence may not be accurately estimated by parametric regression models which could result in an ine↵ective solution; b) using nonparametric models with higher accuracy, the composite SP problems could be potentially nonsmooth and highly nonconvex, thus challenging to be solved. Hence, we need a “smart” way to balance the trade-o↵ between the accuracy of predictive models 104 and the complexity of composite optimization problems. Elmachtoub and Grigas [29] develop the“smart” predict-then-optimize (SPO) approach which modifies the loss function of prediction models for the exogenous uncertainty using the objective function of the optimization problem. However, the SPO approach is applied only for a special class of linear programs with uncertain cost where the cost vector is a response variable of exogenous predictors. Derivative-free methods is used without the need of knowing function structures. In [31], thezeroth-ordermethodisdevelopedtocomputeacriticalpointofthesmoothstochasticprogram- ming problem with decision-dependent uncertainty. Another applicable derivative-fee method is the trust region method based on derivative-free models. The classical trust region method iter- atively solves a subproblem using the second-order approximation in the trust region centered at the proposed solution. As the algorithm proceeds, the trust region radius is either contracted or expanded depending on the ratio of actual-to-predicted reduction on the objective value. We refer toConnetal. [18]forcomprehensivestudyoftrustregionmethods. Withoutusingderivativeand the second-order information of the objective function, we can construct a random trust region model using either interpolation or regression models over the decision variables. Convergence of such TR methods depend on the statistical accuracy of the random trust region model. Recent studies ([3], [8], [13], [33], [47]) give conditions on random trust region functions for the first-order and second-order convergence of various trust region methods in smooth and un- constrained optimization problems. Other derivative-free approaches for the problem of interest include Bertsimas et al [6] in which the conditional expectation of a random function is approx- imated using ML methods including LOESS, regression trees and random forests. In the case of decision-dependent uncertainty, [6] proposes to discretize the space of continuous decision vari- ablesandaggregatetheoptimalvalueofthesubproblemsforalldiscretizeddecisions. Themethod of discretization overcomes the issue of non-convexity, but results in exponential growth in the size of decision-making problem. 105 For the SP problem with the latently decision-dependent uncertainty, we develop the CLEO algorithm which will be presented in the following sections. It can be viewed as a derivative- free trust region method with local linear regression. The di↵erence with the current literature on derivative-free trust region methods such as [13] is that the proposed TR-based method is adapted for the nonsmooth, composite, constrained stochastic programming problem. The other work relevant to our approach is [47] which develops a trust region framework with regression model from data. However, the setting considered there is di↵erent from ours because it has no decision-dependent uncertainty and thus no latently composite structure. CLEO has an iterative scheme of learning steps to predict the latent dependence in a local region, optimization steps to generate the next candidate solution and update steps to decide the central point and the radius of the next local region. Specifically, in the learning step, a local linear regression (LLR) model is fitted for prediction. In the optimization step, the trust region method with a random LLR model is used for optimization. In the update step, the radius of the next local region which is also the bandwidth of the next local regression model follows the trust region radius rule. Through update steps, the radius for regression automatically adjusts to the ratio of actual-to-predicted reduction in optimization. So it provides a bridge between the local learning models and optimization subproblems. In the more recent incarnation, local regression uses higher order function and kernel-based weightstoprovidehigheraccuracy, butitwouldresultinamoredemandingnonconvexcomposite subproblem. In an e↵ort to control the complexity of any iteration, we use simple local linear re- gression without much loss of accuracy. This novel interaction between learning and optimization in the CLEO algorithm leads to almost sure convergence to a stationary solution to what may be a highly nonlinear, non-convex and non-smooth constrained optimization problem. To the best of our knowledge, the CLEO algorithm is the first to present a provably convergent algorithm for coupled learning and optimization to control the overall complexity in seeking a near-optimal decision. 106 4.1.4 Notations In the following analysis, we use 1 p⇥ q to denote the matrix of size p⇥ q with all one entries, 1 p to denote the vector of size p with all one entries, 0 p⇥ q to denote the matrix of size p⇥ q with all zero entries and I p denote the identity matrix of size p. For a vector v 2R m , kvk denotes the Euclidean norm of a vector. For a matrix M2R m⇥ n ,kMk denotes the matrix norm induced by the Euclidean vector norm. We follow the classical O(·) and o(·) representations for asymptotical behavior. When we address the probability distributions of random variables, we use MN m,n (M,U,V) to denote the matrix normal distribution and U(S) to denote the uniform distribution on the set S. The rest of this chapter is organized as follows. In section 4.2 we present the CLEO algorithm and in section 4.3 we analyze its sequential convergence to a stationary point. In section 4.4 we present several numerical experiments of the CLEO algorithm. 4.2 CoupledLearningEnabledOptimization(CLEO)algorithm Noticethattheproblem(4.2)ispotentiallyanonconvexoptimizationproblem. Inthischapter,we limitourscopeinseekingastationarypointwhilerelievingthecomputationalburdenofsearching for a global or local optimum. With the fact that a composition of convex and di↵erentiable functionh(x, ⇤ (x)+e ")isClarkeregularanditislocallyLipschitzcontinuous,wecancomputethe directional derivative of f by [25] and Theorem 7.44 in [76]. For any x2X and any d2T(x;X) where T(x;X) defines the tangent cone of the set X at x, the directional derivative of f at x for the direction d is computed as follows. f 0 (x; d)=r c(x) > d+E e " [h 0 ((x, ⇤ (x)+e "); (d,r ⇤ (x) > d))] (4.5) +E e " [r g((x, ⇤ (x)+e ") > (d,r ⇤ (x) > d))] 107 Accordingly, the directional stationarity of (4.2) is defined as follows. Definition 4.1. We say that ¯ x is a directional stationary point of the problem (4.2) if f 0 (¯ x; x¯ x)0, for any x2X. Moreover, the directional stationarity of ¯ x is equivalent to (¯ x)=0where (x),|minimize x+d2 X kdk 1 f 0 (x; d)|. (4.6) Since ⇤ (·) and the probability distribution of e " are not available a priori, our strategy couples the estimation of a regression function from data pairs (x,!) with the optimization in (2). In the CLEO algorithm, we iteratively conduct a learning step which estimates the local dependence using the local linear regression and an optimization step which seeks a candidate solution of the composite trust region subproblem using the TR method. In the rest of this section, we touch upon the LLR model and the composite trust region subproblem in turn. We assume that we are able to obtain the data pairs in the local region as the algorithm proceeds. In the learning step, we consider a hypothesis class of ane functions, i.e., H,{ (·) :R p ! R m | (x)=(B 1 ) > x+(B 0 ) > , and B 1 2R p⇥ m ,B 0 2R 1⇥ m }. At the kth iteration, given the current point b x k and the current trust region radius k , suppose we draw the data setT k ={(x i ,! i )} of sizeN k inB(b x k , k )\ X which defines the intersection of a ballB(b x k , k ) centered at b x k with radius k and the feasible solution set X. Among the functions in the hypothesis class H, we seek the one which minimizes the sum of square errors on the data set T k . While one could use a kernel to weight the data points according to their distance from 108 T k , the simplest variant is to use a hard constraint requiring the data to belong to T k .Then,the estimation of parameters b B k,1 , b B k,0 and the residuals {e k,i } are constructed as below. { b B k,0 , b B k,1 } argmin B 0 ,B 1 X (x i ,! i )2 T k ! i h (B 1 ) > x i +(B 0 ) > i 2 , (4.7) e k,i ! i ( b B k,1 ) > x i ( b B k,0 ) > , for i=1,2,...,N k (4.8) Let b " k denote the random variable with the empirical probability distribution of {e k,i } N k i=1 .The LLR model m k is constructed as follows. m k (b x k +s, b " k ) , ( b B k,1 ) > (b x k +s)+( b B k,0 ) > +b " k . (4.9) The TR model f k is then constructed as follows. f k (b x k +s) , c(b x k +s)+E b " k [h(b x k +s,m k (b x k +s,b " k ))+g(b x k +s,m k (b x k +s,b " k ))]. (4.10) From an intuitive perspective, the LLR model m k is a first-order probabilistically accurate approximation of latent dependence with enough number of samples. By composite structure, the trust region model f k is a probabilistically accurate approximation to the original objective function. Such a criterion on accuracy will be formally discussed in analyzing the convergence in section 5. We formulate the process of seeking a solution that minimizes the TR model f k in a neighborhood of the current point b x k as the following trust region subproblem. (P k )minimize s f k (b x k +s) subject to b x k +s2X,ksk k (4.11) Note that under the assumption (A2) and (A3), f k is the sum of two smooth functions and a convex nonsmooth function. If the smooth part is convex, the trust region subproblem (4.11) is 109 a convex problem and we could find a global optimum by many available numerical algorithms, such as the stochastic approximation algorithm [54], the stochastic decomposition algorithm [36] and others. If the smooth part is nonconvex, we can compute a suitable point towards the steepestdescentdirectionthatsatisfiessome“sucient”decreasecondition. Specifically,following the construction in [18], for any given point b x k 2X,wesay s k is a suitable step of (P k )if b x k +s2X,ks k k k ; moreover, there exist positive constants dcp 2(0,1) and 0 > 0,✏> 0 such that, f k (¯ x)f k (¯ x+s k ) dcp k (¯ x)min{ k , 0 }, for any ¯ x2B (b x k ,✏)\ X (4.12) where k (x),|minimize x+d2 X kdk 1 f k 0 (x; d)|. (4.13) Lemma 11.4 in [18] implies that the “sucient” decrease condition (4.12) can be satisfied by a step to the boundary of the steepest descent direction, and it will also be satisfied by other steps that produce the sucient reduction on the model. The condition (4.12) is all we need for a trial step of the trust region subproblem (P k ). The method to derive such a point is beyond the scope of our paper; instead we refer the interested readers to section 12.2 in Conn et al. [18]. Under the assumptions (A1)(A4), we present the CLEO scheme in Algorithm 1. At each iteration, the CLEO algorithm successively constructs a LLR model m k and a trust region model f k following line 3-5 of Algorithm 1. Because of modeling error, the suitable step s k computed in line 6 needs to be verified according to the trust region update rule. Specifically, from line 7 to line 9, we compute the value ⇢ k to estimate the ratio of actual-to-predicted reduction through two local regression models m k,1 and m k,2 . Depending on the values ⇢ k and k , a suitable step is accepted or rejected with the expanded or shrunk trust region radius respectively according to the update rule from line 10 to line 13. In the next iterate, we take the bandwidth of LLR model 110 to be the same as the updated trust region radius. This means that if the current local linear regression model does not fit the local dependence well, then the large ratio ⇢ k could result in a smaller bandwidth so that the next LLR model is likely to be more appropriate in a smaller region. In this way, the CLEO algorithm connects the predictive model with the optimization problem through a “smart” bandwidth choice. This connection automatically balances the trade- o↵ between the exploration of LLR models and the exploitation of TR subproblems. Algorithm 3 CLEO 1: Initialization: b x 0 2X, 0 2(0, max )with max > 0,> 1,⌘ 1 2(0,1),⌘ 2 > 0. 2: for k=0,1,2,... do 3: drawthesamples{u i }fromtheuniformdistributionU(B(0 p ,1)). Let{x i } ={ k u i +b x k }\ X and let T k ,{(x i ,! i )} be the associated data set. 4: construct a local linear regression model m k (b x k +s,e ") by (4.9) with the data set T k . 5: construct the trust region model f k (b x k +s) by (4.10). 6: compute a suitable step s k of the TR subproblem (4.11) satisfying the condition (4.12). 7: respectively draw the sets of data pairs S k and S k+1/2 uniformly in the trust region B(b x k ,ks k k/2)\ X and B(b x k +s k ,ks k k/2)\ X. 8: construct a local regression model m k,1 (x,") by fitting the data set S k and compute v k = c(b x k )+E e " [h(b x k ,m k,1 (b x k ,e "))]. Similarly, construct the model m k,2 (x,") using the data set S k+1/2 and compute v k+1/2 = c(b x k +s k )+E e " [h(b x k +s k ,m k,2 (b x k +s k ,e "))]. 9: compute ⇢ k (v k v k+1/2 )/ f k (b x k )f k (b x k +s k ) and compute k (b x k ) by (4.13). 10: if ⇢ k ⌘ 1 and k (b x k )⌘ 2 k then 11: b x k+1 = b x k +s k , k+1 =min { k , max} 12: else 13: b x k+1 = b x k , k+1 = 1 k 14: end if 15: end for 4.3 Convergence analysis of the CLEO algorithm In this section, we show that the sequence {b x k } produced by the CLEO algorithm converges to a stationary point of (4.2) with probability 1. Intuitively, the convergence holds only if the trust region models {f k } are probabilistically accurate within the trust region. The condition of probabilistic accuracy that is required for the convergence is formally defined as probabilistically fully linear based on the literature such as [3], [13]. The convergence analysis is comprised of two parts. In section 4.3.1, we first show that under the size requirement of trust region sample 111 data, the TR random model f k is probabilistically fully linear. Using this property, we then show (section 4.3.2) the sequential convergence of the CLEO algorithm to a stationary point with probability 1. We observe that the extension of convergence to the CLEO setting is nontrivial becausein contrast with [13] which dealswith an unconstrained smooth optimization problem, we extend it to the analysis of stochastic programming with decision-dependent uncertainty, which is a constrained, composite, and nonsmooth optimization problem. 4.3.1 Probabilistically fully linear property of TR model In the following analysis, when we describe a random process in the CLEO algorithm, we use uppercase letters, such as the kth iterate b X k , to denote random variables, while we use low- ercase letter to denote realizations of the random variable, such as b x k which denotes the k- th iterate for a particular realization of our algorithm. This kind of notation will be applied to the LLR models {m k },{M k }, TR random models {f k },{F k } and random value estimates {v k ,v k+1/2 },{V k ,V k+1/2 }aswell. Hence,ouralgorithmresultsinastochasticprocess{M k ,F k ,X k ,S k , k ,V k ,V k+1/2 }. Our goal is to show that under certain conditions on the sequences {F k } and {V k ,V k+1/2 },the resulting stochastic process has desirable convergence properties almost surely. The concepts of ↵ -probabilistically -fully linear approximation is needed for ensuring convergence analysis. Definition 4.2. Let f :Z! R and f k :Z! R be locally Lipschitz continuous and directionally di↵erentiable functions defined on an open set Z ✓ R m containing the convex compact set X. We say that the function f k is a -fully linear model of f on B(b x k , k )with =( ef , eg ) and ef , eg 0, if for any x2B (b x k , k )\ X, | (x) k (x)| eg k , |f(x)f k (x)| ef 2 k , where (x) is defined in (4.6) and k (x) is defined in (4.13). 112 Definition 4.3. Let f : Z ! R be a locally Lipschitz continuous and directional di↵erentiable function defined on an open setZ✓ R m containing the convex compact setX. Let{F k :Z! R} be a sequence of random functions, each of which is locally Lipschitz continuous and directionally di↵erentiable almost surely. The sequence {F k } is said to be ↵ -probabilistically -fully linear with respecttothecorrespondingsequence{B( b X k , k )}ifandonlyifforascalar↵ 2(0,1),aconstant vector 2R 2 + and any k1, the event I k = {F k is a -fully linear model of f on B( b X k , k )} satisfies the condition P(I k |F k 1 ) ↵, where F k 1 denotes the -algebra generated by {F i ,V i ,V i+1/2 } k 1 i=0 . In addition to the probabilistically accurate condition for TR random models, we require suciently accurate value estimates. Definition4.4. Thevalueestimatesv k andv k+1/2 are✏ F -accurateestimatesoff(b x k )andf(b x k + s k ), respectively, for a given k if and only if |v k f(b x k )| ✏ F 2 k |v k+1/2 f(b x k +s k )| ✏ F 2 k . Definition4.5. Asequenceofrandomvalueestimates{(V k ,V k+1/2 )}aresaidtobe -probabilistically ✏ F -accurate with respect to the corresponding sequence {X k , k ,S k } if for the nonnegative con- stants 2[0,1], ✏ F and any k1, the event J k ={V k ,V k+1/2 are ✏ F -accurate estimates of f(b x k ) and f(b x k +s k ) respectively for k } satisfy the condition P(J k |F k 1/2 ), 113 where F k 1/2 denotes the -algebra generated by {F i } k i=1 and {V i ,V i+1/2 } k 1 i=0 . With Definition 4.3 and Definition 4.5, we now aim to show that the trust region models{F k } constructed in (4.11) satisfies the ↵ -probabilistically -fully linear property for ↵ 2(0,1) and 2R 2 + andthevalueestimates{(V k ,V k+1/2 )}satisfythe -probabilisticaccuracypropertyfor 2 (0,1). In [47], the authors analyze the probabilistically fully linear property of regression models under the strongly -poised condition and the Markov inequality. Our analysis is di↵erent from theirs in the sense that we show the probabilistically fully linear property of the LLR estimators using the Law of Large Numbers and Berry-Esseen Theorem under the same strongly -poised condition. For the linear regression model of the set T k ={x i }✓ R p of size N k , we introduce definitions of poised and strongly⇤-poised condition. Let b X k , 0 B B B B B B @ 1(x 1 ) > . . . . . . 1(x N k ) > 1 C C C C C C A Definition 4.6. The set T k = {x i }✓ R p of size N k is poised for the linear regression model if the matrix b X k has full column rank. If T k is poised, then the least-square linear regression is unique. However, the condition of poisedness is not sucient to derive “uniform” error bounds for the convergence analysis. Under the condition of poisedness, when the number of samples is allowed to grow arbitrarily large, we introducethestrongly⇤-poisednesswhichimposesanupperboundontheLagrangianpolynomials l(z)inthe ` 2 norm. 114 Definition4.7. Letascalar⇤ > 0andasetB✓ R p begiven. ApoisedsetT k ={x k }✓ R p ofsize N k issaidtobestrongly⇤-poisedforthelinearregressionmodelin B if p N k max z2B kl(z)k ⇤, where the Lagrangian polynomials l(z), ( b X k )(( b X k ) > b X k ) 1 0 B B @ 1 z 1 C C A . Wefurtherassumethefollowingboundconditionsonthefirstorderandsecond-ordergradients for the probabilistic accuracy of local linear regression models. (A5) The smooth functionsg(·) and{h j (·)} have Lipschitz continuous gradients onX. More- over, there exists a constant K 1 such that max x2 X kr xx ⇤ (x)k K 1 . Proposition 4.1. Under assumptions (A4) and (A5), given the feasible point b x k 2X and the trust region radius k > 0, suppose we have the data set T k = {(x i ,! i )} of size N k where {x i } are independently and uniformly generated in the trust region B(b x k , k )\ X and the set {x i } is strongly⇤-poised in B(b x k , k )\ X. Let b B k,1 and b B k,0 be the least square estimators of the local linear regression model following (4.7) and(4.8). Then for any z2B (b x k , k )\ X,wehavethe following results. (a) Forany0<↵< 1,thereexistsaconstant eg > 0suchthatwhenN k max{O( 4 k 2 eg ),O(↵ 2 )}, P ⇣ kr ⇤ (x) b B k,1 k eg k ,8x2B ( b X k , k )\ X|F k 1 ⌘ 1↵. (b) For any 0<↵< 1, there exists a constant ef > 0, when N k O( 4 k 2 ef ↵ 1 ), P ⇣ k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k ef 2 k ,8x2B ( b X k , k )\ X|F k 1 ⌘ 1↵. Proof. See Appendix. Due to the composite structure, the estimation error of the TR model and its directional stationarity can be related with the estimation error of the LLR parameters. 115 Proposition 4.2. Suppose for the composite SP problem (4.2), assumptions (A1)-(A5) hold. At the k-th iteration of the CLEO algorithm, given the feasible point b x k 2X and the trust region radius k > 0, suppose we have the data set T k = {(x i ,! i )} of size N k where {x i } are independently and uniformly generated in the trust region B(b x k , k )\ X and the set {x i } is strongly⇤-poised in B(b x k , k )\ X. (a) There exists a constant ef > 0, when N k max{O( 4 k 4 ef ),O(↵ 2 )},wehave P ⇣ |f(x)f k (x)| ef 2 k ,8x2B (b x k , k )\ X|F k 1 ⌘ ↵ 2 . (b) There exists a constant ef > 0, when N k max{O( 2 k 2 eg ),O(↵ 2 )},wehave P ⇣ | (x) k (x)| eg k ,8x2B (b x k , k )\ X|F k 1 ⌘ ↵ 2 . From (a) and (b), we can derive that there exists constant vector =( ef , eg ), when N k O(max{ 4 k ,O(↵ 2 )}), we have P ⇣ |f(x)F k (x)| ef 2 k , | (x) k (x) | eg k , 8x2B (b x k , k )\ X F k 1 ⌘ 1↵. Proof. By the definition of f(x) in (4.2), f(x)f k (x) = E e " [h(x, ⇤ (x)+e ")] 1 N k P N k i=1 h(x,m k (x,e k,i )) + E e " [g(x, ⇤ (x)+e ")] 1 N k P N k i=1 g(x,m k (x,e k,i )) Because of the Lipschitz continuity of h(z,·), there exists a constant L 1 such that 116 f(x)f k (x) E e " [h(x, ⇤ (x)+e ")] 1 N k P N k i=1 h(x, ⇤ (x)+e k,i ) + 1 N k P N k i=1 h(x, ⇤ (x)+e k,i ) 1 N k P N k i=1 h(x,m k (x,e k,i )) + E e " [g(x, ⇤ (x)+e ")] 1 N k P N k i=1 g(x, ⇤ (x)+e k,i ) + 1 N k P N k i=1 g(x, ⇤ (x)+e k,i ) 1 N k P N k i=1 g(x,m k (x,e k,i )) | ⌧ k | + |⇠ k | +L 1 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k where ⌧ k = E e " [h(x, ⇤ (x)+e ")] 1 N k P N k i=1 h(x, ⇤ (x)+e k,i ) ⇠ k = E e " [g(x, ⇤ (x)+e ")] 1 N k P N k i=1 g(x, ⇤ (x)+e k,i ) . Since{e k,i } are iid samples of the random variable e ".TheBerry-EsseenTheoremimplieswith the rate O(1/ p N k ), we have p N k ⌧ k d !N (0,V h )where V h =Var e " (h(x, ⇤ (x)+e ")). Similarly, with rate O(1/ p N k ), we have p N k ⇠ k d !N (0,V h )where V g =Var e " (g(x, ⇤ (x)+e ")). Consider the stochastic process F k given F k 1 . By Proposition 4.1, there exists a constant ef > 0, when N k max{O( 4 k 4 ef ),O(↵ 2 )},wehave P |f(x)f k (x)| ef 2 k ,8x2B (b x k , k )\ X|F k 1 P ⇣ |⌧ k |+|⇠ k |+L 1 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k ef 2 k ,8x2B (b x k , k )\ X|F k 1 ⌘ P |⌧ k | 1 3 ef 2 k ,8x2B (b x k , k )\ X|F k 1 +P |⇠ k | 1 3 ef 2 k ,8x2B (b x k , k )\ X|F k 1 +P ⇣ L 1 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k 1 3 ef 2 k ,8x2B (b x k , k )\ X|F k 1 ⌘ ↵ 6 + ↵ 6 + ↵ 6 = ↵ 2 . 117 For part (b), we first notice that under the assumption (A3),h(x,!) = maximize j2J h j (x,!). By writing its dual form, we have h(x,!)= minimize 0, P j j=1 P j2J j h j (x,!) with the corresponding optimal dual solution set denoted⇤( x,!). By the directional derivative of the linear program in [9], given x2X and !2⌦, for any d 1 2R p and d 2 2R m , h 0 ((x,!);(d 1 ,d 2 )) = max 2 ⇤( x,!) X j2J j h r 1 h j (x,!) > d 1 +r 2 h j (x,!) > d 2 i , (4.14) wherer 1 h j denotethepartialgradientofh j withrespecttothefirstargumentintheR p spaceand similarly forr 2 h j . Let b " k denote the random variable with the empirical probability distribution of {e k,i }. By (4.5), given x2X, for any d2T(x;X), f 0 (x; d)= r c(x) > d + E e " [h 0 ((x, ⇤ (x)+e ");(d,r ⇤ (x) > d))] +E e " [r g(x, ⇤ (x)+e ") > (d,r ⇤ (x) > d)] f k 0 (x; d)= r c(x) > d + E b " k [h 0 ((x,m k (x,b " k )); (d, ( b B k,1 ) > d))] +E e " [r g(x,m k (x,b " k )) > (d,( b B k,1 ) > d)] Let ¯ d 2 argmin x+d2 X kdk 1 f 0 (x; d), then (x) k (x)= minimize x+d2 X kdk 1 f 0 (x; d)+minimize x+d2 X kdk 1 f k 0 (x; d) f 0 (x; ¯ d)+f k 0 (x; ¯ d) Following this, we have (x) k (x) E e " [h 0 ((x, ⇤ (x)+e ")); ( ¯ d,r ⇤ (x) > ¯ d))]+E b " [h 0 ((x,m k (x,b ")); ( ¯ d, ( b B k,1 ) > ¯ d))] E e " [r g(x, ⇤ (x)+e ") > ( ¯ d,r ⇤ (x) > ¯ d))]+E b " [r g(x,m k (x,b ") > ( ¯ d, ( b B k,1 ) > ¯ d)] =(⌘ k1 +⌘ k2 )+( k1 + k2 ) 118 where ⌘ k1 = E e " [h 0 ((x, ⇤ (x)+e ")); ( ¯ d,r ⇤ (x) > ¯ d))]+E b " k [h 0 ((x, ⇤ (x)+b " k )); ( ¯ d,r ⇤ (x) > ¯ d))] ⌘ k2 = E b " k [h 0 ((x, ⇤ (x)+b " k )); ( ¯ d,r ⇤ (x) > ¯ d))]+E b " k [h 0 ((x,m k (x,b " k )); ( ¯ d, ( b B k,1 ) > ¯ d))] k1 = E e " [r g(x, ⇤ (x)+e ") > ( ¯ d,r ⇤ (x) > ¯ d)]+E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d,r ⇤ (x) > ¯ d)] k2 = E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d,r ⇤ (x) > ¯ d)]+E b " k [r g(x,m k (x,b " k ) > ( ¯ d, ( b B k,1 ) > ¯ d)] By the Berry-Esseen Theorem, we derive that p N k ⌘ k1 and p N k k1 both converge in dis- tribution to a normal distribution with the rate O(1/ p N k ) uniformly for all x 2X.Wethen bound ⌘ k2 byk ⇤ (x) ( b B k,1 ) > x( b B k,0 ) > k+kr ⇤ (x) b B k,1 k with the scale of a constant. Specifically, given (x,"), let k (x,")2⇤( x,m k (x,")) be the multipliers for which the maximum is achievedintheformulaofh 0 ((x,m k (x,"));( ¯ d, ( b B k,1 ) > d)). ByCorollary3.2.5in[30]andthefact that ⇤( x,!) is essentially bounded, there exist constants C 0 , C 1 and ⇤ (x,")2⇤( x, ⇤ (x)+") satisfying k k (x,") ⇤ (x,")k C 0 max j2J |h j (x, ⇤ (x)+")h j (x,m k (x,"))| C 0 C 1 k ⇤ (x)( b B k,1 ) > xk. By decomposition, we have ⌘ k2 E b " k 2 6 6 6 6 6 4 P j2J ⇤ j (x,b " k ) 0 B B B B @ r 1 h j (x, ⇤ (x)+b " k ) >¯ d +r 2 h j (x, ⇤ (x)+b " k ) > (r ⇤ (x) >¯ d) 1 C C C C A 3 7 7 7 7 7 5 +E b " k 2 6 6 6 6 6 4 P j2J k j (x,b " k ) 0 B B B B @ r 1 h j (x,m k (x,b " k )) >¯ d +r 2 h j (x,m k (x,b " k )) > ( b B k,1 ) > ¯ d 1 C C C C A 3 7 7 7 7 7 5 119 By (4.14), there exist constants C 2 ,C 3 ,C 4 ,C 5 and a Lipschitz gradient constant L of the function g such that ⌘ k2 C 2 E b " k [k ⇤ (x,b " k ) k (x,b " k )k] + max j2J E b " k ⇥ kr 1 h j (x, ⇤ (x)+b " k )r 1 h j (x,m k (x,b " k ))k ⇤ + max j2J E b " k h kr ⇤ (x)r 2 h j (x, ⇤ (x)+b " k ) b B k,1 r 2 h j (x,m k (x,b " k ))k i C 3 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k+ C 4 kr ⇤ (x) b B k,1 k k2 = E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d,r ⇤ (x) > ¯ d)]+E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d, ( b B k,1 ) > ¯ d)] E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d, ( b B k,1 ) > ¯ d)]+E b " k [r g(x,m k (x,b " k ) > ( ¯ d, ( b B k,1 ) > ¯ d)] C 5 r ⇤ (x) b B k,1 +L ⇤ (x)( b B k,1 ) > x( b B k,0 ) > Consider the stochastic process F k given F k 1 . By Proposition 4.1, there exists a constant ef > 0, for any N k max{O( 4 k 2 eg ),O(↵ 2 )}, such that P | (x) k (x)| eg k ,8x2B (b x k , k )\ X|F k 1 P ⌘ k1 +⌘ k2 + k1 + k2 eg k ,8x2B (b x k , k )\ X|F k 1 P ⌘ k1 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 +P ⌘ k2 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 +P k1 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 +P k2 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 4· ↵ 8 = ↵ 2 which establishes part (b). 120 Combining the results in (a) and (b), we have for any 0<↵< 1, there exists constant vector =( ef , eg ), for any N k max{O( 4 k 2 eg ),O( 4 k 2 ef ),O(↵ 2 )} we have P ⇣ |f(x)F k (x)| ef 2 k , | (x) k (x) | eg k , 8x2B (b x k , k )\ X F k 1 ⌘ 1↵, which establishes the statement. We can derive the probability accuracy of the random estimates (V k ,V k+1/2 ) by using almost the same analysis as above. So we omit the proof and present the result as follows. Proposition 4.3. For the composite SP problem (4.2), under assumptions (A1)- (A5) , at k- th iteration in the CLEO algorithm, for any 0<< 1, there exist positive constants ✏ F ,and M k ( ) of order O( 4 k ), such that if |S k |,|S k+1/2 |M k ( ), then random estimates (V k ,V k+1/2 ) is -probabilistically ✏ F -fully accurate with respect to B( b X k , k ). 4.3.2 Convergence analysis to a d-stationary point We now are ready to prove the asymptotic convergence of the CLEO algorithm. The conver- gence analysis serves as an extension of trust region method with random models for smooth unconstrained optimization to the nonsmooth, composite, constrained stochastic programming problem. The convergence analysis is based on the probabilistic accuracy of trust region models in Proposition 4.2 and Proposition 4.3, which hold under the requirement of sample size in the trust region with the scale of constants depending on the smooth functions {h j (·)} j2J and g(·). So with a fair estimation of the constants, we assume that in the process of the CLEO algorithm, the TR data set {T k } satisfies the sample size requirement. While such an assumption may be demanding, the actual requirement on the sample sizes for convergence could be very loose as seen from the experiments in section 4.4. We propose the following assumption of sample size for the sake of the theoretical convergence analysis. 121 (B) Suppose for some chosen probability parameters ↵ and , for any given positive number k, feasible point b x k and the trust region radius k , we are able to generate the samples of size in order of O( 4 k ) in the trust region B(b x k , k )\ X with which the TR random model F k is ↵ -probabilistically -fully linear and random value estimates (V k ,V k+1/2 )is -probabilistically ✏ F -accurate. The following lemmas hold under the assumptions (A1)-(A5). So for the sake of brevity, we will not include them in the lemmas. We first show the guarantees of the decrease of the objective value in the composite SP problem (4.2). Lemma 4.1. Suppose that the TR modelf k is =( ef , eg )-fully linear onB(b x k , k ). If k 0 and k min ( 1 3 eg , dcp 4 ef ) k (b x k ), (4.15) then a suitable step s k under (4.12) leads to an improvement in f(b x k +s k ) such that f(b x k +s k )f(b x k ) dcp 4 (b x k ) k . Proof. From Definition 4.2, since f k is a -fully linear model on B(b x k , k ), we have k (b x k ) (b x k ) eg k (4.16) With a suitable step s k by (4.12) and the fact that k 0 ,wehave f k (b x k )f k (b x k +s k ) dcp k (b x k ) k From the definition of a -fully linear model and the fact that k dcp 4 ef k (b x k ), we have 122 f(b x k +s k )f(b x k )=f(b x k +s k )f k (b x k +s k )+f k (b x k +s k )f k (b x k )+f k (b x k )f(b x k ) 2 ef 2 k dcp k (b x k ) k dcp 2 k (b x k ) k . (4.17) On the other hand, since k (b x k ) (b x k )+ eg k and the fact that k k (b x k ) 3 eg , we derive that k (b x k ) 2 eg . Then combining the inequalities (4.16) and (4.17), we have f(b x k +s k )f(b x k ) dcp 2 h eg 2 k k (b x k ) i dcp 4 k (b x k ). This proves the statement for the suitable step s k . Lemma 4.2. Suppose that the TR model f k is -fully linear on B(b x k , k ) and the estimates (v k ,v k+1/2 )is ✏ F -accurate with =( ef , eg ) and ✏ F ef .If k 0 , and k min ( 1 ⌘ 2 , dcp (1⌘ 1 ) 4 ef ) k (b x k ). where the constant ⌘ 2 is the step acceptance parameter in step 9 of the CLEO algorithm, then k-th iteration is successful, i.e., the sucient decrease is achieved. Proof. The trust region random function f k being ( ef , eg )-fully linear implies that |f(b x k )f k (b x k )| ef 2 k , and |f(b x k +s k )f k (b x k +s k )| ef 2 k . (4.18) Since value estimates (v k ,v k+1/2 ) are ✏ F -accurate with ✏ F ef ,wehave |v k f(b x k )| ef 2 k , and |v k+1/2 f(b x k +s k )| ef 2 k . (4.19) 123 Consider the parameter ⇢ k = v k v k+1/2 f k (b x k )f k (b x k +s k ) = v k f(b x k ) f k (b x k )f k (b x k +s k ) + f(b x k )f k (b x k ) f k (b x k )f k (b x k +s k ) + f k (b x k )f k (b x k +s k ) f k (b x k )f k (b x k +s k ) + f k (b x k +s k )f(b x k +s k ) f k (b x k )f k (b x k +s k ) + f(b x k +s k )v k+1/2 f k (b x k )f k (b x k +s k ) . Then |⇢ k 1| |v k f(b x k )| f k (b x k )f k (b x k +s k ) + |f(b x k )f k (b x k )| f k (b x k )f k (b x k +s k ) + |f k (b x k +s k )f(b x k +s k )| f k (b x k )f k (b x k +s k ) + |f(b x k +s k )v k+1/2 | f k (b x k )f k (b x k +s k ) By the condition of a suitable step in (4.12), and inequalities (4.18) and (4.19), we have |⇢ k 1| 4 ef k dcp k (b x k ) 1⌘ 1 , where the last inequality comes from the fact that k dcp (1 ⌘ 1) 4 ef k (b x k ). Lemma 4.3. Suppose the function value estimates {(v k ,v k+1/2 )} are ✏ F -accurate and ✏ F < 1 2 ⌘ 1 ⌘ 2 dcp min 1, 0 / max . If the kth iteration is successful, i.e. a trial step s k is accepted, then the improvement in f is bounded such that f(b x k+1 )f(b x k ) ⌧ 2 k , where ⌧ = ⌘ 1 ⌘ 2 dcp min 1, 0 / max 2✏ F . Proof. When k-th iteration is successful, ⇢ k ⌘ 1 and k (b x k )⌘ 2 k ,then v k v k+1/2 ⌘ 1 (f k (b x k )f k (b x k +s k ))⌘ 1 dcp k (b x k )min{ k , 0 } ⌘ 1 ⌘ 2 dcp min 1, 0 / max 2 k . 124 Since the function value estimates are ✏ F -accurate, the improvement in f can be bounded such that f(b x k )f(b x k +s k )= ⇣ f(b x k )v k ⌘ + ⇣ v k v k+1/2 ⌘ + ⇣ v k+1/2 f(b x k +s k ) ⌘ ⇣ ⌘ 1 ⌘ 2 dcp min 1, 0 / max 2✏ F ⌘ 2 k . Now with Lemmas 4.1, 4.2, 4.3, we can choose the probability parameter ↵ and to be fixed constants as ↵ ⇤ and ⇤ close to 1 based on the conditions in Corollary 4.12 in [13]. With this choice of parameters, the sum of trust region radii in the CLEO algorithm is finite almost surely from Theorem 4.11 in [13]. We present the result in Proposition 4.4 as follows. Proposition 4.4. For the composite SP (4.2), under the assumptions (A1)-(A5) and (B), the step acceptance parameter ⌘ 2 and the accuracy parameter ✏ F are chosen so that ⌘ 2 4 ef dcp (1⌘ 1 ) , and ✏ F min ⇢ ef , 1 4 ⌘ 1 ⌘ 2 dcp , then the probability parameters (↵ ⇤ , ⇤ ) can be chosen such that the sequence of trust-region radius, { k } generated by the CLEO algorithm satisfies P 1 k=0 2 k <1 almost surely. Theorem 4.1. For the composite SP (4.2), under the assumptions (A1)-(A5) and (B), with probability parameters (↵ ⇤ , ⇤ ), the step acceptance parameter ⌘ 2 and the accuracy parameter ✏ F canbechosensuchthatthesequenceofrandomiterates{X k }generatedbytheCLEOAlgorithm, converges to a directional stationary point almost surely. Proof. With Proposition 4.4 and Theorem 4.18 in [13], we have lim k!1 (b x k ) = 0 almost surely. By Lemma 11.1.2 in [18], this implies that the sequence produced by the CLEO algorithm con- verges to a directional stationary point almost surely. 125 4.4 Numerical experiments and results Inthissection, wediscusstheperformanceoftheCLEOalgorithmcomparedwiththePOscheme. In the PO scheme, we first estimate the dependence of the uncertainty on the decision variables using the prediction models and then solve the SP problem composed with the estimated depen- dence. In order to make a fair comparison of the PO scheme and the CLEO scheme, we use the L-BFGS solver to solve both the approximate optimization problems in the PO scheme or the trust region subproblems in the CLEO scheme. The comparisons are conducted for three decision-making problems, including a nonconvex problem on synthetic data, a joint production and pricing problem on synthetic data and a price sensitive problem of hotel rooms on real world data. 4.4.1 A synthetic nonconvex problem We consider an SP problem as follows. minimize 4 x 1 4 5 x 2 3 1 5 kxk 2 +E e !|x [e ! 2 ] (4.20) The random variable e ! is a regression model of the decision variable x,i.e., e ! = sin(x 1 )+ sin(x 2 )+ e ", and e "⇠N (0,I 2⇥ 2 ). The SP problem (4.20) is thus a nonconvex and smooth prob- lem. Without the knowledge of the actual dependence, we utilize the randomly generated data to seek a near-optimal decision of problem (4.20) using the CLEO algorithm as well as the PO scheme with Ridge Regression(RR), Support Vector Regression(SVR), and Gaussian Process Re- gression(GPR). When implementing the CLEO algorithm, the data in each neighborhood are actively generated and the data size is set to be a constant number (e.g., 10). The performance of four algorithms is evaluated with respect to two criteria: 1) a snapshot of convergence process in Figure 4.1(a) where red, green, blue and yellow surfaces respectively refer to the surface of ground 126 truth objective function, objective value with the uncertainty fitted by SVR, GPR and RR; 2) objective value versus the number of samples in Figure 4.1(b) with the same stopping criterion. These plots are produced based on a fair and careful parameter-tuning together with an average of 50 random replications. As expected, CLEO converges to a stationary point of the true objective surface with the smallest number of iterations. As indicated by Figure 4.1(a), the yellow surface constructed by RR model does not fit well to the true one, therefore, PO scheme wth RR model does not provide the convergence to any stationary point. When the sample size is large, the surface constructed by GPR is close to the true one. Indicated by Figure (4.1(b)), the solution produced by the PO sheme with GPR has the objective value slightly higher than the one produced by CLEO. However, the derived objective values over replications in the PO scheme with GPR are more volatile than the CLEO scheme. Moreover, the time for training a GPR model could be very comsuming when the sample size grows large. (a) Process snapshot (b) Objective value Figure 4.1: Comparisons of the CLEO algorithm and the PO schemes for a synthetic nonconvex problem 127 4.4.2 Joint production and pricing problem We follow the setting in Example 1 with K = 1, the demand curve (p)=a e b cp 1+e b cp +d and the noise e " ⇠ U(¯ ",¯ "). In experiment, the parameters are chosen as a = 470,b=6,c=0.3,d = 0.1,¯ "=0.01. We evaluate the performance of the CLEO algorithm by choosing di↵erent number of sample sizes in the neighborhoods in Figure 4.2. As we can see, except the sample size of 2, other choices of sample size has similar performance in terms of the convergence of the objective value. This indicates that the sample size requirement in the assumption (B) is for the sake of the theoretical analysis, whereas in practice the sample size in the neighborhoods could be set as a constant in the CLEO algorithm. Figure 4.2: Convergence curve of CLEO for the sample size chosen as 2,3,5,10,20 We further compare the performance of the CLEO algorithm with two PO schemes with GPR model and Quadratic Regression (QR) model in each scheme. The blue line at the bottom of Figure 4.3 represents the globally minimal objective value of the true composite SP problem (4.3). It is taken as the ground truth. The red curve represents the objective function value at the solution derived by the CLEO algorithm with respect to the number of samples. The red curve indicates that the CLEO algorithm converges to the value slightly higher than the ground truth as the sample size increases. 128 The performance of the PO schemes relies on the accuracy of the prediction models and the optimality of the solution derived by the solver. To illustrate how these two parts contribute to the overall performance of the PO schemes, we compute two types of convergence curves for each PO scheme. For the approximate optimization problem composed with the GPR model, the green curve represents the globally optimal objective value and the orange curve represents the objective value at the solution derived by the LBFGS solver. The green curve indicates that its global minimum converges to the ground truth as the sample size increases, whereas the derived solutionbyLBFGSsolverforthePOschemeissuboptimalwiththelargevarianceindicatedbythe orange curve. Since we only consider two decision variables in this instance, the global minimum of the highly nonconvex composite problem can be found by the graph. When the dimension is higher than 2, the gap between the orange curve and the ground truth ( suboptimality of the PO scheme) results from the highly nonconvex structure of the approximate optimization problem composed with the GPR model. For the approximate optimization problem composed with the QR model, the gap between the light and dark purple curves is small and relatively stable, whereas the gap between the purple curves and the ground truth is large. This indicates that the suboptimality of the PO scheme with the QR model results from the poor estimation of the latent dependence using QR models. Combining the performance of two PO schemes, we can see the trade-o↵ between the accuracy of the prediction model and the complexity of the composite optimization problem. 4.4.3 A pricing problem for hotel rooms We consider the pricing problem for the hotel room. The hotel booking dataset contains 17,838 booking records with check-in dates from March 12, 2007 to April 15, 2007, in one of five conti- nental U.S. hotels. The decision variable x2R 3 + is the price vector corresponding to three room 129 Figure4.3: ConvergencecurveoftheCLEOalgorithmandthePOschemesforthejointproduction and pricing problem. Figure 4.4: Convergence curve for the price-senstive problem of hotel rooms 130 types: King, Queen, Standard, and the demand vector e ! is the number of booking records for each room type. Then the decision-making problem is formulated as maximize x2 R 3 + E e !|x [x T e !] kxk 2 (4.21) where is a parameter for the regularization of the price. Regardless of other market factors, the demand ! is implicitly a↵ected by the price decision x. We conduct the CLEO algorithm and three PO schemes with SVR, GPR and RR for computing the price decision. Since the true optimization problem is unknown, the objective value at a solution is estimated by the sample average using samples within a neighborhood to the solution. In Figure 4.4, because of the randomness of the TR model constructed in CLEO, the performance curve mildly fluctuates, but overall it dominates the other three approaches in terms of the rate of progress and the objective value for this data set. 4.5 Summary WeproposetheCLEOalgorithmcouplingthelocalregressionmodelsandtrustregionmethodsfor solvingthestochasticprogrammingproblemwiththedecision-dependentuncertainty. TheCLEO algorithm iteratively deals with local behaviors in both estimation and optimization problems to control the overall complexity with the asymptotical convergence to a stationary point with probability 1. The CLEO algorithm is then supported by computational results in a simulation experiment and a real-world pricing problem. It shows that the CLEO algorithm outperforms the the uncoupled ”Predict then Optimize” approach in terms of the convergent objective value and the associated variance. 131 Chapter 5 Conclusion and ongoing work This thesis provides algorithmic schemes for combining predictive and prescriptive analytics to facilitate better decision-making. In Chapter 2, we revisit a classical SD algorithm for the convex two-stage SPs and develop the new convergence rate analysis for the SD algorithm. It provides not only the theoretical support for the SD algorithmbut also the guidance of choosing parameter value in SD. In Chapter 3, motivated from the PO scheme, we address a new class of nonconvex two-stageSPswiththelinearlybi-parameterizedrecourse. Wedevelopanalgorithmthatcombines sampling,regularization,andlinearizataionforthesubsequentialconvergencetoastationarypoint with probability 1. In Chapter 4, for the SPs with the endogenous uncertainty, we develop the CLEO algorithm that couples prediction and optimization in order to automatically balance the trade-o↵betweentheaccuracyandcomplexity. ThesolutionsequenceprovidedbyCLEOisshown to converge to a stationary point with probability 1 of a potentially nonconvex and nonsmooth SP problem. Finally, we discuss several future and ongoing directions pertaining to several topics. • Stochastic Decomposition: First, we have so far used only one sample at each iteration, so it remains an open question to use adaptive sampling for faster convergence; Second, when the recourse function is bi-parameterized by the first stage, a possible approach to with higher eciency is to combine the DCA with the SD algorithm. 132 • Bi-parameterized recourse: First, it is worth conducting numerical experiments in various applications to compare the performance of our sample-based algorithm in Chapter 3 with several global optimization solvers; Second, motivated by the application problems in elec- tricity planning, it is critical to study the convergent algorithms for two-stage SPs with the piecewise linearly bi-parameterized recourse; Third, it remains an interesting question to develop block-coordinate type algorithms based on the implicit convex-concave structure of the linearly bi-parameterized recourse. • Coupled scheme: First, in CLEO we assume that the data pairs of the uncertainty and decision can be generated as the iteration proceeds. Instead, when the sample set is given as fixed,wesuspectthatacombinationofproximalmappingwithweightedlinearregressioncan provide the convergence to stationarity when the sample size increases to infinity; Second, if we have any prior knowledge of the dependence, such as the parametric functions and the distribution of parameters, we might consider to couple Bayesian learning with optimization for better decision making. 133 Reference List [1] Le Thi Hoai An and Pham Dinh Tao. The dc (di↵erence of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of Operations Research, pages 133(1–4):23–46, 2005. [2] Victor F Araman and Ren´ e Caldentey. Dynamic pricing for nonperishable products with demand learning. Operations research, 57(5):1169–1188, 2009. [3] Afonso S Bandeira, Katya Scheinberg, and Lu´ ıs N Vicente. Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization, 24(3):1238–1264, 2014. [4] DirkBanholzer,J¨ orgFliege,andRalfWerner. Onalmost-sureratesofconvergenceforsample average approximations. SIAM Journal on Optimization, 2017. [5] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 2011. Springer, 2017. [6] Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. arXiv preprint arXiv:1402.5481, 2014. [7] Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009. [8] Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust region method for nonconvex optimization. arXiv preprint arXiv:1609.07428, 2016. [9] J Fr´ ed´ eric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. [10] Rasmus Bro and Sijmen De Jong. A fast non-negativity-constrained least squares algorithm. Journal of chemometrics, 11(5):393–401, 1997. [11] Boxiao Chen, Xiuli Chao, and Hyun-Soo Ahn. Coordinating pricing and inventory replen- ishment with nonparametric demand learning. Ross School of Business Paper, (1294), 2015. [12] Donghui Chen and Robert J Plemmons. Nonnegativity constraints in numerical analysis. The birth of numerical analysis, 10:109–140, 2009. [13] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a trust-region method and random models. Mathematical Programming, 169(2):447–487, 2018. [14] XiaojunChen,LiqunQi,andRobertSWomersley. Newton’smethodforquadraticstochastic programs with recourse. Journal of Computational and Applied Mathematics, 60(1):29–46, 1995. 134 [15] Wang Chi Cheung, David Simchi-Levi, and He Wang. Dynamic pricing and demand learning with limited price experimentation. Operations Research, 65(6):1722–1731, 2017. [16] Kai Lai Chung. On a stochastic approximation method. The Annals of Mathematical Statis- tics, pages 463–483, 1954. [17] Frank H. Clarke. Optimization and nonsmooth analysis. Classics in Applied Mathematics, Society of Industrial and Applied Mathematics, 5, 1990. [18] AndrewRConn, NicholasIMGould, andPhLToint. Trust region methods, volume1. Siam, 2000. [19] William L Cooper, Tito Homem-de Mello, and Anton J Kleywegt. Models of the spiral-down e↵ect in revenue management. Operations Research, 54(5):968–987, 2006. [20] William L Cooper, Tito Homem-de Mello, and Anton J Kleywegt. Learning and pricing with models that do not explicitly incorporate competition. Operations research, 63(1):86–103, 2015. [21] Arnoud den Boer. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1):1–18, 2015. [22] Arnoud den Boer and N Bora Keskin. Dynamic pricing with demand learning and reference e↵ects. Available at SSRN 3092745, 2017. [23] Yunxiao Deng, Junyi Liu, and Suvrajeet Sen. Coalescing data and decision sciences for analytics. INFORMS Tutorials in Operations Research: Recent Advances in Optimization and Modeling of Contemporary Problems, pages 20–49, 2018. [24] William S Dorn. Duality in quadratic programming. Quarterly of Applied Mathematics, 18(2):155–162, 1960. [25] John C Duchi and Feng Ruan. Stochastic methods for composite and weakly convex opti- mization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018. [26] Jitka Dupaˇ cov´ a. Optimization under exogenous and endogenous uncertainty. University of West Bohemia in Pilsen, 2006. [27] Jitka Dupaˇ cov´ a and Roger J.B. Wets. Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. The Annals of Statistics, pages 16(4): 1517–1549, 1988. [28] Rick Durrett. Probability: theory and examples. Cambridge university press, 2010. [29] Adam N Elmachtoub and Paul Grigas. Smart “ predict, then optimize”. arXiv preprint arXiv:1710.08005, 2017. [30] Francisco Facchinei and Jong-Shi Pang. Finite-dimensional variational inequalities and com- plementarity problems. Springer Science & Business Media, 2007. [31] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. [32] VikasGoelandIgnacioEGrossmann. Aclassofstochasticprogramswithdecisiondependent uncertainty. Mathematical programming, 108(2-3):355–394, 2006. 135 [33] SergeGratton, Cl´ ementWRoyer, Lu´ ısNVicente, andZaikunZhang. Complexityandglobal rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 2017. [34] Pavithra Harsha, Ramesh Natarajan, and Dharmashankar Subramanian. A data-driven, distribution-free, multivariate approach to the price-setting newsvendor problem. optimiza- tion online, 2015. [35] Lars Hellemo, Paul I Barton, and Asgeir Tomasgard. Decision-dependent probabilities in stochastic programs with recourse. Computational Management Science, 15(3-4):369–395, 2018. [36] JuliaLHigleandSuvrajeetSen. Stochasticdecomposition: Analgorithmfortwo-stagelinear programs with recourse. Mathematics of operations research, 16(3):650–669, 1991. [37] Julia L Higle and Suvrajeet Sen. Finite master programs in regularized stochastic decompo- sition. Mathematical Programming, 67(1-3):143–168, 1994. [38] JuliaLHigleandSuvrajeetSen. Statisticalapproximationsforstochasticlinearprogramming problems. Annals of Operations Research, 85:173–193, 1999. [39] David C Hoaglin and Roy E Welsch. The hat matrix in regression and anova. The American Statistician, 32(1):17–22, 1978. [40] William Hogan. Directional derivatives for extremal-value functions with applications to the completely convex case. Operations Research, pages 21(1): 188–209, 1973. [41] Tito Homem-de Mello. Variable-sample methods for stochastic optimization. ACM Trans- actions on Modeling and Computer Simulation, pages 13(2): 108–133, 2003. [42] Tito Homem-de Mello. On rates of convergence for stochastic optimization problems under non-independentandidenticallydistributedsampling. SIAM Journal on Optimization,pages 19(2):524–551, 2008. [43] Tito Homem-de Mello and G¨ uzin Bayraksan. Monte carlo sampling-based methods for stochasticoptimization. Surveys in Operations Research and Management Science, 19(1):56– 85, 2014. [44] Joel L Horowitz. The bootstrap. In Handbook of econometrics, volume 5, pages 3159–3228. Elsevier, 2001. [45] Robert Janin. Directional derivative of the marginal function in nonlinear programming. Mathematical Programming Study, pages 21:110–126, 1984. [46] Alan King and R. Tyrrell Rockafellar. Asymptotic theory for solutions in statistical es- timation and stochastic programming. Mathematics of Operations Research, pages 18(1): 148–162, 1993. [47] Je↵rey Larson and Stephen C Billups. Stochastic derivative-free optimization using a trust region framework. Computational Optimization and applications, 64(3):619–645, 2016. [48] Charles L Lawson and Richard J Hanson. Solving least squares problems, volume 15. SIAM, 1995. [49] Gue Myung Lee, Nguyen Nan Tam, and Nguyen Dong Yen. Quadratic Programming and Ane Variational Inequalities: A Qualitative Study . Springer, New York, 2005. 136 [50] Soonhui Lee, Tito Homem-de Mello, and Anton J Kleywegt. Newsvendor-type models with decision-dependent uncertainty. Mathematical Methods of Operations Research, 76(2):189– 221, 2012. [51] Je↵ Linderoth, Alexander Shapiro, and Stephen Wright. The empirical behavior of sampling methods for stochastic programming. Annals of Operations Research, 142(1):215–241, 2006. [52] Wai-Kei Mak, David P Morton, and R Kevin Wood. Monte carlo bounding techniques for determining solution quality in stochastic programs. Operations Research Letters, 24(1):47– 56, 1999. [53] Enno Mammen. When does bootstrap work?: asymptotic results and simulations, volume 77. Springer Science & Business Media, 2012. [54] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimiza- tion, 19(4):1574–1609, 2009. [55] Arkadii Nemirovskii, David Borisovich Yudin, and ER Dawson. Problem complexity and method eciency in optimization. 1983. [56] MaherNouiehed,Jong-ShiPang,andMeisamRazaviyayn. Onthepervasivenessofdi↵erence- convexity in optimization and statistics. arXiv preprint arXiv:1704.03535, 2017. [57] Welington Oliveira, Claudia Sagastiz´ abal, and Susana Scheimberg. Inexact bundle methods for two-stage stochastic programming. SIAM Journal on Optimization, 21(2):517–544, 2011. [58] Jong-Shi Pang, Meisam Razaviyayn, and Alberth Alvarado. Computing b-stationary points of nonsmooth dc programs. Mathematics of Operations Research, 42(1):95–118, 2016. [59] Jong-Shi Pang, Suvrajeet Sen, and Uday V Shanbhag. Two-stage non-cooperative games with risk-averse players. Mathematical Programming, 165(1):235–290, 2017. [60] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends R in Opti- mization, 1(3):127–239, 2014. [61] Srinivas Peeta, F Sibel Salman, Dilek Gunnec, and Kannan Viswanath. Pre-disaster invest- ment decisions for strengthening a highway network. Computers & Operations Research, 37(10):1708–1719, 2010. [62] Daniel Ralph and Huifu Xu. Convergence of stationary points of sample average two-stage stochastic programs: A generalized equation approach. Mathematics of Operations Research, 36(3):568–592, 2011. [63] Daniel Ralph and Huifu Xu. Convergence of stationary points of sample average two-stage stochastic programs: A generalized equation approach. Mathematics of Operations Research, 36(3):568–592, 2011. [64] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [65] Stephen M Robinson. An implicit-function theorem for a class of nonsmooth functions. Mathematics of operations research, 16(2):292–309, 1991. [66] Stephen M Robinson. Analysis of sample-path optimization. Mathematics of Operations Research, 21(3):513–528, 1996. 137 [67] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976. [68] R Tyrrell Rockafellar. Lagrange multipliers and subderivatives of optimal value functions in nonlinear programming. In Nondi↵erential and variational techniques in optimization , pages 28–66. Springer, 1982. [69] Johannes O Royset. Optimality functions in stochastic programming. Mathematical Pro- gramming, 135(1-2):293–321, 2012. [70] Ernest K Ryu and Stephen Boyd. Stochastic proximal iteration: a non-asymptotic improve- ment upon stochastic gradient descent. Author website, early draft, 2014. [71] Suvrajeet Sen and Yunxiao Deng. Learning enabled optimization: Towards a fusion of sta- tistical learning and stochastic optimization, 2017. [72] Suvrajeet Sen and Yifan Liu. Mitigating uncertainty via compromise decisions in two-stage stochastic linear programming: Variance reduction. Operations Research, 64(6):1422–1437, 2016. [73] Fei Sha, Lawrence K Saul, and Daniel D Lee. Multiplicative updates for nonnegative quadratic programming in support vector machines. In Advances in neural information processing systems, pages 1041–1048, 2002. [74] AShapiroandYYomdin. Onfunctions,representableasadi↵erenceoftwoconvexfunctions, and necessary conditions in a constrained optimization. Preprint, Beer-Sheva, 1981. [75] Alexander Shapiro. Monte carlo sampling methods. Handbooks in operations research and management science, 10:353–425, 2003. [76] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy´ nski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2009. [77] Alexander Shapiro and Tito Homem-de Mello. On the rate of convergence of optimal solu- tions of monte carlo approximations of stochastic programs. SIAM journal on optimization, 11(1):70–86, 2000. [78] Alexander Shapiro and Tito Homem-de Mello. On the rate of convergence of optimal solu- tions of monte carlo approximations of stochastic programs. SIAM Journal on Optimization, 11(1):70–86, 2000. [79] Alexander Shapiro and Arkadi Nemirovski. On complexity of stochastic programming prob- lems. In Continuous optimization, pages 111–146. Springer, 2005. [80] Alexander Shapiro and Huifu Xu. Uniform laws of large numbers for set-valued mappings and subdi↵erentials of random functions. Journal of mathematical analysis and applications, 325(2):1390–1399, 2007. [81] Naum Zuselevich Shor. Minimization methods for non-di↵erentiable functions , volume 3. Springer Science & Business Media, 2012. [82] Kesar Singh. On the asymptotic accuracy of efron’s bootstrap. The Annals of Statistics, pages 1187–1195, 1981. [83] PhamDinhTaoandLeThiHoaiAn. Convexanalysisapproachtodcprogramming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289–355, 1997. 138 [84] Roger J-B Wets. Chapter viii stochastic programming. Handbooks in operations research and management science, 1:573–629, 1989. [85] AC Williams. Marginal values in linear programming. Journal of the Society for Industrial and Applied Mathematics, 11(1):82–94, 1963. [86] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. [87] Huifu Xu. Uniform exponential convergence of sample average random functions under gen- eralsamplingwithapplicationsinstochasticprogramming. JournalofMathematicalAnalysis and Applications, 368(2), 2010. [88] HuifuXuandDaliZhang. Smoothsampleaverageapproximationofstationarypointsinnon- smooth stochastic optimization and applications. Mathematical Programming, 119(2):371– 401, 2009. 139 Appendix A Supplement to Chapter 2 Proof of Proposition 2.1. For the trust region data set T k ,let X k , 0 B B @ (x 1 ) > . . . (x N k ) > 1 C C A , b X k , 0 B B @ 1(x 1 ) > . . . . . . 1(x N k ) > 1 C C A ,W k , 0 B B @ (! 1 ) > . . . (! N k ) > 1 C C A , ⇤ (X k ), 0 B B @ ⇤ (x 1 ) > . . . ⇤ (x N k ) > 1 C C A Then the matrix of error is defined to be ⌅ k , W k ⇤ (X k ). Under the poisedness condition, least square estimators of the LLR model are unique and can be written in matrix form. 0 @ b B k,0 b B k,1 1 A = ⇣ ( b X k ) > b X k ⌘ 1 ( b X k ) > W k (A.1) Under the assumptions (A4) and (A5), by the Taylor’s expansion of ⇤ at z2B(b x k , k )\ X, we derive ⇤ (x i )= ⇤ (z)+r ⇤ (z) > (x i z)+O(K 1 2 k ) 1 m , 8i=1,...,|T k |. We rewrite the Taylor’s expansion in the matrix form. ⇤ (X k )=1 N k ⇤ (z) > +(X k 1 N k z > )r ⇤ (z)+O(K 1 2 k ) 1 N k ⇥ m = b X k 0 @ ⇤ (z) > z > r ⇤ (z) r ⇤ (z) 1 A +O(K 1 2 k ) 1 N k ⇥ m . Under the poisedness condition, we derive ⇤ (z) > z > r ⇤ (z) r ⇤ (z) ! = ⇣ ( b X k ) > b X k ⌘ 1 ⇣ b X k ⌘ > ⇣ ⇤ (X k )+O(K 1 2 k ) 1 N k ⇥ m ⌘ (A.2) By subtracting (A.2) from the LLR estimation (A.1), we derive b B k,0 b B k,1 ! ⇤ (z) > z > r ⇤ (z) r ⇤ (z) ! = ⇣ ( b X k ) > b X k ⌘ 1 ⇣ b X k ⌘ > ⇣ ⌅ k +O(K 1 2 k ) 1 N k ⇥ m ⌘ (A.3) 140 NowweconsidertheasymptoticalconvergenceofthreeLLRestimators,namely,r ⇤ (z) b B k,1 and ⇤ (z)( b B k,0 ) > ( b B k,1 ) > z on B(x k , k ). (a) Let ¯ x = P N k i=1 x i /N k , and ¯ X k , (I N k 1 N k 1 N k ⇥ N k )X k = 0 B B @ x 1 ¯ x > . . . x N k ¯ x > 1 C C A . From the inverse of block matrix ( b X k ) > b X k , we derive that, ⇣ ( b X k ) > b X k ⌘ 1 = ✓ ⇤⇤ C k D k ◆ , where C k , 1 N k D k ( b X k ) > 1 N k = 1 N k ⇣ ( ¯ X k ) > ¯ X k ⌘ 1 ( b X k ) > 1 N k D k , ( b X k ) > b X k 1 N k ( b X k ) > 1 N k 1 > N k b X k 1 =(( ¯ X k ) > ¯ X k ) 1 Then (0 p⇥ 1 ,I p ) ⇣ ( b X k ) > b X k ⌘ 1 ( b X k ) > =(0 p⇥ 1 ,I p ) ✓ ⇤⇤ C k D k ◆ 0 @ 1 > N k ( b X k ) > 1 A =C k 1 > N k +D k ( b X k ) > = ⇣ ( ¯ X k ) > ¯ X k ⌘ 1 ( ¯ X k ) > Let b I p , (0 p⇥ 1 ,I p ), we derive r ⇤ (z) b B k 1 = b I p ⇣ ( b X k ) > b X k ⌘ 1 ( b X k ) > ⌅ k +O(K 1 2 k ) b I p ⇣ ( b X k ) > b X k ⌘ 1 ( b X k ) > 1 N k ⇥ m = ⇣ ( ¯ X k ) > ¯ X k ⌘ 1 ( ¯ X k ) > ⌅ k +O(K 1 2 k ) ⇣ ( ¯ X k ) > ¯ X k ⌘ 1 ( ¯ X k ) > 1 N k ⇥ m = 1 k ⇣ ( ¯ U k ) > ¯ U k ⌘ 1 ( ¯ U k ) > ⌅ k +O(K 1 k ) ⇣ ( ¯ U k ) > ¯ U k ⌘ 1 ( ¯ U k ) > 1 N k ⇥ m where ¯ U k , ¯ X k / k . According to the construction of {x i } at line 3 in CLEO algorithm, ¯ U k = 0 B B @ (u 1 ¯ u) > . . . (u N k ¯ u) > 1 C C A where u i ⇠U (B(0 p ,1)) for i=1,...,N k and ¯ u = 1 N k P N k i=1 u i . Now we show that the ma- trix ( ¯ U k ) > ¯ U k 1 ( ¯ U k ) > ⌅ k asymptotically converges in distribution, following which the matrix ( ¯ U k ) > ¯ U k 1 ( ¯ U k ) > 1 N k ⇥ m has a similar asymptotical convergence. By the Law of Large Num- bers, 1 N k ( ¯ U k ) > ¯ U k ! V u almost surely where V u , E e x ⇥ e ue u > ⇤ and e u⇠U (B(0 p ,1)). By Theorem 141 2.57 in [28], 1 N k ( ¯ U k ) > ¯ U k converges with the rate of o(N 1/2 k ). By Berry-Esseen Theorem, there exists a constant C such that with the rate ofCN 1/2 k ,wehave 1 p N k ( ¯ U k ) > ⌅ k d !MN p⇥ m (0,V u ,⌃ ⇤ ), as N k !1 . By the Slutsky’s Theorem, with the rate of O(N 1/2 k ), p N k ⇣ ( ¯ U k ) > U k ⌘ 1 ( ¯ U k ) > ⌅ k d !MN p⇥ m (0,V 1 u ,⌃ ⇤ ), as N k !1 . (A.4) Let H 1 ⇠MN p⇥ m (0,V 1 u ,⌃ ⇤ ) and H 2 ⇠MN p⇥ m (0,V 1 u ,0 m ) be random matrices and let functions 1 and 2 be the cumulative probability distribution ofkH 1 k andkH 2 k respectively. Then from the convergence in distribution in (A.4), for any 0<↵< 1, there exists a constant eg > 0, when N k max n 4 k 4( 1 (1↵/ 3)) 2 2 eg , 2 k 4( eg +O(K 1 )) 2 ( 1 (1↵/ 3)) 2 , 9C↵ 2 ) o = max n O( 4 k 2 eg ),O(↵ 2 ) o , we have P ⇣ kr ⇤ (z) b B k,1 k eg k , 8z2B ( b X k , k )\ X F k 1 ⌘ =P ✓ k ( k ) 1 ⇣ ( ¯ U k ) > ¯ U k ⌘ 1 ( ¯ U k ) > ⌅ k +O(K 1 k ) ⇣ ( ¯ U k ) > ¯ U k ⌘ 1 ( ¯ U k ) > 1 N k ⇥ m k eg k F k 1 ◆ P ✓ k( k ) 1 ⇣ ( ¯ U k ) > ¯ U k ⌘ 1 ( ¯ U k ) > ⌅ k k 1 2 eg k |F k 1 ◆ +P ✓ O(K 1 k )k ⇣ ( ¯ U k ) > ¯ U k ⌘ 1 ( ¯ U k ) > 1 N k ⇥ m k 1 2 eg k F k 1 ◆ P ✓ 1 p N k kH 1 k 1 2 eg ( k ) 2 F k 1 ◆ +P ✓ 1 p N k kH 2 k 1 2 ( eg +O(K 1 )) k F k 1 ◆ +CN 1/2 k ↵/ 3+↵/ 3+↵/ 3= ↵. (b) We next consider the error ⇤ (z)( b B k,1 ) > z( b B k,0 ) > . ⇤ (z)( b B k,1 ) > z( b B k,0 ) > = h ⇤ (z)r ⇤ (z) > z( b B k,0 ) > i +(r ⇤ (z) b B k,1 ) > z = ⇤ (z) > z > r ⇤ (z) b B k,0 r ⇤ (z) b B k,1 ! >✓ 1 z ◆ 142 By (A.3), we derive ⇤ (z)( b B k,1 ) > z( b B k,0 ) > =(⌅ k ) > ( b X k ) ⇣ ( b X k ) > b X k ⌘ 1 ✓ 1 z ◆ | {z } T1 +O( 2 k ) 1 m⇥ N k ( b X k ) ⇣ ( b X k ) > b X k ⌘ 1 ✓ 1 z ◆ | {z } T2 By definition 4.7, l(z)=( b X k )(( b X k ) > b X k ) 1 ✓ 1 z ◆ . Under the strong poisedness condition, we have p N k max z2B kl(z)k ⇤. Therefore, E ⇥ kT 1 k 2 ⇤ = E ⇥ k(⌅ k ) > l(z)k 2 ⇤ = P N k i=1 E ⇥ ke k,i k 2 ⇤ ·E ⇥ |l i (z)| 2 ⇤ ⇤ 2 N k E[ke "k 2 ] kT 2 k = k1 m⇥ N k l(z)k P N k i=1 k1 m l i (z)k N k p mkl(x)k ⇤ p mN k From the Markov’s Inequality, there exists a constant C such that P ⇣ k ⇤ (z)( b B k,1 ) > z( b B k,0 ) > k ef 2 k ,8z2B ( b X k , k )\ X F k 1 ⌘ P ⇣ kT 1 k+C 2 k kT 2 k ef 2 k ,8z2B ( b X k , k )\ X F k 1 ⌘ P ⇣ kT 1 k( ef C⇤ p mN k ) 2 k ,8z2B ( b X k , k )\ X F k 1 ⌘ E[kT1k 2 ] ( ef C⇤ p mN k ) 2 4 k ⇤ 2 E[ke "k 2 ] N k ( ef C⇤ p mN k ) 2 4 k Then for any 0<↵< 1, there exists a constant ef > 0, when N k O( 4 k 2 ef ↵ 1 )wehave ⇤ 2 E[ke "k 2 ] N k ( ef C⇤ p mN k ) 2 4 k ↵ , which proves the statement. ⇤ Proof of Proposition 2.2. By the definition of f(x) and f k (x) in (4.2) and (4.10), we have f(x)f k (x) = E e " [h(x, ⇤ (x)+e ")] 1 N k P N k i=1 h(x,m k (x,e k,i )) + E e " [g(x, ⇤ (x)+e ")] 1 N k P N k i=1 g(x,m k (x,e k,i )) By the Lipschitz continuity of h(z,·) and g(z,·), there exists a constant L 1 such that |f(x)f k (x)| E e " [h(x, ⇤ (x)+e ")] 1 N k P N k i=1 h(x, ⇤ (x)+e k,i ) + 1 N k P N k i=1 h(x, ⇤ (x)+e k,i ) 1 N k P N k i=1 h(x,m k (x,e k,i )) + E e " [g(x, ⇤ (x)+e ")] 1 N k P N k i=1 g(x, ⇤ (x)+e k,i ) + 1 N k P N k i=1 g(x, ⇤ (x)+e k,i ) 1 N k P N k i=1 g(x,m k (x,e k,i )) | ⌧ k | + |⇠ k | +L 1 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k 143 where ⌧ k = E e " [h(x, ⇤ (x)+e ")] 1 N k P N k i=1 h(x, ⇤ (x)+e k,i ) ⇠ k = E e " [g(x, ⇤ (x)+e ")] 1 N k P N k i=1 g(x, ⇤ (x)+e k,i ) . Since{e k,i } are iid samples of the random variable e ". By Berry-Esseen Theorem, with the rate O(1/ p N k ), we have p N k ⌧ k d !N (0,V h )where V h =Var e " (h(x, ⇤ (x)+e ")). Siminarly, with the rate O(1/ p N k ), we have p N k ⇠ k d !N (0,V h )where V g =Var e " (g(x, ⇤ (x)+e ")). Consider the stochastic process F k given F k 1 , by Proposition 4.1, there exists a constant ef > 0, when N k max{O( 4 k 4 ef ),O(↵ 2 )},wehave P |f(x)f k (x)| ef 2 k ,8x2B (b x k , k )\ X|F k 1 P ⇣ |⌧ k |+|⇠ k |+L 1 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k ef 2 k ,8x2B (b x k , k )\ X|F k 1 ⌘ P |⌧ k | 1 3 ef 2 k ,8x2B (b x k , k )\ X|F k 1 +P |⇠ k | 1 3 ef 2 k ,8x2B (b x k , k )\ X|F k 1 +P ⇣ L 1 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k 1 3 ef 2 k ,8x2B (b x k , k )\ X|F k 1 ⌘ ↵ 6 + ↵ 6 + ↵ 6 = ↵ 2 . We establish part (a). For part (b), we first notice that under the assumption (A3),h(x,!) = maximize j2J h j (x,!). By writing its dual form, we have h(x,!)= minimize 0, P j j=1 P j2J j h j (x,!) with the corresponding optimal dual solution set denoted⇤( x,!). By the directional derivative of the linear program in [9], given x2X and !2⌦, for any d 1 2R p and d 2 2R m , h 0 ((x,!);(d 1 ,d 2 )) = max 2 ⇤( x,!) X j2J j h r 1 h j (x,!) > d 1 +r 2 h j (x,!) > d 2 i , (A.5) wherer 1 h j denotethepartialgradientofh j withrespecttothefirstargumentintheR p spaceand similarly forr 2 h j . Let b " k denote the random variable with the empirical probability distribution of {e k,i }. By (4.5), given x2X, for any d2T(x;X), f 0 (x; d)= r c(x) > d + E e " [h 0 ((x, ⇤ (x)+e ");(d,r ⇤ (x) > d))] +E e " [r g(x, ⇤ (x)+e ") > (d,r ⇤ (x) > d)] f k 0 (x; d)= r c(x) > d + E b " k [h 0 ((x,m k (x,b " k )); (d, ( b B k,1 ) > d))] +E e " [r g(x,m k (x,b " k )) > (d,( b B k,1 ) > d)] 144 Let ¯ d 2 argmin x+d2 X kdk 1 f 0 (x; d). Then we have (x) k (x)= minimize x+d2 X kdk 1 f 0 (x; d)+minimize x+d2 X kdk 1 f k 0 (x; d) f 0 (x; ¯ d)+f k 0 (x; ¯ d) E e " [h 0 ((x, ⇤ (x)+e ")); ( ¯ d,r ⇤ (x) > ¯ d))]+E b " [h 0 ((x,m k (x,b ")); ( ¯ d, ( b B k,1 ) > ¯ d))] E e " [r g(x, ⇤ (x)+e ") > ( ¯ d,r ⇤ (x) > ¯ d))]+E b " [r g(x,m k (x,b ") > ( ¯ d, ( b B k,1 ) > ¯ d)] =(⌘ k1 +⌘ k2 )+( k1 + k2 ) where ⌘ k1 = E e " [h 0 ((x, ⇤ (x)+e ")); ( ¯ d,r ⇤ (x) > ¯ d))]+E b " k [h 0 ((x, ⇤ (x)+b " k )); ( ¯ d,r ⇤ (x) > ¯ d))] ⌘ k2 = E b " k [h 0 ((x, ⇤ (x)+b " k )); ( ¯ d,r ⇤ (x) > ¯ d))]+E b " k [h 0 ((x,m k (x,b " k )); ( ¯ d, ( b B k,1 ) > ¯ d))] k1 = E e " [r g(x, ⇤ (x)+e ") > ( ¯ d,r ⇤ (x) > ¯ d)]+E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d,r ⇤ (x) > ¯ d)] k2 = E b " k [r g(x, ⇤ (x)+b " k ) > ( ¯ d,r ⇤ (x) > ¯ d)]+E b " k [r g(x,m k (x,b " k ) > ( ¯ d, ( b B k,1 ) > ¯ d)] For ⌘ k1 and k1 , we can apply the Berry-Esseen Theorem to derive that p N k ⌘ k1 and p N k k1 both converge in distribution to a normal distribution with the rate O(1/ p N k ) uniformly for all x2X. Under the assumption (A5), using simple decomposition technique, the term k2 can be bounded byk ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k+kr ⇤ (x) b B k,1 k with the scale of a constant. We then only need to bound ⌘ k2 .Given(x,"), let k (x,")2⇤( x,m k (x,")) be the one that achieves the maximum in the formula of h 0 ((x,m k (x,"));( ¯ d, ( b B k,1 ) > d)). By Corollary 3.2.5 in [30] and the fact that ⇤( x,!) is essentially bounded, there exist constants C 0 and C 1 , such that there exists ⇤ (x,")2⇤( x, ⇤ (x)+") satisfying k k (x,") ⇤ (x,")k C 0 max j2J |h j (x, ⇤ (x)+")h j (x,m k (x,"))| C 0 C 1 k ⇤ (x)( b B k,1 ) > xk. Moreover, with ⇤ (x,") and k (x,"), by (4.14), there exist constants C 2 ,C 3 and C 4 such that ⌘ k2 E b " k 2 6 4 P j2J ⇤ j (x,b " k ) 0 @ r 1 h j (x, ⇤ (x)+b " k ) >¯ d +r 2 h j (x, ⇤ (x)+b " k ) > (r ⇤ (x) >¯ d) 1 A 3 7 5 +E b " k 2 6 4 P j2J k j (x,b " k ) 0 @ r 1 h j (x,m k (x,b " k )) >¯ d +r 2 h j (x,m k (x,b " k )) > ( b B k,1 ) > ¯ d 1 A 3 7 5 C 2 E b " k [k ⇤ (x,b " k ) k (x,b " k )k] + max j2J E b " k ⇥ kr 1 h j (x, ⇤ (x)+b " k )r 1 h j (x,m k (x,b " k ))k ⇤ + max j2J E b " k h kr ⇤ (x)r 2 h j (x, ⇤ (x)+b " k ) b B k,1 r 2 h j (x,m k (x,b " k ))k i C 3 k ⇤ (x)( b B k,1 ) > x( b B k,0 ) > k+ C 4 kr ⇤ (x) b B k,1 k. 145 ConsiderthestochasticprocessF k givenF k 1 ,byProposition4.1,thereexistsaconstant ef > 0, when N k max{O( 4 k 2 eg ),O(↵ 2 )},wehave P | (x) k (x)| eg k ,8x2B (b x k , k )\ X|F k 1 P ⌘ k1 +⌘ k2 + k1 + k2 eg k ,8x2B (b x k , k )\ X|F k 1 P ⌘ k1 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 +P ⌘ k2 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 +P k1 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 +P k2 1 4 eg k ,8x2B (b x k , k )\ X|F k 1 4· ↵ 8 = ↵ 2 we establish part (b). Combining the results in (a) and (b), we have for any 0<↵< 1, there exists constant vector =( ef , eg ), when N k max{O( 4 k 2 eg ),O( 4 k 2 ef ),O(↵ 2 )} we have P ⇣ |f(x)F k (x)| ef 2 k , | (x) k (x) | eg k , 8x2B (b x k , k )\ X F k 1 ⌘ 1↵, which establishes the statement. ⇤ 146 Appendix B Supplement to Chapter 3 Proof of Theorem 3.4 We follow similar steps as in the proof of Theorem 3.1. First define an intermediate iterate: e x ⌫ +1/2 , argmin x2 X '(x)+E e ! h ↵ ⌫, 1 (x,e !) b ↵ ⌫, 2 (x,e !;e x ⌫ ) i + 1 2 kxe x ⌫ k 2 . Step 1. By Lemma 3.2, we derive ✓ '(e x ⌫ +1/2 )+E e ! h ↵ ⌫, 1 (e x ⌫ +1/2 ,e !) i ◆ ⇣ '(e x ⌫ )+E e ! ⇥ ↵ ⌫, 1 (e x ⌫ ,e !) ⇤ ⌘ 1 ⇣ e x ⌫ +1/2 e x ⌫ ⌘ > ⇣ e x ⌫ +1/2 e x ⌫ E e ! ⇥ r x ↵ ⌫, 2 (e x ⌫ ,e !) ⇤ ⌘ = 1 e x ⌫ +1/2 e x ⌫ 2 + ⇣ e x ⌫ +1/2 e x ⌫ ⌘ > E e ! ⇥ r x ↵ ⌫, 2 (e x ⌫ ,e !) ⇤ , (B.1) Step 2. By the optimality conditions of e x ⌫ +1 and e x ⌫ +1/2 ,wededuce e x ⌫ +1/2 e x ⌫ +1 h k e C ⌫ k+k e B ⌫ k i , where e B ⌫ , 1 L ⌫ L⌫ X i=1 r x ↵ ⌫, 2 (e x ⌫ ,! ⌫,i )E e ! ⇥ r x ↵ ⌫, 2 (e x ⌫ ,e !) ⇤ e C ⌫ , 1 L ⌫ L⌫ X i=1 u ↵ ⌫ (e x ⌫ +1/2 ,! ⌫,i )v ↵ ⌫ (e x ⌫ +1/2 ), 147 with u ↵ ⌫ (e x ⌫ +1/2 ,!) 2@ ↵ ⌫, 1 (e x ⌫ +1/2 ,!) and v ↵ ⌫ (e x ⌫ +1/2 )= E e ! h u ↵ ⌫ (e x ⌫ +1/2 ,e !) i . In what follows, we derive bounds for E ⌫ h k e B ⌫ k i and E ⌫ h k e C ⌫ k i . The derivation is similar to the corresponding step in the proof of Theorem 3.1. We have E ⌫ e B ⌫ = E ⌫ 1 2 6 6 4 E ⌫ 2 6 4 1 L ⌫ L⌫ X i=1 r x ↵ ⌫, 2 (e x ⌫ +1/2 ,! ⌫,i )E e ! h r x ↵ ⌫, 2 (e x ⌫ +1/2 ,e !) i |F ⌫ 1 3 7 5 3 7 7 5 E ⌫ 1 " 1 L ⌫ E e ! r x ↵ ⌫, 2 (e x ⌫ +1/2 ,e !)E e ! [r x ↵ ⌫, 2 (e x ⌫ +1/2 ,e !)] 2 # 1/2 . Sincer x ↵, 2 (x,!)= G(!) > (Q+↵ I) 1 ⇥ f(!)+G(!)x ⇤ ,wehave r ↵ ⌫, 2 (x,!) ↵ 1 ⌫ G(!) f(!)+G(!)x . Hence there exists a constant e V 2 > 0 such that E ⌫ e B ⌫ e V 1/2 1 ↵ ⌫ L 1/2 ⌫ . (B.2) By Lemma 3.10, its proof, and a similar string of inequalities, we can derive the existence of a constant e V 1 > 0 such that for all ⌫ with ↵ ⌫ suciently small, E ⌫ e C ⌫ e V 1/2 1 ↵ ⌫ L 1/2 ⌫ . (B.3) Recalling the Lipschitz constant Lip ' from (3.3) and Lip from (3.1), we have b ⇣ ↵ ⌫ (e x ⌫ +1 ;e x ⌫ ) b ⇣ ↵ ⌫ (e x ⌫ +1/2 ;e x ⌫ )+ 1 2 e x ⌫ +1 e x ⌫ 2 1 2 e x ⌫ +1/2 e x ⌫ 2 = ' (e x ⌫ +1 )+E e ! h ↵ ⌫, 1 (e x ⌫ +1 ,e !) i ' (e x ⌫ +1/2 ) E e ! h ↵, 1(e x ⌫ +1/2 ,e !) i E e ! h ↵ ⌫, 2 (e x ⌫ ,e !)+r x 2(e x ⌫ ,e !) > (e x ⌫ +1 e x ⌫ ) i +E e ! h ↵, 2(e x ⌫ ,e !)+r x 2(e x ⌫ ,e !) > (e x ⌫ +1/2 e x ⌫ ) i + 1 2 ⇣ e x ⌫ +1 e x ⌫ +1/2 ⌘ > ⇣ e x ⌫ +1 +e x ⌫ +1/2 2e x ⌫ ⌘ Lip ' + Lip ↵ ⌫ + e ⌥ ! ke x ⌫ +1 e x ⌫ +1/2 k, for some constant e ⌥ > 0 Lip ' + Lip ↵ ⌫ + e ⌥ ! h k e C ⌫ k+k e B ⌫ k i . 148 Step 3. Following the same derivation as in Step 3 of Theorem 3.1 and using the fact that ⇣ ↵ ⌫ +1 (e x ⌫ +1 ) ⇣ ↵ ⌫ (e x ⌫ +1 ), we deduce that ⇣ ↵ ⌫ +1 (e x ⌫ +1 )⇣ ↵ ⌫ (e x ⌫ ) ✓ L ' + L ↵ ⌫ + e ⌥ ◆ h k e C ⌫ k+k e B ⌫ k i 1 2 e x ⌫ +1 e x ⌫ 2 1 2 e x ⌫ +1/2 e x ⌫ 2 . By (B.2) and (B.3), we have 1 X ⌫ =1 Lip ' + Lip ↵ ⌫ + e ⌥ ! E ⌫ h k e C ⌫ k+k e B ⌫ k i 1 X ⌫ =1 Lip ' + Lip ↵ ⌫ + e ⌥ !" e V 1/2 1 ↵ ⌫ L 1/2 ⌫ + e V 1/2 2 ↵ ⌫ L 1/2 ⌫ # with the right-hand sum being finite on assumption. Hence we have that with probability 1, 1 X ⌫ =1 ✓ Lip ' + L ↵ ⌫ + e C ◆ h k b C ⌫ k+k b B ⌫ k i is finite. By Lemma 3.3, the sequence {⇣ ↵ ⌫ (e x ⌫ )} con- verges. Moreover, the two sums 1 X ⌫ =1 e x ⌫ +1/2 e x ⌫ 2 and 1 X ⌫ =1 e x ⌫ +1 e x ⌫ 2 are finite with prob- ability 1. Step 4. Similar to the proof of (3.39), we can show that e x ⌫ +1 is optimal for minimize x2 X 2 6 6 6 6 6 6 6 6 6 6 6 4 ' (x)+ 1 L⌫ L⌫ X i=1 h G(! ⌫,i ) > y ↵ ⌫ (e x ⌫ +1 ,! ⌫,i ) i > (x e x ⌫ +1 )+ 1 L⌫ L⌫ X i=1 ¯ ↵ ⌫ (x,e x ⌫ +1 ,! ⌫,i )+ 1 2 x e x ⌫ 2 + 8 < : 1 L⌫ L⌫ X i=1 h r x ↵ ⌫, 2 (e x ⌫ +1 ,! ⌫,i )r x ↵ ⌫, 2 (e x ⌫ ,! ⌫,i ) i 9 = ; > (x e x ⌫ +1 ) 3 7 7 7 7 7 7 7 7 7 7 7 5 Let e x 1 be the limit of a convergence subsequence {e x ⌫ +1 } ⌫ 2 . To complete the proof, it remains to show the following limits: •every limitpoint ofthesequence 8 < : 1 L ⌫ L⌫ X i=1 h G(! ⌫,i ) > y ↵ ⌫ (e x ⌫ +1 ,! ⌫,i ) i 9 = ; ⌫ 2 , oneof which must exist, belongs to@ z E e ! ⇥ ¯ (e x 1 ,e x 1 ,e !) ⇤ ; • the sequence 8 < : 1 L ⌫ L⌫ X i=1 ¯ ↵ ⌫ (x,e x ⌫ +1 ,! ⌫,i ) 9 = ; converges uniformly to E e ! ⇥ ¯ (x,e x 1 ,e !) ⇤ for x2X; • lim ⌫ (2 )!1 1 L ⌫ L⌫ X i=1 h r x ↵ ⌫, 2 (e x ⌫ +1 ,! ⌫,i )r x ↵ ⌫, 2 (e x ⌫ ,! ⌫,i ) i = 0. All these limits can be proved by invoking Lemma 3.7 and some suitable bounds of the respective summands in the above limits. Specifically, the convergence of the second sequence follows easily from(3.37);thatofthethirdsequencefollowsfromthebound(3.40)andtheargumentthatfollows 149 it. Finally, fortheconvergenceofthefirstsequence, wefirstnotethefollowingtwoprobability-one limits: lim k(2 )!1 1 L ⌫ L⌫ X i=1 ¯ ↵ ⌫ (e x ⌫ +1 ,x,! ⌫,i )= E e ! ⇥ ¯ (e x 1 ,x,e !) ⇤ for all x2X lim k(2 )!1 1 L ⌫ L⌫ X i=1 ¯ ↵ ⌫ (e x ⌫ +1 ,e x ⌫ +1 ,! ⌫,i )= E e ! ⇥ ¯ (e x 1 ,e x 1 ,e !) ⇤ . (B.4) Since ¯ ↵ ⌫ (e x ⌫ +1 ,•,! ⌫,i )isconcaveanddi↵erentiable,andwecomputeitsgradient r z ¯ ↵ ⌫ (e x ⌫ +1 ,e x ⌫ +1 ,! ⌫,i )= G(! ⌫,i ) > y ↵ ⌫ (e x ⌫ +1 ,! ⌫,i ), we have 1 L⌫ L⌫ X i=1 ¯ ↵ ⌫ (e x ⌫ +1 ,x,! ⌫,i ) 1 L⌫ L⌫ X i=1 ¯ ↵ ⌫ (e x ⌫ +1 ,e x ⌫ +1 ,! ⌫,i )+ 1 L⌫ L⌫ X i=1 h G(! ⌫,i ) > y ↵ ⌫ (e x ⌫ +1 ,! ⌫,i ) i > (x e x ⌫ +1 ). For the subsequence of 8 < : 1 L ⌫ L⌫ X i=1 h G(! ⌫,i ) > y ↵ ⌫ (e x ⌫ +1 ,! ⌫,i ) i 9 = ; ⌫ 2 ,let a 1 be an accumulation point. Then by the limits (B.4), we deduce E e ! ⇥ ¯ (e x 1 ,x,e !) ⇤ E e ! ⇥ ¯ (e x 1 ,e x 1 ,e !) ⇤ +(a 1 ) > (xe x 1 ), with probability 1. Hencea 1 2@ z E e ! ⇥ ¯ (e x 1 ,e x 1 ,e !) ⇤ with probability 1. ⇤ 150 Appendix C Supplement to Chapter 4 Proof of Lemma 2.3 Under the assumptions (A1),(A2) and (A3), as k increases to infinity, the sample average function F k (x) uniformly converges to f(x) on X (see Theorem 7.48 in [76]). Moreover, x k converges to x ⇤ with probability 1 from 2.2. Thus, T X ↵F k (x k ) converges to x ⇤ with probability 1 as k increases to infinity. With the assumption (B3), there exists a constant k 2 such that for any k k 2 , h(x,!)is di↵erentiable at T X ↵F k (x k ) for almost every !2⌦. Then we present the optimality conditions of the original optimization problem at the solution pair (x ⇤ , ⇤ ) and of the proximal map at the solution pair (T ↵F k (x k ), k ) respectively as follows. Qx ⇤ +Er [h(x ⇤ ,e !)] =A > ⇤ QT X ↵F k (x k )+ 1 k k X i=1 r h(T X ↵F k (x k ),! i )+ 1 ↵ (T X ↵F k (x k )x k )=A > k . As k increases to infinity, from the law of large numbers the di↵erence of the right hand sides of the above two optimality conditions converges to zero. Therefore, lim k!1 A > ( k ⇤ )=0. (C.1) Recall that lim k!1 T X ↵F k (x k )= x ⇤.Thenthereexistsafinitenumber k 0 2 such that for any kk 0 2 ,if r2(I ⇤ ) c ,then a > r T X ↵F k (x k )<b r . This yields that (I ⇤ ) c ✓ (I k ) c which is equivalent to I k ✓ I ⇤ for any kk 0 2 . (C.2) We denote the index set R k ,{r : k (r) > 0}. Because of the slackness complementarity at x k ,we have R k ✓ I k ✓ I ⇤ for any kk 0 2 . In other words, ⇤ (r) = k (r) = 0 for any r2(I ⇤ ) c , and kk 0 2 . (C.3) Then we show that for any r2I ⇤ ,lim k!1 k (r) = ⇤ (r) . First the limit (C.1) with (C.3) can be rewriten as follows. lim k!1 m X r=1 ( k (r) ⇤ (r) )a r =lim k!1 X r2 I ⇤ ( k (r) ⇤ (r) )a r =0 151 From the assumption (B1) that the active constraint are linearly independent at the optimal solution x ⇤ ,wemusthave lim k!1 k (r) = ⇤ (r) for any r2I ⇤ . Because of the strict complementarity in the assumption (B5) at x ⇤ , (r) > 0 for any r 2I ⇤ . Therefore, there exists a finite number k 00 2 such that for anyk>k 00 2 , k (r) > 0 for any r 2I ⇤ . It follows that for any k k 00 2 , I ⇤ ✓ I k . Combining with (C.2), we thus have I k = I ⇤ with probability 1 for any k ˆ k 2 = max{k 2 ,k 0 2 ,k 00 2 }. ⇤ Proof of Lemma 2.4. Assumex6=x 0 andT ↵g x6=T ↵g x 0 , otherwise there is nothing to show. For the proximal mapping T ↵ ˇ g x, the optimal solution pair (T ↵ ˇ g x, ) satisfies the following KKT conditions. I +↵Q ¯ A > ¯ A 0 ! ✓ T ↵ ˇ g x ◆ = ✓ x ¯ b ◆ . (C.4) Since the matrix ¯ A has linearly independent rows, from the formula of the inverse of a block matrix, the unique global optimal solution can be represented as, T ↵ ˇ g x =K ↵ x+J ↵ ¯ b, (C.5) where K ↵ = G ↵ G ↵ ¯ A > ( ¯ AG ↵ ¯ A > ) 1 ¯ AG ↵ and G ↵ =(I +↵Q ) 1 . The same representation also holds for x 0 such that T ↵ ˇ g x 0 =K ↵ x 0 +J ↵ ¯ b. Therefore T ↵ ˇ g xT ↵ ˇ g x 0 =K ↵ (xx 0 ). Moreover, kT ↵ ˇ g xT ↵ ˇ g x 0 k kxx 0 k = ⇣ (xx 0 ) T K 2 ↵ (xx 0 ) kxx 0 k 2 ⌘ 1/2 ✓ max (K 2 ↵ ) 1/2 = ✓ max (K ↵ ). From the definition, K ↵ =(G 1/2 ↵ )(IG 1/2 ↵ ¯ A > ( ¯ AG ↵ ¯ A > ) 1 ¯ AG 1/2 ↵ )(G 1/2 ↵ ). Moreover, notice that G 1/2 ↵ ¯ A > ( ¯ AG ↵ ¯ A > ) 1 ¯ AG 1/2 ↵ is a hat matrix which is defined formally in [39] with eigenvalues 0 and 1. We then derive ✓ max (K ↵ ) ✓ max (G 1/2 ↵ ) ✓ max (IG 1/2 ↵ ¯ A > ( ¯ AG ↵ ¯ A > ) 1 ¯ AG 1/2 ↵ ) ✓ max (G 1/2 ↵ ) ✓ max (G ↵ )< 1. Hence, the proximal mapping T ˇ g x has the contraction property satisfying that kT ˇ g xT ˇ g x 0 k ✓ max (G ↵ )·kxx 0 k, 8x,x 0 2 ¯ X. ⇤ ProofofLemma2.6. Sincek 0 isalargeinteger,wetaketheapproximationthatlog(1 ⇤ i )⇠ ⇤ i for ik 0 .Then Z ⇤ (k 0 ,k)=exp{ k X i=k 0 log(1 ⇤ i )}⇠ exp{ k X i=k 0 ⇤ i }. Besides, k X i=k 0 ⇤ i Z k k 0 ⇤ u du=⇤ log( k k 0 ). 152 Therefore, claim 1 holds. Next consider the summation P k i= ˆ k+1 e i Z ⇤ (i,k). Since Z ⇤ (k 0 ,k). ⇣ k 0 k ⌘ ⇤ ,wehave k X i= ˆ k+1 e i Z ⇤ (i,k). k X i= ˆ k+1 e i ✓ i k ◆ ⇤ 1 k ⇤ Z k ˆ k+1 e u ·u ⇤ du. Through some computation, the integral can be bounded such that Z k ˆ k+1 e u ·u ⇤ du . 8 < : k ⇤ e k ⇤ 1 k e k ⇤ < 1 Therefore, the summation is bounded as follows k X i= ˆ k+1 e i Z ⇤ (i,k). 8 < : 1 e k ⇤ 1 k 1 ⇤ e k ⇤ < 1 We then consider the summation P k i= ˆ k+1 1 i 2 Z ⇤ (i,k). k X i= ˆ k+1 1 i 2 Z ⇤ (i,k). k X i= ˆ k+1 1 i 2 ✓ i k ◆ ⇤ 1 k ⇤ Z k ˆ k+1 u ⇤ 2 du Therefore, k X i= ˆ k+1 1 i 2 Z ⇤ (i,k). 8 > > < > > : log(k) k ⇤=1 1 (⇤ 1)k ⇤ > 1 1 (1 ⇤) k ⇤ ⇤ < 1 We finally consider the summation P k i= ˆ k+1 1 i Z ⇤ (i,k). k X i= ˆ k+1 1 i Z ⇤ (i,k). k X i= ˆ k+1 1 i ✓ i k ◆ ⇤ 1 k ⇤ Z k ˆ k+1 u ⇤ 1 du⇡ 1 ⇤ . ⇤ 153
Abstract (if available)
Abstract
To transform the data to insights for decision-making, one needs to go through a scientific process of predictive and prescriptive analytics. However, the lack of interactions among these two phases could hurt the achievements in each domain of knowledge. In this dissertation, we study both sequential and coupled approaches which integrate predictive and prescriptive analytics and develop the convergent algorithms to provide a mathematically justifiable decision through the lenses of optimization, statistics, and prediction models. ❧ This dissertation begins with an introduction to stochastic optimization, which is followed by three chapters of new research results. These three chapters can be classified into two parts, one focusing on methodologies which assume that a distribution describing the uncertainty can be obtained prior to decision optimization, whereas, the second part is devoted to a method where the distribution of uncertain quantities is unknown, although data regarding these variables can be collected as necessary. Further details these of these parts are provided below. ❧ In the first part we assume that we are able to infer a distribution associated with uncertainty using statistical methodology. These types of problems lead to stochastic programming (SP) models for which we study sampling-based algorithms. In Chapter 2, we first investigate a class of two-stage stochastic quadratic programming problems and develop the stochastic decomposition algorithm with a sublinear convergence rate. We also design a stopping rule using bootstrap estimators to efficiently deliver an asymptotically optimal decision. In chapter 3, we then address a new class of SP problems with a distinguishing feature that the objective function in the second stage is linearly parameterized by the first-stage decision variable, in addition to the standard linear parameterization in the constraints. For a convex second-stage objective function, the SP is a difference-of-convex program while the difference-of-convex decomposition is not practically suitable for computation. This feature leads to the introduction of the notion of a generalized critical point to which we show the almost-sure subsequential convergence of the combined sample average approximation, difference-of-convex approximation is established via regularization. ❧ In the second part of the dissertation, we develop a coupled scheme of estimation and optimization for the stochastic programming problems with latently endogenous uncertainties. Without assuming any prior knowledge of the dependency, we develop the novel approach which couples the prediction models for estimating the local dependency and the optimization method for seeking the better decision. We prove that the sequence of decisions produced by the coupled scheme converges to a first-order stationary solution with probability 1. Numerical experiments show that our coupled approach improves significantly upon the sequential scheme. By harnessing the algorithmic advances in stochastic programs and the power of prediction, we can provide the data-driven decision with a higher performance in terms of the statistical optimality and computational efficiency.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Computational stochastic programming with stochastic decomposition
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Stochastic games with expected-value constraints
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
Computational validation of stochastic programming models and applications
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
PDF
Stochastic oilfield optimization under uncertainty in future development plans
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Train scheduling and routing under dynamic headway control
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Learning and control in decentralized stochastic systems
PDF
Variants of stochastic knapsack problems
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Learning enabled optimization for data and decision sciences
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Learning and decision making in networked systems
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
Asset Metadata
Creator
Liu, Junyi
(author)
Core Title
The fusion of predictive and prescriptive analytics via stochastic programming
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
07/23/2019
Defense Date
07/22/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithms,OAI-PMH Harvest,quadratic programming,stochastic programming
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sen, Suvrajeet (
committee chair
), Gupta, Vishal (
committee member
), Pang, Jong-Shi (
committee member
)
Creator Email
junyiliu@usc.edu,qianqiankk@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-188095
Unique identifier
UC11660445
Identifier
etd-LiuJunyi-7577.pdf (filename),usctheses-c89-188095 (legacy record id)
Legacy Identifier
etd-LiuJunyi-7577.pdf
Dmrecord
188095
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Junyi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithms
quadratic programming
stochastic programming