Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning enabled optimization for data and decision sciences
(USC Thesis Other)
Learning enabled optimization for data and decision sciences
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNIVERSITY OF SOUTHERN
CALIFORNIA
DOCTORAL THESIS
Learning Enabled Optimization
for Data and Decision Sciences
Author:
Yunxiao DENG
Supervisor:
Dr. Suvrajeet SEN
USC Graduate School
Doctor of Philosophy (Daniel J. Epstein Department of Industrial
Systems Engineering)
May 2019
iii
Contents
List of Figures vii
List of Tables ix
1 Introduction: An Integrated View of Analytics 1
1.1 Algorithmic Stochastic Programming Background . . . 10
1.1.1 Sample Average Approximation(SAA) . . . . . 10
Stochastic Decomposition (SD) . . . . . . . . . . 13
2 Coalescing Data and Decision Sciences for Analytics 19
2.1 Predict-then-Optimize with Deterministic Constraints:
Uncertainty in the Objective . . . . . . . . . . . . . . . . 20
2.1.1 Static Predict-then-Optimize . . . . . . . . . . . 21
2.1.2 Evolving Predict-then-Optimize . . . . . . . . . 26
2.2 SP: Cost and Constraint Uncertainty . . . . . . . . . . . 34
2.2.1 SP as a Bridge and the Structure of Hedged So-
lutions . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.2 SP Model Validation . . . . . . . . . . . . . . . . 44
2.2.3 SP Model Verification . . . . . . . . . . . . . . . 48
iv
3 LEO: Models and Concepts 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Connections to the Literature . . . . . . . . . . . . . . . 64
3.3 Learning Enabled Optimization . . . . . . . . . . . . . . 67
3.3.1 Model Instantiation . . . . . . . . . . . . . . . . . 68
3.3.2 Examples of LEO Models . . . . . . . . . . . . . 76
3.4 Parallel Replications and Statistical Optimality . . . . . 81
3.5 Model Validation, Assessment and Selection . . . . . . 87
3.5.1 Metrics and Model Validation . . . . . . . . . . . 88
3.5.2 Comparison across LEO Models. . . . . . . . . . 97
3.6 Illustrative Computations . . . . . . . . . . . . . . . . . 104
3.7 A Model with Disjoint Spaces: LEO-ELECEQUIP (Time-
Series Data) . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.7.1 Month by Month Validation Results for 2001-2002.108
3.7.2 Snapshot Computational Results forELECEQUIP 109
3.8 A Model with Shared Spaces: LEO-Wyndor . . . . . . . 113
3.8.1 Results for Error Terms. . . . . . . . . . . . . . . 119
3.8.2 Results for Decisions and Optimal Value Esti-
mates. . . . . . . . . . . . . . . . . . . . . . . . . 120
4 SVM using Stochastic Decomposition 125
4.1 SD for Support Vector Machines . . . . . . . . . . . . . 128
4.2 Stochastic Decomposition Algorithm for SVM . . . . . 133
4.3 SVM Instances using SD algorithm . . . . . . . . . . . . 136
v
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5 Computational OR Exchange (cORe) Platform 143
5.1 Background and Motivation . . . . . . . . . . . . . . . . 143
5.2 Characteristics and Designs for the CORE Platform . . 147
5.3 cORe Applications . . . . . . . . . . . . . . . . . . . . . 150
Bibliography 159
vii
List of Figures
1.1 Types/Phases of Analytics. Source: Gartner Analytics.
Integrative Analytics box added by the authors. . . . . 3
2.1 Network Routing . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Particle Filtering Steps . . . . . . . . . . . . . . . . . . . 28
2.3 Network of Example 3.1 . . . . . . . . . . . . . . . . . . 41
2.4 Solution of Deterministic LP . . . . . . . . . . . . . . . . 42
2.5 Solution of Stochastic LP . . . . . . . . . . . . . . . . . . 44
2.6 SP Validation for SSN Problem (Sen, Doverspike, and
Cosares, 1994) . . . . . . . . . . . . . . . . . . . . . . . . 52
2.7 Statistical Learning and Learning Enabled Optimization 53
3.1 Statistical Learning and Learning Enabled Optimization 67
3.2 LEO-ELECQUIP: Stochastic Dominance of SLP Validated
objectives over DAF . . . . . . . . . . . . . . . . . . . . . 112
3.3 The Advertising Data Set (Source: James et al 2011). . . 115
3.4 q-q plot before and after data preprocessing . . . . . . . 116
3.5 LEO-Wyndor: Stochastic Dominance of Validated Ob-
jectives under Alternative Models . . . . . . . . . . . . 124
viii
5.1 CORE: Computational Operations Research Exchange
Platform Design . . . . . . . . . . . . . . . . . . . . . . . 148
5.2 Practical Applications for Community Data . . . . . . . 151
5.3 Homepage of the cORe platform . . . . . . . . . . . . . 152
5.4 MnM problem . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5 SL Phase of MnM problem . . . . . . . . . . . . . . . . . 155
5.6 Jupyter Notebook in R (MnM problem) . . . . . . . . . 155
5.7 LEO-Wyndor in cORe Platform . . . . . . . . . . . . . . 156
5.8 MLR of LEO-Wyndor in cORe Platform . . . . . . . . . 157
5.9 Validation Phase of LEO-Wyndor in cORe Platform . . 158
ix
List of Tables
2.1 High Demand Distribution. . . . . . . . . . . . . . . . . 41
2.2 Low Demand Distribution. . . . . . . . . . . . . . . . . 41
2.3 Verification of Electricity Dispatch (10% Wind Penetra-
tion) (Gangammanavar, Sen, and Zavala, 2016) . . . . . 51
2.4 Data for the Wyndor Glass Problem (Hillier and Lieber-
man, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5 Validation Exercise using standard SP . . . . . . . . . . 56
3.1 LEO-ELECQUIP: Monthly Back-Testing Costs . . . . . . . 109
3.2 LEO-ELECQUIP: Comparison of Solutions under Alter-
native Models . . . . . . . . . . . . . . . . . . . . . . . . 111
3.3 LEO-ELECQUIP: Hypothesis Test Results under Alterna-
tive Models . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.4 LEO-ELECQUIP: Errors under Alternative Models . . . . 112
3.5 Data for the Wyndor Glass Problem (Hillier and Lieber-
man, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.6 LEO-Wyndor: Comparison of Chi-square test . . . . . . 119
x
3.7 LEO-Wyndor: Comparison of Solutions for Alternative
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.8 LEO-Wyndor: Hypothesis Test Results for Alternative
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.9 LEO-Wyndor: Errors for Alternative Models . . . . . . 124
3.10 LEO-Wyndor: Kruskal-Wallis Test Results (p> 0.01) . . 124
4.1 SVM with Heart Disease Dataset . . . . . . . . . . . . . 137
4.2 SVM with Adult (Census Income) Dataset . . . . . . . . 138
4.3 SVM with Breast Cancer Dataset . . . . . . . . . . . . . 139
4.4 SVM with Iris Dataset . . . . . . . . . . . . . . . . . . . 140
4.5 SVM with Car Evaluation Dataset . . . . . . . . . . . . 141
1
Chapter 1
Introduction: An Integrated
View of Analytics
In a recent survey (KPMG 2014), over one hundred CFOs and CIOs
of large organizations (over a billion dollars in annual turnover) were
interviewed regarding the effectiveness of business analytics. Over
96% of those surveyed acknowledged that they “could do better with
big data, and make better use of analytics.” Two of the top three sig-
nificant questions which emerged from the KPMG survey relate to:
a) How will predictive analytics drive future decision making?” and
b) “What technology will be required to operationalize data and an-
alytics within the organization?” We interpret these questions from
an OR/MS perspective as a) how should predictive models be inte-
grated to support decisions using OR/MS models? and b) what for-
mal support can we provide decision-makers so that they can assess
the quality of decisions with a quantifiable degree of confidence? This
dissertation coalesces concepts from data and decision sciences as the
2 Chapter1. Introduction: AnIntegratedViewofAnalytics
foundational basis from which to answer these questions.
A class of applications which come under the umbrella of Analyt-
ics of Things (Tom Davenport, Deloitte Insights, 2014) builds upon the
“sense-and-respond” paradigm in which the Internet is used to com-
municate data for the purposes of Analytics. However, transforming
insight into action is not as straightforward as one may be led to be-
lieve. To recognize the hurdles, consider a common view of analytics,
as summarized by Figure 1.1 (source: Gartner Analytics). While the
mapping of different forms of analytics to “hindsight, insight, fore-
sight”, is itself very insightful, it also highlights the challenges of in-
tegration. Because of disparate managerial levels, data types, sources,
and frequency of updates, as well as disparate mathematical assump-
tions, the specific areas identified in Figure 1.1 tend to operate within
their own silos. This dissertation is intended to help readers bridge
silos at the higher end of the value spectrum.
The fundamentals of data science draw upon statistical (or ma-
chine) learning (SL/ML) and optimization. In addition, there is a
wealth of data that is being collected, curated and disseminated via
the world-wide-web. If this growing deluge of data can be harnessed
by powerful decision models, then the OR/MS community will not
only provide answers to questions raised in the KPMG report, it will
also be legitimate in laying claims to thought-leadership in this emerg-
ing area. One of the goals of this dissertation is to help researchers
recognize the potential for exploiting the interdependence between
Chapter1. Introduction: AnIntegratedViewofAnalytics 3
FIGURE 1.1: Types/Phases of Analytics. Source: Gart-
ner Analytics. Integrative Analytics box added by the
authors.
4 Chapter1. Introduction: AnIntegratedViewofAnalytics
data and decisions. While researchers in both these areas are mak-
ing important strides within each silo, there are many opportunities
for advances in the interstices of data and decision sciences. From
modeling to algorithmic frameworks, and model validation to model
selection, there are a host of open questions that remain unresolved.
Some of the early work in the area started with research on op-
erations management (OM) models with special structures, such as
newsvendor problems. For instance, Liyanage and Shanthikumar,
2005 and more recently Ban and Rudin, 2018 studied the integration
of optimization and learning to identify optimal inventory ordering
policies. Both papers use a specialized form of the newsvendor model
to illustrate the potential for joint models of learning and optimiza-
tion (see also Cooper, Homem-de-Mello, and Kleywegt, 2015). Other
OM applications motivating joint estimation and optimization mod-
els consider problems arising from combining learning and pricing
(Besbes and Zeevi, 2015, Cooper, Homem-de-Mello, and Kleywegt,
2015, Harsha, Natarajan, and Subramanian, 2015). Another avenue
of application-specific use of learning within optimization arises from
the introduction of specialized regularization in the context of portfo-
lio optimization in Ban, Karoui, and Lim, 2017. Finally, we also rec-
ommend a broad personal perspective on data-driven OM of Simchi-
Levi, 2013.
As for algorithmic frameworks which tie data and decision sci-
ences, an early paper is the work on Directed Regression (Kao, Roy,
Chapter1. Introduction: AnIntegratedViewofAnalytics 5
and Yan, 2009). In this setup, regression coefficients are chosen based
on a combined objective consisting of a loss function associated with
regression, as well as a quadratic cost function as in standard un-
constrained optimization. Subsequently, Kao and Van Roy, 2014 ex-
tended these ideas to another quadratic optimization model whose
optimum solution depends on an unknown Variance-Covariance Ma-
trix of a Gaussian random variable. Another viewpoint at the in-
terface between predictive and prescriptive analytics is presented in
Bertsimas and Kallus, 2014. Their work demonstrates that in the pres-
ence of dependencies among random variables, approaches which ac-
commodate learning perform much better than those which do not
(e.g., optimization with sample average approximation).
Despite much interest in the two fundamental building blocks, i.e.,
statistics and optimization, there are gaps in the fundamental ques-
tions that these areas serve. In general, the algorithmic aspects of
both building blocks derive from common variational principles (see
Rockafellar and Uryasev, 2013). However, there is a fair amount of
separations between other aspects such as modeling, model valida-
tion, and model selection. In the world of SL/ML, objectives are typ-
ically stated in the form of “goodness of fit”. This leads to a specific
form of optimization objective. For instance, a lower bound of zero
is often valid for optimization models arising in SL/ML and one can
take advantage of such a priori knowledge within optimization. On
the other hand, model assessment and selection are not as common
6 Chapter1. Introduction: AnIntegratedViewofAnalytics
in the optimization literature. Recent exceptions to this comment are
the works of Den Boer and Sierag, 2016 and Deng and Sen, 2018. The
first of these formalizes the notion of decision-based-model-selection,
whereas, the latter provides an “end-to-end” covering modeling, al-
gorithms, model assessment and selection. This framework, known
as “Learning Enabled Optimization”, will be summarized in chapter
3 of this thesis.
In keeping with our goals to accomplish more with data in OR/MS
applications, there have been some attempts to have optimization meth-
ods guide information gathering for predictive analytics in the work
of Frazier, 2012 and Ryzhov, Powell, and Frazier, 2012 which are in-
tended to help an experimentalist improve the effectiveness of pre-
dictive models by using sensitivity of the response (using a concept
known as knowledge gradient) to design experiments. This line of
work uses algorithmic notions of optimization for experimental de-
sign (including simulation experiments). A more decision-oriented
framework is proposed in Kao, Roy, and Yan, 2009 where the regres-
sion coefficients are chosen based on a combined objective consisting
of a loss function associated with regression, as well as that of opti-
mization. The assumptions of Kao, Roy, and Yan, 2009, namely, un-
constrained quadratic optimization of both problems renders their si-
multaneous optimization manageable. However, if the optimization
model were inequality constrained (as in many applications), such si-
multaneous optimization would lead to bilevel stochastic programs,
Chapter1. Introduction: AnIntegratedViewofAnalytics 7
which are much harder than the SP setting of the LEO model. Another
viewpoint at the interface between predictive and prescriptive ana-
lytics is presented in the recent paper by Bertsimas and Kallus, 2014.
Their work demonstrates that in the presence of dependencies among
random variables, using predictive models which capture dependen-
cies (e.g., k-nearest neighbors) lead to more cost-effective decisions
than using SAA without exploring dependency. It should be noted
that the setting of the above paper is such that the random variables
and the decision variables assume values in disjoint spaces (to be for-
malized in the next chapter as LEO models with disjoint spaces). The
LEO models presented in this thesis not only accommodate the above
case, but they are also able to accommodate situations in which both
SL and SP models share the same spaces. In this way, the LEO model
can accept a statistical model whose parameters may be continuous
random variables as in many types of regression. This may lead to an
infinite dimensional optimization problem for which we propose an
approximate solution through a new concept of statistical optimality.
As for the Operations Management (OM) literature, the primary
focus of statistical learning has been on specific classes of problems
(e.g. newsvendor models). For instance, Liyanage and Shanthikumar,
2005 and more recently Ban and Rudin, 2018 have studied the inte-
gration of optimization and learning to identify optimal inventory or-
dering policies. Both papers use the specialized form of the newsven-
dor model to illustrate the potential for learning and optimization.
8 Chapter1. Introduction: AnIntegratedViewofAnalytics
Another avenue of application-specific use of learning within opti-
mization arises from the introduction of specialized regularization in
the context of portfolio optimization in Ban, Karoui, and Lim, 2017.
Moreover, Lim, Shanthikumar, and Shen, 2006 have summarized dif-
ferent ideas of modeling uncertainty and identifying optimal decision
for unknown parameters using robust optimization. More generally,
the types of questions which motivate our work may be summarized
as follows.
Given that data has become such an widely available commod-
ity, and SL models have begun to play a major role in most op-
erations, are there OR/MS questions which might benefit from
leveraging tools of SL and SP within a unified framework?
If SL models are introduced into an SP framework, what is the
appropriate way to incorporate them, even though SL models
may be governed by continuous random variables?
How should we compute optimal decisions in cases involving
infinite dimensional models (i.e., extremely large/big data sets)?
How should we assess the effectiveness of these models so that
we can support decisions based on validation procedures based
on statistical principles?
The LEO approach provides a relatively general framework for de-
cisions via SP , and uncertainty modeling via SL. This not only allows
Chapter1. Introduction: AnIntegratedViewofAnalytics 9
a more streamlined approach to building a statistical model to sup-
port decision-making, but also recognizes that as in SL, the world of
SP could benefit from providing decision-makers with predictions of
future costs, as well as estimates of validity of these predictions. In
addition, the LEO approach is based on choosing the most promising
model from among a collection of alternatives which seem relevant in
a learning process. For each model-type, we will carry out a collection
of tests both before and after optimization to help guide model-choice.
In this context, we suggest statistical estimates, and tests which sup-
port this choice.
This thesis is organized as follows. Chapter 2 includes preliminary
results for combining data and decision sciences, in which we present
“Predict-then-Optimize” models and stochastic optimization models
with random variables in cost/constraint coefficients. In each class of
models, we provide a couple of illustrative examples associated with
data sources. In Chapter 3 we present general concepts of Learning
Enabled Optimization, with a collection of metrics for model valida-
tion and assessment. Chapter 4 focuses on solving Support Vector
Machine instances using convex SD algorithm, which is an extension
of the stochastic cutting plane method. In addution, we introduce the
statistical optimality results which extends the result of Sen and Liu,
2016 to a situation that the probability of optimality of a compromise
decision can be verified up to a certain accuracy. The computational
10 Chapter1. Introduction: AnIntegratedViewofAnalytics
results and detailed problem instances are included in Chapter 5. Fi-
nally, a Computational Operations Research Exchange Platform is in-
troduced in Chapter 6, which is intended to help users to build the
entire pipeline of LEO applications in software. There are other prac-
tical applications using community data that will be implemented by
CORE platform.
1.1 Algorithmic Stochastic Programming Back-
ground
Before we present our work, brief introductions of Sample average
approximations method and Stochastic Decomposition algorithm are
in order.
1.1.1 Sample Average Approximation(SAA)
Sample Average Approximation is a standard sampling-based SP method-
ology, which involves replacing the expectation in the objective func-
tion by a sample average function of a finite number of data points.
Suppose we have sample size of K, an SAA example is as follows:
min
x2X
F
K
(x)= c
>
x+
1
K
K
å
i=1
h(x,x
i
). (1.1)
As an overview, the SAA process may be summarized as follows.
1.1. AlgorithmicStochasticProgrammingBackground 11
1. Choose a sample size K, and sample K outcomes from the
training data-set.
2. (Optimization Step). Create the approximation function
F
K
(x), and solve an SAA instance (2.7).
3. (Validation Step). Take the decision from F
K
(x), follow the
steps in section 4, estimate the validated confidence interval,
generalization error and optimization error.
4. If the estimated objective does not agree with validated confi-
dence interval, or generalization error and optimization error
are not acceptable, increase the sample size K and repeat from
step 1.
Assumption SAA-a. The expectation function f(x) remains finite and
well defined for all x2 X. Ford> 0 we denote by
S(d) :=fx2 X : f(x) f
+dg and
ˆ
S
K
(d) :=fx2 X :
ˆ
f
K
(x)
ˆ
f
K
+dg,
where f
denotes the true optimal objective, and f
K
denotes the optimal ob-
jective to the SAA problem with sample size K.
Assumption SAA-b. There exists a function k : X!R
+
such that its
12 Chapter1. Introduction: AnIntegratedViewofAnalytics
moment-generating function M
k
(t) is finite valued for all t in a neighbor-
hood of zero and
jF(x
0
,x) F(x,x)j k(x)jjx
0
xjj
for a.e. x2X and all x
0
, x2 X.
Assumption SAA-c. There exists constant l > 0 such that for any
x
0
, x2 X the moment-generating function M
x
0
,x
(t) of rv[F(x
0
,x) f(x
0
)]
[F(x,x) f(x)], satisfies
M
x
0
,x
(t) exp(l
2
jjx
0
xjj
2
t
2
/2),8t2R.
From assumption SAA-b,
[F(x
0
,x) f(x
0
)][F(x,x) f(x)]
2Ljjx
0
xjj w.p. 1, andl= 2L.
Proposition 1 Suppose that assumptions SAA(a-c) hold, the feasible set X
has a finite diameter D, and let d
u
> 0,d2 [0,d
u
),#2 (0, 1), and L =
E[k(x)]. Then for the sample size K satisfying
K(#,d)
8l
2
D
2
(d
u
d)
2
"
n ln
8LD
d
u
d
+ ln
1
1#
#
, (1.2)
we have
Prob(
ˆ
S
K
(d) S(d
u
)) #.
Proof: This is Corollary 5.19 of Shapiro, Dentcheva, and Ruszczynski,
2009 with the assumption that the sample size K is larger than that
1.1. AlgorithmicStochasticProgrammingBackground 13
required by large deviations theory (see 5.122 of Shapiro, Dentcheva,
and Ruszczynski, 2009).
Stochastic Decomposition (SD)
Unlike SAA which separates sampling from optimization, SD is based
on sequential sampling and the method discovers sample sizes “on-
the-fly” (Higle and Sen, 1991,Higle and Sen, 1994). Because of sam-
pling, any stochastic algorithm must contend with both variance re-
duction in objective values as well as solutions. SD uses M indepen-
dent replications of value functions denoted f
n
,n= 1, . . . , M. Each of
these functions is a max-function whose affine pieces represent some
Sample Average Subgradient Approximations (SASA). Because SD
uses proximal point iterations, Higle and Sen, 1994 shows that the
maximum number of affine pieces is n
1
+ 3, where n
1
is the num-
ber of first stage decisions, and these pieces are automatically iden-
tified using positive Lagrange multiplier estimates during the itera-
tions. In theory, one can control the number of affine pieces to be
smaller, but that can also be chosen "on-the-fly" depending on the size
of the first stage decision variables n
1
. When the number of affine
functions reduces to only 1, the SD method reduces to a proximal
stochastic variance reduction method (prox-SVRG) (Xiao and Zhang,
2014). Among other strengths such as parallelizability and variance
reduction, one of the main usability issues which SD overcomes is
that when a model has a large number of random variables (as in the
14 Chapter1. Introduction: AnIntegratedViewofAnalytics
case of multi-dimensional random variables) it does not require users
to choose a sample size because the sequential process automatically
discovers an appropriate sample. Together with the notion of Statisti-
cal Optimality as set forth in section 3, SD provides a very convenient
optimization tool for LEO models.
For SLP models, Sen and Liu, 2016 have already presented signifi-
cant computational evidence of the advantage of SD over plain SAA.
The reduced computational effort also facilitates replications for vari-
ance reduction (VR). VR in SD is achieved by creating the so-called
compromise decision, denoted x
c
, which minimizes a grand-mean ap-
proximation
¯
F
M
(x) :=
1
M
å
M
n=1
f
n
(x) , wheref f
n
g
M
n=1
denotes a termi-
nal value function approximation for each replication m. Suppose that
solutions x
n
(#)2(#f f
n
(x)j x2 Xg) and x
c
(d)2(df
¯
F
M
(x)j x2
Xg). Then, Sen and Liu, 2016 has proved consistency in the sense that
lim
d!0
Pr(
¯
F
M
(x
c
(d)) f
)! 0. Here are the critical assumptions of
SD (Higle and Sen, 1996b).
Assumption SD-a. The objective functions in the first and second stage
models are either linear or quadratic, and all constraints are linear. Moreover,
the set of first stage solutions is compact.
Assumption SD-b. The second stage optimization problem is feasible,
and possesses a finite optimal value for all x2 X, and outcomes x (i.e., the
relatively complete recourse assumption holds).
1.1. AlgorithmicStochasticProgrammingBackground 15
Assumption SD-c. The second stage constraint functions are determin-
istic (i.e., fixed recourse), although the right hand side can be governed by
random variables. The set of outcomes of the random variables is compact.
Assumption SD-d. The recourse function h is non-negative. So long as
a lower bound on the optimal value is known, we can relax this assumption.
(Higle and Sen, 1996b)
A high-level structure of SD algorithm can be summarized as fol-
lows (Sen and Liu, 2016). For greater specifics regarding two-stage
SD, we refer to Higle and Sen, 1991, Higle and Sen, 1994, and Higle
and Sen, 1996b, and for the multi-stage version we refer to Sen and
Zhou, 2014.
1. (Initialize). Let n represent the number of completed replica-
tions, and setn= 0.
2. (Out-of-Sample loop). Set the number of completed replica-
tionsn = 0. Incrementn at each time and start the next repli-
cation.
3. (In-Sample loop). Add one sampled outcome to the available
samples and update the empirical frequencies.
4. (Updated Value Function Approximation). Using the new
outcome from step 3 and previously generated approxima-
tions, update the new value function approximation f
n
k
(x).
16 Chapter1. Introduction: AnIntegratedViewofAnalytics
5. (Optimization Step). Solve the regularization of f
n
k
(x) in step
4, and update an incumbent solution for the first stage.
6. (In-Sample Stopping Rule). If an In-Sample stopping rule is
satisfied, output the incumbent solution x
n
and continue to
step 7. Else repeat from step 3.
7. (Out-of-Sample Stopping Rule). If the number of replications
is greater than or equal to M, calculate a compromise decision
x
c
using a set offx
n
g
M
n=1
. Else, repeat from step 2.
The value function approximation for replication n is denoted f
n
and the terminal solution for that replication is x
n
. Note that we gener-
ate sample average subgradient approximations (SASA) using K
n
(#)
observations. Since these observations are i.i.d, the in-sample stop-
ping rule ensures an unbiased estimate of the second stage objective is
used for the objective function estimate at x
n
. Hence, the Central Limit
Theorem (CLT) implies that[K
n
(#)]
1
2
[ f(x
n
) f
n
(x
n
)] is asymptotically
normal N(0,s
2
n
), wheres
2
n
<¥ denotes the variance of f
n
(x
n
). Since
N = min
n
K
n
(#), (1.3)
it follows that the error[ f(x
n
) f
n
(x
n
)] is no greater than O
p
(N
1
2
).
The following result provides the basis for compromise solutions x
c
as proved in Sen and Liu, 2016.
1.1. AlgorithmicStochasticProgrammingBackground 17
Proposition 2 Suppose that assumptions SD(a-d) hold. Suppose ¯ x is de-
fined as in Theorem 2, andx
c
= ¯ x. Then,
a) x
c
solves
Min
x2X
¯
F
M
(x) :=
1
M
M
å
n=1
f
n
(x), (1.4)
b)
f(x
c
)
¯
F
M
(x
c
)+ O
p
((NM)
1
2
), (1.5)
c)x
c
(d) denote and-optimal solution to (2.10). Let f
denote the optimal
value of the problem,
lim
d!0
jj¯ x(d)x
c
(d)jj ! 0 (wp1), (1.6)
d)
lim
d!0
P(j
¯
F
d,N
(¯ x(d)) f
j t) ! 0 for all t 0. (1.7)
Proof: See Sen and Liu, 2016.
19
Chapter 2
Coalescing Data and Decision
Sciences for Analytics
The main structure of this chapter pertain to two classes of decision
models:
1. Deterministic optimization models for which cost parameters
are estimated via predictive analytics. To date, most practical
interfaces between data and decision sciences have been stud-
ied via a sequential process, sometimes referred to as “Predict-
then-Optimize” (PO) (Elmachtoub and Grigas, 2017). We begin
our presentation using this basic idea for two examples which
highlight how modern data sources have altered the landscape
of decisions associated with some traditional questions.
2. Stochastic optimization models in which cost as well as con-
straint coefficients (including right-hand side parameters) are
20 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
modeled using random variables. While this approach falls un-
der traditional stochastic programming (SP) (Birge and Louveaux,
2011), we will address certain non-traditional aspects such as
model validation. These models are also viewed in the PO set-
ting with the emphasis on model validation which highlights
the impact of data science on SP . We refer to citation Wallace
and Ziemba, 2005 for a more comprehensive collection of appli-
cations.
This thesis also provides links to data sources which may be used
for the pedagogical examples discussed below. Finally, the concluding
section presents some open problems at interface between data and
decision sciences.
2.1 Predict-then-Optimize with Deterministic
Constraints: Uncertainty in the Objective
We begin with two applications for which the set of feasible decisions
are deterministic, whereas, the objective function for the models are
not known precisely. The first example is a thoroughly revamped
version of the famous Diet Problem (DP), and the second one is the
ubiquitous Dynamic Routing Problem (DRP). In the first model (DP ,
see subsection 2.1.1), the objective function represents an ambiguous
quantity (e.g., ratings) and as the model is exercised many times, by
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
21
many individuals, the errors of prediction reduce, and hence the de-
scription as “Collaborative Filtering”. For this setting, once the objec-
tive function is revealed, it stays the same until the decision model is
solved, hence it is a static estimate of the objective function. In case
of DRP (see subsection 2.1.2), the objective function changes during
the decision process, as updated sensor data becomes available. The
updates for DRP are based on “Particle Filtering” and only a subset of
decisions are implemented, while changes to other (future) decisions
are possible as the cost vector evolves. In both cases, the data science
aspect deals with the concept of “filtering”, which refers to the act
of prediction with reduced errors as larger amounts of data becomes
available to the prediction method.
2.1.1 Static Predict-then-Optimize
Example 1 (Meal Planning for the New Millennium (MnM Problem))
The famous Diet Problem was first stated by George Stigler (Stigler, 1945)
who also used a heuristic to find an approximate solution for a data set used
in the 1930s and 40s. While there is continued pedagogical value of the Diet
Problem for LP formulations, there has always been the criticism that in the
absence of a reasonable objective function, a diet based on minimizing cost or
a similar objective may not provide particularly tasty meals. Moreover, tasty
meals result from recipes which are targeted to individual tastes. Thanks to
22 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
Internet-based feedback for recipes, it is now possible to formulate such opti-
mization models by using ratings on various recipes available on the web
1
.
These recipes not only provide the specific quantities of various foods and
the required ingredients, they also tend to provide nutrition content, cooking
time estimates etc. Note that one can also estimate costs of meals from online
groceries (e.g., Walmart or even the neighborhood groceries). Finally, note
that modern python-based programs
2
can be used to automatically download
content using web-scraping programs. As a result, data for the MnM Prob-
lem no longer reside in library archives (as they did in the days of G. Stigler).
Consider two roommates who are planning their meals for the
following week, and they are friendly enough to share cooking and
shopping responsibilities but like many students, they are strapped
for time and money. These roommates have well-trained taste buds,
and are interested in eating healthy, tasty meals, as well as eating
within budgetary and dietary restrictions. The MnM problem focuses
on meal planning with an OR/MS twist: the roommates will share
shopping, cooking, and meal selection responsibilities, and will also
respect each other’s schedules and budgets. They will use social me-
dia (and websites) to choose recipes. This shift from buying food-
types in the Diet Problem to choosing recipes for the MnM results
in the shift of decision modeling paradigm from LP to mixed-integer
1
https://www.kaggle.com/hugodarwood/epirecipes/data
2
https://medium.freecodecamp.org/how-to-scrape-websites-with-python-
and-beautifulsoup-5946935d93fe
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
23
programming (MIP), including mixed-binary variables. We omit pre-
senting the MIP model for a couple of reasons: a) those familiar with
a first course in OR/MS should be able to formulate such a decision
model without much trouble, and b) we do not wish to curb the cre-
ativity of students in OR/MS courses since they might find this effort
to be intellectually interesting. Nevertheless, the “proof is in the pud-
ding”, and as a result, we feel obliged to report that students in our
Analytics course at USC have created week-long dinner menus which
include “Grilled brined chicken with chimichurri sauce” on Monday,
“Beef tenderloin with red wine sauce, creamed spinach and french
fries” on Tuesday, . . ., “Country bread stuffing with smoked ham,
goat cheese and dried cherries” for Friday. Not only do these rec-
ommendations fit the roommates’ tastes, the students also seemed to
be satisfied with their cooking schedules which met their time and
budget constraints. With this motivation, we proceed to discuss the
data science aspects for this application.
Data Science: Collaborative Filtering.
As for the data science side, one of its more interesting contribu-
tions in the meal-planning effort is the ability to include diet prefer-
ences. Although it may not be easy to get personalized ratings in the
early weeks of training a meal planning model, one might be able
to eventually infer personal ratings by using modern tools such as
matrix completion or other collaborative filtering techniques. These
data science tools are commonly used in recommendation systems by
24 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
online retailers (e.g., Amazon, Netflix etc.) to recommend products
and services inferred from prior individual ratings, as well as over-
lapping tastes with other users. Note that personalization only works
after several trials have been undertaken to populate a ratings matrix
with some user data. Nevertheless, one may use average ratings as a
surrogate objective function before sufficient personal data becomes
available.
The goal of matrix completion is to provide item recommenda-
tions or “predictions” based on the opinions of other users with simi-
lar “tastes” (Koren, Bell, and Volinsky, 2009). Based on the input data
of customers and their interested products, one can generate personal-
ized coefficients for a certain user and then use the set of coefficients to
predict the user’s taste according to which we can recommend prod-
ucts with higher ratings. Specifically, suppose we are given a rating
matrix M in which each entry(i, j) represents the rating of product j
by user i if user i has rated product j. Otherwise, the entry is missing.
Since we would like to predict the taste of users based on the given
rating matrix, the problem is converted into predicting the remaining
missing entries of M in order to make good recommendations to the
users.
Note that matrix completion usually assumes that a low-rank ma-
trix is preferred, since over-fitting may result if there are no restric-
tions placed on the number of degrees of freedom. (Therefore, we
assume the rank of completed matrix is given, say r, and we would
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
25
like to find a matrix
ˆ
M with rank r which matches the known entries.)
Based on this idea, one formulates the following mathematical model.
min
X
1
2
jjP
W
(M X)jj
2
F
+ljjXjj
(2.1)
whereW represents the positions of entries which are already known
in the rating matrix M and P
W
() then represents the orthogonal pro-
jection onto the span of matrices vanishing outside ofW, so that the
(i, j)th component of P
W
(X) is equal to X
ij
if(i, j)2W and zero other-
wise. Letkk
F
denote the Frobenius norm which is the sum of square
of all entries andkk
denote the nuclear norm, which is the sum of
the singular values of the matrix. Such a norm provides a convex re-
laxation of rank in (2.1). Therefore we obtain a convex-optimization
problem for matrix completion of M. In addition, there exist cases
where the entries are provided with a small amount of noise. For ex-
ample, a noisy model can have Y
ij
= M
ij
+#
ij
where #
ij
denotes a
noise term and Y
ij
is the observed rating entry. From Hastie et al.,
2015, we could summarize the algorithm of completing M as follows.
At iteration i,
1. The estimated matrix at iteration i is denoted as
ˆ
X
i
. Calculate
ˆ
M = P
W
(M)+ P
W
(
ˆ
X
i
).
26 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
2. Compute the soft-thresholded singular value decomposition
of
ˆ
M represented as UDV
>
.
3. Construct a diagonal matrix
ˆ
D
l
based on D. If D
ii
> l, let
(
ˆ
D
l
)
ii
=(D
ii
l). Otherwise,(
ˆ
D
l
)
ii
= D
ii
.
4. Update
ˆ
X
i+1
by U
ˆ
D
l
V
>
.
5. Repeat from step 1 to step 5 until convergence.
2.1.2 Evolving Predict-then-Optimize
Unlike the previous example, sensor data updates speed predictions,
which then requires updating the route (i.e the decision) itself. Even
though the constraints in the optimization model remain determinis-
tic, the fact that only part of the decisions are implemented changes
the data and decisions interact within this setting. Hence the decision
problem in this example includes a “look-ahead” feature which was
absent from the previous example.
Example 2 (Dynamic Routing Problem) Consider a situation where we
are planning to find the best route (with least travel time) from node s to node
t in a network, whose links are divided into several pieces by speed sensors
(see Figure 2.1). For link i, there are n
i
speed sensors located at d
1
, ..., d
n
i,
while sensors provide noise-corrupted data z(d
1
,t), ..., z(d
n
i,t) at time t
for all locations. We will use particle filtering to use these noise-corrupted
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
27
observations to infer state variables estimates (denoted w) throughout the
network. Suppose we plan to start traveling at time T
0
on a particular week-
FIGURE 2.1: Network Routing
day. Traditional approaches might use historical speed data from previous
weeks as a way to estimate travel time. Because of the availability of sensor
data in today’s traffic network, more current data on speeds can be inferred
from the speed data. In this example we use filtering techniques as a predic-
tion tool (Chen et al. Chen, 2003) for all sensor locations in Figure 2.1. As
for the decision problem, we study a specialized shortest-path Monte Carlo
Tree Search algorithm to find the best route.
Data Science: Particle Filtering. Perhaps, the most commonly used
paradigm for sequential state estimation is the Kalman filter which
is applicable in the context of linear dynamic systems subject to Gaus-
sian noise. Since the setting of urban traffic flow is known to be non-
linear, and the noise is usually not Gaussian, we will work with a more
modern filtering scheme known as the Particle Filter. This scheme,
also known as a Sequential Monte Carlo (SMC) method, assumes that
the state space of the dynamic system can be partitioned into many
small segments over which the density of particles approximate the
28 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
probability distribution of the state of the system. Together the ensem-
ble of particles will represent the distribution of the state at the most
recent (past) time instant. The method will use a Bayesian framework
to obtain a posterior distribution which is predicted using past obser-
vations of the stochastic process. In this mechanism, each particle will
be assigned its own likelihood (weight) which indicates the probabil-
ity of being sampled. This framework of particle filters is illustrated
in Figure 2.2.
FIGURE 2.2: Particle Filtering Steps
To describe the specific update, suppose that the traffic states are
estimated consecutively at discrete time instants t
1
, t
2
, . . . , t
k
, . . ., pos-
sibly asynchronously, based on all the incoming information up to
the current time T
0
transmitted by speed sensors to the filter. Let w
k
denote the state vector of traffic speed at time t
k
, and z
k
be the ob-
servation vector of traffic speed at time t
k
at sensor locations. Thus
the observation vector Z
k
= fz
1
, . . . , z
k
g, will be used to infer the
state vector w
k
(speed) for each link-segment of the network. This
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
29
will be accomplished using a recursive Bayesian update in which the
conditional density function p(w
k
jZ
k
) of the state w
k
, given a set of
measurements Z
k
, can be updated according to
p(w
k
jZ
k1
)=
Z
p(w
k
jw
k1
) p(w
k1
jZ
k1
)dw
k1
(2.2)
p(w
k
jZ
k
)=
p(z
k
jw
k
) p(w
k
jZ
k1
)
p(z
k
jZ
k1
)
µ p(z
k
jw
k
) p(w
k
jZ
k1
) (2.3)
However, it is computationally expensive to evaluate (2.2) and
(2.3). Particle Filtering provides an approximation of the dynamic
process by a discrete-time recursive update of the posterior proba-
bility distribution p(w
k
jZ
k
) corresponding to a collection of particles
fw
(i)
k
g
N
i=1
. Specifically, in the main step of the algorithm (see step 2
in the framed summary below) at iteration k, by sampling the state
space with respect to certain probabilities, we are able to get a set of
particlesfw
(i)
k1
g
N
i=1
as an approximation of the evolving probability
distribution function p(w
k1
jZ
k1
). However, the real posterior den-
sity model p(w
k
jw
k1
) is usually unknown. Therefore it is common to
choose an approximation density function of the true posterior, which
is also known as an importance distribution. Then we sample the par-
ticlesf ˆ w
(i)
k
g
N
i=1
using the importance distribution p(w
k
jw
k1
) given
the particlesfw
(i)
k1
g
N
i=1
. The weightsfb
(i)
k
g
(N)
i=1
of the particles can
thus be updated by the sampled results p(z
k
j ˆ w
(i)
k
) with normaliza-
tion. One could discard the particles with low normalized importance
30 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
weights, and then resample a new set of particlesfw
(i)
k
g
N
i=1
using up-
dated weights. After sufficiently many iterations, only a few particles
have non-zero importance weights and the weights will converge. A
generic algorithm of particle filtering using transition prior density
can be summarized as follows.
For steps k = 0, 1, 2, ...
1. Initialization: for i = 1, ..., N, generate samplesfw
(i)
0
g
N
i=1
from
the initial probability distribution p(w
0
). Set initial weights
b
i
0
=
1
N
for i = 1, . . . , N.
2. Importance Sampling: for the sample i = 1, ..., N, draw a set
of particles ˆ w
(i)
k
from an importance distribution p(w
k
jw
(i)
k1
).
3. Weights Update: calculate the updated weights b
i
k
=
p(z
k
j ˆ w
(i)
k
) for i = 1, . . . , N.
4. Weights Normalization: calculate the normalized importance
weights
ˆ
b
i
k
=
b
i
k
å
N
j=1
b
j
k
.
5. Resampling: discard the particles with low normalized impor-
tance weights, then resample to generate N new particles from
the setf ˆ w
(i)
k
, 1 i Ng according to the importance weights
f
ˆ
b
i
k
g
N
i=1
.
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
31
6. Repeat from step 2 to step 5 until the weights converge.
p(z
k
j ˆ w
k
)
Decision Science
For a given location d
j
and an initial time t, we are able to train
a particle filter by past dataset, to predict the expected future speed
fE[v(d
j
,t +t
0
)],E[v(d
j
,t + 2t
0
)], ...,E[v(d
j
,t + mt
0
)],g, where the
cycle t
0
is decided by the past dataset, and m denotes the number of
prediction period. Once the predicted speed values are obtained, we
are able to calculate and update travel time of each link. Note that in
this example the travel time of each link is time-varying, because the
speed prediction values are changing over time. As a result, to obtain
the route with shortest travel time, we consider using an algorithm
called Monte Carlo Tree Search (MCTS), in which the travel time of
each link is updated during each time period, hence the algorithm be-
comes more responsive to the changing traffic conditions (Bertsimas
et al., 2014, Jiang, Al-Kanj, and Powell, 2017). In the context of SPO,
the decision vector evolves with each prediction round of the particle
filter.
Monte Carlo Tree Search (MCTS) is a search algorithm, which has
already made an impact on Artificial Intelligence (AI), and other cases
in which the domain of the problem can be represented as sequence
of decisions of trees (Fu, 2017). For each iteration, we use a certain
policy (tree policy) to find the most efficient node of the current tree.
32 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
Then one can update the current tree by running a simulation from
the selected node. The standard MCTS algorithm is summarized in
the following.
For each iteration, there are mainly four steps.
1. Selection. Starting from root node, select a child node by a
given tree policy to reach an optimal node. Notice that in this
case, the selected node is required to be expandable, which
represents that the selected node has at least one unvisited
child.
2. Expansion. Since the selected node is expandable, add all the
child nodes of the selected node to expand the tree.
3. Simulation. For the newly added child nodes, run a simula-
tion according to the given policy and calculate the outcome.
4. Backpropagation. Based on the simulation result in step 3,
backpropagate the current tree through the selected nodes,
and update their values.
In this example, the routing algorithm is performed by using one-
step ahead MCTS to find the current optimal route. Then the policy
used for updating the statistics of newly created tree nodes is the pre-
dicted travel time by particle filtering. The shortest path implementa-
tion of MCTS is summarized as follows.
2.1. Predict-then-OptimizewithDeterministicConstraints:
UncertaintyintheObjective
33
For time slots j= 0, 1, 2, ...,
1. Initialization. Initialize the vertex set Q to include all the
nodes. For each node v in the network, set time[v] =Infinity,
which represents a unknown travel time from source s to v in
the initialization state. For the source node s, set time[s]= 0.
2. Selection. Choose a node in Q with minimum distance from
source s to it and denote the node as u. Remove node u from
set Q.
3. Expansion. For each neighbor v of u, update the current
travel-time to be time[u]+t
j
(u, v), where t
j
(u, v) represents
the predicted travel time from u to v in time slot j provided by
particle filtering.
4. Simulation. If the current travel-time time[u]+t
j
(u, v) <
time[v], then we update the state of node v: set time[v] =
time[u]+t
j
(u, v), and update the previous node of v in the
current optimal path: prev[v]= u.
5. Backpropagation. If Q is not empty, repeat from step 2 to step
4.
In this DRP experiment, the traffic travel time data was obtained
from PeMS (the Caltrans Performance Measurement System)
3
, in which
3
https://pems.dot.ca.gov/.
34 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
the traffic data is collected in real-time using individual detectors. We
run particle filtering as well as extended Dijkstra algorithm based on a
network of the freeway-system across all major areas of Los Angeles.
The predicted travel time of the optimal path using dynamic rout-
ing algorithm is 56.10 minutes, and we validate this result by using
the observed travel time data for the calculated optimal path (after
the route is completed). The validated travel time is 58.04 minutes,
which is close to the predicted travel time. For the purposes of com-
parisons with a case of using the deterministic Dijkstra algorithm with
the cost of each link being estimated by expected historical travel-time
data, we find that the estimated travel time of an optimal path is 49.87
minutes, which is better than the result from dynamic routing algo-
rithm. However, when we validate the calculated path by real travel
time data, the resulting simulated time is 65.84 minutes, which indi-
cates that there exists a mismatch of actual travel-time and estimated
travel-time for deterministic routing. Moreover, comparing the vali-
dated results, the travel time of calculated path using particle filtering
is significantly smaller.
2.2 SP: Cost and Constraint Uncertainty
From its earliest version by George Dantzig in 1955, to the most re-
cent class of Stochastic Programming (SP) models, uncertainty has
been modeled in SP using random variables in both objectives and
2.2. SP:CostandConstraintUncertainty 35
constraints. It is usually stated in the setting of an evolving sequence
of random variables, although for our purposes, we will focus on its
simplest two-stage version which may be stated as follows.
min c
1
(x
1
)+E[h
2
(x
1
, ˜ w
1
)] (2.4a)
Pr(g
1
(x
1
, ˜ w
1
) 0) p
1
(2.4b)
x
1
2 X
1
, (2.4c)
where for a particular realizationw
1
of ˜ w
1
, yields h
2
(x,w
1
) defined by
h
2
(x
1
,w
1
)= min f
2
(x
2
; x
1
,w
1
) (2.5a)
x
2
2 X
2
(x
1
,w
1
). (2.5b)
It is important to note that the random variables in SP accommo-
date uncertainty in constraints as well as the objective function. In
some instances, one could use the data directly as in empirical risk
minimization or sample average approximation (SAA), and in others,
a distribution is assumed to have been chosen.
The above statement of SP covers specific formulations of “SP with
recourse” as well as “SP with chance constraints”. This structure is
general enough to allow risk measures (e.g. Downside Risk, Condi-
tional Value at Risk, and others). Indeed, with an appropriately nested
36 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
collection of random variables, one can also state a multi-stage de-
cision model using extensions of the above mathematical statement.
We refer to Birge and Louveaux, 2011 for SP models and algorithms,
Shapiro, Dentcheva, and Ruszczynski, 2009 for mathematical struc-
tures and properties, and to Sen, 2013 for shorter encyclopedic overviews.
For the most part, SP models have focused on convex optimization
models, although there have been significant advances in stochas-
tic mixed-integer programming in recent years (see Küçükyavuz and
Sen, 2017). If one wishes to guard against misspecified distributions,
we recommend models which are distributionally robust (e.g., Delage
and Ye, 2010). Our focus on the fusion of data and decision science
limits our ability to focus on many aspects of modeling uncertainty,
including the work on chance-constrained problems. It is also worth
reiterating that there are a variety of applications in the volume edited
by Wallace and Ziemba, 2005 which is entirely devoted to applications
of SP .
For the purposes of this dissertation we use SP in dual roles: a) as
a prototypical decision model (which accommodates uncertainty in
both the objective and constraints) to make the transition from PO (or
SPO) to Learning Enabled Optimization (LEO) presented in the next
chapter, and b) as a vehicle to illustrate an often-overlooked aspect
of SP modeling, namely, model validation. Because model validation
and selection are an integral part of data science (and SL/ML) we
present some ideas underlying SP model validation in this chapter.
2.2. SP:CostandConstraintUncertainty 37
We begin with a simple commentary on the solutions from SP .
2.2.1 SP as a Bridge and the Structure of Hedged Solu-
tions
In order to highlight how SP provides a bridge between data and de-
cision sciences, we first present the well known SVM model as a SP .
To do so, we will first admit the possibility of using an empirical dis-
tribution into the SP formulation. In this case, the notion of empirical
risk minimization of SL/ML translates to a Sample Average Approx-
imation (SAA) commonly used of SP .
Example 3 (Support Vector Machines) Consider a binary classification
problem. Suppose that we have a training set consisting of N objects which
belong to one of two categories: I
+
and I
(and I = I
+
[ I
). With each
of these categories we have a collection of objects i2 I
+
whose features are
quantified in a vector Z
i
2<
n
. For instance, the number of features n may
be two (say, weight and volume), and therefore Z
i
will denote the weight and
volume of object i2 I
+
. Similarly for each i in I
, we also have feature
vectors Z
i
2<
n
. It is common to represent the membership in I
+
by setting
a scalar W
i
= +1, whereas, the membership in I
is recorded via a scalar
W
i
=1. For the binary classification problem, we state the classification
problem as one of identifying a vector b<
n
and a scalar b
0
such that the
function
38 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
h(b,b
0
; z)= b
>
z+b
0
(2.6)
assumes positive values for i2 I
+
, and analogously assumes negative
values when i2 I
. In case of affine functions such as the one shown
in (2.6), the goal of the SVM problem is to find a vector(b
0
,b)2<
n+1
such that the points in I are classified correctly, to the extent possible.
Note that because the threshold zero is common to both inequal-
ities, there is the likelihood of ambiguities in the decision-making
when the value (b
>
z+b
0
) turns out to be relatively small. It might
actually be more effective to compare the expression in (2.6) to +1
or1 because there is a separation. Thus we should check whether
b
>
z+b
0
1 orb
>
z+b
0
1. As a result, we seek a vectorb2R
n
(where n is the number of features) and a scalarb
0
such that the points
in I
+
and I
are properly classified. One way to do this is to solve
the following SAA approximation using the empirical distribution in-
duced by the samplefw
i
, Z
i
g
i2I
. Then, denoting the expectation with
2.2. SP:CostandConstraintUncertainty 39
respect to the empirical distribution
ˆ
E, we have
min
1
2
b
>
b+g
ˆ
E
Z
[h(b,b
0
, Z)]
where h(b,b
0
, Z = z) = min
å
i2I
(d
+
i
+d
i
)
b
>
z+b
0
+d
+
i
1, i2 I
+
b
>
z+b
0
d
i
1, i2 I
d
+
i
,d
i
0,8i.
Notice from this formulation that the data (representing uncer-
tainty) appears in the constraints.
When there are a few data points, the above problem can be solved
using deterministic quadratic programming (QP). However for a large
number of data points, a QP becomes very unwieldy, and decompo-
sition approaches of SP have been studied (e.g., PEGASOS Shalev-
Shwartz et al., 2011). It is also interesting to note that the regular-
izer (quadratic term) in SVM also appears within SP , especially in the
Stochastic Decomposition algorithm (Higle and Sen, 1994 and Sen and
Liu, 2016). In this sense SP provides an excellent vehicle for bridging
the fields of Data and Decision Sciences.
Example 4 (Hedging when Networks Links Fail) The act of “hedging”
is a common strategy in financial applications and for this reason, SP has
made significant in-roads into that discipline. The same design principles
40 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
are very useful for designing resilient engineering systems. Such a princi-
ple essentially states that the number of number of non-zero elements of an
optimal solution of a stochastic LP is usually larger than that of an instance
in which the random variables are replaced by their expected values (Mur-
phy and Sen, 2002). The following example illustrates this principle in for a
small network planning problem.
Consider an undirected network with K links, and assume that
some of these links are failure prone. When a link fails, it cannot carry
any flow; however, if the link is up, it allows a flow within the range
[0, x
i
], where x
i
> 0 denotes the link capacity of the link with index i.
In this example, there are two types of links: i) those with zero failure
rate, and therefore, these have a fixed capacity. However, such links
can be relatively expensive, with a the cost denoted by c
0
. ii) link with
failure rate r2 (0, 1); in this example r is considered to be a constant
among all the links which have non-zero failure rate. Therefore, the
link size of a given non-zero failure rate link can be considered as
a random variable with binomial distribution (i.e., size of link i is 0
with probability r, and is x
i
with probability 1 r). We also assume
that link failures are independent of each other.
For the sake of illustration consider the simple network shown in
figure 2.3. For simplicity we refer to node pairs A-B, B-C and A-C by
an index i for i = 1, 2, 3 respectively, and w
1
,w
2
and w
3
denote the
number of requests (demand) for the three node-pairs. The demand
2.2. SP:CostandConstraintUncertainty 41
values of pair 1 and 2 are labeled as “high” level, which means that
the demands obey the distribution in 2.1.
FIGURE 2.3: Network of Example 3.1
Value 0 1 2 3 4
Probability 0.05 0.2 0.5 0.2 0.05
TABLE 2.1: High Demand Distribution.
The demand of pair 3 is the “low” distribution shown in 2.2.
Value 0 1 2 3
Probability 0.1 0.4 0.4 0.1
TABLE 2.2: Low Demand Distribution.
These demands are served by flows on links AB, BC, and AC. Sup-
pose the link AB and BC have a failure rate r = 0.3, and the cost
c
0
= $1. On the other hand, the link AC is a zero failure rate link with
cost $4. Given a total budget of $10, the goal is to find the best plan of
link capacities. If we adopt a deterministic approach, then we might
be tempted to replace the random variables by their expected values.
If we do that, the standard LP formulation would be as follows.
Let f
I J
= value of flow directly from I to J, where I, J2fA, B, Cg,
I6= J
42 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
f
I JK
= value of flow from I to J, end at K, where I, J, K2fA, B, Cg,
I6= J6= K
r = failure rate of non-zero failure rate link
s
i
= number of rejects corresponding to node-pair i, i = 1, 2, 3
x
i
= size of the link joining node-pair i, where i = 1, 2, 3
the problem can be formulated as follows:
min
x
i
,s
i
,i=1,2,3
3
å
i1
s
i
s.t. s
1
+ f
AB
+ f
ACB
= w
1
, s
2
+ f
BC
+ f
BAC
= w
2
, s
3
+ f
AC
+ f
ABC
= w
3
f
AB
+ f
BAC
+ f
ABC
(1 r)x
1
, f
ACB
+ f
BC
+ f
ABC
(1 r)x
2
f
ACB
+ f
BAC
+ f
AC
x
3
, x
1
+ x
2
+ 4x
3
10
x
1
, x
2
, x
3
0, s
1
, s
2
, s
3
0
FIGURE 2.4: Solution of Deterministic LP
The solution of the LP formulation indicates that the link sizes of
A-B and B-C are both 5, while the link size of A-C is zero, which means
A-C is disconnected. It is not difficult to show that for this decision,
the expected number of rejects in this network would be 2. Note also
2.2. SP:CostandConstraintUncertainty 43
that since the LP optimum can be found at an extreme point, the asso-
ciated network will have the structure of a tree (no-cycles) as in figure
2.4.
Now suppose that we wish to use SP to capture the uncertainty
of demands w
1
,w
2
and w
3
. Furthermore, the failure rate r becomes
a binonmial random variable. In this case, we minimize the expec-
tation of h(x,w) with respect to the random variablesfw
1
,w
2
,w
3
, rg.
Therefore the SP formulation is as follows.
min E
˜ w,˜ r
[h(x, ˜ w, ˜ r)]
s.t. x
1
+ x
2
+ 4x
3
10,
x
1
, x
2
, x
3
0
where for each given outcome(w, r) and any x, we define
h(x,w, r)= min
3
å
i1
s
i
s.t. s
1
+ f
AB
+ f
ACB
= w
1
, s
2
+ f
BC
+ f
BAC
= w
2
, s
3
+ f
AC
+ f
ABC
= w
3
,
f
AB
+ f
BAC
+ f
ABC
(1 r)x
1
, f
ACB
+ f
BC
+ f
ABC
(1 r)x
2
,
f
ACB
+ f
BAC
+ f
AC
x
3
,
s
1
, s
2
, s
3
0
For this SP formulation, the solution of this two-stage SP isfx
1
, x
2
, x
3
g=
44 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
f3, 3, 1g with expected number of rejects = 1.8. Thus, we expect a
10% improvement in performance by using the SP model, provided
the distribution is correct. We should also comment on the notion of
hedging which we mentioned at the beginning of this section. Recall
that the LP formulation provided a network design which was a tree,
whereas, this SP formulation did not. The presence of a cycle from the
SP formulation creates the opportunity for hedging. Thus for cases in
which one of the failure-prone links fails, all three nodes are still able
to communicate. On the other hand, the solution provided by the de-
terministic model, being a tree, is such that even if one link breaks
down, there can be no communications in the network, thus leading
to poor performance.
FIGURE 2.5: Solution of Stochastic LP
2.2.2 SP Model Validation
Depending on the size of the sample space associated with the ran-
dom variables in an SP formulation, one can adopt direct methods
(as in solving SVM problems by QP solvers), or using decomposition
schemes which are designed to take advantage of the decomposable
2.2. SP:CostandConstraintUncertainty 45
structure of the SP . Both kinds of algorithms are available to SAA (or
empirical risk minimization). If sampling is carried out in an “outer
loop”, the methodology gets the label of “external” sampling because
the numerical solution of the approximation is carried out in an “in-
ner loop”, separately from sampling. For most SP applications, this
is the more popular approach (see Mak, Morton, and Wood, 1999,
Bayraksan and Morton, 2009) and has been used for some problems of
practical size in Linderoth, Shapiro, and Wright, 2006). For “internal”
sampling algorithms, every iteration of optimization is followed by
a sampling phase which enlarges the sample size according to some
rule ( Higle and Sen, 1996b, Royset and Szechtman, 2013). In either
case, replications are essential for reduced variance estimates of both
objective value as well as solutions.
One intrinsic difference between replication for simulation stud-
ies, and replications in SP is that the latter creates the possibility of
generating several candidate solutions. Indeed, Linderoth, Shapiro,
and Wright, 2006 report wide disparity of solutions of sampled in-
stances of SSN, even with a sample size of 5000. (For further infor-
mation on SSN, please see Sen, Doverspike, and Cosares, 1994 as well
as the section on “Verification” in this chapter.) In such cases, the
question is: which decision should be recommended? The traditional
idea is one where a preliminary objective function estimate is obtained
for solutions from each replication, and then one successively prunes
them, and the objective value estimates are refined, until the subset of
46 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
solutions is small enough to recommend a final decision (Linderoth,
Shapiro, and Wright, 2006). Such procedures are better formalized in
terms of ranking and selection methods in simulation optimization.
Nevertheless, they can be extremely time and resource intensive. In
order to overcome these problems, Sen and Liu, 2016 recommend the
“Compromise Decision”. This idea provides a relatively fast and ef-
fective technique for identifying a low-variance solution from among
several candidates. Although it was originally tested for the SD algo-
rithm, it is applicable to any scheme where each sampled approxima-
tion is solved by using a stochastic proximal (sub)gradient algorithm
(e.g.,Atchade, Fort, and Moulines, 2014).
To be specific, letn denote an index of the replications (n= 1, . . . , M)
and the sample average approximation (SAA) for replication n will
have a sample size N
n
. Such an SAA function can be defined by
ˆ
f
n
(x)= c
1
(x)+
N
n
å
i
h
1
(x,w
i
). (2.7)
The right hand side of the above function have subscripts as in (2.4a),
although we omit the subscript on the left hand side to ease the no-
tational burden. Let f
n
(x) denote a piecewise linear lower bound-
ing approximation of
ˆ
f
n
(x), using finitely many subgradients. Given
r
n
> 0, suppose that
x
n
2 Min
x2X
f
n
(x)+
r
n
2
kx x
n
k
2
. (2.8)
2.2. SP:CostandConstraintUncertainty 47
The above setting is also the standard instance at the culmination
of a replication in SD Higle and Sen, 1994 where r
n
is updated dur-
ing the course of the algorithm. Let ¯ r denote the sample average of
fr
n
g
M
n=1
, and consider the following problem known as the “Compro-
mise Problem.”
Min
x2X
1
M
M
å
n=1
f
n
(x)+
¯ r
2
kx x
n
k
2
. (2.9)
Problem (2.9) may be interpreted as a “grand” SAA approxima-
tion, and as a result, the approximation has many more samples which
produces a decision with much lower variance than any of the indi-
vidual solutions, and moreover, the bias from SP is also reduced. To
formalize this result, let x
c
denote an optimal solution to (2.9), and we
refer to it as the compromise decision. Define ¯ x=
1
M
å
n
x
n
, and define
¯
F
M
(x) =
1
M
å
M
n=1
f
n
(x). The following theorem is presented in Sen
and Liu Sen and Liu, 2016.
Theorem 1 Assume that the SP model is a convex program and let N =
min
n
N
n
. For replicationsn= 1, . . . , M, let x
n
denote an optimal solution of
(2.8). Assume that for eachn, at least one subgradient of f
n
is a subgradient
of
ˆ
f
n
at the point x
n
, while all others subgradients (finite in number) are
esubgradients of
ˆ
f
n
at x
n
. Then the following hold true.
a) Ifx
c
= ¯ x, thenx
c
solves
Min
x2X
¯
F
M
(x) :=
1
M
M
å
n=1
f
n
(x), (2.10)
48 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
and b)
j f(x
c
)
¯
F
M
(x
c
)j= O
p
((NM)
1
2
). (2.11)
The most important consequence of the above result is that it cre-
ates an easily implemented stopping rule even when the solutions x
n
are very disparate. Moreover, the right hand side of (2.11) implies that
for M reasonably large, the objective function estimate of the compro-
mise decision has lower variance than solutions x
n
for any individual
replication (which are of the order of O
p
(N
1/2
).
The condition that x
c
= ¯ x might seem somewhat strict, but as it
turns out, it can be relaxed to the case in whichkx
c
¯ xk = #, whence
one can infer an error tolerance on the sub-optimality of x
c
, depending
on the value #. This result is also the basis of “Statistical Optimality”
presented in chapter 4.
2.2.3 SP Model Verification
In very large scale (or difficult) instances, where validation may be too
onerous, the verification process becomes a “zero-th” (as opposed to a
first-order) test. This subsection therefore revolves around large scale
and difficult instances. Unlike the validation process which seeks op-
timality due to replicated runs, the verification process is intended to
only test whether the predicted objective lies within the 95% confi-
dence interval, and hence less stringent than the optimality tests pre-
sented in the previous subsection. The intuition behind this is that an
2.2. SP:CostandConstraintUncertainty 49
SP solver produces biased (low) objective estimates until the sample size is
large enough. Thus if one can ascertain that the 95% Confidence Interval
obtained using a Monte Carlo process already contains the objective estimate
predicted by the SP solver, then this lends credence to the view that the sam-
ple size is large enough. However, it is important to recognize that the
above process may not be acceptable if the width of the 95% C.I is so
large that unattractive decisions become acceptable.
Example 5 (Verification of Statewide Electricity Dispatch) This exam-
ple is from a case study of realistic power grid. The economic dispatch model
for the state of Illinois was part of a study reported in Gangammanavar, Sen,
and Zavala, 2016. Because the study included a significant number of wind
farms (12), the electricity dispatch problem was required to be solved in 10
minute intervals, with each planning period covering the next 60 minutes
in the second stage model. This two stage SP (with 7 10-min. intervals) in-
cluded 84 wind random variables together with the other thermal generators
in the state’s portfolio. If we were to use an “external” sampling approach
with only 100 scenarios representing 84 random variables, we would have to
solve an LP with 4 million columns and 3 million rows within a few minutes
(on the order of 5-7 mins). This was out of the question because our pre-
liminary experiments suggested that using an external sampling approach
(using an LP solver) with only 5 scenarios would exceed 10 hours of com-
putational time with a state-of-the art LP solver. We decided to use the SD
algorithm as mentioned above. We were able to get the SD solver to produce
results within the 20 min. on ordinary desktop machine. We concluded that
50 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
we could easily scale up to about 5 minutes per SD solve, if we used a more
powerful computing platform. However, replications requiring several runs
would be far too stringent for application that runs every 5-10 minutes. So,
we decided to verify (using Monte Carlo Sampling) whether the predicted
objective value is reliable.
The verification results in table 2.3 are reproduced from Gangam-
manavar, Sen, and Zavala, 2016. These computational experiments
were based on the electrical grid of the Illinois system, and the ta-
ble provides both predicted value as estimated by SD, and a vali-
dated 95% C.I., for which an external simulator generates a verifi-
cation dataset. Note that the predicted value in row 2, column 2, is
within the 95% C.I. reported in row 2, column 3. This is in line with
the comments of the previous paragraph
4
. The point of this example
is to illustrate that when the SP algorithm is able to predict an objec-
tive function estimate within a 95% C.I (produced using Monte Carlo
sampling), it may be a reasonably good decision. The intuition behind
this is that an SP solver produces biased objective estimates until the sample
size is large enough. Thus finding that the 95% CI obtained using a Monte
Carlo process already includes the SD objective estimate lends credence to
the view that the sample size is large enough.
Unlike the validation process which seeks optimality due to repli-
cated runs, this verification process only tests whether the prediction
lies within the 95% CI, and hence less stringent than the optimality
4
Over a 10 min. interval, estimated savings by SD over LP is $58,583 for the state
2.2. SP:CostandConstraintUncertainty 51
tests presented in Sen and Liu, 2016 and further refined in Deng and
Sen, 2018.
Algorithm Sample Size Predicted value 95% C.I.
External Sampling (LP) 5 12,811,076 [13,722,876;13,797,483]
Internal Sampling (SD) 138 13,721,268 [13,660,228;13,742,965 ]
TABLE 2.3: Verification of Electricity Dispatch (10%
Wind Penetration) (Gangammanavar, Sen, and Zavala,
2016)
Example 6 (Verification of a Predecessor to SSN) SSN is a notoriously
difficult SLP model which has been circulated within in SP community for
over 20 years. The data set which we discuss below is the original version
of SSN which could not be circulated because of confidentiality agreements.
This network planning problem was associated with a study to introduce
private line services in a network with 86 demand pairs, 89 links and 706
potential routes (Sen, Doverspike, and Cosares, 1994). The plots shown in
Figure 2.6 are intended to verify whether the population of predicted objec-
tives match with verification dataset. Note that Figure 8a plots the estimated
mean value of rejected calls associated with plans corresponding to certain
iterations of the algorithm, whereas Figure 8b plots the estimated value of
percent blocking within a discrete-event simulation. Figure 8c can be ob-
tained by normalizing the data for estimated mean values and percent block-
ing values so that the two sets of values can be compared in a more revealing
way. The normalized objective values of candidate solutions as measured by
the SP agree completely with those obtained via a discrete event simulation.
52 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
FIGURE 2.6: SP Validation for SSN Problem (Sen,
Doverspike, and Cosares, 1994)
This is revealed by the fact that the scaled data points in Figure 8c lie approx-
imately around the diagonal, which indicates that the reported SD objective
function value fits the simulated results reasonably well. Again, this is not a
validation of optimality, but a verification of objective value estimate through
a discrete-event simulation.
2.2. SP:CostandConstraintUncertainty 53
As discussed above SL/ML provides support for predictive ana-
lytics, whereas, optimization forms the basis for prescriptive analyt-
ics, and the methodologies for these are built (somewhat) indepen-
dently of each other. The process recommended for SL/ML is summa-
rized in Figure 9a in which the entire data set is divided into two parts
(Training and Validation), with the former being used to learn model
parameters (for a predictive model), and the latter data set used for
model assessment and selection. Once a model is selected, it can be fi-
nally tested by either simulation or using an additional “test data set”
for trials before adoption. The LEO framework greatly enhances the
potential for fusion by inserting the prescriptive model.
FIGURE 2.7: Statistical Learning and Learning Enabled
Optimization
Example 7 (Wyndor-Uncertainty with Latent Random Effects) Consider
a well known example called “The Wyndor Glass Co.” In this example,
Hillier and Lieberman, 2012 address resource utilization questions arising
54 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
in the production of high quality glass doors: some with aluminum frames
(A), and others with wood frames (B). These doors are produced by using re-
sources available in three plants, named 1, 2, and 3. The data associated with
this problem is shown in Table 3.5. The product mix will not only be decided
Plant Prod. time for A Prod. time for B Total Hours
(Hours/Batch) (Hours/Batch) Available
1 1 0 4
2 0 2 12
3 3 2 18
Profit per Batch $3,000 $5,000
TABLE 2.4: Data for the Wyndor Glass Problem (Hillier
and Lieberman, 2012)
by production capacity, but also the potential of future sales. Sales infor-
mation, however, is uncertain and depends on the marketing strategy to be
adopted. Given 200 advertising time slots, the marketing strategy involves
choosing a mix of advertising outlets through which to reach consumers. Ex-
ercising some “artistic license” here, we suggest that the advertising data set
in James et al., 2013 reflects sales resulting from an advertising campaign
undertaken by Wyndor Glass. That is, the company advertises both types
of doors through one campaign which uses two different media, namely, TV
and radio
5
. Note that in the original data set advertising strategy is rep-
resented as budgeted dollars, whereas, we have revised it to represent the
5
The actual data set discussed in James et al., 2013 also includes newspapers.
However we have dropped it here to keep the example very simple.
2.2. SP:CostandConstraintUncertainty 55
number of advertising time slots. Thus in our statistical model, sales predic-
tions are based on the number of TV and radio advertising time slots
6
. In
our interpretation, product-sales reflect total number of doors sold (fW
i
g)
when advertising expenditure for TV is Z
i,1
and that for radio is Z
i,2
, in
number of advertising time slots. (This data set has 200 data points, that is,
i = 1, . . . , 200). Let x
1
denote the number of TV advertising time slots, and
x
2
denote the number of radio advertising time slots.
Standard SP Approach
i index of product, i2fA, Bg.
y
i
number of batches of product i produced.
maxf0.1x
1
0.5x
2
+E[Profit(x, W, Z)]g (2.12a)
s.t. x
1
+ x
2
200 (2.12b)
x
1
0.5x
2
0 (2.12c)
L
1
x
1
U
1
, L
2
x
2
U
2
(2.12d)
6
The numbers are the same as in James et al., 2013
56 Chapter2. CoalescingDataandDecisionSciencesforAnalytics
where
Profit(x, W
i
, Z
i
) = Max 3y
A
+ 5y
B
M
1
(d
+
i1
+d
i1
) M
2
(d
+
i2
+d
i2
)
(2.13a)
s.t. y
A
4 (2.13b)
2y
B
12 (2.13c)
3y
A
+ 2y
B
18 (2.13d)
y
A
+ y
B
W
i
(2.13e)
d
+
ij
d
ij
= Z
ij
x
j
, j= 1, 2 (2.13f)
y
A
, y
B
,d
+
ij
,d
ij
0, j= 1, 2 (2.13g)
M
1
M
2
x
1
x
2
Objective Validated C.I.
0 0 0.7 0 $60,750 [$6,230, $9,620]
0.1 0.1 122.3 24.4 $35,902 [$25,572, $28,518]
0.1 0.5 124.1 10.2 $33,804 [$20,130, $23,487]
1 1 128.1 12.9 -$48,432 [$21,762, $25,044]
10 10 145.2 24.9 -$827,220 [$28,341, $30,941]
100 100 146.8 24.5 -$8,568,612 [$28,404, $31,009]
TABLE 2.5: Validation Exercise using standard SP
Table 2.5 shows the difficulty with validation when one uses penal-
ties in standard SP modeling. In this formulation the data Z
ij
(TV and
Radio advertising slots) are not relevant in the second stage because
the production variables (y
A
, y
B
) are only constrained by the sales
data (W
i
). In the absence of a connection between advertising and
sales (i.e., the regression) in the SP model, the penalties M
1
, M
2
play
2.2. SP:CostandConstraintUncertainty 57
a vicious role: regardless of the value assigned to these penalties, the
total profit predicted by the model is overwhelmed by the arbitrary
values (of penalties). Consequently, the objective function estimates
predicted by the model are untenable, as are the first stage decisions
x
1
, x
2
. themselves.
The above situation includes some latent random effects, and with-
out statistical modeling, it is not clear how these effects can be intro-
duced into a decision model. If one considers an entirely data-driven
SP model (i.e., without using a statistical model), then connection
between sales and advertising is not represented within the SP ap-
proach, and this gap makes the SP model untenable. Our approach to
this issue is an combination of PO and SP . In general, we will refer to
this approach combining statistical learning and stochastic optimiza-
tion as Learning Enabled Optimization (LEO).
59
Chapter 3
LEO: Models and Concepts
3.1 Introduction
In recent years, optimization algorithms have become the work-horse
of statistical (or machine) learning. Whether studying classification
using linear/quadratic programming for support vector machines (SVM)
or logistic regression using a specialized version of Newton’s method,
deterministic optimization algorithms have provided a strong foun-
dation for statistical learning (SL)
1
. Indeed, SL could be labeled as
“optimization enabled learning”. The class of models studied in this
chapter, entitled Learning Enabled Optimization (LEO), is intended
to leverage advances in statistical learning to support the work-flow
associated with stochastic programming (SP).
In the realm of SP , it is customary to model uncertainty via random
variables, which essentially means that there exists a reasonable hope
1
We will use the term statistical learning when the discussion is application-
agnostic. When referring to specific applications we use the term machine learning.
60 Chapter3. LEO:ModelsandConcepts
of approximating a distribution function to model random variables.
Traditionally, the general statement of SP (Birge and Louveaux, 2011)
is given in the form of the following optimization problem
min
x2X
c(x)+E[H(x, ˜ w)], (3.1)
where x and X denote a vector of decision variables and a feasible set
(resp.), c denotes a lower-semicontinous cost function, ˜ w is a given
random variable (with known distribution), which induces the ran-
dom variable H whose outcomes h provide the so-called recourse
(function) value (i.e., value function of another optimization model).
In cases where the ˜ w obeys a continuous distribution or even an em-
pirical distribution with a large number of data points, one uses sam-
pling to setup a more manageable version known as the Sample Av-
erage Approximation (SAA) problem. In the SL literature, such an
approximation is known as Empirical Risk Minimization.
In order to pose the kind of questions which motivate this chap-
ter, consider a common SL situation in which we have a dataset of
i.i.d. observations (Z
i
, W
i
) which represent covariates (Z, W), gener-
ically referred to as predictors and responses respectively. Unlike SP ,
where the distribution is assumed to be available, the goal of a learn-
ing model is to discover a relationship (e.g., a regression) between the
covariates(Z, W). These statistical relationships can reveal a treasure-
trove of information. Depending on the model, these relationships
3.1. Introduction 61
can predict outcomes of the response variable, given a realization
Z = z. Thus a Learning Enabled Optimization (LEO) model is a para-
metric version of an SP stated as follows.
min
x2X
c(x)+E[H(x, WjZ = z)], (3.2)
whereE[H(x, WjZ = z)] is the conditional expectation, given the pa-
rameter z. Note however, that unlike SP , the distribution of the ran-
dom variable WjZ = z, is not available directly. Instead we may in-
fer a distribution of WjZ = z using the available dataset. Moreover,
since we will be working with finite datasets, we will adopt the SL
orientation based on training and validation, which requires that we
undertake statistical tests a posteriori (after optimization).
In some applications, the predictors Z assume values which do not
belong to the same space as the decisions x. In such cases, the result-
ing model is said to be “LEO with disjoint spaces.” At the other end
of the spectrum, if z = x in (3.2), then the resulting model is said to
be “LEO with shared spaces.” Such models often arise in areas such
as design optimization, marketing optimization, and similar circum-
stances where decisions can be interpreted as “bets”. The analogous
SP models are known as problems with decision-dependent uncer-
tainty.The mathematical structures for these two types of LEO models
may be quite different (e.g., convexity or lack thereof).
62 Chapter3. LEO:ModelsandConcepts
Because the emphasis of SL is to work directly with the dataset (in-
stead of a distribution), our plan for fusion involves using regression
models to infer stochastic models, which can be used to represent the
uncertainty impacting the decision environment. The current chap-
ter adopts a two-stage stochastic programming approach, and we ad-
dress several open questions as summarized below.
1. (Modeling) Expected Risk Minimization in SL and Sample Av-
erage Approximation in SP are subsumed under (3.1). However
when data is available as covariate(Z, W), then how should we
accommodate decision models such as (3.2), where the condi-
tional expectation can only be approximated? The process of
creating such approximations will be referred to as Learning,
and the methodology for solving the decision problem is re-
ferred to as Optimization.
2. (Statistical Optimality) In the presence of stochastic forecasts from
SL, what statistical estimates can we provide regarding the prob-
ability ofd-optimality of a proposed decision (d> 0)?
3. (Generalizability) How should one estimate “generalizability”
which provides cost-to-go estimates of a decision beyond “in-
sample" data (i.e., the ability to perform reasonably well even
for outcomes which are not included within the training set)?
Can we provide other commonly used statistical metrics (e.g.,
95% prediction interval of costs) to support decision-making.
3.1. Introduction 63
4. (Model Validation and Selection) How should we validate the
effectiveness of data-driven models so that we can defend de-
cisions as being validated via statistical principles? Moreover,
since it is common to examine multiple SL models, how should
one select a model which provides the most appropriate deci-
sion?
This chapter is organized as follows. Connections to the litera-
ture are presented in section 3.2. In section 3.3, we present the fun-
damental structures, namely,“LEO Models with Disjoint Spaces” and
“LEO Models with Shared Spaces”. We illustrate the first of these
structures with an inventory control problem, and the second one
is illustrated via a production-marketing coordination problem. Be-
cause LEO models will allow both continuous and discrete random
variables, the statement of optimization will be relaxed to solutions
with high probability (greater than 95%, say) of optimality. This con-
cept, which is set forth in section 3.4, will be referred to as “statis-
tical optimality” for sequential sampling algorithms. In section 3.5
we study hypothesis tests for model validation. Such tests identify
the contenders (models) which may be most promising. In addition,
we also define a concept of generalization error for optimization. For
LEO models, this measure aims to quantify the degree of flexibility
expected in the decision model. This entire protocol is illustrated in
section 3.6 via computations for the examples introduced in section
3.3.
64 Chapter3. LEO:ModelsandConcepts
3.2 Connections to the Literature
In terms of scientific genealogy, one can trace the introduction of learn-
ing into optimization from the work on approximate dynamic pro-
gramming (ADP , Bertsekas, 2012, Powell, 2011) and approximate lin-
ear programming (ALP , e.g. De Farias and Van Roy, 2004). The canon-
ical structure of these approaches pertains to DP , where one uses ap-
proximations of the DP value function by using basis functions. In
this chapter, the canonical setup derives from constrained optimiza-
tion, although we will state our objectives in the context of approxi-
mate solutions. In this sense, one refers to the technical content of our
approach as “approximate stochastic programming.”
An alternative data-driven approach to modeling uncertainty is
via the ideas of Robust Optimization (RO) (Bertsimas and Sim, 2004,
Ben-Tal and Nemirovski, 2001), where performance is measured in
terms of the ability to withstand extreme or near-extreme events. There
are many applications (e.g., engineering design) where the ability to
survive unforeseen circumstances is important. Slightly less demand-
ing criteria come about via risk-averse optimization where the decision-
making model attempts to strike a balance between “risk" and “re-
turn"(e.g., Miller and Ruszczy ´ nski, 2011). The line of work pursued in
this chapter pertains to SP using covariate data and estimated condi-
tional expectation as in (3.2). With regard to data-driven SP , we men-
tion the recent work of Van Parys, Esfahani, and Kuhn, 2017 where
3.2. ConnectionstotheLiterature 65
it is shown that minimizing disappointment (or regret) within a data-
driven SP automatically leads to the solution of a Distributionally Ro-
bust Optimization (DRO) model. Nevertheless, their paper focuses
on a data-driven version of (1), not (2). All of the above models serve
alternative types of applications of decisions under uncertainty, and
they each possess their “sweet spot" in terms of applications and abil-
ity to cope with constrained decisions under uncertainty.
In keeping with our goals to accomplish more with data in OR/MS
applications, there have been some studies of optimization methods
with information gathering in predictive analytics (Frazier, 2012 and
Ryzhov, Powell, and Frazier, 2012). That line of work is intended to
help experimentalists improve the effectiveness of predictive mod-
els by using sensitivity of the response, using a concept known as
knowledge gradient. Their work uses algorithmic notions of opti-
mization for experimental design (including simulation experiments).
A more decision-oriented framework is proposed in Kao, Roy, and
Yan, 2009 where the regression coefficients are chosen based on a com-
bined objective consisting of a loss function associated with regres-
sion, and a quadratic objective for optimization. The assumptions of
Kao, Roy, and Yan, 2009, namely, unconstrained quadratic optimiza-
tion of both problems renders their simultaneous optimization man-
ageable. However, if the optimization model were inequality con-
strained (as in many applications), such simultaneous optimization
would lead to bilevel stochastic programs, which are much harder
66 Chapter3. LEO:ModelsandConcepts
than the SP setting of the LEO model. Another viewpoint at the inter-
face between predictive and prescriptive analytics is presented in the
recent paper by Bertsimas and Kallus, 2014. Their work demonstrates
that in presence of random covariates (as in (3.2)), direct usage of SAA
(using the data setfZ
i
, W
i
g) can produce decisions which are not con-
sistent with optimality, even when the size of the dataset grows indef-
initely. The reasons underlying this phenomenon may be attributed
to the difference in model representations in (3.1) and (3.2).
Before closing this section, we also mention that the OM literature
has adopted a combined learning and optimization focus on specific
classes of problems. For instance, Liyanage and Shanthikumar, 2005
and more recently Ban and Rudin, 2018 have studied the integration
of optimization and learning to identify optimal inventory ordering
policies. Both papers use the specialized form of the newsvendor
model to illustrate the potential for learning and optimization. An-
other avenue of application-specific use of learning within optimiza-
tion arises from the introduction of specialized regularization in the
context of portfolio optimization in Ban, Karoui, and Lim, 2017. While
these special cases have made a case for the potential for the fusion of
SL and SP , this chapter focuses on some general principles of fusion,
via LEO.
3.3. LearningEnabledOptimization 67
3.3 Learning Enabled Optimization
Statistical Learning provides support for predictive analytics, whereas,
optimization forms the basis for prescriptive analytics, and the method-
ologies for these are built independently of each other. The process
recommended for SL is summarized in Figure 1a in which the entire
data set is divided into two parts (Training and Validation), with the
former being used to learn model parameters, and the latter data set
used for model assessment and selection. Once a model is selected, it
can be finally tested via either simulation or using an additional “test
data set” for trials before adoption. This last phase is not depicted in
Figure 3.1 because the concepts for that phase can mimic those from
the model validation phase. By inserting the prescriptive model into
Figure 3.1b the LEO framework greatly enhances the potential for fu-
sion.
FIGURE 3.1: Statistical Learning and Learning Enabled
Optimization
68 Chapter3. LEO:ModelsandConcepts
3.3.1 Model Instantiation
This section presents our aspirations for LEO models. As one might
expect, this framework consists of two major pieces: the SL piece and
the SP piece. We begin by stating a regression model in its relatively
standard form. Towards this end, let m denote an arbitrary regression
model for a training datasetfW
i
, Z
i
g, indexed by i 2 T. Here W
i
denotes the observation and Z
i
the predictor. For notational simplicity
we assume that W
i
2R, whereas Z
i
2R
p
. Given the training data, a
class of deterministic models
ˆ
M, and a loss function`, the regression
is represented as follows:
ˆ m2 argmin
n
1
jTj
å
i2T
`(m)j m2
ˆ
M
o
. (3.3)
We wish to emphasize that in many circumstances (e.g., model-
ing the impact of natural phenomena such as earthquakes, hurricanes
etc.), model fidelity may be enhanced by building the statistical model
of the phenomena independently of decision models. For such appli-
cations, one may prefer to treat the development of the SL piece prior
to instantiating the SP piece.
Alternative Error Models
Empirical additive errors: Suppose that an error random variable
has outcomes defined by x
i
:= W
i
ˆ m(Z
i
). By associating equal
weights to these outcomes, one obtains a discrete random variable
3.3. LearningEnabledOptimization 69
with an empirical distribution. For the case of multiple linear re-
gression (MLR), ˆ m(z) = å
t
b
t
z
t
, where t ranges over the index set
of predictors. In that case, the empirical additive errors are given
by substituting the deterministic affine function in place of ˆ m. For
t = 0, it is customary to use z
0
= 1. When the distribution does
not depend on any specific choice of Z = z, the random variable
is said to be homoscedasticity (as is commonly assumed for MLR).
For a more general setup (known as projection pursuit (Friedman and
Stuetzle, 1981)), one may define ˆ m(z) = å
t2T
f
t
(b
>
t
z), whereT is
a finite index set. Again, the errors are assumed to have the same
homoscedasticity properties. This projection pursuit approach leads
to a rather general setting due to a result of Diaconis and Shahsha-
hani, 1984, which suggests that nonlinear functions of linear combi-
nations of general directions (as opposed to coordinate directions) can
produce arbitrarily close approximations of smooth nonlinear func-
tions. We should note that the notion of homoscedasiticty translates
to stationary error distributions in the context of time series models,
and there are standard (and partial) autocorrelation tests which can
discern stationarity. While on the context of error distributions, we
should mention the work of Rios, Wets, and Woodruff, 2015 which
uses splines and epi-splines to model additive error distributions (see
also Royset and Wets, 2014). However, the corresponding error distri-
butions may not necessarily satisfy homoscedasticity. Consequently,
70 Chapter3. LEO:ModelsandConcepts
extensions to non-convex optimization would be required to accom-
modate such generality.
Multi-dimensional errors: Define a stochastic function m(z,x) =
å
t
˜
b
t
f
t
(z), wheref
0
= 1, andf
t
(), are deterministic functions, but
the parameters
˜
b
t
are elements of a multivariate random variable. Let
ˆ m(z)=å
t
¯
b
t
f
t
(z), where
¯
b
t
=E(b
t
). In this case, a vector of errors
associated with an outcome of random coefficientsf
˜
b
t
g is given by
the difference
˜
x
t
=
˜
b
t
E(b
t
). The coefficientsf
˜
b
t
g may be corre-
lated, but the multivariate distribution of these coefficientsf
˜
b
t
g are
assumed to not depend on any specific choice of Z = z (homoscedas-
ticity).
Remark 1. (a) In SL it is common to formulate a stochastic model
m(z,x), although predictions are deterministic and made with ˆ m(Z =
z). In contrast, LEO will use an optimization model which will rec-
ognize inherent randomness of m(x,x). For this reason, the determin-
istic model ˆ m(z) is not as relevant (for LEO) as the stochastic model
m(z,x). Such stochasticity will provide the optimization model a sense
of uncertainty faced by the decision-maker, although the decisions
produced by LEO must also be deterministic (at least for the first pe-
riod, i.e., here-and-now). To draw another distinction between Di-
rected Regression and LEO, we observe that the former uses the de-
terministic model ˆ m(z) rather than the stochastic model m(z,x). (b)
3.3. LearningEnabledOptimization 71
Note that alternative stochastic models m may lead to the same de-
terministic model ˆ m. This is certainly the case of MLR. For this rea-
son, the LEO setting will consider a finite list of plausible alternative
stochastic models which will be indexed by the letter q. We suggest
representing the class of models by (`
q
,
ˆ
M
q
), and the specific deter-
ministic model obtained in (3.3) is denoted ˆ m
q
. Whenever the specific
index of a model is not important, we will refer to deterministic model
as ˆ m, and its stochastic counterpart by m.
Assumption 1 (A1). fW
i
, Z
i
g are assumed to be i.i.d. observations
of the data processfW, Zg. Moreover we assume that the errors are
homoscedastic.
Assumption 2 (A2). We will assume that decisions in the SP model,
denoted x, have no impact on the continuing data processf(W, Z)g to be
observed in the future.
To put this assumption in the context of some applications, note
that in the advertising/financial market, it may be assumed that an
individual advertiser/investor may not be large enough to change fu-
ture market conditions.
For the remainder of this section, we present two alternative struc-
tures for the SP part of a LEO model.
LEO Models with Disjoint Spaces.
This is the simplest version of a LEO model in which the statistical
inputs (denoted Z) do not assume values in the space of optimization
variables (x). Let x2 X R
n
1
denote the optimization variables,
72 Chapter3. LEO:ModelsandConcepts
and suppose that the predictors Z have p elements indexed by the
setJ =f1, ..., pg. As suggested in Remark 1(b), SL models m
q
are
now assumed to be given. For the case of MLR, the parameters of
the regression are Gaussian random variables with a corresponding
distributionP
q
. Then we have an objective function of the following
form.
f
q
(x) := c(x)+E
x
q
[H(x,
˜
x
q
jZ = z)] (3.4)
where, c : R
n
1
! R is a convex function, the expectation in (3.4) is
calculated with respect to the rv
˜
x
q
jZ, and this rv induces a random
variable H whose outcomes h are given as follows.
h(x,x
q
jZ = z)= minfd(y)jg(x, y) m
q
(z,x
q
) 0, y2 YR
n
2
g
(3.5)
Here g : R
n
1
+n
2
! R. Clearly these constraints could be multi-
dimensional, but some of the same conceptual challenges would per-
sist. Nevertheless, there is an important algorithmic advantage re-
sulting from the disjoint structure: because z and x belong to disjoint
spaces, optimization with respect to x is not restricted by z. Hence,
(3.4) can optimized by simply passing predicted values m
q
(z,x
q
). As
a result, the complexity of m
q
does not impact the optimization part
of the LEO model. As a result, very general regressions (e.g., Kernel-
regression and others) can be included for the prediction process in
(3.4).
3.3. LearningEnabledOptimization 73
Assumption 3 (A3-a). We assume that c, d, g are convex functions, the
sets X, Y are convex sets, and h(x,) is a convex function in x as well.
The value function h defined in (3.5) goes by several alternative
names such “recourse” functions in SP or “cost-to-go” function in DP .
While the underpinnings of LEO models are closer to SP than DP , we
adopt the “cost-to-go” terminology because we plan to extend LEO
models to allow other types of “cost” predictions in future papers
(e.g., forecasts of computational times, and other elements of an un-
certain future). Note that H is a random cost-to-go function, and h are
its outcomes, andE
x
[H] is the expected cost-to-go function.
LEO Models with Shared Spaces.
We continue with assumptions A1- A3, and will include another
assumption for this class of models. Consider an SP model whose
decisions x, have a subset of variables (x
j
, j 2 J J) which take
values in the same space as the predictor data. Hence these models
will be referred to as “models with Shared Spaces”, and the objective
function we formulate is as follows.
f
q
(x) := c(x)+E
x
q
[H(x,
˜
x
q
jZ = z, z
r
= x
r
, r2 J)]g, (3.6)
74 Chapter3. LEO:ModelsandConcepts
where, H is a convex function over the feasible set X, and the out-
comes h are defined as follows.
h(x,x
q
jZ = z, z
j
= x
j
, j2 J) := minfd(y)jg(x, y) m
q
(z,x
q
) 0, y2 YR
n
2
g
(3.7)
As in (3.5) we assume that h(x,) defined in (3.7) is a convex func-
tion in x. Note that in this form of the model, the decision maker
is called upon to make a “bet” (x
j
, j 2 J), and the response is a rv
H(x,
˜
x
q
jZ = z, z
j
= x
j
, j2 J) . While, the objective function of both
Disjoint and Shared Spaces models have a similar forms, the interplay
between decisions and random variables are different. To accommo-
date this, we state the following assumptions.
Assumption 3 (A3-b). In addition to Assumption (A3-a), (3.7) imposes
the assumption that m
q
(z,x
q
) is concave in z
j
, j2 J for allx except on a set
ofP
q
-measure zero.
Assumption 4 (A4). When decisions x are allowed to assume values
in a subspace of observations of the rv Z, we assume that X is a subset of
P
J
(convfZ
i
g
i2T
), where the notation P
J
() denotes the projection on to
the subspace of variables indexed by J. If number of decision variables
are the same as those in Z, then set of feasible solutions (X) is compact
as well.
Due to the structural differences between (3.4) and (3.6) we will
use a time-series model (ARIMA) to illustrate models with a disjoint
structure, whereas, the illustration for the latter will be restricted to
3.3. LearningEnabledOptimization 75
MLR because this form is much easier to manage in case of the latter.
Mathematical Formulation of a LEO Model. As mentioned in Re-
mark 1, the decision model in LEO is a SP for which m(z,x) is more
critical that ˆ m(z) which is a deterministic forecast. The mathemati-
cal statement of a LEO model will be based on the recognition that
importing a statistical model into an optimization problem can be de-
manding, but the payoff (via increased flexibility, and greater opti-
mization) may be significant. The illustrative examples included in
this section demonstrate clear advantages of using Due to the possibil-
ity that the parameters of a statistical model (e.g. MLR) can have sev-
eral coefficients which are continuous random variables (e.g, Gaus-
sian), evaluating f(x) may require some sampling-based SP algorithm.
One of the more popular approaches for such SP is the Sample Aver-
age Approximation (SAA) which is summarized in section 4. How-
ever, it has been observed by many authors (e.g. Mello and Bayrak-
san, 2014) that the sample size recommended by SAA theory can be
extremely conservative (see also comments regarding sample sizes in
the concluding section). For this reason, the algorithmic use of SAA
consists of solving a sequence of sampled problems, each using a
larger sample size (e.g., Linderoth, Shapiro, and Wright, 2006). Since
most deterministic algorithms used for solving the SAA problem are
not designed for changes in sample size, the algorithmic exercise be-
comes computationally burdensome due of the lack of coordination
between solution algorithms and stopping criteria.
76 Chapter3. LEO:ModelsandConcepts
In order to bring the aspirations of a modeler (in search of deci-
sions with performance guarantees), we state the LEO model in terms
of a probabilistic guarantee. LetP
q
denote the probability distribution
of a random variablex
q
. For LEO models, we will seek a pair(x
q
,g
q
)
such that for a pre-specified accuracy toleranced> 0, we have
g
q
:=P
q
(x
q
2 d argminf f
q
(x)jx2 Xg), (3.8)
withg
q
g (= 0.95, say). As the reader might notice, (3.8) states our
aspirations for a LEO model in such a manner that we can report the
following critical quantities: d, g
q
, x
q
for a model indexed by q. The
manner in which we verify these conditions will be discussed in the
section on Statistical Optimality.
3.3.2 Examples of LEO Models
Example 1. Inventory Control: A LEO model with Disjoint Spaces
-LEO-ELECEQUIP
TheELECEQUIP data-set in R provides 10 years of demand data for
electrical equipment. We present an inventory control model with
this data-set. Consider making equipment ordering choices in period
t based on demand data from previous periods (i.e., periods j < t).
Since the optimization model chooses decisions for periods j t, we
treat the optimization variables and that of the statistical model as dis-
joint. Clearly, this property holds for rolling horizon models as well.
3.3. LearningEnabledOptimization 77
In essence disjoint spaces allow the statistical and optimization mod-
els to operate by simply passing values assumed by rvs. Because of
the disjoint spaces, such a LEO model can entertain reasonably com-
plex descriptions of data (e.g. time series, nonlinear and the so-called
link functions). Our preliminary results provide an example using an
ARIMA model, with (p,d,q), (P ,D,Q) = (0,0,0), (1,1,0). This notation
(p,d,q) is standard for ARIMA, and represents the number of peri-
ods used in predicting three parts of the stationary series (i.e., after
removal of trends) representing autoregressive terms (p), integration
terms (d) and moving average errors (q). The quantities P ,D,Q refer to
the “seasonal" component, which in this case was annual. In choosing
these parameters, it is customary to test for stationarity using autocor-
relation and partial autocorrelation functions (ACF and PACF) (Box et
al., 2015). Note that because P=1, D=1, the ARIMA model implies that
two previous periods (from a year ago) are necessary for forecasting
the demand for period t = 1. In order to acknowledge the rolling-
horizon nature of the decision models, we also include an additional
period look-ahead after period 1. Thus we have a three period model
with t = 0, 1, 2. The detailed model formulation is provided in Chap-
ter 3.6.
Example 2. Production-Marketing Coordination: A LEO model
with Shared Spaces -LEO-Wyndor
An important piece of data for production planning is predicted
sales, which in turn depends on how much advertising is carried out.
78 Chapter3. LEO:ModelsandConcepts
Suppose that a company buys advertising slots (units of time) on sev-
eral media channels, then, this decision has an impact on future sales
figures. We present an example which we refer to as the LEO-Wyndor
data in which the decision vector x represents the allocation of the ad-
vertising time slots to each type of media (TV and radio). The name
Wyndor and the production part of this problem is borrowed from a
very popular OR textbook (Hillier and Lieberman, 2012). Our exam-
ple extends the original Wyndor model to one in which the produc-
tion plan is to be made while bearing in mind that the allocation of
the advertising time slots affects sales of Wyndor products (two types
of doors), and the final production plan will be guided by firm orders
(sales) in the future. For this example, a statistical model predicting
future sales is borrowed from the advertising data set of James et al.,
2013 which also presents an MLR model relating sales (W) with adver-
tising slots (Z) on TV and radio. In this example, the advertising deci-
sions constitute a “bet" on the first stage (advertising) decisions x, and
the second stage decisions are the production planning choices, given
“firm orders” (sales). A specific numerical instance of this model is
given in Chapter 3.6.
This is a case where a linear regression model reveals the relation-
ship between advertising and sales. When we include an SL model
into the production planning model, we are able to accommodate the
3.3. LearningEnabledOptimization 79
relationships discovered by a learning model. Because of the follow-
ing assumptions of linear regression, one can justify the fusion of re-
gression and SP . We have the following assumptions underlying this
statistical estimation model.
1. Linearity: A linear fit must be deemed appropriate for any Mul-
tiple Linear Regression (MLR) model.
2. Independence: The data(W
i
, Z
i
) are assumed to be i.i.d.
3. Normality: Another standard assumption in regression is that
the error (i.e., difference between the “least-squares” estimate,
and the observed data) is approximately normally distributed.
4. Error Distribution: This refers to the property that the distribu-
tion of the error# does not vary with the choice of x.
If all these assumptions hold, it can be shown that the variance
s
2
= I E[
å
N
i
#
2
i
Nn1
]. Hence with data points indexed by i, one can esti-
matess
2
as follows.
s
2
=
å
N
i
#
2
i
N n 1
, (3.9)
where, N is the number of data points, and n+ 1 is the total number
of parametersfb
j
g
n
j=0
in MLR.
Ordinarily, for somewhat large data sets, one should be able to jus-
tify that the parameters yield stable estimates of the estimated mean
b
0
+å
j
b
j
x
j
, and the random variable ˜ # can be approximated by a
normal distribution when the number of degrees of freedom is large
80 Chapter3. LEO:ModelsandConcepts
enough. In the statistical literature, it is customary to suggest that
when the degrees of freedom (i.e. N n 1) exceeds about 60, the pa-
rametersfb
j
g
n
j=0
are quite stable, and the errors # are approximately
N(0, s
2
). Thus, for the example with the advertising data set, the num-
ber of data points N = 100 (for the training data set), and the number
of parameters being used is n+ 1 = 3. Hence, the number of degrees
of freedom far exceeds the suggested 60, and as a result, # in such
examples may be reasonably approximated by using normal random
variates using N(0, s
2
).
In case of the LEO-Wyndor example, several alternative LEO mod-
els are plausible based on the SL model used. When we use empirical
additive errors (EAE), the MLR coefficients (b
j
, j = 0, 1, 2) are fixed
scalars, and they become the deterministic entries of the matrix C,
while the empirical additive errorx is the random right hand side. On
the other hand, when the MLR model with normally distributed coef-
ficients are used, thenf
˜
b
j
g
fj=0,1,2g
, then entries of C are also random.
If the cross-correlations among these coefficients are ignored, then the
matrix C is diagonal (with normally distributed uncorrelated random
variables - denoted NDU). On the other hand, if the cross-correlations
off
˜
b
j
g
fj=0,1,2g
are included, then the resulting LEO model has error
terms which are normally distributed and correlated (denoted NDC).
3.4. ParallelReplicationsandStatisticalOptimality 81
While these alternative statistical models are plausible, statistical the-
ory recommends that we choose the model with the least generaliza-
tion error (in a statistical sense). Since the goal of LEO models is to rec-
ommend optimal decisions e.g. order quantities or advertising alloca-
tions, we will choose that x
q
for which the validated objective value
ˆ
f
turns out to be the lowest, where,
ˆ
f(x
q
) := c(x
q
)+(1/jVj)
å
j2V
h(x
q
, W
j
jz= Z
j
) (3.10)
Note that this validation process is based on observations set aside
for the validation set V, and is entirely data-driven. Hence the index
q only appears for the solutions x
q
, but this index does not appear in
the definition of
ˆ
f . Finally, there are some obvious amendments to the
definition of
ˆ
f when we use cross-validation (see section 3.5).
3.4 Parallel Replications and Statistical Opti-
mality
Statistical optimality bounds have been studied in the literature for a
while (e.g., Higle and Sen, 1991, Higle and Sen, 1996b, Mak, Morton,
and Wood, 1999, Kleywegt, Shapiro, and Mello, 2002, Bayraksan and
Morton, 2011, Glynn and Infanger, 2013). A complete mathematical
treatment of these concepts appears under the banner of “Statistical
82 Chapter3. LEO:ModelsandConcepts
Validation” in Shapiro, Dentcheva, and Ruszczynski, 2009, and a de-
tailed tutorial for SAA appears in Mello and Bayraksan, 2014. In ad-
dition, there have been theoretical investigations on how the compu-
tational budget ought to be allocated so that increases in sample size
can be determined in an online manner (e.g., Bayraksan and Pierre-
Louis, 2012, Royset and Szechtman, 2013). Despite this relatively long
history, their use in identifying near-optimal decisions for realistic in-
stances has been limited. There are at least two hurdles to overcome
here: 1) as mentioned earlier, the sample size requirements predicted
by the current theory leads to relatively large approximation prob-
lems, and 2) while replications of sampled algorithms are important
for lower variance estimates of the objective function f , the unfortu-
nate reality of sampling-based optimization is that replications may
introduce significant variability in decisions (see experience with SSN
reported in Freimer, Linderoth, and Thomas, 2012). The current sec-
tion extends the result of section 4.1.2 to a situation in which statistical
optimality of a compromise decision can be verified up to an accuracy
of d as in (3.8). To accomplish this goal, we combine the algorithmic
framework of Sen and Liu, 2016 and the convergence rate results of
SAA (section 5, Shapiro, Dentcheva, and Ruszczynski, 2009) to obtain
a distribution-free estimate of the probability of optimality of the pro-
posed decision. Thus our approach combines concepts from external
sampling (SAA), as well as internal sampling (SD) within one com-
mon framework.
3.4. ParallelReplicationsandStatisticalOptimality 83
In case of the former, the analysis is typically presented with a
view towards an existence result which shows that for a finite sam-
ple size N, the SAA setup provides and-optimum solution with high
probability. Unfortunately, the sample size estimates for SAA are,
more-or-less, existence results because some of the parameter esti-
mates are unavailable a priori (i.e., before the algorithm is executed).
Hence one can only use very loose bounds for some of the Lipschitz
constants), and as a result the sample sizes themselves are either not
computable, or so large that the authors suggest that they are too con-
servative for practical purposes Shapiro, Dentcheva, and Ruszczyn-
ski, 2009.
In the following, we impose the assumptions required for the asymp-
totic convergence of SD. Let n = 1, ..., M to denote the index of repli-
cations, and for eachn, the SD algorithm is assumed to run for K
n
(#)
samples, to produce a terminal solution x
n
(#), and a terminal value f
n
#
,
where# is the stopping tolerance used for each replication. From sec-
tion 4, note that the grand-mean approximation
¯
F
M
(x) :=(1/M)å
M
n=1
f
n
(x)
, wheref f
n
g
M
n=1
denotes terminal value function approximations for
each replication m. In addition, ¯ x = (1/M)å
n
x
n
, and the compro-
mise decision x
c
is defined by x
c
2 arg minf
¯
F
M
(x)+
¯ r
2
jjx ¯ xjj
2
: x2
Xg, where ¯ r is the sample average offr
n
g, which denote the terminal
proximal parameter for then
th
replication.
Theorem 2 Assume X is non-empty, closed and convex, and the approxi-
mations f
n
are proper convex functions over X. For d = ¯ rjjx
c
¯ xjj
2
, we
84 Chapter3. LEO:ModelsandConcepts
have,
1
M
M
å
n=1
f
n
#
+d
¯
F
M
(x
c
). (3.11)
which implies x
c
is d-argmin to
1
M
å
M
n=1
f
n
(), and the tolerance level
satisfiesd= ¯ rjjx
c
¯ xjj
2
.
Proof:
Since x
c
2 arg minf
¯
F
M
(x)+
¯ r
2
jjx ¯ xjj
2
: x2 Xg, we have,
02 ¶
¯
F
M
(x
c
)+N
X
(x
c
)+ ¯ r(x
c
¯ x).
Hence, ¯ r(x
c
¯ x) can be used as a subgradient of the function
¯
F
M
(x)+
I
X
(x) at x = x
c
. Hence, for all x2 X,
¯
F
M
(x)+I
X
(x)
¯
F
M
(x
c
)+I
X
(x
c
) ¯ r(x
c
¯ x)
>
(x x
c
)
Since ¯ x, x
c
2 X, the indicator terms vanish, and therefore,
¯
F
M
(¯ x)+ ¯ r(x
c
¯ x)
>
(¯ x x
c
)
¯
F
M
(x
c
).
Since ¯ r(x
c
¯ x)
>
(¯ x x
c
) ¯ rjjx
c
¯ xjjjj¯ x x
c
jj, we have
¯
F
M
(¯ x)+ ¯ rjjx
c
¯ xjj
2
¯
F
M
(x
c
). (3.12)
3.4. ParallelReplicationsandStatisticalOptimality 85
Recall that ¯ x=
1
M
å
n
x
n
, and
¯
F
M
is convex, therefore, F
M
(¯ x)
1
M
å
n
¯
F
M
(x
n
).
Because f
j
(x
n
) f
n
(x
n
) for all pairs(j,n), and f
n
#
= f
n
(x
n
), we have
¯
F
M
(¯ x)
1
M
å
n
¯
F
M
(x
n
)
1
M
å
n
f
n
#
. (3.13)
Combining (3.12) and (3.13), we get
1
M
å
n
f
n
#
+d=
1
M
å
n
f
n
#
+ ¯ rjjx
c
¯ xjj
2
¯
F
M
(x
c
).
If we define
ˆ
S
M
(d) =fx2 Xj
¯
F
M
(x)
1
M
å
n
f
n
(x
n
)+dg, f
the
optimal value, and S(d
u
)=fx2 Xj
¯
F
M
(x) f
+d
u
g, then Theorem
2 has proved that x
c
2
ˆ
S
M
(d). Note that S(d
u
) defines the solution set
which is d
u
-optimal to the true optimal solution, and as a result, we
should also analyze the relationship between x
c
and S(d
u
).
Unless one restricts the model to using only Empirical Additive
Errors (EAE), it is difficult for a user to prescribe a sample size for
a stochastic programming model for reasons discussed earlier (also
see computational experience reported in Sen and Liu, 2016). Hence
we do not recommend this approach for cases where the distribution
used is continuous or discrete with countably infinite number of out-
comes. Instead, we use SD to suggest sample sizes, and discover the
probability that a recommendation x
c
2 S(d
u
). The following theorem
gives the probability bound of x
c
2 S(d
u
).
86 Chapter3. LEO:ModelsandConcepts
Theorem 3 Let F(x,
˜
x) := c(x)+ H(x,
˜
x) denote the objective rv in (3.4)
and (3.6). Suppose for each outcomex,k(x) satisfiesjF(x
0
,x) F(x,x)j
k(x)jjx
0
xjj. We define the Lipschitz constant ofE
x
[F(x,
˜
x)] as L=E
x
[k(
˜
x)].
Suppose XR
n
has a finite diameter D, M stands for the number of repli-
cations in solving the SP , N denotes the minimum of sample size of all the
replications, and let the tolerance leveld
u
> d, withd defined in Theorem 2.
Then we have the following inequality:
Prob(
ˆ
S
M
(d) S(d
u
)) 1 exp
NM(d
u
d)
2
32L
2
D
2
+ n ln
8LD
d
u
d
!
.
(3.14)
Proof:
If we solve for the probability from (1.2) in Proposition 1, the fol-
lowing inequality holds:
Prob(
ˆ
S
M
(d) S(d
u
)) 1 exp
K(d
u
d)
2
8l
2
D
2
+ n ln
8LD
d
u
d
!
.
(3.15)
From assumption SAA-c in section 1, l = 2L. Also, recall from (1.3),
each replication uses a sample size of at least N. Therefore, in this
case the total sample size K is at least NM. The conclusion holds by
replacingl and K in (3.15).
Remark 2. To the best of our knowledge, the sample size formulas
for SAA (section 5, Shapiro, Dentcheva, and Ruszczynski, 2009) are
not intended for use to set up computational instances for solution.
3.5. ModelValidation,AssessmentandSelection 87
Instead, their primary role has been in showing that the growth of
sample size for SAA depends logarithmically on the size of the feasi-
ble set and the reliability level (1a). Our approach, seeking proba-
bilistic bounds, allows us to estimate the reliability of a solution for a
sampling-based algorithm.
We make two other observations in connection with the vision of
statistical optimality: (i) due to replications, there is a natural affinity
towards parallel algorithms, and (ii) it promotes the use of adaptive
solution algorithms which can increase the sample size of any repli-
cation without having to restart the algorithmic process from scratch.
The SD algorithm for SLP problems, and the convex SD algorithm in-
troduced in the next chapter illustrate this strategy.
3.5 Model Validation, Assessment and Selec-
tion
The stochastic programming (SP) literature has some foundational re-
sults for assessing solution quality as proposed in Mak, Morton, and
Wood, 1999. Shapiro and Mello, 1998 and Higle and Sen, 1996a. How-
ever, these tests are not proposed within the larger context of model
validation and assessment. The field of statistics, and more recently,
Statistical Learning have developed notions of model selection on the
basis of estimated errors for models which use empirical distributions.
88 Chapter3. LEO:ModelsandConcepts
Because the LEO setup allows alternative plausible SL models, includ-
ing the deterministic prediction model ˆ m as a potential alternative, it
is important to adopt the SL approach of model validation, assess-
ment and selection. Note that any model validation schema should
depend on measuring the specific response function we wish to opti-
mize. In this chapter, our optimization objective reflects a sample av-
erage criterion. In cases where the optimization objective is not the ex-
pectation (such as robust optimization, mean-risk optimization), the
validation methods proposed in this chapter may not be appropriate.
The protocol we adopt is one based on Figure 1b where validation
is critical part of the modeling process. In section 3.5.1 we discuss met-
rics for any LEO model, and comparisons between alternative LEO
models will be presented in section 3.5.2. These tests correspond to
the hypothesis tests used in the lower diamond-shaped block of Fig-
ure 1b, and require a decision as an input.
3.5.1 Metrics and Model Validation
The following tests will be included for each alternative LEO model
(indicated by an index q are mentioned in section 3.3).
c
2
test for error terms and cost-to-go objectives
T-test for the mean of cost-to-go function
F-test for the variance of cost-to-go function
3.5. ModelValidation,AssessmentandSelection 89
Tests to identify outliers.
Prediction and confidence intervals such as those motivated by
SL.
The three standard statistical tests summarized below include c
2
,
T and F tests. For any of these tests, if the null hypothesis is rejected
in either case, then the corresponding LEO model is rejected. Oth-
erwise, it is NOT rejected (i.e., models will be either rejected or not).
In addition, we include the calculation of confidence intervals of the
estimated expected cost-to-go function for a given decision.
The T-test and F-test are based on asymptotic normality of the op-
timal value and the optimal solutions of the stochastic programming
problem. The property of asymptotic normality is obtained via uni-
form convergence of SAA (Mello and Bayraksan, 2014) whereas the
asymptotic normality of solutions follows from the uniqueness of lim-
its as shown in Sen and Liu, 2016. The latter property does not nec-
essarily hold for the general SAA setting which does not impose any
algorithmic conditions.
c
2
Test for Error Terms and Cost-to-go Objectives. We perform c
2
tests for error terms and the cost-to-go objective functions. The data-
set is expected to have two parts, and we test the null hypothesis that
both parts of the data-set share a common distribution. Given a data-
set we allocate the data into B bins, for the i
th
bin, denote E
1i
as the
observed frequency for bin i from one sample, and E
2i
as the observed
90 Chapter3. LEO:ModelsandConcepts
frequency for bin i from validation the other sample. Then the c
2
statistic for this data is estimated as:
ˆ c
2
=
B
å
i=1
(E
1i
E
2i
)
2
E
1i
+ E
2i
(3.16)
We check the c
2
distribution with B degrees of freedom which pro-
vides the standard valuec
2
(B), and the probability p= Prob(c
2
(B)>
ˆ c
2
). Given a significance levela, we reject the null hypothesis if p a;
otherwise, we do not reject the null hypothesis.
Cost-to-go Hypothesis test for the Mean value using T-statistic.
For cost-to-go objectives, we introduce the null hypothesis test for
the mean value. Suppose we have two independent samples which
reflect the training and validation sets of size m
1
and m
2
respectively,
the mean values of them are h
1
, h
2
, and the standard deviations are
s
1
, s
2
. To determine whether two sample means are significantly dif-
ferent with a = 0.05, then T-statistic of two group means is t =
jh
1
h
2
j
p
s
2
1
/m
1
+s
2
2
/m
2
. We compare the calculated T-value with t
0
= 1.96; if
t> t
0
, we reject the null hypothesis because it indicates that the mean
values of two samples are significantly different. If this hypothesis is
rejected, the objectives of training and validation set are considered to
be different on the basis of the first moment.
Cost-to-go Hypothesis test for the Variance value using F-statistic.
Besides the hypothesis test on the first moment level, we also per-
form a test for the variance value based on the F distribution. The
3.5. ModelValidation,AssessmentandSelection 91
F-test is often used to test if the variances of two samples are con-
sistent. The null hypothesis H
0
is defined as: the variances of two
samples are equal. Suppose we denote the observed variances of two
samples as s
2
1
and s
2
2
, then the F statistic is the following f = s
2
1
/s
2
2
.
Suppose we choose the significance levela, sample 1 has sample size
N
1
, and sample 2 has sample size N
2
, then the critical region is decided
by two values from F-distribution: F
a/2,N
1
1,N
2
1
, F
1a/2,N
1
1,N
2
1
. If
F
a/2,N
1
1,N
2
1
F F
1a/2,N
1
1,N
2
1
, we do not reject the null hy-
pothesis.
Confidence Intervals of the estimated expected cost-to-go functions.
Next we present the calculation of confidence interval (of the mean).
Suppose we let
¯
h denote the mean of validated cost-to-go values, and
d denote the approximate standard deviation of validated cost-to-go
mean values among all the replications. The 1a = 95% confidence
interval is given by[
¯
h
t
a/2,k1
p
k1
d,
¯
h+
t
a/2,k1
p
k1
d], in which k2[5, 10] rep-
resents the number of folds in cross validation and t
a/2,k1
denotes
the t distribution value with probability 1a and k 1 degrees of
freedom.
Identifying outliers is an integral part of regression modeling (James
et al., 2013), and a similar undertaking is advisable for the optimiza-
tion setting. In this section, we will also identify prediction intervals
to quantify the range of cost-to-go values which might result as a con-
sequence of the decision.
92 Chapter3. LEO:ModelsandConcepts
Identifying Outliers. In stating the LEO model, the class of regres-
sionsM can be quite general. However, a model with Shared Spaces
may call for a constrained regression whereM may include bounds
on predictions. For instance, in the LEO-Wyndor example, an uncon-
strained regression may lead to predictions which violate the bounds
of the data-set. Unlike robust optimization where outliers may be crit-
ical to a decision model, our setting is more in line with regression
model of statistics where the outliers can have a detrimental impact
on the estimated conditional expectation. As in regression, where
the focus is approximating a sample average prediction (using em-
pirical residual minimization), data that are considered to be outliers
should be removed. Similar considerations also hold for clustering
algorithms (see Bertsimas and Shioda, 2007).
To identify outliers, it is important to choose the type of errors (ad-
ditive scalar or multi-dimensional) to be used during cross-validation.
Outliers from additive errors can be identified via Q-Q plots, whereas
the case of multi-dimensional errors require greater computational ef-
fort. In cases such as MLR, the regression coefficients are allowed to be
random variables, and hence lead to multi-dimensional error terms.
Outliers for an additive error model. Let W
L
= min
i
fW
i
: i 2 Tg
and W
U
= max
i
fW
i
: i2 Tg. Once these bounds W
L
, W
U
have been
computed, we identify those i2 V as outliers by checking whether
m(Z
i
,x)2 [W
L
, W
U
]. Hence, data points with predictors outside the
3.5. ModelValidation,AssessmentandSelection 93
bounds ([W
L
, W
U
]) are considered to be outliers, and should be re-
moved. Figure 3.4 shows the q-q plots for the error terms of the Train-
ing and Validation data sets of the LEO-Wyndor example before and
after data preprocessing. We also compared thec
2
test result of error
sets before and after preprocessing. The detailed results ofc
2
test are
included in section 5, where all computational results are presented.
An alternative way to identify outliers for the additive scalar SL
model is to identify a 1a percent range where the parameter b
0
should belong. Choosinga= 0.05, the acceptable range ofb
0
is
E
M
0
=
n
b
0
(b
0
¯
b
0
)
2
s
2
b
0
c
2
(a)
o
,
where
¯
b
0
denotes the mean value of the constant coefficient from MLR,
and s
b
0
is the standard deviation of the coefficient. We declare a data
point(W
i
, Z
i
) to be an outlier if the following set is empty.
8
>
<
>
:
(
¯
b
0
+x
i0
)+
¯
b
>
Z
i
= W
i
¯
b
0
+x
i0
2E
M
0
The above test for uni-dimensional outliers is generalized to the multi-
dimensional case below.
Outliers for a multi-dimensional error model. Here we consider the
case of statistical models in which the parameters are considered to
be rvs. Given our current emphasis on MLR, we study the case of
parameters which have multivariate normal distributions. For such
94 Chapter3. LEO:ModelsandConcepts
statistical models, the Mahalanobis distance is known to provide a
justifiable test for identifying outliers (Kulis, 2013). The essence of our
test revolves around identifying parametersb
i
, which are expected to
belong to a multi-dimensional ellipsoid
E
M
=fbj(b
¯
b)
>
S
1
b
(b
¯
b) c
2
p
(a)g,
where
¯
b denotes the mean value of coefficients reported for MLR,S
b
is the variance-covariance matrix associated with the coefficients, p
denotes the degrees of freedom for thec
2
distribution and 1a is the
probability. If for a given data point(W
i
, Z
i
), the following system is
infeasible, then we declare such a data point as an outlier.
8
>
<
>
:
(
¯
b
0
+x
i0
)+(
¯
b+x
i
)
>
Z
i
= W
i
(
¯
b
0
+x
i0
,
¯
b+x
i
)2E
M
These multi-dimensional feasibility problems are best solved in par-
allel using a convex optimization solver for each i.
Prediction Intervals. Under uncertainty, decision-makers (DM) not
only seek a recommendation for a well-hedged decision, but they are
interested in predictions of future costs. The 95% prediction interval,
which is a population property, is most commonly used when one is
interested in making inference on a single future observation of to-
tal cost of the decision model. We recommend prediction intervals
which represent a non-parametric interval estimate of the population
3.5. ModelValidation,AssessmentandSelection 95
value of the cost-to-go-function at a 95% level. Unlike confidence in-
tervals (even prediction intervals in regression), non-parametric pre-
diction intervals may not be symmetric. The prediction interval can
be obtained using the shortest interval which covers 95% of the vali-
dated cost-to-go population (see Theorem 1 in Frey, 2013). As noted
in the cited paper, the coverage probability of the shortest interval is
accurate for a uniform distribution, whereas, it is a lower bound on
coverage probability for other distributions.
While the problem of seeking the shortest interval can be stated as
an extension of a knapsack problem, we propose a somewhat tighter
formulation (with additional valid inequalities). Letfh
j
g
jVj
j=1
denote
the validated cost-to-go data points, wherejVj represents the sample
size of validation dataset. We recommend that the dataset be sorted
increasing order (i.e., h
j
h
j+1
, j = 1, ...,jVj 1). Let z
j
denote a
binary variable which assumes a value 1 if h
j
is included in the pre-
diction interval, and 0 otherwise. The problem identifying a 1a
96 Chapter3. LEO:ModelsandConcepts
prediction interval is formulated as follows.
min
jVj
å
j=1
h
j
v
j
jVj
å
j=1
h
j
u
j
s.t.
jVj
å
j=1
v
j
= 1,
jVj
å
j=1
u
j
= 1,
V
å
j=1
z
j
(1a)jVj
v
j
z
j1
z
j
, u
j
z
j
z
j1
, j= 1, ...,jVj
u
j
1 v
jt
, t= 0, ..., j 1, j= 1, ...,jVj
å
tj1
u
t
v
j
, j= 2, ...,jVj
z
t
1 u
j
, z
t
1 v
j
, t= 0, ..., j 1, t = j+ 1, ...,jVj, j= 1, ...,jVj,
z
j
, u
j
, v
j
2f0, 1g, j= 1, ...,jVj.
In this formulation, the binary variables u
j
(v
j
) are assumed to be 1 if
h
j
is the smallest (largest) index to be included in the prediction in-
terval, and 0 otherwise. The constraintså
jVj
j=1
v
j
= 1 andå
jVj
j=1
u
j
= 1
are to ensure only 1 smallest (largest) index exists, and the constraint
v
j
z
j1
z
j
guarantees that if j is not the index of the largest ele-
ment (which indicates v
j
= 0) and the index j 1 is included in the
prediction interval, then j is also within the prediction interval (The
case of u
j
with smallest index is similar). The 1a prediction interval
is
å
jVj
j=1
h
j
u
j
,å
jVj
j=1
h
j
v
j
.
3.5. ModelValidation,AssessmentandSelection 97
3.5.2 Comparison across LEO Models.
In this subsection, we discuss how alternative LEO models are as-
sessed and which of these should be recommended as the most ap-
propriate. In order to do so, we first estimate generalization error and
optimization error. Finally, we include the Kruskal-Wallis test, which
provides a sense of reliability of the estimates.
Generalization Error. This quantity is a prediction of out-of-sample
cost-to-go error in prescriptive models which may be observed when
the system is implemented for a given decision x. Note that in our
approach the generalization error is estimated based on the cost-to-
go function values of SP models. The term “in-sample” includes all
training as well as the validation data. For a given decision x, define
h
+
i
new value of the cost-to-go function
ˆ
h
i
a cost-to-go function value in the training dataset of sample sizejTj
h
v
i
a cost-to-go function value in the validation dataset of sample sizejVj
The random variable corresponding to h
+
i
(h
v
i
) will be denoted as
h
+
(h
v
). For a given decision x, the new observation of the cost-to-go
function can be estimated by the k nearest neighbors (k-NN) approxi-
mation from the training set. For a new observation(Z
+
, W
+
), define
D
+
j
=jjZ
j
Z
+
jj where j2 T and letD
+
[j]
denote the j
th
order-statistic
offD
+
j
, j2 Tg. Let the indicator variable I
k
j,+
be 0 ifD
+
j
>D
+
[k]
, and 1
otherwise. Then the k-NN estimate for the case of disjoint spaces can
98 Chapter3. LEO:ModelsandConcepts
be approximated as
h
+
=
1
k
N
å
j=1
I
k
j,+
h(x,x
j
jZ = Z
j
), (3.17)
where N =jTj. For the case of shared spaces, we put the new data
would take the form(x, W
+
), hence we would use Z
+
= x to compute
the nearest neighbors. In keeping with the Lemma 1 of Walk, 2008, we
make the following assumption.
Assumption 4 (A4). Assume that k/N! 0.
Define an approximation of the in-sample cost-to-go error as
Err
in
1
N
N
å
i=1
E
h
+(h
+
i
ˆ
h
i
)
2
. (3.18)
The approximate equality () in (3.18) is intended to convey that the
right hand side is asymptotically equal to the left hand side as N ap-
proaches infinity. In any event, the in-sample cost-to-go error pro-
vides an estimated average error between a new cost-to-go response
and the training set cost-to-go.
Next we consider how well the training data predicts the cost-to-
go values of the validation set. Similarly we provide a k-NN approx-
imation of the cost-to-go values in the validation dataset for a given
decision x. Let(Z
v
, W
v
) denote a validation data point, and letD
v
[j]
de-
note the j
th
order statistic ofD
v
j
, j2 T whereD
v
j
=jjZ
j
Z
v
jj. Hence
the k-NN approximation of the validation cost-to-go function value is
3.5. ModelValidation,AssessmentandSelection 99
given by
h
v
=
1
k
N
å
j=1
I
k
j,v
h(x,x
j
jZ = Z
j
), (3.19)
where the indicator variable I
k
j,v
is 0 ifD
v
j
>D
v
[k]
, and 1 otherwise. Sim-
ilar to the k-NN approximation in (3.17), we require A4 to be satisfied.
Now define the cost-to-go training error (err
tr
) as
err
tr
=
1
N
N
å
i=1
(h
v
i
ˆ
h
i
)
2
. (3.20)
Given (3.18) and (3.20), the generalization error for the cost-to-go
is estimated byE
h
(Err
in
err
tr
). Then the following theorem sug-
gests a mechanism to estimate this error. Although a similar result is
available for the SL context of “goodness of fit” (Hastie, Tibshirani,
and Friedman, 2011), it is not applicable because there is no paramet-
ric function to approximate the cost-to-go value. Since the setup for
the “cost-to-go” function is based on a non-parametric estimate, we
provide following result.
Theorem 4 Let A1 A2 and A4 hold. Then the generalization error for the
cost-to-go is estimated by
E
h
(Err
in
)E
h
(err
tr
)
2
N
N
å
i=1
Cov(h
v
i
,
ˆ
h
i
). (3.21)
Proof: Based on assumptions A1, A2 and A4,E
h
+(h
+
i
)
2
=E
h
(h
v
i
)
2
andE
h
+h
+
i
=E
h
h
v
i
. By the definitions ofE
h
(Err
in
) andE
h
(err
tr
), the
100 Chapter3. LEO:ModelsandConcepts
following equations hold:
E
h
(Err
in
)E
h
(err)
1
N
N
å
i=1
E
h
E
h
+(h
+
i
ˆ
h
i
)
2
1
N
N
å
i=1
E
h
(h
v
i
ˆ
h
i
)
2
=
1
N
N
å
i=1
"
E
h
E
h
+((h
+
i
)
2
+
ˆ
h
2
i
2h
+
i
ˆ
h
i
)E
h
((h
v
i
)
2
+
ˆ
h
2
i
2h
v
i
ˆ
h
i
)
#
=
1
N
N
å
i=1
"
E
h
+(h
+
i
)
2
2E
h
+E
h
(h
+
i
ˆ
h
i
)E
h
(h
v
i
)
2
+ 2E
h
(h
v
i
ˆ
h
i
)
#
=
1
N
N
å
i=1
"
2E
h
(h
v
i
ˆ
h
i
) 2E
h
+(h
+
i
)E
h
(
ˆ
h
i
)
#
=
1
N
N
å
i=1
"
2E
h
(h
v
i
ˆ
h
i
) 2E
h
(h
v
i
)E
h
(
ˆ
h
i
)
#
=
2
N
N
å
i=1
Cov(h
v
i
,
ˆ
h
i
).
As is common in statistical learning, one obtains better estima-
tion of generalization error by using cross-validation (Hastie, Tibshi-
rani, and Friedman, 2011). In one run of cross-validation, the data is
partitioned randomly into two complementary subsets. To analyze
the generalization error for a given decision, we calculate the covari-
ance of cost-to-go objectives from these independent subsets. Multi-
ple runs of cross-validation will be performed to sample a general-
ization error, and finally, we report the estimate of the generalization
error as the average value over k runs. Finally, note that if the decision
model is a deterministic problem (the number of outcomes is 1), then
the assumption A4 is not satisfied, and the estimate in (3.21) may only
be a lower bound.
3.5. ModelValidation,AssessmentandSelection 101
Optimization Error and Model Selection.
In SL/ML pieces of LEO model, multiple classes of models will
be available to choose from. Therefore, we need to design a criterion
based on the decisions to choose the best model class among multiple
classes. Den Boer and Sierag Den Boer and Sierag, 2016 proposes a
decision based model selection (DBMS) method that evaluates mod-
els based on the quality of the decisions they produce. The key idea of
DBMS method is using resampling technique to evaluate the perfor-
mance of the decision generated by using different models. Our ap-
proach to model selection is similar to the DBMS methodology with
an important distinction which we note subsequently. Specifically,
one considers a decision making problem
min E[ f(x, W(x))] (3.22)
s.t. x2 X (3.23)
where f : XW!R is a cost function andfW(x), x2 XgW is a
collection of random variables. The distribution of W(x) is unknown
to the decision maker but a data set T
0
=f(X
i
, W
i
)g
n
i=1
is available.
Suppose there is a setH,fM
(0)
,M
(1)
, . . . ,M
(Q)
g which includes
the class of parametric models of the probability distributions we con-
sider. In case of the LEO-Wyndor example, the total number of mod-
els is Q= 5. Each class of modelsM
(q)
=fF
q
x,q
2Fj x2 X,q2Q
(q)
g
whereF is the set of cumulative distribution functions onW, and
102 Chapter3. LEO:ModelsandConcepts
Q
(q)
is a nonempty parameter set. Suppose for each q, the estimator
is a function j
(q)
: (XW)
n
!Q which maps the data to parameter
values and an optimization algorithm is a mappingc
(q)
:Q! X such
thatc
(q)
(q) generates an exact/inexact solution which optimizes
Z
w2W
f(x, w) dF
(q)
x,q
(w). (3.24)
Therefore, the decision maker first picks a class of modelsM
(q)
2
H, then estimates the parameter q
q
= j
(q)
(T
0
), and finally makes a
decision x
(q)
= c
(q)
(q
q
). In the development by Den Boer and Sierag
Den Boer and Sierag, 2016, it is assumed thatM
0
is the class of models
containing the true model F
0
z,q
and an x
(0)
can be obtained from this
class. Specifically, let
ˆ
q = j
(0)
(T
(r)
) where T
(r)
is the resampled data
such that T
r
=f(z
i
, w
t
i
)g andt =ft
1
,t
2
, . . . ,t
n
g is a permutation of
the setf1, 2, . . . , ng. The recommendation of Den Boer and Sierag Den
Boer and Sierag, 2016 is to choose the model for which the following
minimum is attained.
min
0,...,Q
Z
w2W
f(x
(q)
, w)dF
0
x
(q)
,
ˆ
q
(w). (3.25)
This strategy is shown to have the consistency property such that op-
timal cost value with the model chosen by DBMS method converges
in probability to the optimal cost value with the true model as the data
size goes to infinity. Correspondingly, Den Boer and Sierag Den Boer
3.5. ModelValidation,AssessmentandSelection 103
and Sierag, 2016 show consistency in the sense that optimal cost value
using the DBMS method converges in probability to the optimal cost
value with the chosen model. Moreover, the assumption of the true
model and true parameter value can be relaxed to be any one in which
the model and the parameter that provide the best explanation of the
data.
Our approach adopts a somewhat relaxed set of requirements, namely
a) instead of solving (3.24) exactly, we find an approximate solution
through (3.8), b) we estimate the objective value in (3.25) using the
validation data set representing F
0
without assuming thatM
0
is avail-
able and c) instead of (3.25), we use the Kruskal-Wallis non-parametric
analysis of variance for pairwise comparisons between alternative mod-
els. To formally state the process for model selection, let
˜
f
q
denote the
entire population vector of the validated objectives for model q, and
let
¯
f
q
denote the sample average of validated objective for model q. Then
the model selection procedure is as follows. LetQ =f1, . . . , Qg, and
choose small scalarsa,#
1
,#
2
> 0, and do the following for r2Q :
if9 q2Q such that
Pr(jj
˜
f
q
˜
f
r
jj> #
1
) 1a &
¯
f
q
¯
f
r
#
2
,Q Qnfrg.
(3.26)
Letting f
= minf
¯
f
q
j q2Qg, we define the optimization error
to be the difference
¯
f
q
f
for any q2f1, . . . , Qg. The probability
estimates above are provided by the non-parametric Kruskal-Wallis
test which does not require normality as an assumption. As for details
104 Chapter3. LEO:ModelsandConcepts
regarding the generalization error for cost-to-go estimates, we refer to
Deng and Sen, 2018. The final choice of the model should be based on
a combination of both optimization and generalization errors of the
models.
3.6 Illustrative Computations
The instances discussed below are developed using existing data-sets
and existing optimization models. As with the rest of the paper, the
novelty here is in the fusion of learning data-sets and optimization
models. We include one example for each type of a LEO structure:
Disjoint Spaces and Shared Spaces. Since the data-sets are not new, we
append the acronym LEO to the names of the existing data-sets. Each
model requires two aspects: one is the SL aspect, and the other is the
decision/optimization aspect. In case of the SL part of the model, we
assume that measures that are necessary to create acceptable SL mod-
els are undertaken. The time series setting (LEO-ELECEQUIP Inventory
model) calls for stationarity tests, whereas in case of cross-sectional
data (LEO-Wyndor model), outliers in the data should be identified
and removed.
3.7. AModelwithDisjointSpaces: LEO-ELECEQUIP(Time-Series
Data)
105
3.7 A Model with Disjoint Spaces: LEO-ELECEQUIP
(Time-Series Data)
This model is based on a “rolling-horizon” decision model commonly
used in inventory control. Before starting a sequence of decisions, one
typically analyzes historical demand. In this particular example, we
use a commonly available data set referred to asELECEQUIP which pro-
vides demand of electrical equipment over a ten-year period. We will
use the first five years to discover the time series pattern of demand,
and then, use it within a rolling horizon inventory model. We conduct
the validation exercise for two years, 2001-2002.
When fitting an ARIMA model to the training data, one identifies
(p, d, q) and(P, D, Q)
t
wheret represents the seasonal backshift, and
(p, d, q) specify the AR(p), I(d) and MA(q) components of ARIMA.
In order to ascertain whether a series is stationary, it is customary to
create an autocorrelation function (ACF) plot, and then one examines
whether such a plot drops off to zero relatively fast. If so, we venture
to guess that the data are stationary. While checking for correlation
is not a substitute for independence, the lack of correlation can be
considered as a surrogate.
Model Details: Without loss of generality, we can view the deci-
sion epoch as j = 0, and the most recent demand will be denoted
d
0
= D
j+1
(from the validation data set). The beginning inventory y
0
and and ending inventory u
0
are also available. The inventory model
106 Chapter3. LEO:ModelsandConcepts
will look ahead into periods t= 0, . . . , T, , although as in rolling hori-
zon schemes, onlyD
0
will be implemented. The model will be a two-
stage multi-period stochastic program with the first stage making or-
dering decisions x = (D
0
, . . . ,D
T1
), and the second stage predict-
ing the potential cost of the choiceD
0
. As the decision clock moves
forward in time, the total cost of inventory management becomes es-
timated by this process. The various relationships in the inventory
model are summarized below.
Because of the delivery capacity U
t
, we must have 0D
t
U
t
.
We will sample demand realizations in period t using a forecast
D
t
(x) (using the time series model created in the training phase).
Here the notation x denotes one sample path of demands over
the periods t = 1, . . . , T 1. The notation d
0
(in lower case) de-
notes a deterministic quantity, whereas, the notation D
t
(x) de-
notes the outcomex of the demand (stochastic process) observed
in period t. The planning horizon used in the linear program
will be T = 3, which requires us to simulate two periods of de-
mand.
We assume that y
0
and d
0
are given. Let u
t
(x) denote the ending
inventory in period t, and y
t+1
(x) denote the beginning inven-
tory in period t+ 1. We have y
t+1
(x)= u
t
(x)+D
t
, and a storage
(refrigerator) capacity constraint requires that y
t+1
(x) R
t+1
,
where the latter quantity is given. Then the ending inventory
3.7. AModelwithDisjointSpaces: LEO-ELECEQUIP(Time-Series
Data)
107
of period t, denoted u
t
(x), must obey the relationship u
t
(x) =
Maxf0, y
t
(x) D
t
(x)g. The unit cost of holding inventory is c
u
,
where c
u
0. The total inventory holding cost for period t is
then c
u
u
t
(x).
Let v
t
(x) denote the lost sales in period t, so that v
t
(x)= Maxf0, D
t
(x) y
t
(x)g.
Suppose that the per unit cost of lost sales in period t is c
v
, where
c
v
0. Then the total cost of lost sales for period t is c
v
v
t
(x), and
the first stage cost is zero. Recalling that x =(D
0
, . . . ,D
T1
), the
cost-to-go function is defined as follows.
Min
0D
t
U
t
f(x)=E
˜
x
h
h(x,x)=
T1
å
t=0
c
u
u
t
(
˜
x)+ c
v
v
t
(
˜
x)
i
s.t. y
t+1
(x) u
t
(x) =D
t
for allx
y
t+1
(x) R
t+1
for allx
y
t
(x)+ u
t
(x) D
t
(x) for allx
y
t
(x)+ v
t
(x) D
t
(x) for allx
u
t
(x), v
t
(x) 0
Note that constraints in (3.27) should be imposed for all possible
errors in the training set. However, not all error combinations are
sampled, and as result, we say that the constraints must hold for a
large enough sample size (which is what we mean by the phrase “al-
most all”x). It suffices to say that the sample size used in optimization
108 Chapter3. LEO:ModelsandConcepts
is decided during the Stochastic Decomposition (SD) algorithmic pro-
cess. The computational results for T=2 are presented in this section.
In this example, we use c
u
= 1, c
v
= 3 and U
t
= R
t
=¥.
(a) Deterministic ARIMA Forcasting (DAF). Since U
t
and R
t
are infin-
ity, we can use the predicted demand to define the order quan-
tity as:D
t
= Maxf0,
ˆ
D
t
u
t
g, where
ˆ
D
t
is the expected value of
the ARIMA model. This is a case of using ˆ m in section 2.
(b) Stochastic Linear Programming (SLP), which gives the decision
by solving the instance in equation (3.27). Note that our rolling
horizon approach solves three period problems (0,1,2), and we
use the solution of period 0 as our current decision, and drop
the other decisions. We then use the demand of the following
period, update the inventory status, and move the clock forward
to the next period. This is a case of m(z,x) = ˆ m(z)+x, wherex
is an outcome of
˜
x, the normal error from ARIMA.
3.7.1 Month by Month Validation Results for 2001-2002.
The ARIMA model was trained on data from 1996-2000, and the per-
formance of the models were validated during the two year period
2001-2002. Table 3.1 presents costs for the year 2001 and 2002 (24
months) for each of the two inventory policies DAF and SLP speci-
fied in (a) and (b) above. Note that of the 24 runs (simulating two
years of inventory), the LEO approach cost higher only for month 1.
3.7. AModelwithDisjointSpaces: LEO-ELECEQUIP(Time-Series
Data)
109
Month 1 2 3 4 5 6
DAF Cost 12.33 14.41 39.02 14.54 26.44 28.86
SLP Cost 16.53 3.06 12.28 9.49 20.63 17.77
Month 7 8 9 10 11 12
DAF Cost 7.65 37.99 25.26 38.46 16.92 30.34
SLP Cost 7.38 31.27 14.82 28.66 13.23 21.92
Month 13 14 15 16 17 18
DAF Cost 11.35 3.05 15.11 26.74 15.67 38.98
SLP Cost 6.04 1.11 11.06 15.78 14.22 24.56
Month 19 20 21 22 23 24
DAF Cost 33.23 23.81 17.90 16.62 15.31 29.72
SLP Cost 11.90 19.88 5.13 8.66 9.05 23.20
TABLE 3.1: LEO-ELECQUIP: Monthly Back-Testing Costs
Thereafter, it cost less in each subsequent month, with some (months)
reducing costs by over 66%. The average inventory cost reduction
over the deterministic ARIMA forecast is approximately 34% over the
2 year run.
3.7.2 Snapshot Computational Results forELECEQUIP
For practical decision-making the back-testing exercise performed above
is more convincing, than any snapshot study of a dynamic process.
Nevertheless, we present a snapshot of this inventory model in the
interest of parsimony of presentation for both examples of this sec-
tion. We select the end of the first year as the point in time when sta-
tistical comparisons are made. For this snapshot study, such a choice,
allowing the model to run for a year, helps to avoid initialization bias.
110 Chapter3. LEO:ModelsandConcepts
Table 3.2 provides the estimated objective, validated objective (95%
confidence and prediction intervals), and the solution quality. The
probability of optimality reported in Table 3.2 is a result of the com-
putations suggested in Theorem 3, where d
u
is chosen to be 1% of
the total cost. Notice that we do not report a probability for the DAF
model because it is simply a result of the ARIMA forecast. On the
other hand, we include the probability for the SLP model, and this is
consistent statistical optimality of section 3.4.
Table 3.3 summarizes results for three hypothesis tests for both
DAF and SLP cases. A hypothesis test rejects the null hypothesis at
the 95% level when the statistic lies outside the range provided in the
table. Upon examining the entries for the T-test, the null hypothesizes
for both DAF and SLP are not rejected. We also perform the F-test for
DAF and SLP , and the F-test rejects the hypothesis that the variance
of training and validation are the same at the 95% level. The results of
thec
2
test are presented in the last two rows, which analyzes the con-
sistency of two data sets. Note that both DAF and SLP are not rejected
at levela = 0.05, but SLP shows a higher p-value. From these test re-
sults, we conclude that the SLP approach performs better in terms of
consistency between training and validation sets.
The comparison across models is provided in Table 3.4. The cost of
SLP in validation is smaller than DAF by 9.42, and shows smaller gen-
eralization error as well. We include the pvalue of Kruskal-Wallis
3.7. AModelwithDisjointSpaces: LEO-ELECEQUIP(Time-Series
Data)
111
Models DAF SLP
Estimated Obj. 25.52 22.75
95% Validated Confidence Interval 30.34(3.84) 21.92(3.38 )
95% Validated Prediction Interval 30.87(9.03) 21.75(8.32 )
Probability (g) 0.9934
Tolerance (d) 0.092
TABLE 3.2: LEO-ELECQUIP: Comparison of Solutions
under Alternative Models
test between DAF and SLP approaches, and the result shows that ob-
jectives of DAF and SLP methodologies have significantly different
ranked medians. Since LEO-ELECEQUIP is a minimization problem,
better solutions result in costs that are at the lower end of the horizon-
tal (cost) axis. In this case, the better decision results from SLP , and
Figure 3.2 gives evidence of this conclusion because for all cost levels
C, the Prob(Cost C) is higher for SLP than it is for DAF.
Models DAF SLP
T-statistic (t< 1.96) t= 1.21 t= 0.20
Cost-to-go Test (Mean) not rejected not rejected
F-statistic (0.62< f < 1.62) f = 2.53 f = 1.23
Cost-to-go Test (Variance) rejected not rejected
c
2
Test p-value (p> 0.05) p= 0.13 p= 0.37
Cost-to-go Test (Distribution) not rejected not rejected
TABLE 3.3: LEO-ELECQUIP: Hypothesis Test Results un-
der Alternative Models
112 Chapter3. LEO:ModelsandConcepts
Models DAF SLP
Generalization Error 1.45 0.96
Kruskal-Wallis Test (pvalue) 1.24 10
6
Optimization Error 9.42
TABLE 3.4: LEO-ELECQUIP: Errors under Alternative
Models
FIGURE 3.2: LEO-ELECQUIP: Stochastic Dominance of
SLP Validated objectives over DAF
3.8. AModelwithSharedSpaces: LEO-Wyndor 113
3.8 A Model with Shared Spaces: LEO-Wyndor
We study a “textbook”-ish example which has been amalgamated from
two textbooks: one on Operations Research (Hillier and Lieberman,
2012) and another on Statistical Learning (James et al., 2013). Con-
sider a well known pedagogical product-mix model under the banner
of “The Wyndor Glass Co.” In this example, Hillier and Lieberman,
2012 address resource utilization questions arising in the production
of high quality glass doors: some with aluminum frames (A), and
others with wood frames (B). These doors are produced by using re-
sources available in three plants, named 1, 2, and 3. The data associ-
ated with this problem is shown in Table 3.5. The product mix will
Plant Prod. time for A Prod. time for B Total Hours
(Hours/Batch) (Hours/Batch) Available
1 1 0 4
2 0 2 12
3 3 2 18
Profit per Batch $3,000 $5,000
TABLE 3.5: Data for the Wyndor Glass Problem (Hillier
and Lieberman, 2012)
not only be decided by production capacity, but also the potential of
future sales. Sales information, however, is uncertain and depends
on the marketing strategy to be adopted. Given 200 advertising time
slots, the marketing strategy involves choosing a mix of advertising
outlets through which to reach consumers. Exercising some “artis-
tic license” here, we suggest that the advertising data set in James et
114 Chapter3. LEO:ModelsandConcepts
al., 2013 reflects sales resulting from an advertising campaign under-
taken by Wyndor Glass. That is, the company advertises both types of
doors through one campaign which uses two different media, namely,
TV and radio
2
. Note that in the original data set advertising strategy
is represented as budgeted dollars, whereas we have revised it to rep-
resent advertising time slots. Thus in our statistical model, sales pre-
dictions are based on number of TV and radio advertising time slots
3
.
In our interpretation, product-sales reflect total number of doors sold
(fW
i
g) when advertising time slots for TV is Z
i,1
and that for radio is
Z
i,2
, again as number of advertising time slots. (This data set has 200
data points, that is, i = 1, . . . , 200). For the SP side, x
1
denotes the
number of TV advertising time slots, and x
2
denotes the number of
radio advertising time slots.
[Comments Regarding a Standard SP Approach] The above situa-
tion includes some latent random effects, and without statistical mod-
eling, it is not clear how these effects can be introduced into a decision
model. If one considers an entirely data-driven SP model (i.e., without
using a statistical model), then connection between sales and adver-
tising is not represented within the SP approach, and this gap makes
the SP model untenable. Our approach to this issue is an combination
of PO and SP . In general, we will refer to this approach combining
statistical learning and stochastic optimization as Learning Enabled
2
The actual data set discussed in James et al., 2013 also includes newspapers.
However we have dropped it here to keep the example very simple.
3
The numbers used are the same as in James et al., 2013
3.8. AModelwithSharedSpaces: LEO-Wyndor 115
Optimization (LEO). Further comments regarding this difficulty ap-
pear at the end of this subsection.
FIGURE 3.3: The Advertising Data Set (Source: James et
al 2011).
This is a case where a linear regression model reveals the relation-
ship between advertising and sales. When we include an SL model
into the production planning model, we are able to accommodate the
relationships discovered by a learning model. Because of the follow-
ing assumptions of linear regression, one can justify the fusion of re-
gression and SP . The linear regression model for sales is shown in Fig-
ure 3.3, and will be represented by ˆ m(x). We consider the following
statistical models reported in section 3.6.
Data Preprocessing
1. (DF) For deterministic forecasts (DF) we simply use the sales
given by ˆ m
1
(z) =
¯
b
0
+
¯
b
1
z
1
+
¯
b
2
z
2
. Thus, for deterministic pre-
dictions, we do not account for errors in forecast, whereas the
116 Chapter3. LEO:ModelsandConcepts
FIGURE 3.4: q-q plot before and after data preprocess-
ing
remaining cases below use some form of error distributions.
2. (NDU) One approximation of the sales forecast is one where the
correlation between the coefficients are ignored, and the result-
ing model takes the form m
2
(z,x) = (
¯
b
0
+x
0
)+(
¯
b
1
+x
1
)z
1
+
(
¯
b
2
+x
2
)z
2
, where (x
0
,x
1
,x
2
) are normally distributed and un-
correlated with mean zero, and the standard deviations are com-
puted within MLR.
3. (NDC) Another approximation of the sales forecast is m
3
(z,x)=
(
¯
b
0
+x
0
)+(
¯
b
1
+x
1
)z
1
+(
¯
b
2
+x
2
)z
2
, where(x
0
,x
1
,x
2
) are nor-
mally distributed and correlated according to the variance-covariance
matrix reported by MLR.
4. (EAE) This is the empirical additive error model, where m
4
(z,x)=
¯
b
0
+
¯
b
1
z
1
+
¯
b
2
z
2
+x
0
, andx
0
denotes a rvs whose outcomes are
W
i
ˆ m
4
(Z
i
). We refer to this model as empirical additive errors.
3.8. AModelwithSharedSpaces: LEO-Wyndor 117
As in the set up for (3.6), the index q refers to the alternative error
models (DF, NDU, NDC and EAE). The corresponding first stage is
given as follows:
The formulation presented below mimics (3.6), and since all deci-
sions variables x share the same space as the rv Z, we explicitly re-
mind the reader that Z = z= x.
Index Sets and Variables
i index of product, i2fA, Bg.
y
i
number of batches of product i produced.
f(x)=0.1x
1
0.5x
2
+E[h(x,
˜
x
q
j Z = z= x)] (3.28a)
s.t. x
1
+ x
2
200 (3.28b)
x
1
0.5x
2
0 (3.28c)
L
1
x
1
U
1
, L
2
x
2
U
2
(3.28d)
h(x,x
q
j Z = z= x) = Max 3y
A
+ 5y
B
(3.29a)
s.t. y
A
4 (3.29b)
2y
B
12 (3.29c)
3y
A
+ 2y
B
18 (3.29d)
y
A
+ y
B
m
q
(z,x
q
) (3.29e)
y
A
, y
B
0 (3.29f)
118 Chapter3. LEO:ModelsandConcepts
Note that the choice of ranges [L
1
, U
1
] and [L
2
, U
2
] are chosen so
that assumption A2 is satisfied. Note that this instance is stated as a
“maximization” model, whereas, our previous discussions were set in
the context of “minimization”. When interpreting the results, it helps
to keep this distinction in mind. The LEO models presented above
are relatively general, allowing very general regression models such
as kernel-based methods, projection pursuit, and others. However,
our current computational infrastructure is limited to stochastic linear
programming (SLP) and as a result the regression used for models
with Shared Spaces will be restricted to MLR. We now turn to the
LEO-ELECEQUIP and the LEO-Wyndor illustrations. All computations
reported below are carried out by the original SD algorithm because
the models are SLPs.
We now present the LEO-Wyndor problem under alternative mod-
els. DF/LP represents learning enabled optimization using determin-
istic forecasts, in which we use the expected value of the linear re-
gression as the demand model. This results in a deterministic LP . In
addition, we also study other models where linear regression suggests
alternative parameters: a) the additive scalar error model, using the
empirical additive errors (EAE) and deterministic model coefficients
b
0
,b
1
,b
2
where the first is the constant term, the second is the coeffi-
cient for TV expenditures, and the third is the coefficient for radio ex-
penditures; b) a linear regression whose coefficients are random vari-
ablesf
˜
b
j
g, which are normally distributed and uncorrelated (NDU);
3.8. AModelwithSharedSpaces: LEO-Wyndor 119
c) a linear regression whose coefficients are random variablesf
˜
b
j
g
which are normally distributed and correlated (NDC). We reiterate
that all three models EAE, NDU, NDC correspond to specific types
of errors (which are indexed by q in the presentation in section 3.3).
Note that for models NDU and NDC, we have continuous rvs, and as
a result we adopted SD as the solution methodology because it man-
ages discrete and continuous random variables with equal ease. We
refer to these results by NDU/SD and NDC/SD. Also note that for the
case of EAE, the dataset is finite and reasonably manageable. Hence
we will use both SAA and SD for this model, and refer to them by
EAE/SAA and EAE/SD.
3.8.1 Results for Error Terms.
The calculations begin with the first test as the top diamond block
in Figure 1b. Table 3.6 shows p-values and test results of c
2
test for
NDU/SD, NDC/SD and EAE. From values reported in Table 3.6, the
fit appears to improve when a few of the data points near the bound-
ary are eliminated. (see Figure 3.4).
NDU/SD NDC/SD EAE
Before 0.44, not rejected 0.42, not rejected 0.45, not rejected
After 0.59, not rejected 0.57, not rejected 0.78, not rejected
TABLE 3.6: LEO-Wyndor: Comparison of Chi-square
test
120 Chapter3. LEO:ModelsandConcepts
3.8.2 Results for Decisions and Optimal Value Estimates.
The decisions and various metrics discussed earlier are shown in Ta-
ble 3.7. Although the prediction interval is listed in a symmetric way,
the actual data points are asymmetric with respect to the estimated
mean. The last two rows report the probabilityg and the correspond-
ing tolerance level d, which are provided by SD algorithm based on
the theorems in Chapter 4. We choose 1% of the mean value of vali-
dated objective to bed
u
in Theorem 3. Once again, notice that for both
DF/LP and EAE/SAA, we do not report any probability because we
use a deterministic solver as in (3.6).
The hypothesis test results for the cost-to-go objectives (the lowest
diamond in Figure 1b) for each model are reported in Table 3.8. The
T-test rejects the DF/LP model, where the others were not. The next
two rows give the test results of variance based on F-statistic, and we
conclude that none of the models can be rejected. We also performed
ac
2
test for the cost-to-go objectives using the training and validation
sets. Again, the DF/LP model was rejected where as the others were
not.
Remark 3. The concept of cross-validation (k-fold) is uncommon
in the stochastic programming literature. With k > 1, this tool is a
computational embodiment of (3.18), and provides a prediction of the
error. Without such cross-validation, it is often likely that model as-
sessment can go awry. For instance, in this example we have observed
3.8. AModelwithSharedSpaces: LEO-Wyndor 121
that if we use k = 1, then the EAE/SAA model can get rejected al-
though using k = 5, the EAE/SAA is no longer rejected. This can be
attributed to the fact that variance reduction due to k = 5-fold cross-
validation reduces Type I error (when compared with k = 1).
Table 3.9 reports the optimization error, as well as the generaliza-
tion error for all models. DF/LP shows the largest optimization error,
which indicates that it is not an appropriate model to recommend for
this application. On the other hand, NDU/SD and NDC/SD have
comparable and relatively small generalization errors. However the
optimization errors appear to be significant, therefore NDU/SD and
NDC/SD do not yield the most profitable decisions. Had the criterion
been simply the generalization error (as in SL), we might have chosen
non-profit-maximizing decisions.
In Table 3.10 we present the pairwise comparison of Kruskal-Wallis
test. For the tests of DF/LP with other methodologies, the p-values
are all smaller than 0.01, which implies that there are significant differ-
ences between the median ranks of DF/LP and each of the other four
approaches. The stepped curve in Figure 3.5 illustrates the ordering
discovered by the Kruskal-Wallis test. Note that DF/LP shows sig-
nificant difference from the other approaches. Moreover, the curves
for NDU/SD and NDC/SD are relatively close, whereas EAE/SAA
and EAE/SD are indistinguishable. These similarities were quanti-
fied in Table 3.10 by the fact that the p-values for these comparisons
are greater than 0.01. Finally, EAE/SAA and EAE/SD give the largest
122 Chapter3. LEO:ModelsandConcepts
objective value, which is also reported in Table 3.7. LEO-Wyndor
example is a profit maximization problem, therefore EAE/SAA and
EAE/SD lead to more profitable decisions since they stochastically
dominate the others. The Kruskal-Wallis test suggests that the dif-
ference of EAE/SAA and EAE/SD is not significant, therefore both
EAE/SAA and EAE/SD provide the most profitable decision (see Ta-
ble 6 for the validated objective value estimated for EAE/SAA and
EAE/SD).
3.8. AModelwithSharedSpaces: LEO-Wyndor 123
Models DF/LP NDU/SD NDC/SD EAE/SAA EAE/SD
x
1
173.48 181.70 181.40 191.27 191.40
x
2
26.52 18.30 18.60 8.73 8.60
Estimated Obj. $41,391 $41,580 $41,492 $42,009 $42,045
Validated 95% C.I. $39,869(692) $41,903 (335) $41,865 (302) $42,269 (522) $42,274 (493)
Validated 95% P .I. $39,856(1,302) $41,911 (632) $41,841 (639) $42,258 (1,012) $42,279 (973)
Probability (g) 0.9633 0.9698 0.9872
Tolerance (d) 760 694 842
TABLE 3.7: LEO-Wyndor: Comparison of Solutions for Alternative Models
Models DF/LP NDU/SD NDC/SD EAE/SAA EAE/SD
T-statistics(t< 1.96) t= 2.18 t= 0.72 t= 0.84 t= 0.62 t= 0.49
Cost-to-go Test(Mean) rejected not rejected not rejected not rejected not rejected
F-statistics(0.67< f < 1.49) f = 1.23 f = 1.43 f = 1.29 f = 0.79 f = 1.16
Cost-to-go Test(Variance) not rejected not rejected not rejected not rejected not rejected
c
2
Test p-value (p> 0.05) p= 0.038 p= 0.34 p= 0.32 p= 0.42 p= 0.42
Cost-to-go Test(Distribution) rejected not rejected not rejected not rejected not rejected
TABLE 3.8: LEO-Wyndor: Hypothesis Test Results for Alternative Models
124 Chapter3. LEO:ModelsandConcepts
FIGURE 3.5: LEO-Wyndor: Stochastic Dominance of
Validated Objectives under Alternative Models
Models DF/LP NDU/SD NDC/SD EAE/SAA EAE/SD
Optimization Error 2405 371 409 5
Generalization Error 29.751 19.406 19.554 21.889 21.326
TABLE 3.9: LEO-Wyndor: Errors for Alternative Mod-
els
Models EAE/SD EAE/SAA NDC/SD NDU/SD
DF/LP 2.76 10
8
1.34 10
7
1.12 10
7
5.60 10
7
NDU/SD 8.46 10
7
6.2 10
3
0.37
NDC/SD 2.05 10
7
1.72 10
3
EAE/SAA 5.87 10
2
TABLE 3.10: LEO-Wyndor: Kruskal-Wallis Test Results
(p> 0.01)
125
Chapter 4
SVM using Stochastic
Decomposition
Support Vector Machine (SVM) is proposed in the thesis of Vapnik,
which has become a powerful and popular tool for building classi-
fiers. Besides, the kernelized SVM is able to solve the problem in a
higher dimension where the dataset can be linearly separable, or at
least improve the prediction accuracy. SVM problems are usually for-
mulated as a large scale Quadratic Programming (QP) problem, and
finding SVM classifiers can be achieved using QP solvers (Suykens
and Vandewalle, 1999) or by stochastic gradient descent method (Bot-
tou, 2010). Such SVM instances are also solvable with Stochastic De-
composition algorithm, in which data points can be treated as samples
of random variables, therefore we can avoid building large scale QP
models.
Stochastic Decomposition (SD) is a sequentially sampled decom-
position method. It is inspired by the Benders’ decomposition method,
126 Chapter4. SVMusingStochasticDecomposition
and SD is improved by introducing the sequential sampling strategy
and an approximation scheme in which subgradient of the convex
function are approximated with improved accuracy as the methods
continues its operations. The improvement in approximations are ob-
tained by using newly discovered dual extreme points as the algo-
rithm proceeds from one iteration to the next . A two-stage stochastic
programming problem can be stated as:
min c(x)+E[h(y, ˜ w)] (4.1)
s.t. x2 X (4.2)
where the recourse function is defined as follows.
h(x,w)= min d(y) (4.3)
s.t. y2 Y(x,w) (4.4)
The random variable ˜ w is defined over an appropriate probability
space, and w denotes a realization of the random variable ˜ w. This
model extends the deterministic programming framework for plan-
ning under uncertainty. While the associated probability space for ˜ w
can be large, our focus is to address this formulation computationally
and identify statistical optimality bounds for the decision support.
For problems with continuous random variables arise in a vari-
ety of applications, the standard formulation of finite scenario SP can
Chapter4. SVMusingStochasticDecomposition 127
be very cumbersome. For instance, the SAA sample size formula is
known to provide sample sizes which are far too conservative for al-
gorithmic purposes. As a result we turn to a so-called internal sam-
pling algorithm: the stochastic cutting plane (SCP) method. This was
a precursor to SD and its presentation in (Higle and Sen, 1996b) shows
its asymptotic convergence. This chapter focuses on the SD algorithm
and the related machine learning applications, and we summarize our
contributions by section below.
Section 1 focuses on our motivation, in which we introduce the
two-stage stochastic programming formulation of the Support
Vector Machine problem, therefore such instances can be solved
by stochastic stochastic decomposition algorithms. In addition,
stochastic programming based kernelized methods of Support
Vector Machine are also introduced.
In Section 2, we introduce the algorithmic background including
Sample Average Approximation and SD. In the background of
SD algorithm, the notion of compromise problems to reduce the
variance as well as computational time is presented. For a de-
tailed mathematical proofs of this algorithm, we refer the read-
ers to Higle and Sen, 1991, Higle and Sen, 1994, and Sen and Liu,
2016.
128 Chapter4. SVMusingStochasticDecomposition
Following the algorithm background, in section 3 We present de-
tailed steps in which the standard Stochastic Decomposition al-
gorithm is revised for solving SP-based SVM instances. Further
more, the SD solver is also able to provide optimality guarantees
of solutions.
In section 4 where we present all the computational results, sev-
eral examples of Support Vector Machines (SVM) are included,
which are solved by the SD algorithm. The SVM instances in-
clude Heart Disease Dataset, Census Income Dataset, Breast Can-
cer Dataset, Iris Dataset and Car Evaluation Dataset. We com-
pared SD with variety of algorithms, such as Stochastic Gradi-
ent Descent and Quadratic Programming. Further more, we also
present the computational result with kernel functions.
4.1 SD for Support Vector Machines
The objective function of SVM formulation intends to maximize the
margin width of the classifier. The mathmatical formulation of the
binary SVM classifier can be stated as follows:
min f(b, b)=
1
2
jjbjj
2
+
C
N
N
å
i=1
x
i
(4.5)
s.t. w
i
(b
>
z
i
+ b) 1x
i
, i = 1, ..., N (4.6)
x0, i = 1, ..., N (4.7)
4.1. SDforSupportVectorMachines 129
Here(z
i
, w
i
) denotes the data points in the training set, where z
i
is the
predictor and w
i
is the corresponding response. The first stage deci-
sion variables(b, b) denote the weights of the separating hyperplane.
Note that a weighting parameter C is also included in the objective to
denote the weights between the empirical errors of the classifier and
the margin width of the separation. We can also write the two-stage
formulation of the SVM problem as follows.
min f(b, b)=
1
2
jjbjj
2
+ CE[h((b, b),
˜
d)] (4.8)
h((b, b),d)= min x (4.9)
s.t. x 1 w(d)z
>
(d)b+ w(d)b (4.10)
x 0, (4.11)
where
˜
d denotes the random variable and for each d, the recourse
function h can be defined.
Kernel Methods of SVM
Consider the dual form of the SVM using the Lagrangian, and we
have the following formulation.
L(b, b,a)= max
a,m
min
b,b
1
2
b
>
b+
C
N
N
å
i=1
x
i
n
å
i=1
a
i
[w
i
(b
>
z
i
+ b 1+x
i
)]
N
å
i=1
m
i
x
i
,
(4.12)
130 Chapter4. SVMusingStochasticDecomposition
wherea
i
andm
i
are Lagrangian dual multipliers satisfy the following.
b=
N
å
i=1
a
i
w
i
z
i
(4.13)
N
å
i=1
a
i
w
i
= 0 (4.14)
C/Na
i
m
i
= 0. (4.15)
With the above conditions, we have
L(b, b,a)
=
1
2
b
>
b+
C
N
N
å
i=1
x
i
N
å
i=1
a
i
w
i
b
>
z
i
b
N
å
i=1
a
i
w
i
+
N
å
i=1
a
i
N
å
i=1
a
i
x
i
N
å
i=1
m
i
x
i
=
1
2
b
>
b+
N
å
i=1
(
C
N
a
i
m
i
)x
i
b
>
b+
N
å
i=1
a
i
=
N
å
i=1
a
i
1
2
b
>
b
=
N
å
i=1
a
i
1
2
N
å
i,j=1
w
i
w
j
a
i
a
j
hz
i
, z
j
i,
4.1. SDforSupportVectorMachines 131
whereh,i denotes the inner product of two vectors.Therefore the SVM
problem can be formulated as follows.
max
N
å
i=1
a
i
1
2
N
å
i,j=1
w
i
w
j
a
i
a
j
hz
i
, z
j
i (4.16)
s.t.
N
å
i=1
a
i
w
i
= 0 (4.17)
a
i
C
N
, i = 1, ..., N (4.18)
a
i
0, i = 1, ..., N. (4.19)
An SVM dual problem in Stochastic Programming
One of the advantages of stating the dual SVM problem in the SP
base is that the structure of the Lagrangian dual problem in SP is more
scalable and the dual problem which we obtain is essentially solvable
using SD algorithm. For a standard SVM formulation, we introduce
dual multipliers z. Following from the dual formulation studied in
Higle, Rayco, and Sen, 2010, leta denote a subgradient of h, with the
132 Chapter4. SVMusingStochasticDecomposition
necessary conditions of Lagrangian formulation, we have the follow-
ing dual formulation.
max
1
2
b
>
b+E
d
[
a(d)
N
] (4.20)
s.t. g= 0 (4.21)
b=E
d
[a(d)w(d)z(d)] (4.22)
g=E
d
[a(d)w(d)] (4.23)
0 a(d) C. (4.24)
If we consider to sample from the dataset
fhz
i
, z
j
i, i = 1, ..., N, j= 1, ..., Ng,
then the Kernel method can be applied to build nonlinear classifier in
the cases where the training data points can be classified with smaller
error or the dataset becomes linearly separable in higher dimension.
Letf(.) denote a feature mapping, and we can achieve the kernel ver-
sion by replacing the inner producthz
i
, z
j
i with K(z
i
, z
j
)=hf(z
i
),f(z
j
)i.
In the next chapter of computation results, we will discuss several il-
lustrative SVM examples and compare the performance of SD algo-
rithm with other standard methods.
4.2. StochasticDecompositionAlgorithmforSVM 133
4.2 Stochastic Decomposition Algorithm for
SVM
In the following we present an extension of SD which includes a quadratic
term. As shown in Sen and Liu, 2016, our goal in this section is to
show that a finite convergence result still holds for the SD algorithm
with an SVM module.As in the standard SD method (Sen and Liu,
2016), the piecewise linear approximations we maintain requires only
finitely many cutting planes.
As in the original SD method (for SLPs), the algorithm will work
in an “online” manner, augmenting the sample by a small number
of additional samples (say 1) at each iteration. Any iteration of the
algorithm, denoted k, starts with an incumbent solution denoted ˆ x
k
.
The method is as follows.
During initialization, set the iteration index k = 1 and choose a
starting incumbent ˆ x
1
2 X, and set a value forr
1
> 1, and set f
0
(x) :=
c(x).
As in the standard SD method, at iteration k, a value function ap-
proximation is computed by the pointwise maximum of several affine
functions, and in our case, f
k1
(x) = c(x)+ maxh
j,k1
(x), j2 J
k1
.
For each index j, the algorithm also records the number of samples
which are used to create the approximation function.
Then at the next iteration (the k
th
iteration), we assume that the
value function from earlier iteration is already known, and we solve
134 Chapter4. SVMusingStochasticDecomposition
the following optimization problem to find a candidate decision.
x
k
2 argminf f
k1
(x)+
r
2
jjx ˆ x
k1
jj
2
jx2 Xg.
For the methodology of updating the incumbent decisions ( ˆ x
k1
) and
candidate solutions (x
k
), we refer to the original SD method (Sen and
Liu, 2016).
Theorem 5 Let the assumptions SAA-a-d and SD-b-d hold. In place of as-
sumption SD-a, assume that g(x, y) is convex in both variables. In case of
LEO models with disjoint spaces (3.4)-(3.5), no additional assumptions for
m(z,x) are necessary (beyond those of section 3.3). In case of LEO models
with shared spaces, i.e., (3.6)-(3.7), assume that m(z,x) is a proper concave
integrand. Finally, we assume that the set of all subgradients of the ex-
pected cost-to-go function are bounded (i.e.,9B> 0j
¯
b2 ¶E[H(x,x)] =)
jj
¯
bjj
2
B
2
,8x2 X). Then, the SD algorithm produces a g optimal solu-
tion which satisfies (3.8) in finitely many iterations.
Proof:
Under assumptions for SAA, we may set a tolerance level d/2,
and a reliability level # = (1+g)/2 > g. Then Proposition 1 en-
sures that there exists a finite sample size K(#,d/2) < ¥ such that
K(g,d) < K(#,d/2), and the SAA approximation provides a d/2-
optimum solution with probability#> g. Because SP with the sample
size K(#,d/2) has finitely many outcomes, its sample space is com-
pact. Hence if SD is applied to this empirical problem, assumptions
4.2. StochasticDecompositionAlgorithmforSVM 135
SD-a,b,c,d of traditional SD are satisfied. Therefore the SD algorithm
provides a d/2-optimum solution with probability 1 for SAA with
sample size K(#,d/2). Since g < # < 1, there exists a finite itera-
tion K(g,d) K(#,d/2) such that the solution of SD is d-optimum.
Proposition 3 Suppose the assumptions SD-a,b,c,d are satisfied, and the set
of all subgradients of the expected cost-to-go function are bounded. Letg
k
=
f
k1
(x
k
) f
k1
( ¯ x
k1
), and letfk
n
g, n2 N represent the sequence of itera-
tions at which the incumbent is changed. If N is finite, then lim
k!¥
g
k
= 0,
with probability one. If N is infinite, then lim
m!¥
1
m
å
m
n=1
g
k
n
= 0 with
probability one.
Proposition 4 Suppose that assumptions SD-a,b,c,d hold. Letf ¯ x
k
g and
fx
k
g denote the sequence of incumbent and candidate solutions generated
by the regularized SD algorithm. With probability one, there exists a subse-
quence of iterations, indexed by setk, such that lim
k2k
f
k
(x
k+1
)+
1
2
jjx
k
ˆ x
k1
jj
2
f
k
( ˆ x
k
) = 0, and every accumulation point off ˆ x
k
g
k2k
is an opti-
mal solution.
Theorem 6 Assume that the assumptions SD-a,b,c,d are satisfied, the set of
all subgradients of the expected cost-to-go function are bounded, andr 1.
Then the sequence of incumbent solutionsf ˆ x
k
g generated by the regularized
SD algorithm for SVM converges to an optimal solution with probability
one.
Proof: See Sen and Liu, 2016.
136 Chapter4. SVMusingStochasticDecomposition
4.3 SVM Instances using SD algorithm
In this section We illustrate the SP-based SVM method using SD al-
gorithm by several instances. As mentioned in section 1, SVM can be
solved using variety of algorithms, such as Stochastic Gradient De-
scent (SGD) method, deterministic quadratic programming (QP), and
Stochastic Decomposition algorithm. In this subsection we study sev-
eral SVM instances using SGD, QP , as well as the SD, then compare
the rates of correct predictions. We carried out the computations on a
MakBook Pro with the following specifications: Core i7 CPU 2.8GHz,
Memory 16GB 2133 MHz. For the SGD method, we use the R package
named as gradDescent
1
. For the QP case, we solve the deterministic
QP formulation with CPLEX version 12.6. Note that for the instances
solved by SD, probabilities of optimality are also provided.
Heart Disease Dataset
2
The dataset contains four sections con-
cerning heart disease diagnosis, which was collected from four dif-
ferent locations in 1988. The data points include 76 attributes, and
the response feature of each data point refers to the presence of heart
disease in the patient. The total sample size of this problem is 303.
In Table 4.1, we compare the results of multiple methods for this in-
stance.
From Table 4.1 we can see that SGD provides the shortest running
1
https://cran.r-project.org/web/packages/gradDescent/gradDescent.
pdf
2
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
4.3. SVMInstancesusingSDalgorithm 137
Methods Training Validation Probability (g) Time
SGD 93.6% 80.9% N/A 0.53s
QP 96.6% 83.1% N/A 12.94s
SD 94.5% 83.3% 96.9% 15.75s
SD (Cross-Validation) 94.7% 83.5% 97.4%
TABLE 4.1: SVM with Heart Disease Dataset
time, however compared with the other three methods, the validation
accuracy rate is smaller. In this instance both QP and SD get relatively
good validation accuracy rate, and the running time of SD is slightly
larger. For the method of SD with cross-validation, we choose k =
5 (5-fold cross-validation), and with cross-validation, both the total
accuracy rate and the probability of optimality get improved.
Adult Dataset (Census Income Dataset)
3
This dataset contains
48,842 datapoints, and the response of each data point reflects whether
income exceeds $50,000 per year. The extraction of the dataset was
from the 1994 Census database. There are 14 attributes, which in-
clude age, workclass, level of education, hours-per-week, etc. Table
4.2 below summarizes the rates of correct predictions for both train-
ing set and validation set, the probability guarantee of optimality for
SD solver, and execution time. In particular, we also include the re-
sults from logistic regression model for comparison.
In the column of training accuracy of Table 4.2, note that besides
3
https://archive.ics.uci.edu/ml/datasets/Adult
138 Chapter4. SVMusingStochasticDecomposition
Logistic Regression, all the methods get large training estimated accu-
racy rate. On the other hand, from the column of validation accuracy,
the rates of correct predictions are comparable among all the meth-
ods, which indicates that for Logistic Regression method, the train-
ing and validation accuracy rates are closer to each other. Compared
with Logistic Regression, SGD has a longer running time, however
the validation accuracy is decreased. The validation accuracy of QP
is the largest among the first four methods (without cross-validation),
nevertheless it shows a significantly longer running time, which is
about twice of SD. In addition, by including a 5-fold cross-validation
scheme, the accuracy of validation dataset is improved, which be-
comes the best performance among all the algorithms.
Methods Training Validation Probability (g) Time
Logistic Regression 84.1% 82.3% N/A 241.43
SGD 94.8% 81.8% N/A 277.68
QP 94.6% 83.0% N/A 830.11
SD 95.3% 82.7% 97.8% 431.52
SD (Cross-Validation) 95.2% 84.2% 98.2%
TABLE 4.2: SVM with Adult (Census Income) Dataset
Breast Cancer Dataset
4
This dataset is from Wisconsin Diagno-
sis Breast Cancer database created by Dr. William H. Wolberg at the
University of Wisconsin. The Breast Cancer dataset consists of 400 ob-
servations of patients, and each data point has 20 features, which are
computed from digital image of characteristics of the cell nuclei. In
4
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
4.3. SVMInstancesusingSDalgorithm 139
this instance we performed 5 experiments with different classification
algorithms: SGD, QP , SD, SD with Radial Basis Function (RBF) Kernel
(also known as Gaussian Kernel) and 5-fold cross-validation SD with
RBF Kernel.
Methods Training Validation Probability (g) Time
SGD 92.4% 84.9% N/A 1.98
QP 95.2% 85.2% N/A 19.38
SD 94.0% 85.9% 95.5% 16.23
SD (RBF) 94.9% 88.7% 95.2% 16.92
SD (Cross-Validation, RBF) 93.3% 89.0% 96.0%
TABLE 4.3: SVM with Breast Cancer Dataset
In Table 4.3, we summarize the training/validation accuracy rates,
probability of optimality and running time for each algorithm. From
the table, we can see that for the first three algorithms which are with-
out kernel functions, SGD has the shortest running time whereas the
running time of QP is 19.38s. The execution time of SD stays in the
middle, however it provides the best validation accuracy rate among
all the three algorithms.
Another observation from Table 4.3 is that by introducing RBF ker-
nel funtion, the validation accuracy rate increased by about 2%, which
indicates that in this example, RBF kernel function has a good perfor-
mance. In addition, for the case of SD with RBF Kernel, when the
5-fold cross-validation scheme is also used, improvements can be ob-
served for both accuracy rate and probability of optimality.
140 Chapter4. SVMusingStochasticDecomposition
Iris Dataset
5
The Iris dataset is a well-known dataset from a pat-
tern recognition literature Fisher, 1936. There are 150 data points in
the dataset, which can be classified into 3 categories, and each class
refers to a type of iris plant. Note that there is one category which is
linearly separable from the others, and the other two classes are not
separable from each other. In each data point, there are 4 attributes:
Sepal length, sepal width, petal length and petal width.
Similarly, we performed 4 experiments using variety of algorithms:
SGD, QP , SD, and a 5-fold cross-validation version of SD. The results
are presented in Table 4.4. In these experiments, comparing with
SGD and QP , SD method shows a better performance in validation
accuracy rate (approximately improved by 3%), and for the cross-
validated method, the accuracy get increased by 1% more.
Methods Training Validation Probability (g) Time
SGD 93.1% 87.4% N/A 1.77
QP 94.9% 88.0% N/A 8.62
SD 94.3% 91.2% 97.1% 10.13
SD (Cross-Validation) 96.5% 92.1% 96.3%
TABLE 4.4: SVM with Iris Dataset
Car Evaluation Dataset
6
This dataset is originally developed by
Bohanec and Rajkoviˇ c, 1990. The response of this dataset evaluates
cars into 4 types of conditions: unacceptable, acceptable, good and
very-good, according to 6 different attributes. The sample size of this
5
https://archive.ics.uci.edu/ml/datasets/Iris
6
https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
4.4. Conclusion 141
instance is 1728, and we compared the results of the following meth-
ods: SGD, QP , SD and SD with cross-validation.
The rates of accuracy is summarized in Table 4.5, where the SGD
method has shortest running time. Comparing with SGD, SD takes
approximately 10 seconds longer, however the validated accuracy is
improved by 1%. Although SD with cross-validation provides a higher
rate of training accuracy, the validated accuracy is not significantly
improved (compare with SD without cross-validation).
Methods Training Validation Probability (g) Time
SGD 98.8% 95.9% N/A 6.43
QP 96.7% 95.6% N/A 28.55
SD 97.1% 97.0% 98.3% 17.28
SD (Cross-Validation) 98.4% 97.0% 98.7%
TABLE 4.5: SVM with Car Evaluation Dataset
4.4 Conclusion
In this paper, we present an extension of Stochastic Decomposition to
address SP-based support vector machine problems. As in Sen and
Liu, 2016, the idea of compromise solution is to use the solutions and
approximated functions which are generated from different replica-
tions to create a compromise problem. By solving the compromise
problem, we can obtain a compromise solution, which provides a
142 Chapter4. SVMusingStochasticDecomposition
more reliable decision. As suggested in Deng and Sen, 2018, a statis-
tical optimality probability bound can also be calculated. In addition,
five SVM instances are presented to illustrate the SD algorithm with
SVM module, and from the computational results, we can conclude
that in most of the cases, SD algorithm can provide improvements in
the solution quality.
143
Chapter 5
Computational OR Exchange
(cORe) Platform
In this section, we present a Computational Operations Research Ex-
change (cORe) platform, which allows OE (Operations Engineering)
researchers to leverage data, models, and software created by an ecosys-
tem of researchers so that they might be able to use the exchange to
demonstrate that the value of their research results using the infras-
tructure developed, maintained, and updated by this system.
5.1 Background and Motivation
The entire workflow (i.e. protocol) of the applications of LEO setup is
software intensive because the setting is intended to investigate sev-
eral plausible statistical models, and their corresponding optimization
models. Ultimately, the decisions from these models will be pitted
144 Chapter5. ComputationalORExchange(cORe)Platform
against each other so that the model validation, assessment, and se-
lection procedures will either guide a user or a mechanism to make
appropriate choices. Because SP models are usually communicated
using algebraic formulations, while SL methods are available through
open source software, we plan to integrate these alternative architec-
tures through a novel open-source platform freely available for aca-
demic use. In the meantime, our integrative analytics framework is
generating more data due to alternative learning and optimization
models, in which a collection of hypotheses are tested and evolved to
provide better decisions. Therefore, user-friendly tools are required to
manipulate, organize and analyze largescale data, accordingly maxi-
mizing the knowledge for research problems.
One of the existing life sciences cyberinfrastructures named as Cy-
Verse (formerly iPlant Collaborative, see Goff et al., 2011) which pro-
vide user interfaces for access and manage data for scientific discov-
ery, includes a variety of services for storing and sharing large and
diverse biological datasets. In addition, CyVerse provides a collec-
tion of products for scientific analyses including Discovery Environ-
ment, which allows non-technical users to run existing bioinformat-
ics softwares and other command-line tools. To achieve such scien-
tific experiments and analyses, multiple softwares, libraries, packages
and tools are needed to be installed through CyVerse platform, and
each program requires specific versions of operating systems. In or-
der to collaborate with other researchers or reproduce past results,
5.1. BackgroundandMotivation 145
the same versions of packages are usually required. These require-
ment of local packages makes it challenging for users to share their
applications and may impact the reproducibility of experiments. The
CI (Cyber-Infrastructure) for CORE will leverage a system known as
DERIVA (Discovery Environment for Relational Information and Ver-
sioned Assets) (Schuler, Kesselman, and Czajkowski, 2016) which is
currently in use by the Biosciences system which has been used to
support both small and large scientific communities in the Biosciences
community. The DERIVA system is a model-driven web-based inter-
face for data discovery and sharing, therefore it is easier for the users
to exchange and reproduce experiment results without requirements
of operating system and underlying programs. Because of its roots
in the experimental sciences, DERIVA is designed to support collab-
oration through the full life-cycle of scientific data, which is intend
to streamline the storage and management of digital assets such as
pictures, including experimental data from instruments (e.g. micro-
scopes, sequencers and flow cytometers).
OE research however differs significantly in its life-cycle because
of the its focus on mathematical modeling, and algorithm develop-
ment. We think of the model, and the algorithm of OE researcher
as the experimental setup which a scientist typically designs. For in-
stance, in case of the Statewide Electricity Dispatch Example, we may
look upon model building and algorithm design as the experimen-
tal setup which a DERIVA user is responsible for. Once the model,
146 Chapter5. ComputationalORExchange(cORe)Platform
and algorithm have been developed, we begin the integration pro-
cess of introducing these procedures into the larger experiment which
needs to be instantiated with data and algorithms for other parts of
the experiment. This will be accomplished through the services of
DERIVA which include: a loosely coupled web services architecture
with well defined public interfaces for every component, use of En-
tity Relationship Models (ERM) that leverage standardized vocabu-
laries, with adaptive components which can automatically respond
to evolving ER models, model-driven user interfaces to enable navi-
gation and discovery as the data model evolves, data-oriented proto-
cols where distributed components coordinate complex activities via
data state changes. DERIVA uses an entity-relationship data model
to catalog and organize assets which will be digital objects (i.e. files).
Assets are characterized by contextualizing metadata which places an
asset into the model by relating it to a specific entity. Additional de-
scriptive metadata are used to describe additional attributes and re-
lationships between assets. The components of the CORE ecosystem
are shown in the Figure 5.1 below. The CORE platform includes the
acquisition, management, analysis and sharing of data, models and
decisions, which provides the following capabilities.
Characterization and acquisition of datasets, including input
5.2. CharacteristicsandDesignsfortheCOREPlatform 147
data for statistical learning and analysis, outputs from learn-
ing/optimization models, and results from validation analy-
sis.
Organization of data-driven decisions. CORE system pro-
vides OE users with interactive ways of connecting and im-
porting data-analysis packages/functions (such as R pack-
ages), optimization solvers and validation/visualization tools.
The data assets will be stored in distributed in cloud based
data storage systems.
Sharing and exchange of model and data collections. The plat-
form involves management of IP associate with data assets,
which is intend to protect proprietary data. In the meantime,
CORE allows users to export and share data and models for
other usages.
5.2 Characteristics and Designs for the CORE
Platform
Here we will describe each CORE capability presented in Figure 5.1.
The pipeline of each application begins with acquisition of data as-
sets, which may be provided by the CORE platform or uploaded by
148 Chapter5. ComputationalORExchange(cORe)Platform
users. As new assets are acquired, analyzed and processed in Meta-
data Repository, they may be immediately cataloged for searching,
sorting and sharing in the future. The analysis outputs will also be
collected and imported into the next layer which is named as Opti-
mization Models Repository. In this step the platform integrate data
analysis outputs and corresponding mathematical models, and gen-
erate decision and other related outputs by automatically connecting
and importing external optimization solvers. Finally the decisions
and analytics results can be visualized and validated by the visualiza-
tion interface. The platform will archive newly-generated data assets,
then provide reports of analytics. Then we discuss the design and
organization of the general architecture for the CORE ecosystem.
Metadata Repository: Acquisition, Characterization and Analysis of Data Assets.
FIGURE 5.1: CORE: Computational Operations Re-
search Exchange Platform Design
5.2. CharacteristicsandDesignsfortheCOREPlatform 149
As we discussed before, the datasets may be generated or up-
loaded by users, in the meantime our platform provides diverse sources
of assets such as government open sourced databases. The metadata
repository automatically integrates with packages and functions that
are commonly used in data analysis, which allows user to create, man-
age and analyze datasets. For instance, in the LEO-Wyndor example,
input dataset is analyzed using a multiple linear regression model,
and in CORE platform such regression functions are integrated so that
the user can easily access and use.
Note that since the above architecture imports learning and anal-
ysis tools arbitrarily, certain pre-conditions and assumptions must be
satisfied. For example, in multiple linear regression methods, the as-
sumptions such as linearity, independence and normality should hold
so that the learning models and outputs become reasonable and ac-
ceptable. Therefore, the CORE platform provides a list of necessary
pre-conditions for each candidate learning tool to help users identify
proper methods and functions.
During learning process, models files as well as new data assets
(e.g. error sets) are generated and stored in the system. All the datasets
will be labeled with name, owner, category, modified time, keywords
and other related informations provided by collaborators for search-
ing and sorting purposes. In metadata repository, a learning step is
not always necessary, hence users can create and manage data assets
without analysis methods. Hence the CORE platform is compatible
150 Chapter5. ComputationalORExchange(cORe)Platform
with standard SP applications, in which the matadata repository sim-
ply manages samples or distributions of random variables.
5.3 cORe Applications
One of the major conveniences today is the widespread availability
of local data, ranging from weather and traffic, to crime, hospitals,
health-care and more. Many of these data resources are available
freely on the web, while others need some level of authentication, and
still others call for subscription to a service.
In the cORe platform, we intend to include the following three
classes of applications.
(a) Experiments using pedagogical dataset for educational pur-
poses, such as the application of dynamic routing problem in section
2 which illustrates the power of combining filtering and optimization.
Furthermore, such learning+optimization experiments can be carried
out with very little effort by this platform. If authors are willing
to share their models, there could be a new experimental paradigm
where users can get somewhat instantaneous feedback on the useful-
ness of methods for semi-real world applications. Overtime, as cORe
and its users grow, we expect there to be a rich set of models and data
for these and other urban issues (police force allocation, safest bicycle
route choice etc.)
5.3. cOReApplications 151
FIGURE 5.2: Practical Applications for Community
Data
(b) Advanced applications, which have larger sample sizes and
higher dimensions comparing with pedagogical examples. These ap-
plications are mainly for academic research, especially for the semi-
real world experiments which involve a series of computations such
as data processing or solving an optimization model. For instance,
in the platform SSN example is labeled as an advanced application.
A cORe workflow can help the authors manage the datasets as well
as the algorithm programs, so that it is more convenient to share and
reproduce the results which appear in the corresponding literature.
For instance, in the platform SSN example is labeled as an advanced
application.
(c) Practical applications for community datasets: Other applica-
tions which come to mind include choice of healthcare facilities (based
on hospital performance record as well as convenience for a patient).
152 Chapter5. ComputationalORExchange(cORe)Platform
The increased use of bicycle in some urban areas also prompts ques-
tions such as safety-first routing. Such applications would require ac-
cident data from which one can deduce probability of accidents, and
then, choose routes which are least accident prone (i.e., path with the
minimum probability of accidents). Incidentally, one of the major ad-
vantages of the new data driven movement is that many of the data
bases (e.g. Open Data, and Socrata Open Data) are available at most
U.S. jurisdictions, and as a result, algorithms tested on a data set for
LA should be usable for data sets in other parts of the country, thus
creating an explosion of possibilities for OE researchers to make a dif-
ference to communities.
MnM (Meal Planning for the New Millennium) problem
As introduced in Chapter 2, the MnM (Meal Planning for the New
FIGURE 5.3: Homepage of the cORe platform
5.3. cOReApplications 153
FIGURE 5.4: MnM problem
154 Chapter5. ComputationalORExchange(cORe)Platform
Millennium) problem is inspired by the famous Diet Problem, which
includes the preference ranting based on the feedback datasets for
the recipes. In this particular instance, we use collaborative filter-
ing method to predict the recipe preferences for an individual. As
for the optimization side, the goal is to choose from variety of recipes
and maximize the total rating. In the meantime, the solution should
satisfy all the nutrition constraints. In the cORe platform, this MnM
problem is classified as a pedagogical example.
Figure 5.4 shows the main page of the MnM problem in our cORe
platform. The MnM instance consists of two phases: Statistical Learn-
ing phase, and Mixed Integer Programming phase. On the left hand
side, the user can click the buttons to view or edit each phase. The
statistical learning phase includes Input file, Step Program and Out-
put File. The input file is a dataset of recipes with rating and nutrition
facts.
Based on the input dataset, we include the collaborative filtering
program using Jupyter Notebook in R, and by clicking on the code
reference link, the user can view the collaborative filtering code and
reproduce the experiment results (See Figure 5.5 and Figure 5.6). Sim-
ilarly in the Mixed Integer Programming phase, the platform includes
the MIP model file as an input, which takes the parameter calculated
from the collaborative filtering step.
LEO-Wyndor problem Another illustrative instance we have cov-
ered in Chapter 3 and Chapter 5 is the LEO-Wyndor problem. This
5.3. cOReApplications 155
FIGURE 5.5: SL Phase of MnM problem
FIGURE 5.6: Jupyter Notebook in R (MnM problem)
156 Chapter5. ComputationalORExchange(cORe)Platform
problem is borrowed from a very popular OR textbook (Hillier and
Lieberman, 2012), which include three phases: Statistical Learning,
Stochastic Programming and Validation. In particular, the Stochas-
tic Programming phase can be divided into two sub-phases: PySP
Modeling and SP solving. All these phases are listed in the page of
LEO-Wyndor instance in the cORe platform (see Figure 5.7).
FIGURE 5.7: LEO-Wyndor in cORe Platform
The MLR program written in Jupyter Notebook is presented in
Figure 5.8. The program takes the input advertising dataset from the
platform and calculates MLR parameters. In addition, the program
also provides the variance-covariance matrix, which can be used for
the NDC and NDU models.
The PySP modeling step takes MLR parameters and generates SMPS
format inputs. Then the stochastic programming step will solve the
SMPS input models and obtain solutions for the alternative models.
Figure 5.9 includes variety of metrics which are included in the cORe
platform. The details of these metrics are presented in Chapter 3.
5.3. cOReApplications 157
FIGURE 5.8: MLR of LEO-Wyndor in cORe Platform
158 Chapter5. ComputationalORExchange(cORe)Platform
FIGURE 5.9: Validation Phase of LEO-Wyndor in cORe
Platform
159
Bibliography
Atchade, Yevs F, Gersende Fort, and Eric Moulines (2014). “On stochas-
tic proximal gradient algorithms”. In: arXiv preprint arXiv:1402.2365
23.
Ban, Gah-Yi, Noureddine El Karoui, and Andrew E. B. Lim (2017).
“Machine Learning and Portfolio Optimization”. In: Management
Science. DOI:10.1287/mnsc.2016.2644.
Ban, Gah-Yi and Cynthia Rudin (2018). “The big data newsvendor:
Practical insights from machine learning”. In: Operations Research.
Bayraksan, Güzin and David P Morton (2009). “Assessing solution
quality in stochastic programs via sampling”. In: Tutorials in Oper-
ations Research 5, pp. 102–122.
— (2011). “A sequential sampling procedure for Stochastic Program-
ming”. In: Operations Research 59.4, pp. 898–913.
Bayraksan, Guzin and Péguy Pierre-Louis (2012). “Fixed-width se-
quential stopping rules for a class of stochastic programs”. In: SIAM
Journal on Optimization 22.4, pp. 1518–1548.
160 BIBLIOGRAPHY
Ben-Tal, Aharon and Arkadi Nemirovski (2001). Lectures on Modern
Convex Optimization: Analysis, Algorithms, and Engineering Applica-
tions. SIAM.
Bertsekas, D.P . (2012). Dynamic Programming and Optimal Control. Athena
Scientific optimization and computation series v. 2. Athena Scien-
tific. ISBN: 9781886529441.
Bertsimas, Dimitris and Nathan Kallus (2014). “From predictive to
prescriptive analytics”. In: arXiv preprint arXiv:1402.5481.
Bertsimas, Dimitris and Romy Shioda (2007). “Classification and Re-
gression via Integer Optimization”. In: Operations Research 55.2,
pp. 252–271.
Bertsimas, Dimitris and Melvyn Sim (2004). “The price of robustness”.
In: Operations Research 52.1, pp. 35–53.
Bertsimas, Dimitris et al. (2014). “A comparison of Monte Carlo tree
search and mathematical optimization for large scale dynamic re-
source allocation”. In: arXiv preprint arXiv:1405.5498.
Besbes, Omar and Assaf Zeevi (2015). “On the (surprising) sufficiency
of linear models for dynamic pricing with demand learning”. In:
Management Science 61.4, pp. 723–739.
Birge, John R and Francois Louveaux (2011). Introduction to Stochastic
Programming. Springer Science Business Media.
Bohanec, Marko and Vladislav Rajkoviˇ c (1990). “DEX: An expert sys-
tem shell for decision support”. In: Sistemica 1.1, pp. 145–157.
BIBLIOGRAPHY 161
Bottou, Léon (2010). “Large-scale machine learning with stochastic
gradient descent”. In: Proceedings of COMPSTAT’2010. Springer,
pp. 177–186.
Box, George EP et al. (2015). Time Series Analysis: Forecasting and Con-
trol. John Wiley & Sons.
Chen, Zhe et al. (2003). “Bayesian filtering: From Kalman filters to par-
ticle filters, and beyond”. In: Statistics 182.1, pp. 1–69.
Cooper, William L, Tito Homem-de-Mello, and Anton J Kleywegt (2015).
“Learning and pricing with models that do not explicitly incorpo-
rate competition”. In: Operations research 63.1, pp. 86–103.
De Farias, Daniela Pucci and Benjamin Van Roy (2004). “On constraint
sampling in the linear programming approach to approximate dy-
namic programming”. In: Mathematics of Operations Research 29.3,
pp. 462–478.
Delage, Erick and Yinyu Ye (2010). “Distributionally robust optimiza-
tion under moment uncertainty with application to data-driven
problems”. In: Operations research 58.3, pp. 595–612.
Den Boer, Arnoud V and Dirk D Sierag (2016). Decision-based model
selection.
Deng, Yunxiao and Suvrajeet Sen (2018). “Learning Enabled Optimiza-
tion: Towards a Fusion of Statistical Learning and Stochastic Pro-
gramming”. In: INFORMS Journal on Optimization (submitted).
162 BIBLIOGRAPHY
Diaconis, Persi and Mehrdad Shahshahani (1984). “On nonlinear func-
tions of linear combinations”. In: SIAM Journal on Scientific and Sta-
tistical Computing 5.1, pp. 175–191.
Elmachtoub, Adam N and Paul Grigas (2017). “Smart “Predict, then
Optimize"”. In: arXiv preprint arXiv:1710.08005.
Fisher, Ronald A (1936). “The use of multiple measurements in taxo-
nomic problems”. In: Annals of eugenics 7.2, pp. 179–188.
Frazier, Peter (2012). “Optimization via simulation with Bayesian statis-
tics and dynamic programming”. In: Proceedings of the Winter Sim-
ulation Conference. Winter Simulation Conference, p. 7.
Freimer, Michael B, Jeffrey T Linderoth, and Douglas J Thomas (2012).
“The impact of sampling methods on bias and variance in stochas-
tic linear programs”. In: Computational Optimization and Applica-
tions 51.1, pp. 51–75.
Frey, Jesse (2013). “Data-driven nonparametric prediction intervals”.
In: Journal of Statistical Planning and Inference 143.6, pp. 1039–1048.
Friedman, Jerome H and Werner Stuetzle (1981). “Projection pursuit
regression”. In: Journal of the American Statistical Association 76.376,
pp. 817–823.
Fu, Michael C (2017). “Markov Decision Processes, AlphaGo, and Monte
Carlo Tree Search: Back to the Future”. In: Leading Developments
from INFORMS Communities. INFORMS, pp. 68–88.
BIBLIOGRAPHY 163
Gangammanavar, Harsha, Suvrajeet Sen, and Victor M Zavala (2016).
“Stochastic optimization of sub-hourly economic dispatch with wind
energy”. In: IEEE Transactions on Power Systems 31.2, pp. 949–959.
Glynn, Peter W and Gerd Infanger (2013). “Simulation-based confi-
dence bounds for two-stage stochastic programs”. In: Mathematical
Programming 138.1-2, pp. 15–42.
Goff, Stephen A et al. (2011). “The iPlant collaborative: cyberinfras-
tructure for plant biology”. In: Frontiers in plant science 2, p. 34.
Harsha, Pavithra, Ramesh Natarajan, and Dharmashankar Subrama-
nian (2015). “A data-driven, distribution-free, multivariate approach
to the price-setting newsvendor problem”. In:
Hastie, Trevor et al. (2015). “Matrix completion and low-rank SVD
via fast alternating least squares”. In: Journal of Machine Learning
Research 16, pp. 3367–3402.
Hastie, Trevor J.., Robert John Tibshirani, and Jerome H Friedman
(2011). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer.
Higle, Julia L, Brenda Rayco, and Suvrajeet Sen (2010). “Stochastic sce-
nario decomposition for multistage stochastic programs”. In: IMA
Journal of Management Mathematics 21.1, pp. 39–66.
Higle, Julia L and Suvrajeet Sen (1991). “Stochastic decomposition: An
algorithm for two-stage linear programs with recourse”. In: Math-
ematics of Operations Research 16.3, pp. 650–669.
164 BIBLIOGRAPHY
Higle, Julia L and Suvrajeet Sen (1994). “Finite master programs in
regularized stochastic decomposition”. In: Mathematical Program-
ming 67.1-3, pp. 143–168.
— (1996a). “Duality and statistical tests of optimality for two stage
stochastic programs”. In: Mathematical Programming 75.2, pp. 257–
275.
— (1996b). Stochastic Decomposition. Springer.
Hillier, Frederick S and G J Lieberman (2012). Introduction to operations
research. Tata McGraw-Hill Education.
James, Gareth et al. (2013). An introduction to statistical learning. Vol. 6.
Springer.
Jiang, Daniel R, Lina Al-Kanj, and Warren B Powell (2017). “Monte
Carlo Tree Search with Sampled Information Relaxation Dual Bounds”.
In: arXiv preprint arXiv:1704.05963.
Kao, Y-H., Benjamin V . Roy, and Xiang Yan (2009). “Directed Regres-
sion”. In: Advances in Neural Information Processing Systems 22. Ed.
by Y. Bengio et al. Curran Associates, Inc., pp. 889–897. URL:http:
//papers.nips.cc/paper/3686-directed-regression.pdf.
Kao, Yi-Hao and Benjamin Van Roy (2014). “Directed principal com-
ponent analysis”. In: Operations Research 62.4, pp. 957–972.
Kleywegt, Anton J, Alexander Shapiro, and Tito Homem-de Mello
(2002). “The sample average approximation method for stochas-
tic discrete optimization”. In: SIAM Journal on Optimization 12.2,
pp. 479–502.
BIBLIOGRAPHY 165
Koren, Yehuda, Robert Bell, and Chris Volinsky (2009). “Matrix factor-
ization techniques for recommender systems”. In: Computer 42.8.
Küçükyavuz, Simge and Suvrajeet Sen (2017). “An Introduction to
Two-Stage Stochastic Mixed-Integer Programming”. In: INFORMS
TutORials in Operations Research.
Kulis, Brian et al. (2013). “Metric learning: A survey”. In: Foundations
and Trends R
in Machine Learning 5.4, pp. 287–364.
Lim, Andrew E B, J George Shanthikumar, and Z J Max Shen (2006).
“Model uncertainty, robust optimization, and learning”. In: Tuto-
rials in Operations Research: Models, Methods, and Applications for In-
novative Decision Making, pp. 66–94.
Linderoth, Jeff, Alexander Shapiro, and Stephen Wright (2006). “The
empirical behavior of sampling methods for stochastic program-
ming”. In: Annals of Operations Research 142.1, pp. 215–241.
Liyanage, Liwan H and J George Shanthikumar (2005). “A practical
inventory control policy using operational statistics”. In: Opera-
tions Research Letters 33.4, pp. 341–348.
Mak, Wai-Kei, David P Morton, and R Kevin Wood (1999). “Monte
Carlo bounding techniques for determining solution quality in stochas-
tic programs”. In: Operations Research Letters 24.1, pp. 47–56.
Mello, Tito Homem-de and Güzin Bayraksan (2014). “Monte Carlo
sampling-based methods for stochastic optimization”. In: Surveys
in Operations Research and Management Science 19.1, pp. 56–85.
166 BIBLIOGRAPHY
Miller, Naomi and Andrzej Ruszczy ´ nski (2011). “Risk-averse two-stage
stochastic linear programming: Modeling and decomposition”. In:
Operations Research 59.1, pp. 125–132.
Murphy, Frederic H and Suvrajeet Sen (2002). “Qualitative implica-
tions of uncertainty in economic equilibrium models”. In: Decision
Making Under Uncertainty. Springer, pp. 135–151.
Powell, Warren B (2011). Approximate Dynamic Programming: Solving
the Curses of Dimensionality. Vol. 842. John Wiley & Sons.
Rios, Ignacio, Roger JB Wets, and David L Woodruff (2015). “Multi-
period forecasting and scenario generation with limited data”. In:
Computational Management Science 12.2, pp. 267–295.
Rockafellar, R Tyrrell and Stan Uryasev (2013). “The fundamental risk
quadrangle in risk management, optimization and statistical esti-
mation”. In: Surveys in Operations Research and Management Science
18.1-2, pp. 33–53.
Royset, Johannes O and Roberto Szechtman (2013). “Optimal budget
allocation for sample average approximation”. In: Operations Re-
search 61.3, pp. 762–776.
Royset, Johannes O and Roger JB Wets (2014). “From data to assess-
ments and decisions: Epi-spline technology”. In: Tutorials in Oper-
ations Research: Bridging Data and Decisions. INFORMS, pp. 27–53.
Ryzhov, Ilya O, Warren B Powell, and Peter I Frazier (2012). “The
knowledge gradient algorithm for a general class of online learn-
ing problems”. In: Operations Research 60.1, pp. 180–195.
BIBLIOGRAPHY 167
Schuler, Robert E, Carl Kesselman, and Karl Czajkowski (2016). “Ac-
celerating data-driven discovery with scientific asset management”.
In: e-Science (e-Science), 2016 IEEE 12th International Conference on.
IEEE, pp. 31–40.
Sen, Suvrajeet (2013). “Stochastic Programming”. In: Encyclopedia of
Operations Research and Management Science. Springer, pp. 1486–
1497.
Sen, Suvrajeet, Robert D Doverspike, and Steve Cosares (1994). “Net-
work planning with random demand”. In: Telecommunication sys-
tems 3.1, pp. 11–30.
Sen, Suvrajeet and Yifan Liu (2016). “Mitigating Uncertainty via Com-
promise Decisions in Two-Stage Stochastic Linear Programming:
Variance Reduction”. In: Operations Research 64.6, pp. 1422–1437.
Sen, Suvrajeet and Zhihong Zhou (2014). “Multistage Stochastic De-
composition: a bridge between Stochastic Programming and Ap-
proximate Dynamic Programming”. In: SIAM Journal on Optimiza-
tion 24.1, pp. 127–153.
Shalev-Shwartz, Shai et al. (2011). “Pegasos: Primal estimated sub-
gradient solver for svm”. In: Mathematical programming 127.1, pp. 3–
30.
Shapiro, Alexander, Darinka Dentcheva, and Andrzej Ruszczynski (2009).
“Lectures on Stochastic Programming”. In: SIAM, Philadelphia 10.
168 BIBLIOGRAPHY
Shapiro, Alexander and Tito Homem-de Mello (1998). “A simulation-
based approach to two-stage Stochastic Programming with recourse”.
In: Mathematical Programming 81.3, pp. 301–325.
Simchi-Levi, David (2013). “OM forum—OM research: From problem-
driven to data-driven research”. In: Manufacturing & Service Oper-
ations Management 16.1, pp. 2–10.
Stigler, George J (1945). “The cost of subsistence”. In: Journal of farm
economics 27.2, pp. 303–314.
Suykens, Johan AK and Joos Vandewalle (1999). “Least squares sup-
port vector machine classifiers”. In: Neural processing letters 9.3,
pp. 293–300.
Van Parys, Bart PG, Peyman Mohajerin Esfahani, and Daniel Kuhn
(2017). “From data to decisions: Distributionally robust optimiza-
tion is optimal”. In: arXiv preprint arXiv:1704.04118.
Walk, Harro (2008). “A universal strong law of large numbers for con-
ditional expectations via nearest neighbors”. In: Journal of Multi-
variate Analysis 99.6, pp. 1035–1050.
Wallace, Stein W and William T Ziemba (2005). Applications of Stochas-
tic Programming. SIAM.
Xiao, Lin and Tong Zhang (2014). “A proximal stochastic gradient
method with progressive variance reduction”. In: SIAM Journal on
Optimization 24.4, pp. 2057–2075.
Abstract (if available)
Abstract
Several emerging applications call for a fusion of statistical learning (SL) and stochastic programming (SP). The Learning Enabled Optimization paradigm fuses concepts from these disciplines in a manner which not only enriches both SL and SP, but also provides a framework which supports rapid model updates and optimization, together with a methodology for rapid model-validation, assessment, and selection. Moreover, in many “big data/big decisions” applications, these steps are repetitive, and realizable only through a continuous cycle involving data analysis, optimization, and validation. This thesis sets forth the foundation for such a framework by introducing several novel concepts such as statistical optimality, hypothesis tests for model-fidelity, generalization error of stochastic programming, and finally, a non-parametric methodology for model selection. These new concepts provide a formal framework for modeling, solving, validating, and reporting solutions for Predictive Stochastic Programming (PSP). We illustrate the PSP framework by applying it to an inventory control model in which we use demand data available for ARIMA modeling in the statistical package “R”. In addition, we introduced a computational Operations Research exchange Platform, which is intended to help users to build the entire pipeline of LEO applications in software.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational validation of stochastic programming models and applications
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Computational stochastic programming with stochastic decomposition
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Learning and decision making in networked systems
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Learning and control in decentralized stochastic systems
PDF
Deep learning models for temporal data in health care
PDF
Smarter markets for a smarter grid: pricing randomness, flexibility and risk
PDF
Stochastic games with expected-value constraints
PDF
Calibration uncertainty in model-based analyses for medical decision making with applications for ovarian cancer
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Learning personal thermal comfort and integrating personal comfort requirements into HVAC system control loop
Asset Metadata
Creator
Deng, Yunxiao
(author)
Core Title
Learning enabled optimization for data and decision sciences
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
03/21/2019
Defense Date
01/14/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
model assessment,OAI-PMH Harvest,statistical learning,stochastic programming
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sen, Suvrajeet (
committee chair
), Jain, Rahul (
committee member
), Kesselman, Carl (
committee member
)
Creator Email
dengyx1992@gmail.com,yunxiaod@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-133942
Unique identifier
UC11675857
Identifier
etd-DengYunxia-7170.pdf (filename),usctheses-c89-133942 (legacy record id)
Legacy Identifier
etd-DengYunxia-7170.pdf
Dmrecord
133942
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Deng, Yunxiao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
model assessment
statistical learning
stochastic programming