Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Decision-aware learning in the small-data, large-scale regime
(USC Thesis Other)
Decision-aware learning in the small-data, large-scale regime
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Decision-Aware Learning in the Small-Data, Large-Scale Regime
by
Michael Huang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BUSINESS ADMINISTRATION)
December 2023
Copyright 2024 Michael Huang
Acknowledgements
Writing acknowledgments is always challenging, given the countless individuals who have contributed to my Ph.D. journey. I am sure that I will fail to thank everyone. Still, I want to express
my gratitude to everyone—family, friend, teacher, or advocate—who has supported, fostered, and
promoted my education. Throughout my life, I have faced few barriers to achieving my educational
goals, and these individuals’ efforts have made my education journey smooth and successful.
First, I would like to thank my advisors Vishal Gupta and Paat Rusmevichientong, who guided
and supported me through my Ph.D. journey. Vishal always highlighted the importance of maintaining a good work-life balance, which was essential for my survival through the Ph.D. I greatly
appreciated that he would always make time to listen to and help refine my research ideas. With
his help, I formed the kind of data-driven optimization dissertation I wished for when I entered
the program. Paat, on the other hand, helped me become a more rigorous researcher by always
setting higher standards for work, whether for a weekly meeting or publication. His higher standard pushed me to think more systematically and, in turn, made me a better and more confident
researcher. Together, they made a great team and are the kind of researchers and people I hope to
become one day.
I would also like to express my gratitude to Carri Chan and Vahid Sarhangian, who mentored
me during my Masters at Columbia and were co-authors of my very first journal paper. Without
them, I might not have seriously considered pursuing a Ph.D. Their mentorship allowed me to fall
in love with research and understand the practical impacts we can make in society.
Many other researchers also helped support my Ph.D. journey. In particular, I would like
to thank my other committee members: Kimon Drakopoulos for qualifying and the dissertation,
ii
Meissam Razaviyayn for the dissertation, Raman Randhawa for qualifying, and John Carlsson
for qualifying. Your feedback and support helped elevate my research by providing key outside
perspectives. I also would like to thank the rest of the Data Sciences and Operations department,
faculty, and staff for making my experience at USC enjoyable.
I also cannot fail to mention fellow students and postdocs that formed the bulk of my Ph.D.
experience. I am overly grateful for my department cohort, Wilson Lin, Simone Shao, Bradley
Rava, and Patrick Vossler, who all had to put up with me waltzing into their offices and distracting
them from their work with mostly pointless conversations. Outside the cohort, I was also lucky to
have many chats and adventures in LA with Shobhit Jain, Justin Mulvany, Junxiong Yin, Katerina
Giannoutsou, Mohammed Mehrabi, Greg Faletto, Luella Fu, Yiqiu Shen, Rashmi Bhuyan. I am
also thankful to Justin Hyun for being the postdoc who was still willing to hang out with Ph.D.
students and provide his wisdom. I was thankful to friends Andrew Grasberger, Bill Tang, Louis
Yang, and AJ Chen, who were willing and, on rare occasions, asked to hear me talk about my
work.
Last but certainly not least, I would like to thank my family. My parents have supported me and
made me who I am today. My father, Jianzhong Huang, always fostered my education, ensuring
I had the best learning opportunities. I remember and appreciate every weekend he took me to
my math team practices, Chinese school, violin lessons, and painting class. My mother, Lishuang
Su, always instilled in me the value of being realistic, which inspires my teaching and research
philosophy. I am also nothing without my wife Simone, who not only accompanied me throughout
my entire Ph.D. journey but also kept me sane and allowed me to enjoy the endless sun in Los
Angeles. She supported me and is a constant reminder that the most important thing in life is
spending time with the people you love. I also want to thank my cat Opal, who is sleeping next to
me as I write this.
iii
Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Small-Data, Large-Scale Regime . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Decision-Aware Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Chapter 2: Debiasing In-Sample Policy Performance for Small-Data, LargeScale Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Chapter 3: Decision-Aware Denoising . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Chapter 4: Learning Best-in-Class Policies in the Predict-then-Optimize
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Debiasing In-Sample Policy Performance for Small-Data, Large-Scale Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 A Motivating Example: Poor Performance of Cross-Validation with Limited Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Relationship to Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Affine Plug-in Policy Classes . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Variance Gradient Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Motivating the Variance Gradient Correction (VGC) . . . . . . . . . . . . 27
2.3.2 Bias of Variance Gradient Correction . . . . . . . . . . . . . . . . . . . . 30
2.3.3 The Variance of the VGC . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Smoothness and Boundedness of the VGC . . . . . . . . . . . . . . . . . 34
2.4 Estimating Out-of-sample Performance for Weakly-Coupled Problems . . . . . . . 36
2.4.1 Problems Weakly-Coupled by Variables . . . . . . . . . . . . . . . . . . . 38
2.4.2 Problems with Weakly-Coupled by Constraints . . . . . . . . . . . . . . . 41
2.5 Numerical Case Study: Drone-Assisted Emergency Medical Response . . . . . . . 49
2.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iv
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7 Appendix: Background Results on Empirical Processes . . . . . . . . . . . . . . . 56
2.7.1 Method of Bounded Differences Excluding an Exceptional “Bad” Set . . . 62
2.8 Appendix: Properties of the Variance Gradient Correction (VGC) . . . . . . . . . . 63
2.8.1 Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8.2 Properties of VGC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.8.3 Bias Under Violations of Assumption 2.2.4 . . . . . . . . . . . . . . . . . 73
2.8.4 Proof of Theorem 2.3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.8.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.9 Appendix: Problems that are Weakly Coupled by Variables . . . . . . . . . . . . . 82
2.9.1 Convergence of In-Sample Optimism . . . . . . . . . . . . . . . . . . . . 82
2.9.2 Convergence of VGC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.10 Appendix: Problems that are Weakly Coupled by Constraints . . . . . . . . . . . . 89
2.10.1 Properties of the Dual Optimization Problem . . . . . . . . . . . . . . . . 89
2.10.2 Constructing the Good Set . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.10.3 Bounding the “Bad” Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.10.4 Properties of the Good Set . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.10.5 Pointwise Convergence Results . . . . . . . . . . . . . . . . . . . . . . . 115
2.10.6 Uniform Convergence of Dual Solutions . . . . . . . . . . . . . . . . . . . 121
2.10.7 Uniform Convergence of In-Sample Optimism . . . . . . . . . . . . . . . 123
2.10.8 Uniform Convergence of VGC . . . . . . . . . . . . . . . . . . . . . . . . 130
2.10.9 Proof of Theorem 2.4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 3: Decision-Aware Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.1.2 Relationship to Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.2.1 Policy Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.2.2 A Decision-Aware Oracle Benchmark . . . . . . . . . . . . . . . . . . . . 145
3.2.3 Examples of Denoising Plug-Ins . . . . . . . . . . . . . . . . . . . . . . . 146
3.3 One-Shot Variance Gradient Correction for Policy Evaluation . . . . . . . . . . . . 148
3.3.1 A Heuristic Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.3.2 Properties of the One-Shot VGC . . . . . . . . . . . . . . . . . . . . . . . 150
3.3.3 Comparison to Multi-Shot VGC . . . . . . . . . . . . . . . . . . . . . . . 151
3.4 Bounding the Estimation Error by Solution Stability . . . . . . . . . . . . . . . . . 153
3.4.1 Is Dependence on Solution Stability Unavoidable? . . . . . . . . . . . . . 156
3.5 Bounds for Regularized Plug-in Policies in Convex Problems . . . . . . . . . . . . 157
3.6 Bounds for Unregularized Plug-in Policies in Weakly-Coupled Problems . . . . . . 158
3.6.1 Block Decoupled Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.6.2 A Modified VGC for Weakly-Coupled Problems . . . . . . . . . . . . . . 163
3.6.3 Estimation Error of Weakly-Coupled One-Shot VGC . . . . . . . . . . . . 166
3.7 Case Study: Prioritizing Speed Hump Requests . . . . . . . . . . . . . . . . . . . 168
3.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
v
3.9 Appendix: Background Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.10 Appendix: Section 3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.10.1 Auxilliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
3.10.2 Proof for Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
3.10.3 Proof for Theorem 3.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.11 Appendix: Section 3.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.11.1 Proof for Lemma 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.11.2 Proof for Proposition 3.4.3. . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.12 Appendix: Regularized Policy Proofs . . . . . . . . . . . . . . . . . . . . . . . . 181
3.12.1 Proof for Lemma 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.12.2 Proof for Theorem 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
3.12.2.1 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . 182
3.13 Appendix: Proofs and Notes for Block Decoupled Problems . . . . . . . . . . . . 183
3.13.1 What are typical values of Tmax and Xmax? . . . . . . . . . . . . . . . . . 183
3.13.2 Proof for Bounding Solution Stability, Lemma 3.6.5 . . . . . . . . . . . . 185
3.13.3 Bound on Eq. (3.22a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.13.4 Bound on Eq. (3.22b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.13.5 Proof for Block Decoupled Estimation Error, Theorem 3.6.6 . . . . . . . . 197
3.14 Appendix: Weakly-Coupled Policy Proofs . . . . . . . . . . . . . . . . . . . . . . 198
3.15 Appendix: Details for Numerics and Examples . . . . . . . . . . . . . . . . . . . 205
3.15.1 Details for Fig. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
3.15.2 Implementation Details for Numerics . . . . . . . . . . . . . . . . . . . . 205
Chapter 4: Learning Best-in-Class Policies in the Predict-then-Optimize Framework . 207
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.1.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.2 Differentiable Surrogate Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.2.1 Properties of Differentiable Surrogate Loss . . . . . . . . . . . . . . . . . 213
4.3 Generalization Results for Surrogate Loss . . . . . . . . . . . . . . . . . . . . . . 214
4.3.1 Generalization Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
4.4 Numerics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6 Appendix: Proofs for Section 4.2.1 Results . . . . . . . . . . . . . . . . . . . . . 222
4.7 Appendix: Proofs for Section 4.3 Results . . . . . . . . . . . . . . . . . . . . . . 224
Chapter 5: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 231
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
vi
List of Tables
2.1 Expected Estimates of Out-of-Sample Performance by Policy for Problem (2.2).
In the left table, we take n = 100, S = 3 and µj = 1 if j ≤ 14 and µj = −1 otherwise. We estimate the expected out-of-sample perf. across 1,000,000 simulations.
Std. errors are less than 0.005. In the right graph, we plot the bias of the estimates with respect to the expected out-of-sample performance as we increase the
signal-to-noise ratio. The In-Sample error not shown at 0.75 SNR exceeds 500%. . 15
List of Figures
2.1 When X is polyhedral, x j(Z +λξje j) must occur at a vertex if it is unique. Hence,
small perturbations to λ do not change the solution (see figure), and the derivative
of V(Z +λξje j) is entirely determined by the derivative of r(Z +λξje j). Similar
intuition holds for non-polyhedral X . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Approximate Strong Convexity of L (λ ). Figure (a) shows a strongly convex
function f(λ ) and visualizes the strong convexity condition Eq. (2.25). Figure (b)
shows that because L (λ ) is piecewise linear, it does not satisfy Eq. (2.25) for
points on same line segment (λ 1 and λ
′
2
). However, when λ 1,λ 2 are sufficiently
far apart, they are on different line segments and Eq. (2.25) is satisfied. . . . . . . 46
2.3 Bias and variance as K → ∞. The two graphs plot the bias and variance of the
different out-of-sample performance estimators for the Sample Average Approximation (SAA) policy. The bias and variance were estimated across 100 simulations for each K. Although variance vanishes for all methods as K increases,
cross-validation exhibits a non-vanishing bias and is uniformly worse for all K. . . 51
2.4 Estimating Performance across Policy Class. The first graph shows the estimates of out-of-sample performance across the policy class for the parameter
τ ∈ [25νmin,100νmin] and µ0 = 1000 for one sample path when K = 400. The
second graph is similar, but for K = 3200. Both plots highlight the smoothness of
VGC relative to the Stein-Correction. . . . . . . . . . . . . . . . . . . . . . . . . 51
vii
2.5 Comparing Policy Decisions. Left (resp. right) panel shows routing decisions for
the policy selected by VGC (resp. cross-validation). Color indicates time-saved
relative to an ambulance-only policy (green is good, red is bad) computed relative to the ground truth. Although routing is largely similar, Regions (a) and (b)
highlight some differences where the cross-validation policy makes poorer routing
decisions (more orange dots). The larger black points are drone depots. . . . . . . 52
2.6 Estimating Performance across Policy Class. Each data point in the graph represents the performance metric of each selected policy for a ring shaped region corresponding to distance in time from a drone depot. Graph (a) shows the fraction
of patients served in each region for the patients serviced by the Southern depot in
Fig. 2.5. Graph (b) plots the time saved by each policy. The plots highlight the
performance difference in routing decisions between the two policies. . . . . . . . 53
2.7 Varying Cross-Validation Folds. We plot the policy evaluation and learning performance of cross-validation with different folds across 500 simulations. In each
simulation there are 100 samples of Z
i
. In (a), we plot the mean squared error of
cross-validation for three different policies and compare them with the respective
VGC estimates represented by the dotted lines. In (b), we plot the percent improvement VGC has over cross-validation, so larger bars indicate lower cross-validation
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Numerical Comparison of Multi-Shot vs. One-Shot. The graphs show the mean
squared error (MSE) of the one-shot VGC (Eq. (3.11)) is better than the MSE of the
multi-shot VGC (Eq. (3.10)) for the plug-in policy with T (Z,
¯θ ) = Z. The values
are computed over 100 trials and use the same δ h. Plot (a) varies the step size
parameter h for n = 320 and plot (b) varies the number of decisions n for h = .5.
See Section 3.15.1 for details of the optimization problem. . . . . . . . . . . . . . 152
3.2 Performance Results. We compare the performance of our method to a MSE
benchmark over 50 trials. We show how the performance changes as we increase
the amount of data. While VGC approach achieves best-in-class performance as
the amount of data increases, the MSE approach does not. . . . . . . . . . . . . . 170
3.3 Map of Census Tracts Selected. We plot the census tracts selected by the MSE
policy and the one-shot VGC policy for the Manhattan and Brooklyn boroughs
of New York City. The contour lines plot the density of crashes with pedestrian
injuries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.4 Amount of Smoothing by Method The figure compares the values of θ chosen
by each method. Across realizations, θ
MSE tends to oversmooth relative to θ
OR
,
particularly as the amount of historical data grows. By contrast, the one-shot VGC
performs less smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.1 Choice of h The figure plots the ERM loss and DSL loss for a single simulation
where n = 40. The figure highlights how the DSL approach constructs a differentiable surrogate by effectively smoothing the ERM loss. As h → 0, we see the
ERM loss is recovered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
viii
4.2 Regret of Learned Policies The figure plots the regret of the policies learned using
the different loss functions as the number of samples increase over 100 trials. We
see that SPO+ does not learn the best-in-class policy even as the number of samples
increases. Additionally, we see DSL seems to converges to optimal regret at a
faster rate than even ERM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
4.3 Why SPO+ Fails The figure plots a single simulation of the different losses for
n = 80 in the misspecified setting. The vertical lines plot the choices of β10 that
minimize the respective losses. The plot shows that SPO+ does not directly estimate ERM and thus minimum can be drastically different from the minimum of
ERM. Thus, in misspecified settings SPO+ may learn a suboptimal policy policy. . 221
4.4 Choice of h on Regret The figure plots the regret of the policy learned by DSL
for different values of h across 100 simulations. We plot the results for n = 20 and
n = 80 to highlight different noise settings. The figure highlights a bias-variance
trade-off for the choice of h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
ix
Abstract
Modern decision-making has rapidly evolved to integrate data into the decision-making pipeline
via machine learning. While many machine learning algorithms have become more specialized
to better leverage the intricacies of specific “big data” datasets to make better predictions, there
are fewer algorithms that target general decision-making problems. Instead, most practical datadriven approaches use machine learning as an auxiliary step to predict unknown parameters used in
decision-making. Unfortunately, a vast majority of decision-making problems often have “smalldata,” that is, noisy or sparse data. As a result, the machine learning predictions often have high
uncertain which leads to poor decisions.
To address these challenges, this dissertation proposes new methods and algorithms that combine prediction and decision-making into a single learning procedure to improve data-driven decisionmaking. We broadly characterize these approaches as decision-aware learning as they integrate
downstream information of the decision tasks into learning predictions. The decision information
helps machine learning algorithms learn predictions that lead to better decisions, particularly in the
aforementioned small-data settings.
The first part of the dissertation focuses on a fundamental issue in decision-aware learning,
that is evaluating the performance of policies. The chapter highlights that in “small-data, largescale” settings, the gold standard approach of cross-validation produces poor estimates of policy
performance due to sample splitting which requires sacrificing already limited data in the training of decision-making policies. As result, we propose a novel debiasing approach known as the
Variance Gradient Correction (VGC) that does not require any data to be sacrificed. The second
part of the dissertation builds upon the VGC and proposes the one-shot VGC that generalizes its
x
use to evaluate policies that plug-in predictions from any machine learning algorithm. In particular, we use the one-shot VGC to expand decision-aware learning to the broad field of denoising
algorithms. Finally, the last chapter focuses on developing theoretically sound yet computation
tractable decision-aware learning algorithms. The chapter considers an alternative use of the insights of the VGC developed in the previous chapters. Instead of using the VGC to evaluate
policies, we utilize it to develop a novel surrogate loss for decision-aware learning in contextual
linear optimization setting. Unlike existing surrogate losses, our approach has broader theoretical
guarantees that hold in the small-data setting where machine learning models are fundamentally
misspecified.
Decision-aware learning provides a general framework for integrating the power of modern
machine learning models into decision-making. This dissertation provides a starting point for
understanding the true benefits of decision-aware learning.
xi
Chapter 1
Introduction
Data-driven decision-making is at its core an exploration of two key questions: i) How do we use
data to predict the future? ii) How do we make good decisions under uncertainty? Traditionally,
practitioners approach these questions separately. In scenarios abundant with data, this approach
often yields effective decisions. However, when the available relevant data for each decision is
limited, relying solely on such methods can result in sub-optimal decisions. To address small-data
settings, this dissertation proposes decision-aware approaches that consider methods that think
about the two questions in tandem by combining prediction and optimization into a single task.
Throughout the chapters, we primarily focus on downstream decision-making tasks that take
the form of the linear optimization problem
min
x∈X
µ
⊤x
where X is an arbitrary decision space. This problem generalizes linear programming problems,
combinatorial optimization problems, mixed-integer programming problems, etc. Due to the generality of X it can generalize non-linear objectives as we show in Example 2.2.3 in Chapter 2.
The objective in each chapter is also the same, that is, we observe unbiased estimates Z of µ
and features W and would like to learn the best-in-class plug-in policy. Plug-in policies have the
form
x f(Z,W ) ∈ argmin
x∈X
f(Z,W )
⊤x.
1
The best-in-class plug-in policy corresponds to the plug-in that solves the following,
f
OR ∈ argmin
f∈F
µ
⊤x f(Z,W ),
where F is the class of plug-ins. In the first two chapters, µ = E[Z] and W are fixed. In the last
chapter, µ = E[Z|W ] = f
∗
(W ) is the expect objective costs condition on the context W and W are
random.
The notation is not necessarily consistent from chapter to chapter and is updated throughout as
needed.
1.1 Background
Below we provide a brief summary of the background information related to the different settings
of the chapters.
1.1.1 Small-Data, Large-Scale Regime
The small-data, large-scale regime describes decision-making settings where the number of decisions is large, but the amount of data per decision is small. Thus, while the datasets for these
settings may be large, the signal-to-noise ratio of the data is small. The challenge of these settings is developing policies that perform well given the small-data setting and evaluating policies
to select the best one. For example, Gupta and P. Rusmevichientong (2021a) propose a Empirical
Bayes inspired policy class for linear optimization and propose a method to identify the best-inclass policy using Stein’s Lemma. Another key idea of small-data, large-scale settings is showing
that pooling data across different decisions can improve policy performance. For example, Gupta
and Kallus (2022) propose a data-pooling algorithm for stochastic optimization that shows pooling
data from unrelated tasks can improve decision-making. The key results for both papers study how
2
the performance of algorithms improves as number of decisions grows. In the dissertation, Chapters 2 and 3 both consider this setting and show how the performance of their respective approaches
improves as the number of decisions increases.
1.1.2 Decision-Aware Learning
Decision-Aware Learning broadly generalizes end-to-end learning approaches for the predict-thenoptimize framework. In the framework, there is an upstream prediction task that generates predictions and parameters for the downstream optimization task. Decision-aware approaches such as
Grigas, Qi, et al. (2021), Adam N Elmachtoub and Grigas (2022), and Wilder, Dilkina, and Tambe
(2019) train the upstream prediction tasking using the empirical objective performance rather than
the prediction accuracy of the unknown problem parameters.
In particular, chapters 2 and 3 are closely related to decision-aware work in contextual linear
optimization (Hu, Kallus, and Mao, 2022; Adam N Elmachtoub and Grigas, 2022) which considers
policies that map contexts to decisions. The chapters differ as they primarily focus on developing
decision-aware methods for the small-data, large-scale setting which consider policies that depend
explicitly on observed costs rather than contexts.
Chapter 4, on the other hand, studies the computational issues of decision-aware approaches for
contextual linear optimization. Existing works primarily propose approaches to estimate gradients
of the decision loss (Wilder, Dilkina, and Tambe, 2019; Vlastelica et al., 2019) which is has zero
gradients almost everywhere. Adam N Elmachtoub and Grigas (2022) construct a convex surrogate
loss with attractive theoretical properties.
1.2 Chapter Overview
We now provide a brief overview of each chapter of the dissertation.
3
1.2.1 Chapter 2: Debiasing In-Sample Policy Performance for Small-Data,
Large-Scale Optimization
In response to the limitations observed in the performance of cross-validation when data is limited,
this chapter introduces a novel approach called the Variance Gradient Correction (VGC) to estimate the out-of-sample performance of what we call separable affine plug-in policies. Our method
leverages the sensitivity analysis of the optimization problem to estimate the gradient of the policy
objective value with respect to the data’s noise level. This estimated gradient is then utilized to
debias the in-sample performance of the policy. Notably, unlike cross-validation techniques, our
approach does not sacrifice data for a separate test set, utilizing all available data during training. This makes our method particularly well-suited for scenarios with scarce data. The chapter
provides bounds on the bias and variance of the VGC as well as high probability tail bounds on
its estimation error of the out-of-sample performance. Finally, we conclude with a case study
on dispatching drones for emergency medical services. We highlight how the VGC outperforms
cross-validation in learning decision-making policies, especially for data scarce regions.
1.2.2 Chapter 3: Decision-Aware Denoising
Leveraging the Variance Gradient Correction methodology from chapter 2, we propose a new
problem class that selects and trains traditional denoising methods in a decision-aware fashion.
These denoising methods produce plug-in estimates for the plug-in policies studied in chapter 2.
Unfortunately, the Variance Gradient Correction in chapter 2 only applies to plug-in policies with
separable affine plug-ins which does not include most denoising methods. Separable affine plugins only leverage data local to a single decision. In contrast denoising methods allow the plug-ins
to leverage data for decisions with similar covariates. We address this limitation with the one-shot
Variance Gradient Correction. This allows us to capture a significantly more rich class of plug-ins
which include many modern machine learning methods and denoising techniques. We show the
relative error the one-shot VGC as an estimate to the out-of-sample policy performance can be
4
bounded by a sense of solution stability. We show the relative size of the solution stability term
is vanishing as the dimension of the problem grows for plug-in and regularized plug-in policies
under different settings.
1.2.3 Chapter 4: Learning Best-in-Class Policies in the Predict-then-Optimize
Framework
The last chapter leverages insights of the one-shot VGC to construct a surrogate loss of the decision
loss for contextual linear optimization. We then utilize results from our small-data, large-scale setting to prove generalization bounds on the surrogate loss in large-sample settings. In particular, our
results hold even in misspecified settings, that is when the plug-in class does not contain the plugin function that maps context to the true expected objective costs. Additionally, our surrogate loss
has many attractive computation properties such as non-zero gradients and Lipschitz smoothness.
We provide numerical experiments to highlight the benefits of these properties.
5
Chapter 2
Debiasing In-Sample Policy Performance for Small-Data,
Large-Scale Optimization
The work below inspired the paper “Debiasing In-Sample Policy Performance for Small-Data,
Large-Scale Optimization” (Gupta, Huang, and P. Rusmevichientong, 2022) which appeared in
Operations Research.
Motivated by the poor performance of cross-validation in settings where data are scarce, we
propose a novel estimator of the out-of-sample performance of a policy in data-driven optimization.
Our approach exploits the optimization problem’s sensitivity analysis to estimate the gradient of
the optimal objective value with respect to the amount of noise in the data and uses the estimated
gradient to debias the policy’s in-sample performance. Unlike cross-validation techniques, our
approach avoids sacrificing data for a test set, utilizes all data when training and, hence, is wellsuited to settings where data are scarce. We prove bounds on the bias and variance of our estimator
for optimization problems with uncertain linear objectives but known, potentially non-convex,
feasible regions. For more specialized optimization problems where the feasible region is “weaklycoupled” in a certain sense, we prove stronger results. Specifically, we provide explicit highprobability bounds on the error of our estimator that hold uniformly over a policy class and depends
on the problem’s dimension and policy class’s complexity. Our bounds show that under mild
conditions, the error of our estimator vanishes as the dimension of the optimization problem grows,
even if the amount of available data remains small and constant. Said differently, we prove our
6
estimator performs well in the small-data, large-scale regime. Finally, we numerically compare
our proposed method to state-of-the-art approaches through a case-study on dispatching emergency
medical response services using real data. Our method provides more accurate estimates of outof-sample performance and learns better-performing policies.
2.1 Introduction
The crux of data-driven decision-making is using past data to identify decisions that will have good
out-of-sample performance on future, unseen data. Indeed, estimating out-of-sample performance
is key to both policy evaluation (assessing the quality of a given policy), and to policy learning
(identifying the best policy from a potentially large set of candidates). Estimating out-of-sample
performance, however, is non-trivial. Naive estimates that leverage the same data to train a policy
and to evaluate its performance often suffer a systematic, optimistic bias, referred to as “in-sample
bias” in machine learning and the “optimizer’s curse” in optimization (Smith and Winkler, 2006).
Consequently, cross-validation and sample-splitting techniques have emerged as the gold-standard
approach to estimating out-of-sample performance. Despite the multitude of cross-validation
methods, at a high-level, these methods all proceed by setting aside a portion of the data as “testing” data not to be used when training the policy, and then evaluating the policy on these testing
data. The policy’s performance on testing data then serves as an estimate of its performance on
future, unseen data, thereby circumventing the aforementioned in-sample bias. Cross-validation
is ubiquitous in machine learning and statistics with provably good performance in large sample
settings (Bousquet and Elisseeff, 2001; Kearns and Ron, 1999).
Unfortunately, when data are scarce, cross-validation can perform poorly. Gupta and P. Rusmevichientong, 2021a prove that for the small-data, large-scale regime — when the number of
uncertain parameters in an optimization problem is large but the amount of relevant data per parameter is small — each of hold-out, 5-fold, 10-fold and leave-one-out cross validation can have
poor performance when used for policy learning, even for very simple optimization problems. J.
7
Shao, 1993 observes a similar failure for leave-one-out cross-validation in a high-dimensional linear regression setting. The key issue in both cases is that when relevant data are scarce, estimates
of uncertain parameters are necessarily imprecise, and omitting even a small amount of data when
training a policy dramatically degrades its performance. Hence, the performance of a policy trained
with a portion of the data on the test set is not indicative of the performance of the policy trained
with all the data on future unseen data. We elucidate this phenomenon with a stylized example in
Section 2.1.2 below.
Worse, this phenomenon is not merely an intellectual curiosity. Optimization problems plagued
by numerous low-precision estimates are quite common in modern, large-scale operations. For
example, optimization models for personalized pricing necessarily include parameters for each
distinct customer type, and these parameters can be estimated only imprecisely since relevant data
for each type are limited. Similar issues appear in large-scale supply-chain design, promotion
optimization, and dispatching emergency response services; see Section 2.2 for further discussion.
In this paper, we propose a new method for estimating out-of-sample performance without
sacrificing data for a test set. The key idea is to debias the in-sample performance of the policy
trained on all the data. Specifically, we focus on the optimization problem
x
∗ ∈ argmin
x∈X ⊆[0,1]
n
µ
⊤x (2.1)
where X is a known, potentially non-convex feasible region contained within [0,1]
n
, and µ ∈ R
n
is an unknown vector of parameters. We assume access to a vector Z of noisy, unbiased predictions
of µ (based on historical data) and are interested in constructing a policy x(Z) with good out-ofsample performance µ
⊤x(Z). (For clarity, the in-sample performance of x(Z) is Z
⊤x(Z).) Note
that for many applications of interest, µ
⊤x
∗ = O(n) as n → ∞; i.e., the full-information solution
grows at least linearly as the dimension grows. Hence, the unknown out-of-sample performance
µ
⊤x(Z) must also be at least Op(n) as n → ∞.
1 See Section 2.2 for examples.
1 Following Van der Vaart, 2000, we say a sequence of random variables Xn = Op(an) if the sequence Xn/an is
stochastically bounded, i.e., for every ε > 0, there exists finite M > 0 and finite N > 0 such that P{Xn/an ≥ M} <
ε, for all n > N.
8
Despite its simplicity, Problem (2.1) subsumes a wide class of optimization problems because
X can be non-convex and/or discrete. This class includes mixed-binary linear optimization problems such as facility location, network design, and promotion maximization. By transforming
decision variables, even some non-linear optimization problems such as personalized pricing can
be rewritten as Problem (2.1); see Section 2.2. In this sense, Problem (2.1) is fairly general.
Our estimator applies to classes of affine plug-in policies which are formally defined in Section 2.2. Loosely, affine plug-in policies are those obtained by solving Problem (2.1) after “pluggingin” some estimator rj(Zj) in place of µj
, and rj(Zj) depends affinely on Zj
. Many policies used
in practice and previously studied in the literature can be viewed as elements of an affine plugin policy class including Sample Average Approximation (SAA), estimate-then-optimize policies
based on regression, the Bayes-Inspired policies of Gupta and P. Rusmevichientong, 2021a, and
the SPO+ policy of Adam N Elmachtoub and Grigas, 2021. Thus, our estimator provides a theoretically rigorous approach to assessing the quality of optimization policies based on many modern
machine learning techniques.
We debias Z
⊤x(Z) by exploiting the structure of Problem (2.1) with the plug-in r(Z). Specifically, by leveraging this problem’s sensitivity analysis, we approximately compute the gradient
of its objective value with respect to the variance of Z, and use the estimated gradient to debias
the in-sample performance. We term this correction the Variance Gradient Correction (VGC). Because our method strongly exploits optimization structure, the VGC is Lipschitz continuous in the
plug-in values r(Z). This continuity is not enjoyed by other techniques such as those in Gupta and
P. Rusmevichientong, 2021a.
Although the VGC’s continuity may seem like mere a mathematical nicety, empirical evidence
suggests it improves empirical performance. Similar empirical phenomena – where an estimator
that varies smoothly in the data often outperforms similar estimators that change discontinuously –
are rife in machine learning. Compare k-nn regression with Gaussian kernel smoothing (Friedman,
T. Hastie, and R. Tibshirani, 2001), CART trees with bagged trees (Breiman, 1996), or best subsetregression with lasso regression (Trevor Hastie, Robert Tibshirani, and Ryan Tibshirani, 2020a).
9
Theoretically, we exploit this smoothness heavily to establish bounds that hold uniformly over the
policy class.
Specifically, we show that, when Z is approximately Gaussian, the bias of our estimator for
out-of-sample performance is O˜(h) as h → 0, where h is a user-defined parameter that controls
the accuracy of our gradient estimate (Theorem 2.3.2). Characterizing the variance is more delicate. We introduce the concept of Average Solution Instability, and prove that if the instability
of the policy vanishes at rate O(n
−α
) for α ≥ 0, then the variance of our estimator is roughly
O(n
3−α/h). Collectively, these results suggest interpreting h as a parameter controlling the biasvariance tradeoff of our estimator. Moreover, when α > 1, the variance of our estimator is o(n
2
).
Since, as mentioned, the unknown out-of-sample performance often grows at least linearly in n,
i.e., µ
⊤x(Z) = Op(n), our variance bound shows that when α > 1 and n is large, the stochastic
fluctuations of our estimator are negligible relative to the out-of-sample performance. In other
words, our estimator is quite accurate in these settings.
Our notion of Average Solution Instability is formally defined in Section 2.3.3. Loosely, it
measures the expected change in the j
th component of the policy after replacing the k
th data point
with an i.i.d. copy, where j and k are chosen uniformly at random from {1,...,n}. This notion
of stability is similar to hypothesis stability (Bousquet and Elisseeff, 2001), but, to the best of our
knowledge, is distinct. Moreover, insofar as we expect that a small perturbation of the data is
unlikely to have a large change on the solution for most real-world, large-scale optimization problems, we expect Average Solution Instability to be small and our estimator to have low variance.
We then prove stronger high-probability tail bounds on the error of our estimator for two special classes of “weakly-coupled” instances of Problem (2.1): weakly-coupled-by-variables and
weakly-coupled-by-constraints. In Section 2.4.1, we consider problems that are weakly-coupledby-variables, i.e., problems that decouple into many, disjoint subproblems once a small number
of decision variables are fixed. In Section 2.4.2 we consider problems that are weakly-coupledby-constraints, i.e., problems that decouple into many, disjoint subproblems once a small number
10
of constraints are removed. For each problem class, we go beyond bounding the variance to provide an explicit tail bound on the relative error of our estimator that holds uniformly over the
policy class. We show that for problems weakly-coupled-by-variables the relative error scales like
O˜(CPI
polylog(1/ε)
√3 n
) where CPI is a constant measuring the complexity of the policy class; see Theorem 2.4.3. Similarly, we show the relative error for problems weakly-coupled-by-constraints scales
like O˜(CPI
polylog(1/ε)
√4 n
), where CPI measures both the complexity of the policy class and number of
constraints of the problem; see Theorem 2.4.7. Importantly, since these bounds hold uniformly,
our debiased in-sample performance can be used both for policy evaluation and policy learning,
even when data are scarce, so long as n (the dimension of the problem) is sufficiently large. Said
differently, our estimator of out-of-sample performance is particularly well-suited to small-data,
large-scale optimization.
Admittedly, weakly-coupled problems as described above do not cover all instances of Problem (2.1) and the appropriateness of modeling Z as approximately Gaussian is application specific.
Nonetheless, our results and their proofs highly suggest our estimator will have strong performance
whenever the underlying optimization problem is well-behaved enough for certain uniform laws
of large numbers to pertain.
Finally, to complement these theoretical results, we perform a numerical case study of dispatching emergency medical services with real data from cardiac arrest incidents in Ontario, Canada.
With respect to policy evaluation and learning, we show that our debiased in-sample performance
outperforms both traditional cross-validation methods and the Stein correction of Gupta and P.
Rusmevichientong, 2021a. In particular, while the bias of cross-validation is non-vanishing as the
problem size grows for a fixed amount of data, the bias of our VGC converges to zero. Similarly,
while both the Stein correction and our VGC have similar asymptotic performance, the smoothness
of VGC empirically leads to lower bias and variance for moderate and sized instances.
2.1.1 Our Contributions
We summarize our contributions as follows:
11
1. We propose an estimator of out-of-sample performance for Problem (2.1) by debiasing
in-sample performance through a novel Variance Gradient Correction (VGC). Our VGC
applies to a general class of affine plug-in policies that subsumes many policies used in practice. Most importantly, unlike cross-validation, VGC does not sacrifice data when training,
and, hence, is particularly well-suited to settings where data are scarce.
2. We prove that under some assumptions on the data-generating process, for general instances
of Problem (2.1), the bias of our estimator is at most O˜(h) as h → 0, where h is a user-defined
parameter. For policy classes that satisfy a certain Average Solution Instability condition, we
also prove that that its variance scales like o
n
2
h
as n → ∞.
3. We prove stronger results for instances of Problem (2.1) in which the feasible region is only
weakly-coupled. When the feasible region is weakly-coupled by variables, we prove that,
with probability at least 1−ε, debiasing in-sample performance with our VGC recovers the
true out-of-sample performance up to relative error that is at most O˜
CPI
log(1/ε)
n
1/3
as n → ∞,
uniformly over the policy class, where CPI is a constant that measures the complexity of the
plug-in policy class (Theorem 2.4.3). Similarly, for certain linear optimization problems that
are weakly coupled by constraints, we prove that, with probability at least 1−ε, debiasing insample performance with VGC estimates the true out-of-sample performance uniformly over
the policy class with relative error that is at most O˜
CPI
√
log(1/ε)
n
1/4
where CPI is a constant
measuring the complexity of the plug-in policy class and the number of of constraints of
the problem. (Theorem 2.4.7). We stress that since both these bounds hold uniformly, our
debiased in-sample performance can not only be used for policy evaluation, but also policy
learning, even when data are scarce, so long as n (the size of the problem) is sufficiently
large.
4. Finally, we present a numerical case study based on real data from dispatching emergency
response services to highlight the strengths and weaknesses of our approach relative to crossvalidation and the Stein correction of Gupta and P. Rusmevichientong, 2021a. Overall, we
12
find that since our VGC exploits the optimization structure of Problem (2.1), it outperforms
the benchmarks when the number of uncertain parameters is sufficiently large. Additionally,
in settings where the signal to noise ratio is low, VGC more effectively balances the biasvariance trade-off than cross-validation which can be quite sensitive to the number of folds
used.
2.1.2 A Motivating Example: Poor Performance of Cross-Validation with
Limited Data
Before proceeding, we present an example that highlights the shortcomings of cross-validation and
the benefits of our method when data are limited. Consider a special case of Problem (2.1)
max
x∈{0,1}
n
n
∑
j=1
µjx j (2.2)
where the true parameters µ ∈ {−1,1}
n
are unknown, but we observe S samples Y 1,...,Y S where
Y i ∈ R
n
and Y i ∼ N (µ,2I) for all i and I is the identity matrix. A standard data-driven policy
in this setting is Sample Average Approximation (SAA), also called empirical risk minimization,
which prescribes the policy
x
SAA(Z) ∈ argmax
x∈{0,1}
n
n
∑
j=1
Zjx j where Zj =
1
S
S
∑
i=1
Yi j.
The key question, of course, is “What is SAA’s out-of-sample behavior µ
⊤x
SAA(Z)?”
To study this question, the left panel of Table 2.1 shows different estimators for the out-ofsample performance of SAA E
µ
⊤x
SAA(Z)
when S = 3 in Problem (2.2). The right panel shows
the expected relative error (with respect to the oracle) of these estimators as the number of samples
S grows. To account for the noise level of the samples, we plot the estimation error with respect to
the signal-to-noise ratio (SNR) of Zj
2
. For reference, Trevor Hastie, Robert Tibshirani, and Ryan
2Following Trevor Hastie, Robert Tibshirani, and Ryan Tibshirani, 2020b, we define SNR = Var(µπ )/Var(Zπ ) =
S
2n ∑
n
j=1
µj −
1
n ∑
n
i=1 µi
2
where π is an index drawn uniformly random from 1 to n.
1
Tibshirani, 2020b argues that SNR greater than 1 is “rare” when working with “noisy, observational
data,” and an SNR of 0.25 is more “typical.”
The first row of Table 2.1 presents the in-sample performance, i.e, the objective of the SAA
problem. As expected, we see in-sample performance significantly over-estimates the out-ofsample performance. The right panel of Table 2.1 suggests this effect persists across SNRs and the
relative error is at least 23% for SNRs less than 2.
The second row of the left panel of Table 2.1 shows the leave-one-out cross-validation error, which aims to correct the over-optimistic bias and computes 1
S ∑
S
i=1
Y
⊤
i
x
SAA(Z
−i
), where
Z
−i =
1
S−1 ∑j̸=iY j
. Cross-validation is also fairly inaccurate, suggesting SAA performs worse
than the trivial, non-data-driven policy x = 0, which has an out-of-sample performance of 0. In the
right panel, this incorrect implication occurs for SNRs less than about 0.875.
Why does cross-validation perform so poorly? By construction cross-validation omits some
data in training, and hence, does not estimate the out-of-sample performance of x
SAA(Z), but
rather, that of x
SAA(Z
−1
).
3 From the second column of Table 2.1, we see the cross-validation
estimate does nearly match the true (oracle) performance of x
SAA(Z
−1
). When data are scarce,
sacrificing even a small amount of data in training can dramatically degrade a policy. As seen in
the right panel of Table 2.1, this phenomenon is non-negligible (at least 10% relative error) for
signal-to-noise ratios less than or equal to 1.75. Thus, the performance of x
SAA(Z
−1
) may not
always be a good proxy of the performance of x
SAA(Z).
How then might we resolve the issue? The third row of the left panel of Table 2.1 presents our
estimator based on debiasing the in-sample performance of x
SAA(Z) with our VGC. Our estimate
is essentially unbiased (see also Theorem 2.3.2 below). The right panel of Table 2.1 confirms this
excellent behavior across a range of signal-to-noise ratios. Finally, although this example focuses
on the bias of our estimator, our results in Section 2.4 are stronger and bound the (random) error
of our estimator directly, rather than its expectation.
3For clarity, this is the same as the performance as x
SAA(Z
−2
) because the data are i.i.d.
14
x
SAA(Z) x
SAA(Z
−i
)
In-Sample 18.36 22.33
Cross-Val −1.86 −9.98
Our Method 2.95 −1.89
Oracle 2.97 −1.87
-100%
0%
100%
200%
0.75 1.00 1.25 1.50 1.75 2.00
Signal-to-Noise Ratio
Estimation Error
(% of Oracle)
In-Sample
Our Method
Cross-Val
Table 2.1: Expected Estimates of Out-of-Sample Performance by Policy for Problem (2.2). In
the left table, we take n = 100, S = 3 and µj = 1 if j ≤ 14 and µj = −1 otherwise. We estimate
the expected out-of-sample perf. across 1,000,000 simulations. Std. errors are less than 0.005.
In the right graph, we plot the bias of the estimates with respect to the expected out-of-sample
performance as we increase the signal-to-noise ratio. The In-Sample error not shown at 0.75 SNR
exceeds 500%.
2.1.3 Relationship to Prior Work
Cross-validation is the gold-standard for estimating out-of-sample performance in the large-sample
regime with i.i.d. data; see Bousquet and Elisseeff, 2001; Kearns and Ron, 1999 for some fundamental results. As discussed above, when estimating the performance of a fixed-policy, these
approaches entail sacrificing some data in the training step to set aside for validation, and, hence,
may be ill-suited to data-scarce settings. Similar issues arise in a variety of other sample-splitting
methods, including “honest-trees” Wager and Athey, 2018 and most forms of doubly-robust estimation Dud´ık, Langford, and L. Li, 2011. By contrast, our VGC based approach to debiasing
the in-sample performance effectively uses all the data when training, making it somewhat better
suited to data-scarce settings and small-data, large-scale optimization.
Our work also contributes to the growing literature on “optimization-aware” estimation. These
works employ a variety of names including operational statistics (Liyanage and Shanthikumar,
2005), learning-enabled optimization (Deng and Sen, 2018), decision-focused learning (Wilder,
Dilkina, and Tambe, 2019), end-to-end learning (Wilder, Ewing, et al., 2019) and task-based learning (Donti, Amos, and Kolter, 2017). Fundamentally, this area of research seeks estimators that
optimize the out-of-sample performance of a policy in a downstream optimization problem rather
15
than the prediction error of the estimate. Closest to our work is the “Smart ‘Predict then Optimize’” framework studied in Adam N Elmachtoub and Grigas, 2021 and A. Elmachtoub, Liang,
and McNellis, 2020. These works also study Problem (2.1), but in a slightly different data setting,
and propose policy selection methods for affine and tree-based policies, respectively. Also related
is Ito, Yabe, and Fujimaki, 2018a which develops an unbiased estimate of the sample average approximation (SAA) policy for Problem (2.1), but does not consider higher level moments, policy
evaluation for other policies, or policy learning.
Recently, El Balghiti, Adam N Elmachtoub, et al., 2019; Hu, Kallus, and Mao, 2022 have
sought to establish generalization guarantees for data-driven policies for plug-in policies for Problem (2.1), i.e., bounds on the difference between in-sample performance and out-of-sample performance that hold uniformly over the policy class. Both works prove generalization guarantees
that vanish in the large-sample regime (when the amount of data grows large). We similarly bound
the difference between our debiased in-sample performance and out-of-sample performance, uniformly over a policy class. However, unlike the previous works, our bounds are specifically constructed to vanish relative to the unknown out-of-sample performance in the small-data, large-scale
regime. When applied to the types of policies studied in El Balghiti, Adam N Elmachtoub, et al.,
2019; Hu, Kallus, and Mao, 2022, our debiased in-sample performance equals the ordinary insample performance (see discussion of “Policy Classes that Do Not Depend on Z” in Section 2.3).
Hence, our results can be reinterpreted as generalization guarantees for these classes, showing
that generalization error vanishes relative to the out-of-sample performance as the problem size
grows. In this sense, our work complements the large-sample analysis of El Balghiti, Adam N
Elmachtoub, et al., 2019; Hu, Kallus, and Mao, 2022 with new, small-data, large-scale analysis.
More generally, there has been somewhat less work in the small-data, large-scale regime,
most notably Gupta and Kallus, 2021 and Gupta and P. Rusmevichientong, 2021a. Of these,
Gupta and P. Rusmevichientong, 2021a, henceforth GR 2021, is most closely related to our work.
Loosely, GR 2021 study a class of weakly coupled linear optimization problems and propose an
16
estimator of the out-of-sample performance based on Stein’s Lemma. By leveraging a careful duality argument, the authors prove that the estimation error of their procedure vanishes in both the
large-sample and small-data, large-scale regime.
Our work differs in two important respects: First, our estimator applies to a more general
class of problems and more general policy classes. Indeed, we focus on Problem (2.1) with
specialized results for weakly-coupled instances. Our weakly-coupled by constraints variant in
Section 2.4.2 mirrors the setting of GR 2021, and our weakly-coupled by variables variant in Section 2.4.1 is more general, allowing us to model, for example, discrete optimization problems.
Moreover, our affine plug-in policy class significantly generalizes the “Bayes-Inspired” policy
class of GR 2021 by incorporating covariate information.
The second important difference from GR 2021 relates to exploiting optimization structure in
Problem (2.1). GR 2021 fundamentally relies on Stein’s lemma, a result which applies to general functions and does not specifically leverage optimization structure. By contrast, our method
directly leverages the structure of Problem (2.1) through its sensitivity analysis and Danskin’s theorem. By leveraging optimization structure, our VGC is, by construction, continuous in the policy
class. Evidence from Section 2.5 suggests this smoothness yields an empirical advantage of our
method.
Finally, our work also contributes to a growing literature on debiasing estimates in highdimensional statistics, most notably for LASSO regression (Javanmard and Montanari, 2018; C.-H.
Zhang and S. S. Zhang, 2014) and M-estimation (Javanmard and Montanari, 2014). Like these
works, VGC involves estimating a gradient of the underlying system and using this gradient information to form a correction. Unlike these works, however, our gradient estimation strongly
leverages ideas from sensitivity analysis in optimization. Moreover, the proofs of our performance
guarantees involve substantively different mathematical techniques.
17
2.2 Model
As mentioned, our focus is on data-driven instances of Problem (2.1) where the feasible region X
is known, but the parameters µ are unknown. Despite its simplicity, several applications can be
modeled in this form after a suitable transformation of variables.
Example 2.2.1 (Promotion Optimization). Promotion optimization is an increasingly well-studied
application area (Maxime C. Cohen et al., 2017; Baardman et al., 2019). Our formulation mirrors
a formulation from the ride-sharing company Lyft around incentive allocation (Schmoys and S.
Wang, 2019), but also resembles the online advertising portfolio optimization problem (D. P. W.
Rusmevichientong P., 2006; Pani and Sahin, 2017,GR 2021).
The decision-maker (platform) seeks to allocate J different types of coupons (promotions) to K
different customer (passenger) types. Coupons are costly, and there is a finite budget C available.
Let µjk be the reward (induced spending) and c jk be the cost of assigning coupon type j to customer
type k. Using x jk to denote the fraction of customers of type k who receive coupons of type j, we
can formulate the following linear optimization problem of the form of Problem (2.1).
max
x≥0
(
K
∑
k=1
J
∑
j=1
µjkx jk :
J
∑
j=1
x jk ≤ 1 for each k = 1,...,K,
K
∑
k=1
J
∑
j=1
c jkx jk ≤ C
)
.
In typical instances, the cost c jk are likely known (a “$10 off” coupon costs $10), whereas the
reward µjk must be estimated from historical data. In settings with many types of coupons and
customers, we might further expect that the reward estimates may be imprecise.
Some reflection suggests many linear optimization problems including shortest-path with uncertain edge costs, or even binary linear optimization problems like multi-choice knapsack with
uncertain rewards can be cast as above.
We next observe that some two-stage linear optimization problems can also be framed as Problem (2.1).
18
Example 2.2.2 (Drone-Assisted Emergency Medical Response). In recent years, emergency response systems have begun utilizing drones as part of their operations, specifically for rapid delivery of automatic electronic defibrillators (AEDs) for out-of-hospital cardiac arrests (OHCA)
(Sanfridsson et al., 2019; Cheskes et al., 2020). The intention is that a drone might reach a patient
in a remote region before a dispatched ambulance, and (untrained) bystanders can use the AED to
assist the patient until the ambulance arrives. Consequently, researchers have begun studying both
how to design a drone-assisted emergency response network (where to locate depots) (Boutilier
and T. C. Chan, 2019) and how to create optimal dispatch rules (to which locations should we
allocate a drone and from which depot) (Chu et al., 2021). Combining these two problems yields
a two-stage optimization problem, similar to facility location, aimed at minimizing the response
time.
Namely, let µkl be the response time of drone routed from a source l for to a patient location k,
l = 1,...,L and k = 1,...,K. Let yl ∈ {0,1} be a binary decision variable encoding if we build
a drone depot at location l, and let xkl be a binary decision variable encoding if, after building
the network, we should dispatch a drone from location l to patient requests at location k. We let
xk0 be the choice not to route a drone (sending only an ambulance) to location k and µk0 be the
corresponding ambulance travel time. Suppose we can build at most B depots. Then, we have the
following optimization problem.
min
y∈{0,1}
L, x∈{0,1}
K×L
K
∑
k=1
L
∑
l=0
µklxkl,
s.t.
L
∑
l=1
yl ≤ B, xkl ≤ yl
,
L
∑
l=0
xkl = 1, ∀k = 1,...,K, l = 1,...,L.
Insofar as some drone response times are difficult to predict (depending on the weather, local
environment, ability of bystanders to locate and use the drone’s payload), we expect in typical
instances that estimates µkl may be imprecise.
Interestingly, some non-linear problems can be transformed into the form of Problem (2.1).
19
Example 2.2.3 (Personalized Pricing). Personalized pricing strategies seek to assign a tailored
price to each of many customer types reflecting their heterogeneous willingness-to-pay (Maxime C
Cohen, Kalas, and Perakis, 2021; Javanmard, Nazerzadeh, and S. Shao, 2020; Aouad et al., 2019).
One simple formulation posits distinct demand models Dj(p) = mjφj(p) + bj
in each customer
segment j, for some decreasing function φj(p). This yields the revenue maximization problem
max
p≥0
n
∑
j=1
mj pjφj(pj) +bj p,
where pj
is the price for the j
th segment. We can cast this nonlinear objective in the form Problem (2.1) by transforming variables,
max
p≥0,x
(
n
∑
j=1
mjx j +bj pj
: x j = pjφj(pj) for each j = 1,...,n
)
where the resulting feasible region is now non-convex. In typical settings, we expect the parameters
mj
,bj are unknown, and estimated via machine learning methods (Aouad et al., 2019). When there
are many customer types, these estimates may be imprecise for rarely occurring types.
Finally, we mention as an aside that some dynamic programs like the economic lot-sizing
problem in inventory management can be cast in the above form through a careful representation
of the dynamic program; see Adam N Elmachtoub and Grigas, 2021 for details.
2.2.1 Data
Following GR 2021, we do not assume an explicit data generation procedure. Rather, we assume
that as a result of analyzing whatever raw data are available, we obtain noisy, unbiased predictions
Z such that E[Z] = µ with known precision E
(Zj − µj)
2
= 1/νj for j = 1,...,n. These predictions might arise as sample averages as in Section 2.1.2, or as the outputs of some pre-processing
regression procedure. We further assume that for each j = 1,...,n, we observe a non-random
covariate of feature data W j ∈ R
p
, which may (or may not) be informative for the unknown µj
.
20
We believe this set-up reasonably reflects many applications. In the case of drone-assisted
emergency response (Example 2.2.2), W j encodes features that are predictive of EMS response
times such as physical road distance between the patient and the responding ambulance, time of
day, day of week, and weather conditions (Chu et al., 2021), while Zk0 may be an average of
historical response times to location k.
An advantage of modeling Z in lieu of the data generation process is that the precisions νj
implicitly describe the amount of relevant data available for each µj
. Let νmin ≡ minj νj and
νmax ≡ maxj νj
. Then, loosely speaking, the large-sample regime describes instances where νmin is
large, i.e., where data are plentiful and we can estimate µ easily. By contrast, the small-data, largescale regime describes instances in which n is large (large-scale), but there are limited relevant data,
and, hence, νmax is small.
To simplify our exposition, we will also assume:
Assumption 2.2.4 (Independent Gaussian Corruptions). For each j = 1,...,n, Zj has Gaussian
distribution with Zj ∼ N (µj
,1/νj) where νj
is the known precision of Zj
. Moreover, Z1,...,Zn
are independent.
Assumption 2.2.4 is common. GR 2021 employ a similar assumption; Javanmard and Montanari, 2018 strongly leverages a Gaussian design assumption when debiasing lasso estimates, and
Ito, Yabe, and Fujimaki, 2018a also assumes Gaussian errors in their debiasing technique. In each
case, the idea is if a technique enjoys provably good performance under Gaussian corruptions, it
will likely have good practical performance when data are approximately Gaussian. Indeed, if Z
is obtained by maximum likelihood estimation, ordinary linear regression, simple averaging as in
Section 2.1.2, or Gaussian process regression, then the resulting estimates will be approximately
Gaussian. We adopt the same perspective in our work.
Note, the independence assumption in Assumption 2.2.4 is without loss of generality as illustrated in the following example.
Example 2.2.5 (Correlated Predictions). Suppose we are given an instance of Problem (2.1) and
predictions Z ∼ N (µ,Σ) where Σ is a known, positive semidefinite matrix. Consider a Cholesky
21
decomposition Σ = LL⊤ and the transformed predictions Z ≡ L
−1Z. Notice Z ∼ N (µ,I) where
µ ≡ L
−1µ. We then recast Problem (2.1) as the equivalent problem
min
x∈X
µ
⊤
x, where X ≡ {L
⊤x : x ∈ X }.
Our new problem is of the required form with transformed predictions independent across j.
Most importantly, Assumption 2.2.4 is not crucial to many of our results. Violating the
Gaussian assumption only affects the bias of our estimator (see Theorem 2.3.2). Our analysis
bounding the variance and tails of the stochastic errors utilize empirical process theory, and can
easily be adapted for non-Gaussian corruptions. Moreover, although the bias of our estimator is
non-negligible when Z is non-Gaussian, we can bound this bias in terms of the Wasserstein distance
between Z and a multivariate Gaussian, suggesting our method has good performance as long as
corruptions are approximately Gaussian and this distance is small (see Lemma 2.8.5 in Appendix
for the bound).
Finally, similar results also hold when ν is, itself, estimated noisily with the addition of a small
bias term related to its estimate’s accuracy (see Lemma 2.8.4 in Appendix).
2.2.2 Affine Plug-in Policy Classes
A data-driven policy for Problem (2.1) is a mapping Z 7→ x(Z) ∈ X that determines a feasible
decision x(Z) from the observed data Z. We focus on classes of affine plug-in policies. Intuitively,
a plug-in policy first proxies the unknown µ by some estimate, r(Z), and then solves Problem (2.1)
after “plugging-in” this estimate for µ.
Definition 2.2.6 (Affine Plug-in Policy Classes). For j = 1,...,n, let rj(z,θ ) = aj(θ )z+bj(θ ) be
an affine function of z where aj(θ ) and bj(θ ) are arbitrary functions of the parameter θ ∈ Θ. Let
r(Z,θ ) = (r1(Z1,θ ),r2(Z2,θ ),...,rn(Zn,θ ))⊤ ∈ R
n
. The plug-in policy with respect to r(·,θ ) is
given by
x (Z,θ ) ∈ arg min
x∈X
r(Z,θ )
⊤x, (2.3)
22
where ties are broken arbitrarily. Furthermore, we let XΘ(Z) ≡ {x(Z,θ ) ∈ X : θ ∈ Θ} ⊆ X
denote the corresponding class of plug-in policies over Θ.
When θ is fixed and clear from context, we suppress its dependence, writing x(Z) and r(Z).
Moreover, for a fixed θ , rj(Zj
,θ ) only depends on the data (linearly) through the j
th component.
Plug-in policies are attractive in data-driven optimization because computing x(Z,θ ) involves
solving a problem of the same form as Problem (2.1) . Thus, if a specialized algorithm exists for
solving Problem (2.1) – e.g., as with many network optimization problems – the same algorithm
can be used to compute the policy. This property does not necessarily hold for other classes of
policies such as regularization based policies (GR 2021).
Moreover, many policies used in practice are of the form x(Z,
ˆθ (Z)) for some ˆθ (Z). (See examples below.) Such policies are not affine plug-in policies; rj(Zj
,
ˆθ (Z)) may depend nonlinearly
on all the data Z. Nonetheless, our analysis will bound the error of our estimator applied to such
policies. Namely, in Section 2.4, we provide error bounds on our estimator that hold uniformly
over XΘ(Z). Since these bounds hold uniformly, such bounds also hold for all policies of the form
x(Z,
ˆθ (Z)).
For clarity, we make no claim about the optimality of affine plug-in policies for Problem (2.1);
for a particular application, there may exist non-affine policies with superior performance. Our
focus on affine plug-ins is motivated by their ubiquity and computational tractability.
We next present examples:
• Sample Average Approximation (SAA). The Sample Average Approximation (SAA) is a canonical data-driven policy for Problem (2.1). It is defined by
x
SAA(Z) ∈ arg min
x∈X
Z
⊤x. (2.4)
SAA is thus an affine plug-in policy where the function rj(z,θ ) = z.
23
• Plug-ins for Regression Models. Consider the linear model rj(Z,θ ) = θ
⊤W j
, which does not
depend on Z, and the affine plug-in policy
x
LM(Z,θ ) ∈ arg min
x∈X
n
∑
j=1
W ⊤
j θ · x j
. (2.5)
As mentioned, many policies in the literature are of the form x
LM(Z,
ˆθ (Z)) for a particular ˆθ (Z). For example, letting θ
OLS(Z) ∈ argminθ ∑
n
j=1
(Zj − θ
⊤W j)
2 be the ordinary leastsquares fit yields the estimate-then-optimize policy x
LM(Z,θ
OLS(Z)). Similarly, by appropriately padding the covariate with zeros, we can write the “optimization-aware” SPO and SPO+
methods of Adam N Elmachtoub and Grigas, 2021 over linear hypothesis classes in the form
x
LM(Z,θ
SPO(Z)) and θ
SPO+(Z)) where θ
SPO(Z) and θ
SPO+(Z) are obtained by minimizing
the so-called SPO and SPO+ losses, respectively. Other methods, e.g., (Wilder, Dilkina, and
Tambe, 2019), can be rewritten similarly. As mentioned, our analysis will bound the error when
debiasing these polices as well.
Of course, we are not limited to a linear model for rj(z,θ ). We could alternatively use a nonlinear
specification rj(z,θ ) = f(W j
,θ ) for some, given, nonlinear regression f with parameters θ .
This specification of r(Z,θ ) still gives rise to a class of affine plug-in policies. Again, many
policies in the literature, including estimate-then-optimize policies and SPO+ over non-linear
hypothesis classes can be written in the form x(Z,θ (Z)) for some particular mapping θ (Z) ∈ Θ.
• Mixed-Effects Policies. When W j
is not informative for µj
, plug-ins for regression models
perform poorly because no choice of θ yields a good estimate of µ. By contrast, if νmin is large,
SAA performs quite well. Mixed-effects policies interpolate between these choices. Define
x
ME
Z,(τ,β )
∈ arg min
x∈X
n
∑
j=1
νj
νj +τ
Zj +
τ
νj +τ
W ⊤
j β
x j
, (2.6)
where we have focused on a linear model for simplicity and made the dependence on θ = (τ,β )
explicit for clarity. Mixed-effects policies are strongly motivated by Bayesian analysis (Gelman
2
et al., 2014). These policies generalize the Bayes-Inspired policy class considered in GR 2021.
Again, we observe that x
ME
Z,(τ,β )
is an affine-plug in policy. Moreover, we can also consider shrinking towards a nonlinear regression model as in (Ignatiadis and Wager, 2019a).
Note in Definition 2.2.6, we require that rj(Zj) depends only on Zj
, not on Zk
for k ̸= j. We exploit this structure in the design and analysis of our debiasing technique. However, this requirement
precludes certain types of plug-ins, e.g., those based on linear smoothers (Buja, Trevor Hastie, and
Robert Tibshirani, 1989) including local polynomial regression and k-nearest neighbors. Extension
of our method to these settings remains an interesting open research question.
The choice of which affine plug-in policy class to use is largely application dependent. Our
bounds in Section 2.4 provide some preliminary guidance, suggesting a tradeoff between the expressiveness of the policy class and the error of our estimator.
2.3 Variance Gradient Correction
We make the following assumption on problem parameters for the remainder of the paper:
Assumption 2.3.1 (Assumptions on Parameters). There exists a constant Cµ > 1 such that ∥µ∥∞ ≤
Cµ, and constants 0 < νmin < 1 < νmax < ∞ such that νmin ≤ νj ≤ νmax for all j. Moreover, we
assume that n ≥ 3.
The assumptions for Cµ and νmin,νmax are without loss of generality. These assumptions and
the assumption on n allow us to simplify the presentation of some results by absorbing lower order
terms into leading constants.
The in-sample performance of a policy x(Z) is Z
⊤x(Z). Let ξ = Z −µ. We call the difference
between in-sample and out-of-sample performance, corresponding to (Z − µ)
⊤x(Z) = ξ
⊤
x(Z),
the in-sample optimism. The expected in-sample optimism E
h
ξ
⊤
x(Z)
i
is the in-sample bias.
2
Our method estimates the in-sample optimism of an affine, plug-in policy x(Z,θ ). To this end,
denote the plug-in objective value by
V(Z,θ ) ≡ r(Z,θ )
⊤x(Z,θ ) = min
x∈X
r(Z,θ )
⊤x. (2.7)
Because it is the minimum of linear functions, Z 7→ V(Z,θ ) is always concave. We then estimate
the in-sample optimism by the Variance Gradient Correction (VGC) defined by
D(Z,(θ ,h)) ≡
n
∑
j=1
Dj(Z,(θ ,h)), (2.8)
where for j = 1,2,...,n,
Dj(Z,(θ ,h)) ≡
E
1
h
√νjaj(θ )
V(Z +δje j)−V(Z)
Z
, if aj ̸= 0,
0, otherwise,
(2.9)
and δ1,...,δn are independent Gaussian random variables such that δj ∼ N
0,h
2 + √
2h
νj
for all
j. To reduce notation, we define Θ¯ ≡ Θ×[hmin,hmax] for 0 < hmin ≤ hmax and write D(Z,θ ) with
θ ∈ Θ¯ . We utilize the defined notation for results that require separating h and θ .
The VGC is defined as a (conditional) expectation over the auxiliary random variables δj
. In
practice, we can approximate this expectation to arbitrary precision by simulating δj and averaging;
see Appendix 2.8.5 for more efficient implementations.
Given the VGC, we estimate the out-of-sample performance by
µ
⊤x(Z,θ ) ≈ Z
⊤x(θ )−D(Z,θ ). (2.10)
In Section 2.3.1, we motivate the VGC. We then establish some of its key properties, namely
that it is almost an unbiased estimator for the in-sample optimism, its variance is often vanishing
as n → ∞, and it is smooth in the policy class.
26
Policy Classes that Do Not Depend on Z: Recall our plug-ins for linear regression models
example from Section 2.2.2. From Eq. (2.5), we can see that D(Z,(θ ,h)) = 0 uniformly over
the class. Said differently, the in-sample performance is already an unbiased estimator of out-ofsample performance. This happy coincidence occurs whenever the plug-in function rj(z,θ ) does
not depend on z for each j and θ . That said, we stress that although in-sample performance is an
unbiased estimator, it is not immediately clear what the variance of this estimator is. We discuss
this further in Sections 2.3.3 and 2.4. Moreover, when rj(z,θ ) does depend on z, the VGC is
typically non-zero.
2.3.1 Motivating the Variance Gradient Correction (VGC)
Throughout this section, θ is fixed so we drop it from the notation. Our heuristic derivation of
D(Z) proceeds in three steps.
Step 1: Re-expressing the In-Sample Optimism via Danskin’s Theorem. Fix some j. If aj = 0,
then from the plug-in policy problem (Problem (2.3)) we see that x(Z) is independent of Zj and
the corresponding term in the in-sample bias is mean-zero, i.e., E
ξjx(Z)
= 0 = Dj(Z). In other
words, we do not correct such terms.
When aj ̸= 0, consider the function
λ 7→ V(Z +λξje j). (2.11)
This function is an example of a parametric optimization problem. Danskin’s Theorem (Bertsekas,
1997, Section B.5) characterizes its derivative with respect to λ.
4 Specifically, for any λ ∈ R such
that x(Z +λξje j) is the unique optimizer to Problem (2.3), we have
∂
∂ λ V(Z +λξje j) = ajξjx j(Z +λξje j).
4See Theorem 2.8.1 in Appendix 2.8 for a statement of Danskin’s Theorem.
27
x(Z + λξjej)
Z + λξjej
Z
λξjej
ξ
µ
Figure 2.1: When X is polyhedral, x j(Z +
λξje j) must occur at a vertex if it is unique.
Hence, small perturbations to λ do not change
the solution (see figure), and the derivative
of V(Z +λξje j) is entirely determined by the
derivative of r(Z + λξje j). Similar intuition
holds for non-polyhedral X .
When x(Z +λξje j) is not the unique optimizer, ajξjx j(Z +λξje j) is a subgradient, see Fig. 2.1
for intuition.
Notice that ∂
∂ λV(Z +λξje j) is the derivative of the plug-in value when we make the j
th component of Z more variable, i.e., variance increases by a factor (1+∂ λ)
2
, where ∂ λ represents an
infinitely small perturbation to λ. This observation motivates our nomenclature “Variance Gradient
Correction.”
Evaluating the above derivative at λ = 0, dividing by aj
, and summing over j such that aj ̸= 0,
allows us to re-express the in-sample bias whenever x(Z) is unique as
n
∑
j=1
ξjx j(Z) = ∑
j:aj̸=0
1
aj
·
∂
∂ λ V(Z +λξje j)
λ=0
.
Unfortunately, it is not clear how to evaluate these derivatives from the data. This leads to the
second step in our derivation.
Step 2: Approximating the Derivative via Randomized Finite Differencing. As a first attempt,
we approximate the above derivatives with first-order, forward finite-differences (LeVeque, 2007,
Chapter 1). Intuitively, we expect that for a sufficiently small step size h > 0,
∂
∂ λ V(Z +λξje j)
λ=0
=
1
h
√
ν j
V(Z +h
p
νjξje j)−V(Z)
+ op(1) as h → 0, (2.12)
28
which suggests that
n
∑
j=1
ξjx j(Z) = ∑
j:aj̸=0
1
h
√νjaj
V(Z +h
p
νjξje j)−V(Z)
+ op(n) as h → 0. (2.13)
Unfortunately, the right side of Eq. (2.13) is not computable from the data, because we do not
observe µj
, and, hence, do not observe ξj = Zj − µj
.
To circumvent this challenge, recall that ξj
is Gaussian and independent across j, and let δj
be the independent Gaussian random variables defined in the definition of the VGC (Eq. (2.9)). A
direct computation shows, Z +h
√νjξje j ∼d Z +δje j
, because both Z +h
√νjξje j and Zj +δje j
are Gaussians with matching mean and covariances.5 Hence, V(Z +h
√νjξje j) ∼d V(Z +δje j).
Inspired by this relation, we replace the unknown V(Z + h
√νjξje j) by V(Z + δje j) in our
first-order, finite difference approximation, yielding a randomized finite difference:
n
∑
j=1
ξjx j(ξ + µ;θ )
| {z }
In-Sample Optimism
≈ ∑
j:aj̸=0
1
h
√νjaj(θ )
V(µ +ξ +δje j)−V(µ +ξ )
| {z }
Randomized Finite Difference
. (2.14)
Step 3: De-Randomizing the Correction. Finally, in the spirit of Rao-Blackwellization, we then
de-randomize this correction by taking conditional expectations over δ . This de-randomization
reduces the variability of our estimator and yields the VGC (Eq. (2.9)).
Higher Order Finite Difference Approximations: Our heuristic motivation above employs a
first-order finite difference approximation, and our theoretical analysis below focuses on this setting for simplicity. However, it is possible to use higher order approximations, which in turn reduce
the bias. Theoretical analysis of such higher order approximations is tedious, but not substantively
different from the first-order case. Hence, it is omitted. In our experiments, we use a particular
second-order approximation described in Appendix 2.8.5.
5Here, ∼d denotes equality in distribution.
29
2.3.2 Bias of Variance Gradient Correction
Our first main result shows that one can make the heuristic derivation of the previous section
rigorous when quantifying the bias of the VGC.
Theorem 2.3.2 (Bias of the Variance Gradient Correction). Under Assumptions 2.2.4 and 2.3.1,
for any 0 < h < 1/e and any affine, plug-in policy x(Z,θ ), there exists a constant C (depending on
νmin) such that
0 ≤ E
"
n
∑
j=1
ξjx j(µ +ξ ,θ )−
n
∑
j=1
Dj(µ +ξ ,θ )
#
≤ C· hnlog
1
h
Recall we expect that, in typical instances, the full-information performance of Problem (2.1)
is O(n) as n → ∞. Thus, this theorem asserts that as long as h is small, say h = hn = o(1) as n → ∞,
the bias of VGC is negligible relative to the true out-of-sample performance. In this sense, VGC is
asymptotically unbiased for large n.
The proof of the theorem is in Appendix 2.8.1 and proceeds similarly to our heuristic derivation
but uses the following monotonicity property to precisely quantify the “little oh” terms.
Lemma 2.3.3 (Monotonicity of Affine Plug-in Policies). For any z and j, the function t 7→ x(z +te j)
is non-increasing if aj ≥ 0, and the function is non-decreasing if aj < 0.
Intuitively, the lemma holds because z 7→ V(z) is a concave function (it is the minimum
of affine functions). In particular, t 7→ V(z + t) is also concave, and by Danskin’s Theorem,
d
dtV(z +t) = ajx j(z +t) whenever x(z +t) is unique. Informally, the lemma then follows since
the derivative of a concave function is non-increasing. Appendix 2.8.1 provides a formal proof
accounting for points of non-differentiability.
Before proceeding, we remark that Theorem 2.3.2 holds with small modifications under mild
violations of the independent Gaussian assumption (Assumption 2.2.4). Specifically, in cases
where νj are not known but are estimated, the bias of the VGC constructed with the estimated
νj
increases by a small term depending on the accuracy of the precisions. See Lemma 2.8.4 in the
appendix for formal statements and proof.
30
2.3.3 The Variance of the VGC
As mentioned in the contributions, the parameter h controls the trade offs between bias and variance
in our estimator. Unfortunately, while Theorem 2.3.2 gives a direct analysis of the bias under mild
assumptions, a precise analysis of the variance (or tail behavior) of the VGC is more delicate. In
this section we provide a loose, but intuitive bound on the variance of VGC that illustrates the
types of problems for which our estimator should perform well. The main message of this section
is that the VGC concentrates at its expectation so long as the policy x(Z) is “stable” in the sense
that perturbing one element of Z does not cause x(Z) solution to change too much.
The main challenge in showing D(Z) concentrates at its expectation is that D(Z) is a sum
of dependent random variables Dj(Z). Worse, this dependence subtly hinges on the structure
of Problem (2.1) and the plug-in policy problem (Problem (2.3)) and hence is not amenable to
techniques based on mixing or bounding the correlations between terms. We require a different
approach.
As a first step towards analyzing D(Z), we upper bound its variance by a related, fullyrandomized estimator.
Lemma 2.3.4 (Fully-Randomized VGC). Suppose that the solution x(Z) to Problem (2.3) is almost
surely unique. For each j such that aj ̸= 0, let
D
R
j
(Z) ≡
δj
h
√νjaj
x j(Z +δjU˜
je j) (2.15)
where U˜
j ∼ Uniform[0,1] and δj
is defined in Eq. (2.9). Let DR(Z) = ∑j:aj̸=0 D
R
j
(Z) denote the
fully-randomized VGC. Then, for any j such that aj ̸= 0,
Dj(Z) = E
h
D
R
j
(Z) | Z
i
and Var(D(Z)) ≤ Var(D
R
(Z)).
31
Proof. Again, by Danskin’s Theorem and the fundamental theorem of calculus,
V(Z +δje j)−V(Z) = Z δj
0
ajx j(Z +te j)dt =
Z 1
0
ajδjx j(Z +tδje j)dt = E
ajδjx(Z +δjUe˜
j) | Z,δj
.
Scaling both sides by 1
ajh
√νj
and taking expectations over δj proves the first statement. The second
then follows from Jensen’s inequality.
We next propose upper bounding Var(D
R(Z)) with the Efron-Stein Inequality. In particular,
let Z, δ and U be i.i.d. copies of Z, δ and U˜ respectively, and let Z
k
denote the vector Z but
with the k
th component replaced by Zk
. Define δ
k
and U˜
k
similarly. Let D
R(Z,δ ,U˜ ) be the fullyrandomized VGC with dependence on all constituent random variables made explicit. Then, by
the Efron-Stein Inequality
Var(D
R
(Z)) ≤
1
2
n
∑
k=1
E
h
(D
R
(Z,δ ,U˜ )−D
R
(Z
k
,δ ,U˜ ))2
i
(2.16a)
+
1
2
n
∑
k=1
E
h
(D
R
(Z,δ ,U˜ )−D
R
(Z,δ
k
,U˜ ))2
i
(2.16b)
+
1
2
n
∑
k=1
E
h
(D
R
(Z,δ ,U˜ )−D
R
(Z,δ ,
˜ U
k
))2
i
. (2.16c)
Recall that in the typical case, µ
⊤x(Z) = Op(n). Hence in what follows, we will focus on
developing conditions for which the upper bound in Eqs. (2.16a) to (2.16c) is o(n
2
). Indeed, such
a bound would suggest D(Z) − E[D(Z)] = op(n), i.e., the stochastic fluctuations in the VGC are
negligible relative to the magnitude of the out-of-sample error for n sufficiently large. With this
perspective, it is not difficult to argue that both Eqs. (2.16b) and (2.16c) both contribute at most
O
n
h
(see proof of Theorem 2.3.5). Thus, we focus on Eq. (2.16a).
3
Consider the k
th element of the sum. Write,
D
R
(Z)−D
R
(Z
k
)
≤
1
hakνmin
n
∑
j=1
δj(x j(Z +δjU˜
je j)−x j(Z
k +δjU˜
je j))
≤
∥δ ∥2
√
n
hakνmin
·
1
n
n
∑
j=1
x j(Z +δjU˜
je j)−x j(Z
k +δjU˜
je j)
2
!1/2
, (2.17)
where the last inequality follows from Cauchy-Schwarz’s inequality. Since each δj
is Gaussian, we
expect ∥δ ∥
2
2
to concentrate sharply at its mean, i.e., ∥δ ∥
2
2 = Op(hn). Thus by squaring Eq. (2.17),
taking expectations and substituting into Eq. (2.16a), we roughly expect
Var(D
R
(Z)) ≤ O
n
h
| {z }
Eqs. (2.16b) and (2.16c)
+ E
"
∥δ ∥
2
2
n
2
h
2ν
2
mina
2
min
·
1
n
2
n
∑
k=1
n
∑
j=1
x j(Z +δjUe˜
j)−x j(Z
k +δjUe˜
j)
2
#
≈ O
n
h
+O
n
3
h
·
1
n
2
n
∑
k=1
n
∑
j=1
E
x j(Z +δjU˜
je j)−x j(Z
k +δjU˜
je j)
2
| {z }
Avg. Solution Instability
,
where amin ≡ minj:aj̸=0
aj
.
We call the indicated term the Average Solution Instability. In the worst case, it is at most 1
since X ⊆ [0,1]
n
. If, however, it were O(n
−α
) for some α > 1, then the Var(D
R(Z)) = o(n
2
)
as desired.
How do we intuitively interpret Average Solution Instability? Roughly, in the limit as h → 0,
we might expect that x(Z + δjU˜
je j) ≈ x(Z) because δj = Op(
√
h). Then the Average Solution
Instability is essentially the expected change in the solution in a randomly chosen component j
when we replace the data for a randomly chosen component k with an i.i.d. copy. This interpretation suggests Average Solution Instability should be small so long as a small perturbation to the
k
th component doesn’t change the entire solution vector x(Z) by a large amount, i.e., if small perturbations lead to small, local changes in the solutions. Intuitively, many large-scale optimization
problems exhibit such phenomenon (see, e.g., Gamarnik, 2013), so we broadly expect the VGC to
have low variance.
33
The above heuristic argument can be made formal as in the following theorem.
Theorem 2.3.5 (Variance of the VGC). Suppose that the solution x(Z) to Problem (2.3) is almost
surely unique, that there exists a constant C1 (not depending on n) such that
E
"
1
n
2
n
∑
k=1
n
∑
j=1
x j(Z +δjU˜
je j)−x j(Z
k +δjU˜
je j)
2
#
≤ C1n
−α
and that Assumption 2.3.1 holds. Then, there exists a constant C2 (depending on νmin and amin ≡
minj aj) such that for any 0 < h < 1/e
Var(D(Z)) = C2
h
max(n
3−α
,n).
In particular, in the typical case where the full-information solution to Problem (2.1) is O(n),
the stochastic fluctuations in the VGC are negligible relative to the out-of-sample performance if
α > 1.
The proof of Theorem 2.3.5 is in Appendix 2.8.4.
We remark that Theorem 2.3.5 provides a sufficient condition for the variance of the VGC to
be negligible asymptotically and to show that h controls the bias-variance tradeoff, however, the
bound is not tight. In Section 2.4 we provide a tighter analysis given more stringent assumptions on
Problems (2.1) and (2.3), which then also provides us guidance on how to select h to approximately
balance the bias-variance tradeoff.
2.3.4 Smoothness and Boundedness of the VGC
One of the key advantages of our VGC is that it is smooth in the policy class, provided θ 7→ r(·,θ )
is “well-behaved.” Other corrections, like the Stein Correction of GR 2021, do not enjoy such
smoothness. In Section 2.5, we argue this smoothness improves the empirical performance of our
method. We formalize “well-behaved” in the next assumption:
34
Assumption 2.3.6 (Plug-in Function is Smooth). We assume the functions aj(θ ), bj(θ ) are each
L-Lipschitz continuous for all j = 1,...,n. Moreover, we assume there exists amax,bmax < ∞ such
that
sup
θ∈Θ
aj(θ )
≤ amax and sup
θ∈Θ
bj(θ )
≤ bmax ∀ j = 1,...,n.
Finally, we assume there exists amin such that
0 < amin ≤ inf{
aj(θ )
: aj(θ ) ̸= 0, j = 1,...,n, θ ∈ Θ}
In words, Assumption 2.3.6 requires the functions aj(θ ) and bj(θ ) to be Lipschitz smooth,
bounded, and that the non-zero components of aj(θ ) be bounded away from 0.
Bias and Variance of VGC for Plug-In Linear Regression Models. Recall our Plug-in Linear Model class from Section 2.2.2. Since aj(θ ) = 0 for all j, D(Z,θ ) = 0 for all Z and (non
data-driven) θ for this class. Said differently, the in-sample performance of a policy is, itself,
our estimate of the out-of-sample performance, and, both Theorems 2.3.2 and 2.3.5 can both be
strengthened; the bias of our estimator and variance of the correction are both zero. More generally,
D(Z,θ ) = 0 whenever the plug-in functionals rj(z,θ ) do not depend on z for all j.
We stress however that this analysis does not immediately guarantee that the in-sample performance of policies of the form x
LM(Z,θ (Z)) is a good estimate of out-of-sample performance,
because θ (Z) depends on Z. In Section 2.4 we provide sufficient conditions to ensure that insample performance is, indeed, a good estimate of out-of-sample performance. Moreover, when
r(Z,θ ) does depend on Z, e.g., as with our Mixed Effects Regression class, D(Z,θ ) is generally
non-zero.
Lemma 2.3.7 (Smoothness of Variance Gradient Correction). Under Assumptions 2.3.1 and 2.3.6,
the following hold:
i) There exists a constant C1 (depending on amin,amax,bmax, and νmin) such that for any z ∈
R
n and any 0 < h < 1/e, the function θ 7→ D(z,(θ ,h)) is Lipschitz continuous with parameter
35
C1n
2L
h
√
νmin
(∥z∥∞ +1). Moreover, there exists a constant C2 (depending on Cµ and C1) such that for
any R > 1, with probability at least 1−e
−R
, the (random) function θ 7→ D(Z,(θ ,h)) is Lipschitz
continuous with parameter C2L
h
q R
νmin
· n
2√
logn.
ii) Consider D(z,(θ ,h)) where h ∈ [hmin,hmax] and 0 < hmax −hmin < 1. There exists an absolute
constant C3 such that for any z ∈ R
n and θ ∈ Θ, the following holds,
D(z,(θ ,h))−D(z,(θ ,h))
≤
C3n
hminν
3/4
min
q
h−h
.
See Appendix 2.8.2 for a proof. Intuitively, the result follows because θ 7→V(z,θ ) is Lipschitz
by Danskin’s theorem and D(z,θ ) is a linear combination of such functions. The second part
follows from a high-probability bound on ∥Z∥∞.
In addition to being smooth, the VGC is also bounded as a direct result of taking the conditional
expectation over the perturbation parameters δj
.
Lemma 2.3.8 (VGC is Bounded). Suppose that Assumptions 2.3.1 and 2.3.6 hold. For any z, and
any j = 1,...,n,
Dj(z)
≤
√
3
ν
3/4
min
√
h
.
The proof can be found in Appendix 2.8.2. The result follows from observing that the j
th component of the VGC is the difference of the optimal objectives values of two optimization problems
whose cost vector differs by O(|δj
|) in one component. Thus, the two optimal objective values can
only differ by O(|δj
|) which is at most a constant once we take the conditional expectation.
2.4 Estimating Out-of-sample Performance for Weakly-Coupled Problems
In this section we provide high-probability tail bounds on the error of our estimator for out-of-sample
performance that hold uniformly over a given policy class. Such bounds justify using estimator
36
for policy learning, i.e., identifying the best policy within the class. They are also substantively
stronger than the variance analysis of Theorem 2.3.5 as they provide exponential bounds on the
tail behavior, rather than bounding the second moment. Additionally, we show the uniform results
hold even when θ ∈ Θ¯ is chosen in a data-driven manner (which recall also includes h).
From the definition of the VGC out-of-sample estimator (Eq. (2.10)), the error of our estimator
of out-of-sample performance for x(Z,θ (Z)) is
µ
⊤x(Z,θ (Z))−
Z
⊤x(Z,θ (Z))−D(Z,θ (Z))
| {z }
Error Estimating Out of Sample Perf.
≤ sup
θ∈Θ¯
ξ
⊤
x(Z,θ )−D(Z,θ )
| {z }
Error Estimating In-Sample Optimism
≤ sup
θ∈Θ¯
ξ
⊤
x(Z,θ )−E
h
ξ
⊤
x(Z,θ )
i
(2.18a)
+ sup
θ∈Θ¯
|D(Z,θ )−E[D(Z,θ )]| (2.18b)
+ sup
θ∈Θ¯
E
h
µ
⊤x(Z,θ )−Z
⊤x(Z,θ )−D(Z,θ )
i
. (2.18c)
We bounded Eq. (2.18c) in Theorem 2.3.2. Our goal will be to find sufficient conditions to
show the remaining terms are also op(n) uniformly over θ ∈ Θ. Then, in the typical case where
the out-of-sample performance is Op(n), the error of our estimator will be negligible relative to the
true out-of-sample performance. Our strategy will be to leverage empirical process theory since
the argument of each suprema is a sum of random variables. Importantly, this empirical process
analysis does not strictly require the independent Gaussian assumption (Assumption 2.2.4). The
challenge of course is that the constraints of Problem (2.3) introduce a complicated dependence
between the terms.
Inspired by the average stability condition of Theorem 2.3.5, we focus on classes of “weaklycoupled” optimization problems. We consider two such classes of problems, those weakly-coupled
by variables in Section 2.4.1 and those weakly-coupled by constraints Section 2.4.2. We provide
formal definitions below.
37
2.4.1 Problems Weakly-Coupled by Variables
We say an instance of Problem (2.1) is weakly-coupled by variables if fixing a small number of
variables causes the problem to separate into many, decoupled subproblems. Generically, such
problems can be written as
min
x
µ
0
⊤
x
0 +
K
∑
k=1
µ
k
⊤
x
k
(2.19)
s.t. x
0 ∈ X 0
, x
k ∈ X k
(x
0
), ∀k = 1,...,K.
Here, x
0
represents the coupling variables and k = 1,...,K represent distinct subproblems. Notice
that once x
0
is fixed, each subproblem can be solved separately. Intuitively, if dim(x
0
) is small
relative to n, the subproblems of Eq. (2.19) are only “weakly” coupled. Some reflection shows
both Examples 2.2.2 and 2.2.3 from Section 2.2 are weakly-coupled by variables.
Let Sk ⊆ {1,...,n} be the indices corresponding to x
k
for k = 0,...,K, and Smax = maxk≥0 |Sk
|.
The sets S0,...,SK form a disjoint partition of {1,...,n}. Without loss of generality, reorder the
indices so that the Sk occur “in order,”; i.e., (j : j ∈ S0),...,(j : j ∈ SK) is a consecutive sequence.
Given the weakly-coupled structure of Eq. (2.19), we define a generalization of x(Z,θ ): For
each x
0 ∈ X 0
and θ ∈ Θ¯ , let
x
k
(Z,θ , x
0
) ∈ argmin
x
k∈X k(x
0)
r
k
(Z,θ )
⊤x
k
, k = 1,...,K, (2.20)
where r
k
(Z,θ ) =
rj(Z,θ ) : j ∈ Sk
. Intuitively, the vector
x(Z,θ , x
0
) ≡
(x
0
)
⊤, x
1
(Z,θ , x
0
)
⊤,..., x
K
(Z,θ , x
0
)
⊤
⊤
satisfies the Average Instability Condition of Theorem 2.3.5 so long as Smax is not too large since
the j
th component of the solution changes when perturbing the k
th data point if and only if j and k
belong to the same subproblem. This event happens with probability at most Smax/n
2
.
The key to making this intuition formal and obtaining exponential tails for the error of the
out-of-sample estimator (Eq. (2.18)) is that
sup
θ∈Θ¯
ξ
⊤
x(Z,θ )−E
h
ξ
⊤
x(Z,θ )
i
≤ sup
θ∈Θ¯ ,x
0∈X 0
ξ
⊤
x(Z,θ , x
0
)−E
h
ξ
⊤
x(Z,θ , x
0
)
i
and,
sup
θ∈Θ¯
|D(Z,θ )−E[D(Z,θ )]| ≤ sup
θ∈Θ¯ ,x
0∈X 0
D(Z,θ , x
0
)−E
D(Z,θ , x
0
)
,
where both ξ
⊤
x(Z,θ , x
0
) and D(Z,θ , x
0
) can, for a fixed θ , x
0
, be seen as sums of K independent
random variables. To obtain uniform bounds, we then need to control only the metric entropy of
the resulting (lifted) stochastic processes indexed by (θ , x
0
).
We propose a simple assumption on the policy class to control this metric entropy. We believe
this assumption is easier to verify than other assumptions used in the literature (e.g., bounded linear
subgraph dimension or bounded Natarajan dimension), but admittedly slightly more stringent.
Assumption 2.4.1 (Lifted Affine Plug-in Policy). Given an affine plug-in policy class defined
by r(·,θ ) for θ ∈ Θ, we say this class satisfies the lifted affine plug-in policy assumption for
problems weakly-coupled by variables (Eq. (2.19)) if there exists mapping φ(·) and mappings
gk(·) for k = 1,...,K such that
x
k
(Z,θ , x
0
) ∈ argmin
x
k∈X k(x
0)
φ(θ )
⊤gk(Z
k
, x
k
, x
0
) k = 1,...,K, ∀x
0 ∈ X 0
.
We stress that the mapping φ(·) is common to all K subproblems and all x
0 ∈ X 0
, and both φ(·)
and gk(·) can be arbitrarily nonlinear. Moreover, gk(·) may implicitly depend on the precisions ν
and covariates W as these are fixed constants. With the exception of policies from linear smoothers,
each of our examples from Section 2.2.2, satisfies Assumption 2.4.1. For example, for plug-ins for
linear regression models, we can simply take φ(θ ) = θ and gk(Z
k
, x
k
, x
0
) = ∑j∈Sk W ⊤
j
x j
.
When x
k
(Z,θ , x
0
) is not the unique minimizer to the problem weakly-coupled variables defined
in Eq. (2.20), we require that ties are broken consistently. Let Ext(Conv(X k
(x
0
))) denote the
39
set of extreme points of Conv(X k
(x
0
)) and let Xmax = maxk≥0 Ext(Conv(X k
(x
0
))). Note, if
x
k
(Z,θ , x
0
) is unique, it is an extreme point.
Assumption 2.4.2 (Consistent Tie-Breaking). We assume there exists functions σkx0 : 2X k
(x
0
)
7→
Ext(Conv(X k
(x
0
))) such that
x
k
(Z,θ , x
0
) = σkx0
argmin
x
k∈X k(x
0)
φ(θ )
⊤gk(Z
k
, x
k
, x
0
)
!
k = 1,...,K, ∀x
0 ∈ X 0
.
Consistent tie-breaking requires that if (θ 1, x
0
1
) and (θ 2, x
0
2
) induce the same minimizers in
Eq. (2.20) for some Z, then x
k
(Z,θ 1, x
0
1
) = x
k
(Z,θ 2, x
0
2
), and this point is an extreme point of
X k
(x
0
).
Assumptions 2.4.1 and 2.4.2 allow us to bound the cardinality of the set x
0
, x
1
(Z,θ , x
0
),..., x
K(Z,θ , x
0
)
: x
0 ∈ X 0
,θ ∈ Θ
by adapting a geometric argument counting regions in a hyperplane arrangements from Gupta and
Kallus, 2021 (Lemma C.7). The cardinality of the set characterizes the metric entropy of the policy
class.
Finally, for this section, we say a constant C is dimension-independent if C does not depend
on {K,Smax,h,X 0
,Xmax,dim(φ)}, but may depend on {νmin,Cµ,L}. We now present the main
result of this section:
Theorem 2.4.3 (Policy Learning for Problems Weakly-Coupled by Variables). Suppose Assumptions 2.2.4, 2.3.1, 2.3.6, 2.4.1 and 2.4.2 all hold. Let Xmax ≥
Ext(Conv(X k
(x
0
)))
for all
k = 1,...,K and x0 ∈ X 0
, and assume Xmax < ∞. Then, for 0 < hmin ≤ hmax ≤ 1, there exists a
dimension-independent constant C such that, for any R > 1, with probability at least 1−2exp(−R),
sup
θ∈Θ¯
ξ
⊤
x(Z,θ )−D(Z,θ ,h)
≤ CKSmax · hmax log
1
hmin
+ CSmaxR
s
Klog(1+|X 0
|)
hmin
log(K)
vuutlog(Smax)log
h
−1
min ·N
r
hmin
Kn2
,Θ
!!
+
p
log(K)dim(φ)log(1+Xmax)
!
.
4
where N(ε,Θ) is the ε-covering number of the set Θ.
In the typical case where Θ does not depend on n or K and
max(Smax,|X0|,dim(φ)) = O˜(1) as K → ∞, (2.21)
we can approximately minimize the above bound by selecting h ≡ hk = O(K
−1/3
) and noting the
relevant covering number grows at most logarithmically in K. This choice of h approximately
balances the deterministic and stochastic contributions, and the bound reduces to O˜(CPIK
2/3
) for
some CPI (depending on |X0|, dim(φ), dim(θ )) that measures the complexity of the policy class.
Many applications satisfy the conditions in Eq. (2.21), including the drone-assisted emergency
medical response application (Example 2.2.2).
To illustrate, recall Example 2.2.2. Here, y represents the binding variables “x
0
”, and we
see
X 0
=
L
B
. Moreover, Smax = L, since x decouples across k. Inspecting the constraints,
Xmax ≤ B, since for each k, we choose exactly one depot from which to serve location k, and
there are most B available depots. Finally, for a fixed policy class, dim(φ) is constant and the log
covering number above grows at most logarithmically in K. Most importantly, we expect L (the
number of possible depots) and B (the budget) to be fairly small relative to K since regulations
and infrastructure limit placement of depots, but there are many possible locations for cardiac
events. Here typical instances of Example 2.2.2 satisfy Eq. (2.21). We return to Example 2.2.2 in
Section 2.5 where we study the performance of our method numerically.
2.4.2 Problems with Weakly-Coupled by Constraints
An instance of Problem (2.1) is weakly-coupled by constraints if, after removing a small number
of binding constraints, the problem decouples into many separate subproblems. Data-driven linear
4
optimization problems of this form have been well studied by X. Li and Ye, 2019 and GR 2021. In
order to facilitate comparisons to the existing literature, we study the specific problem
max
x∈X
n
∑
j=1
µjx j
, X =
(
x ∈ [0,1]
n
:
1
n
n
∑
j=1
A jx j ≤ b
)
, (2.22)
and corresponding plug-in policies
x(Z,θ ) ∈ argmax
x∈X
n
∑
j=1
rj(Zj
,θ )x j
. (2.23)
Here, A j ∈ R
m with m ≥ 1. In particular, we consider a maximization instead of a minimization.
We next introduce a dual representation of Problem (2.23). Specifically, scaling the objective
of Problem (2.23) by 1
n
and dualizing the binding constraints yields
λ (Z,θ ) ∈ argmin
λ ∈Rm
+
(
b
⊤λ + max
x∈[0,1]
n
1
n
n
∑
j=1
(rj(Zj
,θ )−A
⊤
j λ )x j
)
(2.24)
For a fixed λ , the inner maximization of Eq. (2.24) can be solved explicitly, yielding
λ (Z,θ ) ∈ argmin
λ ∈Rm
+
L (λ ,Z,θ ), where L (λ ,Z,θ ) ≡ b
⊤λ +
1
n
n
∑
j=1
rj(Zj
,θ )−A
⊤
j λ
+
.
By strong duality, V(Z,θ ) = r(Z,θ )
⊤x(Z,θ ) = nL (λ (Z,θ ),Z,θ ).
This dual representation highlights the weakly-coupled structure. Indeed, the dependence
across terms in the sum in L (λ (Z,θ ),Z,θ ) arises because λ (Z,θ ) depends on the entire vector Z.
However, this dependence has to be “channeled” through the m dimensional vector λ (Z,θ ), and,
hence, when m is small relative to n, cannot create too much dependence between the summands.
Indeed, we will show that if m is small relative to n, then λ (Z,θ ) concentrates at its expectation,
i.e., a constant, as n → ∞, and, hence, the summands become independent asymptotically. This
insight is key to the analysis.
42
To formalize these ideas, we make assumptions similar to those in GR 2021 and X. Li and Ye,
2019:
Assumption 2.4.4 (s0-Strict Feasibility). There exists an s0 > 0 and x0 ∈ X such that 1
n ∑
n
j=1 A jx
0
j +
s0e ≤ b.
Assumption 2.4.5 (Regularity of Matrix A). There exists a constant CA such that ∥A j∥∞ ≤ CA
for all 1 ≤ j ≤ n. Moreover, there exists a constant β > 0 such that the minimal eigenvalue of
1
n ∑
n
j=1 A jA
⊤
j
is at least β.
The strict feasibility assumption can often be satisfied by perturbing A or b and ensures the dual
optimal values λ (Z,θ ) are bounded with high probability. The regularity assumptions on A ensure
the function λ 7→ E[L (λ ,Z,θ )] is strongly convex, a key property in our proof (see below). Such
a property holds, e.g., if the columns A j are drawn randomly from some distribution.
Like Section 2.4.1, in order to obtain uniform bounds we must also control the metric entropy
of the different stochastic error terms in out-of-sample error Eq. (2.18). Generalizing GR 2021, we
make the following assumption:
Assumption 2.4.6 (VC Policy Class). There exists a function ρ(·) such that
rj(zj
,θ ) = ρ((zj
,νj
,W j
,A j),θ ), j = 1,...,n
and a constant V such that the class of functions
n
(z,ν,W ,A) 7→ ρ((z,ν,W ,A),θ )−A
⊤λ : θ ∈ Θ, λ ∈ R
m
+
o
has a pseudo-dimension at most V . Without loss of generality, we further assume V ≥ max(m,2).
The size of the constant V captures the complexity of the policy class which typically depends
upon the dimension of θ as well as the number of binding constraints m. As an illustration, recall
the plug-in for linear regression models policy class from Section 2.2. By Pollard, 1990, Lemma
4.4, this policy class satisfies Assumption 2.4.6 with V = dim(θ ) +m.
43
Finally, we say a constantC is dimension-independent ifC does not depend on {n,h,m,V,dim(θ )},
but may depend on {νmin,CA,Cµ,β,s0,amin,amax,bmax,L}.
The main result of this section is then:
Theorem 2.4.7 (Estimation Error for Problems Weakly Coupled Constraints). Under Assumptions 2.2.4, 2.3.1, 2.3.6 and 2.4.4 to 2.4.6, for 0 < hmin ≤ hmax ≤ 1 there exists dimension-independent
constants C and n0, such that for any R > 1 and all n ≥ n0e
R
, we have that with probability
1−Cexp(−R),
sup
θ∈Θ¯
n
∑
j=1
ξjx j(Z,θ )−Dj(Z,θ )
≤ Cn · hmax log
1
hmin
+C·V
3
log3V ·R·
q
nlog
n ·N(n
−3/2
,Θ)
·
log5
n
hmin
To build intuition, consider instances where Θ does not depend on n. Then, V = O(1) and the
covering number grows at most logarithmically as n → ∞. We can then minimize the above bound
(up to logarithmic terms) by taking taking h ≡ hn = O(n
−1/4
), yielding a bound of order O˜(n
3/4
).
In particular, in the typical instance where the full-information optimum (c.f. Problem (2.22)) is
O(n), the relative error of our estimate is O˜(n
−1/4
) which is vanishing as n → ∞.
Remark 2.4.8. The rate above (O˜(n
−1/4
)) is slightly slower than the rate of convergence of the
Stein correction in GR 2021 (O˜(n
−1/3
)). We attribute this difference to our choice of a first order
finite difference when constructing the VGC. A heuristic argument strongly suggests that had
we instead used a second order forward finite difference scheme as in Appendix 2.8.5, we would
recover the rate O˜(n
−1/3
). Moreover, our numerical experiments (with the second order scheme)
in Section 2.5 shows our second-order VGC to be very competitive.
Proof Intuition: Approximate Strong-Convexity and Dual Stability
To build intuition, recall that to show the convergence of VGC, it suffices to bound the Average Solution Instability defined in Theorem 2.3.5. By complementary slackness, x j(Z) = I
n
rj(Zj) > A
⊤
j λ (Z)
o
4
except possibly for m fractional components. Hence, by rounding the fractional components, we
have, for j ̸= k,
x j(Z +δjU˜
je j)−x j(Z
k +δjU˜
je j)
≤ I
n
rj(Zj +δjU˜
je j) ≥ A
⊤
j λ (Z +δjU˜
je j)
o
−I
n
rj(Zj +δjU˜
je j) > A
⊤
j λ (Z
k +δjU˜
je j)
o
≤ I
n
rj(Zj +δjU˜
je j) ∈
D
A
⊤
j λ (Z +δjU˜
je j), A
⊤
j λ (Z
k +δjU˜
je j)
Eo,
where we use ⟨l,u⟩ to denote the interval [min(l,u), max(l,u)]. By symmetry, the same bound
holds for x j(Z
k +δjU˜
je j)−x j(Z +δjU˜
je j). Since summands where j = k each contribute at most
1 to the Average Solution Instability, we thus have that
1
n
2
n
∑
k=1
n
∑
j=1
x j(Z +δjU˜
je j)−x j(Z
k +δjU˜
je j)
2
≤
1
n
2
n
∑
k=1
n
∑
j=1
I
n
rj(Zj +δjU˜
je j) ∈
D
A
⊤
j λ (Z +δjU˜
je j), A
⊤
j λ (Z
k +δjU˜
je j)
Eo+Op
1
n
.
The principal driver of the Solution Instability is the first double sum; in a worst-case, it might
be Op(1). It will be small if ∥λ (Z + δjU˜
je j) − λ (Z
k + δjU˜
je j)∥ is small for most j and k. Said
differently, problems like Problem (2.23) that are weakly-coupled by constraints will have small
Solution Instability if the dual solutions λ (·) are, themselves, stable, i.e., if the dual solution does
not change very much when we perturb one data point. Our analysis thus focuses on establishing
this dual solution stability.
Unfortunately, solutions of linear optimization problems need not be stable – a small perturbation to the cost vector might cause the optimal solution to switch between extreme points, inducing
a large change. By contrast, solutions of convex optimization problems with strongly-convex objectives are stable (see, e.g., Shalev-Shwartz et al., 2010). The next key step of our proof is to show
that although λ 7→ L (λ ,Z) is not strongly-convex, it is still “approximately” strongly-convex with
high probability in a certain sense.
45
f(λ)
f(λ1) + ∂λf(λ1)⊤(λ − λ1)
λ2
λ1
L(λ)
L(λ1) + ∂λL(λ1)⊤(λ − λ1)
L(λ1) + ∂λL(λ1)⊤(λ − λ1) + κ
2
λ − λ1
2
2
λ2
λ1
λ
′
2
(a) (b)
f(λ1) + ∂λf(λ1)⊤(λ − λ1) + κ
2
λ − λ1
2
2
Figure 2.2: Approximate Strong Convexity of L (λ ). Figure (a) shows a strongly convex function f(λ ) and visualizes the strong convexity condition Eq. (2.25). Figure (b) shows that because
L (λ ) is piecewise linear, it does not satisfy Eq. (2.25) for points on same line segment (λ 1 and
λ
′
2
). However, when λ 1,λ 2 are sufficiently far apart, they are on different line segments and
Eq. (2.25) is satisfied.
To be more precise, recall that a function f(λ ) is κ-strongly-convex if
f(λ 2)− f(λ 1) ≥ ∇λ
f(λ 1)
⊤(λ 2 −λ 1) + κ
2
∥λ 2 −λ 1∥
2
2
, ∀λ 1,λ 2 ∈ Dom(f). (2.25)
where κ > 0 and ∇λ denotes the subgradient. The left panel of Fig. 2.2 depicts this condition
graphically. For any two points λ 2 and λ 1, the first-order Taylor series underestimates the function
value, and one can “squeeze in” a quadratic correction κ
2
∥λ 2 −λ 1∥
2
2
in the gap.
The function λ 7→ L (λ ,Z) does not satisfy this condition, as seen in the right panel for points
λ 1 and λ
′
2
. This function is piecewise-linear, and, for two points on the same line-segment, the
first-order Taylor series is exact. However, for points on different line segments, such as λ 1 and
λ 2, the first-order Taylor series is a strict underestimation, and we can squeeze in a quadratic
correction. Said differently, Eq. (2.25) does not hold for all λ 1, λ 2, but holds for most λ 1, λ 2. In
this sense, λ 7→ L (λ ,Z) is “approximately” strongly-convex.
46
To make a formal statement, it is more convenient to use a different, equivalent definition of
strong-convexity. Equation (2.25) is equivalent to the condition
∇λ
f(λ 1)−∇λ
f(λ 2)
⊤
(λ 1 −λ 2) ≥ κ∥λ 1 −λ 2∥
2
2 ∀λ 1,λ 2 ∈ Dom(f). (2.26)
Lemma 2.10.9 then shows that λ 7→ L (λ ,Z) is approximately strongly convex in the sense that,
with high probability, there exists a C > 0 such that
∇λ L (λ 1,Z)−∇λ L (λ 2,Z)
⊤
(λ 1 −λ 2) ≥ C∥λ 1 −λ 2∥
2
2 −
∥λ 1 −λ 2∥
3/2
2
C
√
n
, (2.27)
for all ∥λ 1∥1 ≤ λmax, ∥λ 2∥1 ≤ λmax, and ∥λ 1 −λ 2∥2 ≥
4
n
where λmax is a dimension independent constant satisfying E∥λ (Z,θ )∥1 ≤ λmax.
Equation (2.27) mirrors Eq. (2.26). As n → ∞, Eq. (2.27) reduces to the analogue of Eq. (2.26)
for ∥λ 1∥1,∥λ 2∥1 ≤ λmax. Moreover, for ∥λ 1 −λ 2∥2 ≫ 1
n
, the first term on the left of Eq. (2.27)
above dominates the second, so that the right hand side is essentially a quadratic in ∥λ 1 − λ 2∥2.
These relations motivate our terminology “approximately strongly-convex.”
Using this notion of approximate strong-convexity, we show in Lemma 2.10.12 that there exists
a set En ⊆ R
n
such that Z ∈ En with high probability, and, more importantly, for any z ∈ En, the
dual solutions are stable, i.e.,
∥λ (z,θ )−λ (z,θ )∥2 ≤
C
n
n
∑
j=1
I
zj ̸= zj
∀z s.t. ∥λ (z)∥1 ≤ λmax. (2.28)
Equipped with this dual-stability condition, we can bound the average solution instability and the
variance of D(Z) as in Theorem 2.3.5.
However, since the above stability condition holds with high probability instead of in expectation, we can actually use a modification of McDiarmid’s inequality (see Theorem 2.7.5) to prove
the following, stronger tail bound:
Lemma 2.4.9 (Pointwise Convergence of VGC). Fix some θ ∈ Θ. Under the assumptions of
Theorem 2.4.7, there exists a dimension independent constants C,n0 such that, for any R > 1 and
n ≥ n0e
R
, we have with probability at least 1−4exp(−R),
|D(Z,θ )−E[D(Z,θ )]| ≤ CV3
log2
(V)
log4
(n)
√
n
h
√
R.
We then complete the proof of Theorem 2.4.7 by i) invoking a covering argument over Θ to
extend this tail bound to a uniform concentration for the VGC and ii) again leveraging dual stability
to show the in-sample optimism (Eq. (2.18a)) concentrates similarly. See Appendix 2.10.5 for the
details.
Comparison to Proof Technique to GR 2021 and X. Li and Ye, 2019
GR 2021 and X. Li and Ye, 2019 also analyze the behavior of λ (Z) in the limit as n → ∞. The
key to their analysis is observing that the function λ 7→ E[L (λ ,Z)] is strongly-convex. Using this
property, they prove that
∥λ (Z)−λ
∗
∥2 = O˜
p
1
√
n
(2.29)
for some constant λ
∗
that does not depend on the realization of Z.
Our analysis via approximate strong-convexity takes a different perspective. Specifically, instead of studying the function λ 7→ E[L (λ ,Z)], we study the (random) function λ 7→ L (λ ,Z).
While more complex, this analysis permits a tighter characterization of the behavior of the dual
variables. In particular, leveraging Eq. (2.28), one can prove a statement similar to Eq. (2.29)
(see Lemma 2.10.17), however, to the best of our knowledge, one cannot easily prove Eq. (2.28)
given the strong-convexity of λ 7→ E[L (λ ,Z)] or Eq. (2.29). A simple triangle inequality from
Eq. (2.29) would suggest the much slower rate ∥λ (Z)−λ (Z¯)∥ = Op
√
1
n
.
It is an open question whether this tighter analysis might yield improved results for the online
linear programming setting studied in X. Li and Ye, 2019.
48
2.5 Numerical Case Study: Drone-Assisted Emergency Medical Response
We reconsider Example 2.2.2 using real data from Ontario, Canada. Our data analysis and setup largely mirror Boutilier and T. C. Chan, 2019, however, unlike that work, our optimization
formulation explicitly models response time uncertainty.
Data and Setup. Recall, our formulation decides the location of drone depots (yl) and dispatch
rules (xkl) where a dispatch rule determines whether to send a drone from depot l to location k when
requested. Our objective is to minimize the expected response time to out-of-hospital cardiac arrest
(OHCA) events. We consider L = 31 potential drone depot locations at existing paramedic, fire,
and police stations in the Ontario area.
To study the effect of problem dimension on our estimator, we vary the number of OHCA
events via simulation similarly to Boutilier and T. C. Chan, 2019. Specifically, we estimate a
(spatial) probability density over Ontario for the occurrence of OHCA events using a kernel density
estimator trained on 8 years of historical OHCA events. We then simulate K (ranging from 50 to
3,200) events according to this density giving the locations k used in our formulation.
In our case-study, µkl represents the excess time a drone-assisted response takes over an ambulanceonly response. (This objective is typically negative). We learn these constants by first training a knearest neighbors regression (kNN) for the historical ambulance response times to nearby OHCAs.
(For a sense of scale, the maximum ambulance response time is less than 1500s = 25 min.) We
estimate a drone response time based on the (straight-line) distance between k and l assuming an
average speed of 27.8 m/s and 20s for take-off and landing (assumptions used in Boutilier and
T. C. Chan, 2019). We then set µkl to the difference of the drone time minus the ambulance time.
These values are fixed constants throughout our simulations and range from −3100 seconds to
1200 seconds.
We take Zkl be normally corrupted predictions of µkl where the precisions νkl are determined
by bootstrapping. Specifically, we take many bootstrap samples of our original historical dataset
and refit the k-nearest neighbor regression and recompute an estimate of ambulance and drone
49
response times. The precision νkl is taken to be the reciprocal of the variance of these bootstrap
replicates. Precisions range from νmin = 4×10−7
to νmax = 2×10−4
.
Policy Class. To determine dispatch rules for our case study, we consider the following policies:
x(Z,W ,(τ,µ0)) ∈ arg min
x∈X
K
∑
k=1
L
∑
j=1
νjk
νjk +τ
Zjk +
τ
νjk +τ
Wjk − µ0
x jk,
where Wjk is the computed drone travel time between facility j and OHCA k. Our policy class
consists of policies where τ ∈ [0,100νmin] and µ0 ∈ [0,1500]. Similar to the Mixed-Effects Policies
from Section 2.2.2, each policy is a weighted average between the SAA policy and a deterministic
policy that dispatches to any location whose drone travel-time is at most µ0.
For the first three experiments, we generate out-of-sample estimates using our VGC, the Steincorrection of GR 2021, and cross-validation using hold-out (2-fold) cross-validation. We assume
that we are given two samples Z
1
,Z
2
so that Zjk =
1
2
(Z
1
jk + Z
2
jk). We set h = n
−1/6
for both the
VGC and the Stein-Correction based on the recommendation of GR 2021. In the last experiment,
we are given one hundred samples Z
i
for i = 1,...,100 where Zjk =
1
100 ∑
100
i=1
Z
i
jk and generate outof-sample estimates for 2, 5, 10, 20, 50, and 100 fold cross-validation. For ease of comparison,
we present all results as a percentage relative to full-information optimal performance.
2.5.1 Results
In our first experiment, we evaluate the bias and square root variance of each method for the
out-of-sample performance of the SAA policy (τ = 0) as K, the number of OHCA events grows
(see Fig. 2.3). As predicted by our theoretical analysis, the quality of the out-of-sample estimates
improve as we increase the problem size for both the VGC and the Stein Correction. However,
cross-validation incurs a non-vanishing bias because it only leverages half the data in training.
As a second experiment, in Fig. 2.4, we can observe the quality of the estimators over multiple
policies in the policy class. We highlight the smoothness of the VGC as τ varies. Since, for large
K, the true performance is quite smooth, the worst-case error of VGC is generally smaller than that
5
0
20
40
60
80
100 300 1,000 3,000
OHCA Events (K)
Abs. Bias (% Full-Info)
0
20
40
60
80
100 300 1,000 3,000
OHCA Events (K)
Root Variance (% Full-Info)
Cross-Val
Stein Correction
VGC
Figure 2.3: Bias and variance as K → ∞. The two graphs plot the bias and variance of the different
out-of-sample performance estimators for the Sample Average Approximation (SAA) policy. The
bias and variance were estimated across 100 simulations for each K. Although variance vanishes
for all methods as K increases, cross-validation exhibits a non-vanishing bias and is uniformly
worse for all K.
0.0%
2.0%
4.0%
6.0%
1e-05 2e-05 3e-05 4e-05
τ
Sub-Optimality
0.0%
2.0%
4.0%
6.0%
1e-05 2e-05 3e-05 4e-05
τ
Sub-Optimality
Cross-Val
VGC
Oracle
Stein Correction
Figure 2.4: Estimating Performance across Policy Class. The first graph shows the estimates
of out-of-sample performance across the policy class for the parameter τ ∈ [25νmin,100νmin] and
µ0 = 1000 for one sample path when K = 400. The second graph is similar, but for K = 3200.
Both plots highlight the smoothness of VGC relative to the Stein-Correction.
51
Time Saved (sec)
1
⌧600
0
800
(a)
(b)
(a)
(b)
Figure 2.5: Comparing Policy Decisions. Left (resp. right) panel shows routing decisions for
the policy selected by VGC (resp. cross-validation). Color indicates time-saved relative to an
ambulance-only policy (green is good, red is bad) computed relative to the ground truth. Although routing is largely similar, Regions (a) and (b) highlight some differences where the crossvalidation policy makes poorer routing decisions (more orange dots). The larger black points are
drone depots.
of the Stein Correction. We also note that while it appears both Stein and VGC systematically overestimate performance, this is an artifact of the particular sample path chosen. By contrast, crossvalidation does systematically underestimate performance, because it estimates the performance
of a policy trained with half the data, which is necessarily worse.
In our third experiment, we highlight the differences in the policy selected by the VGC estimator and the policy selected by cross-validation. In Fig. 2.5, we plot the routing decisions of each
policy and color code them by the true (oracle) amount of time saved. We highlight two regions
(labeled (a) and (b)) on the map where drones arrive after the ambulance. We see that in those regions, the cross-validation policy routes to more patients/regions where the drone arrives after the
ambulance, thus potentially wasting drone resources and availability. We see in the regions outside
of (a) and (b) that routing decisions of the two polices are similar and result in the drone arriving
before the the ambulance. Additionally, we note that the drone depots in the southern region of the
map are the same for both policies while the drone depots in the northern region are different.
52
0.00
0.25
0.50
0.75
1.00
400 800 1200
Coverage Ring (sec)
(a)
Fraction Served
Cross-Val
VGC
Full Info
Northern Region Southern Region
400 800 1200 400 800 1200
0
25000
50000
75000
100000
Coverage Ring (sec)
(b)
Time Saved (sec)
Cross-Val
VGC
Figure 2.6: Estimating Performance across Policy Class. Each data point in the graph represents
the performance metric of each selected policy for a ring shaped region corresponding to distance
in time from a drone depot. Graph (a) shows the fraction of patients served in each region for the
patients serviced by the Southern depot in Fig. 2.5. Graph (b) plots the time saved by each policy.
The plots highlight the performance difference in routing decisions between the two policies.
In Fig. 2.6, we plot key performance metrics for regions organized by their distance from
a drone depot for these two policies. Specifically, we group OHCA events into “coverage rings”
based on the travel time from the depot to the event. Each ring is of “width” 100s. For example, the
800 seconds coverage ring corresponds to all OHCA events that are between 701 and 800 seconds
away from the drone depot. In Panel (a) of Fig. 2.6, we restrict attention to the the southern
region Fig. 2.5 where both policies have selected the same drone depot so that we can focus on
routing decisions. We plot the fraction of patients served for each coverage ring. We see that the
policy chosen by VGC is more conservative with routing in comparison to the policy chosen by
cross-validation and more closely aligns the full information benchmark.
In panel (b) of Fig. 2.6, we compare the time saved between the two policies. We organize
the regions into the North and South corresponding to the servicing depots. In the northern region,
we see that the VGC policy saves more time in coverage rings further away from the drone depot
by sacrificing time saved in closer coverage rings. This difference partially corresponds with the
more conservative routing decisions of the VGC, but also can be attributed to the choice of drone
depot. We see the VGC policy chooses a depot in less densely populated region that is more
centralized overall, while the cross-validation chooses a depot closer to more densely populated
53
regions in terms of OHCA occurrences. In total, we see that the VGC policy saves 1.43% more
time in comparison to the cross-validation policy. However, if we breakdown the time saved with
respect to minimum distance from a depot, we see that for patients within 600 seconds of an
existing drone depot, the VGC performs less than 1% worse in comparison to the cross-validation
policy. However, if we consider patients more than 600 seconds away from existing drone depot,
the VGC policy saves 13.8% more time in comparison to cross-validation. We interpret this to
mean that both the VGC and cross-validation policies make similar performing depot decisions,
but VGC makes significantly better routing decisions, particularly at long distances. Since these
long distances are precisely where the imprecision is most crucial, we argue this is a relevant
advantage.
Finally, we also compare how higher fold cross-validation performs with respect to VGC. In (a)
of Fig. 2.7, we first show how the cross-validation estimators for three different policies performs
in estimation error relative to VGC as we vary the number of folds. For all the policies (where
increasing τ corresponds to lower variance policies), the plot shows the root mean-squared error
(MSE) of cross-validation is uniformly larger than VGC over the folds considered. Furthermore,
the plot highlights a drawback of cross-validation, which is that it is not even clear how to select
the optimal number of folds in order to minimize the bias variance trade-off of the cross-validation
estimator. In comparison, the VGC with minimal guidance on the choice of the h parameter outperforms common choices of folds for cross-validation such as 5-fold cross-validation and leaveone-out cross-validation. In (b) of Fig. 2.7, we show how the performance of policy evaluation
translates to policy learning in regions further away from the depot, where drone decisions are
most crucial. As expected, VGC out-performs cross-validation across all choices of folds and the
fold that performs most similarly to VGC, 20-fold, also corresponds to the cross-validation fold
with the lowest MSE. This observation suggests with an “ideal” number of folds, cross-validation
can perform well in this example, but identifying the right number of folds is non-trivial.
54
1.0
1.5
2.0
2.5
3.0
5 10 20 50 100
Folds
(a)
Root MSE (% Full-Info)
CV τ = 0
CV τ = 8×10−7
CV τ = 3.96×10−5
VGC τ = 0
VGC τ = 3.96×10−5
VGC τ = 8×10−7
0.0%
5.0%
10.0%
15.0%
20.0%
2 5 10 20 50 100
Folds
(b)
VGC Performance
Improvement
OHCAs >930s Away
OHCAs >870s Away
OHCAs >810s Away
Figure 2.7: Varying Cross-Validation Folds. We plot the policy evaluation and learning performance of cross-validation with different folds across 500 simulations. In each simulation there are
100 samples of Z
i
. In (a), we plot the mean squared error of cross-validation for three different
policies and compare them with the respective VGC estimates represented by the dotted lines. In
(b), we plot the percent improvement VGC has over cross-validation, so larger bars indicate lower
cross-validation performance.
2.6 Conclusion
Motivated by the poor performance of cross-validation in data-driven optimization problems where
data are scarce, we propose a new estimator of the out-of-sample performance of an affine plug-in
policy. Unlike cross-validation, our estimator avoids sacrificing data and uses all the data when
training, making it well-suited to settings with scarce data. We prove that our estimator is nearly
unbiased, and for “stable” optimization problems – problems whose optimal solutions do not
change too much when the coefficient of a single random component changes – the estimator’s
variance vanishes. Our notion of stability leads us to consider two special classes of weaklycoupled optimization problems: weakly-coupled-by-variables and weakly-coupled-by-constraints.
For each class of problems, we prove an even stronger result and provide high-probability bounds
on the error of our estimator that holds uniformly over a policy class. Additionally, in our analysis
of optimization problems weakly-coupled-by-constraints, we provide new insight on the stability
of the dual solutions. This new insight may provide further insight in problems that leverage the
dual solution such as online linear programming.
55
Our work offers many exciting directions for future research. Our solution approach to the
weakly-coupled problems exploits the decomposability of the underlying optimization problems.
We believe that such an approach can be generalized to other settings. Finally, our analysis strongly
leverages the linearity of the affine plug-in policy class; it is an open question if similar debiasing
techniques might be developed to handle nonlinear objective functions as well.
2.7 Appendix: Background Results on Empirical Processes
In this appendix we collect some results on the suprema of empirical processes that we will require
in our proofs. All results are either known or easily derived from known results. Our summary is
necessarily brief and we refer the reader to Pollard, 1990 for a self-contained exposition.
Let Ψ(t) = 1
5
exp(t
2
). For any real-valued random variable Z, we define the Ψ-norm ∥Z∥Ψ to be
∥Z∥Ψ ≡ inf{C > 0 : E[Ψ(|Z|/C)] ≤ 1}. Random variables with finite Ψ-norm are sub-Gaussian
random variables. We first recall a classical result on the suprema of sub-Gaussian processes over
finite sets.
Theorem 2.7.1 (Suprema of Stochastic Processes over Finite Sets). Let
f(θ ) = (f1(θ ),..., fK(θ )) ∈ R
K
be a vector of K independent stochastic processes indexed by θ ∈ Θ. Let F ∈ R
K
+ be a random
vector such that | fk(θ )| ≤ Fk
for all θ ∈ Θ, k = 1,...,K, and suppose there exists a constant
M < ∞ such that |{ f(θ ) : θ ∈ Θ}| ≤ M almost surely. Then, for any R > 1, there exists an absolute
constant C such that with probability 1−e
−R
sup
θ∈Θ
K
∑
k=1
fk(θ )−E
"
K
∑
k=1
fk(θ )
#
≤ C·R· ∥∥F∥2
∥Ψ
p
logM.
56
Proof.Proof: The result follows from the discussion leading up to Eq. (7.4) of Pollard, 1990 after
noting that the entropy integral (Jn(ω) in the notation of Pollard, 1990) is at most 9∥F∥2
√
logM
given the conditions of the theorem. □
When considering the suprema over potentially infinite sets, we must characterize the “size” of
{ f(θ ) : θ ∈ Θ} more carefully. Recall for any set F, the ε-packing number of F is the largest
number of points we can select in F such that no two points are within ℓ2 distance ε. We denote
this packing number by D(ε,F). We restrict attention to sets whose packing numbers do not grow
too fast.
Definition 2.7.2 (Euclidean Sets). We say a set F is Euclidean if there exists constants A and W
such that
D(εδ,F) ≤ Aε
−W ∀0 < ε < 1,
where δ ≡ supf ∈F ∥ f ∥.
Furthermore, note that in the special case that Fk ≤ U, Theorem 2.7.1 bounds the suprema by a
term that scales like U
√
K. This bound can be quite loose since fk(θ ) typically takes values much
smaller than U and is only occasionally large. Our next result provides a more refined bound on the
suprema when the pointwise variance of the process is relatively small and the relevant (random)
set is Euclidean. We stress the parameters A and W below must be deterministic.
Theorem 2.7.3 (Suprema of Stochastic Processes with Small Variance). Suppose that the set
{ f(θ ) : θ ∈ Θ} ⊆ R
K is Euclidean with parameters A and W almost surely. Suppose also that
i) There exists a constant U such that supθ∈Θ ∥ f(θ )∥∞ ≤ U, almost surely, and
ii) There exists a constant σ
2
such that supθ∈Θ E
∥ f(θ )∥
2
2
≤ Kσ
2
.
Then, there exists an absolute constant C such that for any R > 1, with probability at least 1−e
−R
,
sup
θ∈Θ
K
∑
k=1
fk(θ )−E
"
K
∑
k=1
fk(θ )
#
≤ CR·V(A,W)
√
K
σ +
UV(A,W)
√
K
,
57
where V(A,W) ≡
logA+W
√
logA
.
Remark 2.7.4. Notice that when K is sufficiently large, the term in the parenthesis is dominated
by 2σ, and hence the bound does not depend on U. Theorem 2.7.3 is not tight in its dependence on
R. See, for example, Talagrand’s Inequality for the suprema of the empirical process (Wainwright,
2019). We prefer Theorem 2.7.3 to Talagrand’s inequality in what follows because it is somewhat
easier to apply and is sufficient for our purposes.
Proof. Proof of Theorem 2.7.3. For convenience in what follows, let F be the (random) set
{ f(θ ) : θ ∈ Θ}. Let δ ≡ supf ∈F ∥ f ∥2.
Our goal will be to apply Theorem A.2 of GR 2021. That theorem shows that there exists an
absolute constant C1 such that
sup
f ∈F
K
∑
k=1
fk −E
"
K
∑
k=1
fk
#
≤ C1RV(A,W)∥δ∥Ψ. (2.30)
Thus, the remainder of the proof focuses on bounding ∥δ∥Ψ. As an aside, a naive bound ∥δ∥Ψ ≤
U
√
K, so we know that this value is finite. In what follows, we seek a stronger bound.
Write
δ
2 = sup
f ∈F
∥ f ∥
2
2
≤ sup
f ∈F
∥ f ∥
2
2 −E
∥ f ∥
2
2
+ sup
f ∈F
E
∥ f ∥
2
2
≤ sup
f ∈F
∥ f ∥
2
2 −E
∥ f ∥
2
2
+Kσ
2
Let C2 > 0 be a constant to be determined later. Dividing by C2 and taking expectations above
shows
E
e
δ
2
C2
≤ e
Kσ
2
C2 ·E
e
Z¯
C2
, (2.31)
where
Z¯ ≡ sup
f ∈F
∥ f ∥
2
2 −E
∥ f ∥
2
2
= sup
f ∈F
(
K
∑
k=1
f
2
j −E
f
2
j
)
58
Importantly, Z¯ is again the suprema of an empirical process, namely for the “squared” elements.
Pollard, 1990 provides bounds on Z¯ in terms of the entropy integral of the process.
Specifically, let f
2
denote the vector whose j
th element is f
2
j
. Let F2 = { f
2
: f ∈ F}. Then
the entropy integral of the squared process is defined to be
J ≡ 9
Z δ
0
q
logD(x,F2)dx,
where δ ≡ supf ∈F ∥ f
2
∥2.
Then, in the discussion just prior to Eq. (7.4) of Pollard, 1990, it is proven that
E
h
e
Z¯/∥J¯∥Ψ
i
≤ 25. (2.32)
Hence, to bound the right side of Eq. (2.31), we will next bound ∥J¯∥Ψ. This in turn will allow us
to bound ∥δ∥Ψ and invoke Theorem A.2 of GR 2021.
To this end, observe that for any f ,g ∈ F, we have
∥ f
2 −g
2
∥
2 =
K
∑
k=1
(f
2
j −g
2
j
)
2 =
K
∑
k=1
(fj +gj)
2
(fj −gj)
2 ≤ 4U
2
∥ f −g∥
2
.
Hence, D(ε,F2
) ≤ D
ε
2U
,F
. Write
J ≡ 9
Z δ
0
q
logD(x,F2)dx
≤ 9
Z δ
0
r
logD
x
2U
,F
dx.
= 2U · 9
Z δ
2U
0
p
logD(x,F)dx.
5
where the last equality is a change of variables. We now claim we can upper bound this last
expression by replacing the upper limit of integration with δ. Indeed, if δ
2U ≤ δ, then because the
integrand is nonnegative, we only increase the integral. If, δ
2U > δ, then note that
Z δ
2U
δ
p
logD(x,F)dx = 0,
since D(x,F) = 1 for all x ≥ δ. Thus, in either case we can replace the upper limit of integration,
yielding
J¯ ≤ 18U
Z δ
0
p
logD(x,F)dx.
Recall the entropy integral of the original process is given by
J ≡ 9
Z δ
0
p
logD(x,F)dx.
Hence,
J ≤ 2UJ.
Moreover, Theorem A.2 of GR 2021 shows that ∥J∥Ψ ≤ C3∥δ∥ΨV(A,W) for some absolute constant C3. Thus we have successfully bounded
∥J¯∥Ψ ≤ 2UC3∥δ∥ΨV(A,W).
Substituting back into Eq. (2.32) shows that
E
e
Z¯
2UC3
∥δ∥ΨV(A,W)
≤ 25.
Now choose C2 in Eq. (2.31) to be C2 = α2UC3∥δ∥ΨV(A,W) for some α > 0 to be determined
later. Substituting our bound on Z¯ into Eq. (2.31) shows
60
E
exp
δ
2
α2UC3∥δ∥ΨV(A,W)
≤ e
Kσ
2
α2UC3
∥δ∥ΨV(A,W)
·E
e
Z¯
α2UC3
∥δ∥ΨV(A,W)
≤ exp
Kσ
2
α2UC3∥δ∥ΨV(A,W)
· 251/α
, (2.33)
where we have used α > 0 and Jensen’s Inequality to simplify.
We now to choose α large enough that the right side is at most 5. Taking logs, it suffices to
choose α such that
log(5) ≥
1
α
log(25) + Kσ
2
2UC3∥δ∥ΨV(A,W)
⇐⇒ α ≥ 2+
Kσ
2
2UC3 log(5)· ∥δ∥ΨV(A,W)
.
Substituting into Eq. (2.33) shows
∥δ∥
2
Ψ ≤
2+
Kσ
2
2UC3 log(5)· ∥δ∥ΨV(A,W)
2UC3∥δ∥ΨV(A,W) = 4C3UV(A,W)∥δ∥Ψ +
Kσ
2
log(5)
In summary, ∥δ∥Ψ is at most the largest solution to the quadratic inequality
y
2 −by−c ≤ 0,
where
b = 4C3UV(A,W) and c =
Kσ
2
log(5)
.
61
Bounding the largest root shows
y ≤
b
2
+
√
b
2 +4c
2
≤
b
2
+
b+2
√
c
2
(Triangle-Inequality)
= b+
√
c.
Or in other words,
∥δ∥Ψ ≤ 4C3UV(A,W) +σ
√
K,
where we upper bounded (
p
log(5))−1 ≤ 1.
Now simply substitute into Eq. (2.30) and collect constants to complete the proof. □
2.7.1 Method of Bounded Differences Excluding an Exceptional “Bad” Set
In our analysis, we utilize an extension of McDiarmid’s inequality due to Combes, 2015. Recall,
McDiarmid’s inequality shows that for a random vector Z ∈ X with independent components and
function f : X → R such that
| f(x)− f(y)| ≤
n
∑
i=1
ciI{xi ̸= yi} ∀(x, y) ∈ X 2
, (2.34)
for some c ∈ R
n
, we have that
P{| f(Z)−E[ f(Z)]| ≥ t} ≤ 2exp
−
2t
2
∑
n
i=1
c
2
i
.
The next result extends McDiarmid’s inequality to a setting where Eq. (2.34) only holds for all
(x, y) ∈ Y
2 where Y ⊆ X is a certain “good” set:
62
Theorem 2.7.5 (Combes, 2015). Let Z ∈ X be a random vector with independent components
and f : X 7→ R be a function such that
| f(x)− f(y)| ≤
n
∑
j=1
c jI
x j ̸= y j
∀(x, y) ∈ Y ,
for some vector c ∈ R
n
, where Y ⊆ X . Let c = ∑
n
i=1
ci
, and p = P{X ̸∈ Y }. Then, for any t > 0
P(| f(Z)−E[ f(Z) | Z ∈ Y ]| ≥ t + pc¯) ≤ 2
p+exp
−
2t
2
∑
n
i=1
c
2
i
In particular, this implies that for any ε > 2p, with probability at least 1−ε,
| f(Z)−E[ f(Z) | Z ∈ Y ]| ≤ pc¯+∥c∥2
s
log
2
ε −2p
.
Remark 2.7.6. In the special case that Y = X , then p = 0, the theorem recovers McDiarmid’s
inequality.
2.8 Appendix: Properties of the Variance Gradient Correction (VGC)
First, we state the relevant portion of Danskin’s Theorem for reference. See Bertsekas, 1997,
Section B.5 for a proof of a more general result:
Theorem 2.8.1 (Derivative Result of Danskin’s Theorem). Let Z ⊆ R
m be a compact set, and let
φ : R
n×Z 7→ R be continuous and such that φ(·,z): R
n
7→ R is convex for each z ∈ Z. Additionally,
define
Z(x) =
z¯ : φ(x,z¯) = max
z∈Z
φ(x,z)
.
Consider the function f : R
n
7→ R given by
f(x) = max
z∈Z
φ(x,z).
63
If Z(x) consists of a unique point z and ¯ φ(·,z¯) is differentiable at x, then f is differentiable at x, and
∇ f(x) = ∇xφ(x,z¯), where ∇xφ(x,z¯) is the vector with coordinates
∂ φ(x,z¯)
∂ xi
, i = 1,...,n.
The remainder of the section contains proofs of the results in Section 2.3.
2.8.1 Proof of Theorem 2.3.2
This section contains the omitted proofs leading to the proof of Theorem 2.3.2. We first relate
finite difference approximations of the subgradients of V(z +te j) to their true values.
Lemma 2.8.2 (Subgradients Bound Finite Difference Approximation). For any z ∈ R
n and t ∈ R,
we have
ajx j(z +te j)t ≤ V(z +te j)−V(z) ≤ ajx j(z)t.
Proof.Proof: Let f(t) = V(z + te j). Recall that f(t) is concave and f
′
(t) = ajx j(z + te j) by
Danskin’s Theorem. Hence, by the subgradient inequality for concave functions, f(t) ≤ f(0) +
f
′
(0)t and f(0) ≤ f(t) − t f ′
(t), and thus, t f ′
(t) ≤ f(t) − f(0) ≤ t f ′
(0). This is equivalent to
ajx j(z +te j)t ≤ V(z +te j)−V(z) ≤ ajx j(z)t, which is the desired result.
Equipped with Lemma 2.8.2, the proof of Lemma 2.3.3 is nearly immediate.
Proof. Proof of Lemma 2.3.3: The bounds in Lemma 2.8.2 show that aj
t(x j(z)−x j(z +te j)) ≥ 0.
If aj ≥ 0, it follows that t(x j(z) − x j(z +te j)) ≥ 0 for all t, which shows that t 7→ x j(z +te j) is
non-increasing. Similarly, if aj < 0, it follows that t(x j(z)−x j(z +te j)) ≤ 0 for all t, which shows
that t 7→ x j(z +te j) is non-decreasing.
Before proving Theorem 2.3.2, we establish the following intermediary result on the error of a
non-randomized forward step, finite difference.
64
Lemma 2.8.3 (Forward Step Finite Difference Error). Fix some j such that aj ̸= 0 and 0 < h < 1/e.
Then,
E
ξjx j(Z)−
V(Z +h
√νjξje j)−V(Z)
h
√νja j(θ )
≤ 4hlog
1
h
√
νmin
In other words, the forward finite step difference introduces a bias of order O˜(h).
Proof.Proof: From Lemma 2.8.2, we see that the term inside the expectation can be upper-bounded
by the non-negative term ξj
x j(Z)−x j(Z +h
√νjξje j)
. Hence,
E
ξjx j(Z)−
V(Z +h
√νjξje j)−V(Z)
h
√νja j(θ )
≤ E
ξj
x j(Z)−x j(Z +h
p
νjξje j)
.
To simplify notation, let g(t) = E
x j
Z−j +ξje j
|ξj = t
where Z−j
is identical to Z but has a 0
at the j
th component. Then,
E
ξj
x j(Z)−x j(Z +h
p
νjξje j)
=
Z ∞
−∞
t
g(t)−g(t +h
p
νj
t)
φj(t)dt
where φj(t) is the density for N (0,1/νj).
To bound the integral, choose a constant U > 0 (which we optimize later) and break the integral
into three regions, (−∞,−U),(−U,U),(U,∞). This yields the upper bound
Z ∞
−∞
t
g(t)−g(t +thp
νj)
φj(t)dt
≤
Z U
−U
U
g(t)−g(t +thp
νj)
φj(t)dt
| {z }
(a)
+
Z −U
−∞
|t|φj(t)dt +
Z ∞
U
|t|φj(t)dt
| {z }
(b)
.
We first bound (a). As the fist step, we attempt to remove the absolute value. From Lemma 2.8.2,
g(·) is a monotone function. We claim that for |t| < U,
g(t)−g(t +h
p
νj
t)
≤
g
t −Uhp
νj
−g
t +Uhp
νj
, (2.3
since (t −Uh√νj
,t +Uh√νj) always contains the interval (t,t +h
√νj
t). Let
b =
Uh√νj
if aj > 0,
−Uh√νj otherwise,
so that
g(t −Uh√νj)−g(t +Uh√νj)
= g(t −b)−g(t +b). Then,
Z ∞
−∞
g(t −b)−g(t +b)
φj(t)dt =
Z ∞
−∞
g
t +b
φj(t +2b)−φj(t)
dt, (Change of variables)
=
Z ∞
−∞
g(t +b)
Z t+2b
t
−νjzφj(z)dzdt, (since φ
′
j
(z) = −νjzφj(z))
≤ νj
Z ∞
−∞
|z|φj(z)
Z z
z−2b
g(t +b) dt dz, (Fubini’s Theorem)
≤ 2|b|νj
Z ∞
−∞
|z|φj(z)dz, (Since |g(t +b)| ≤ 1 for all t)
= 4b
r
νj
2π
≤ 2
r
2
π
Uhνj
.
To summarize, we have shown that the term (a) satisfies
Z U
−U
U
g(t)−g(t +thp
νj)
φj(t)dt ≤ 2
r
2
π
U
2
hνj
To bound (b), we see that first see that
Z −U
−∞
|t|φj(t)dt +
Z ∞
U
|t|φj(t)dt = 2
Z ∞
U
tφj(t)dt = 2
s
1
2πνj
exp
−U
2νj
2
where the first equality holds by symmetry.
Putting the bounds of (a) and (b) together, we have
Z ∞
−∞
t
g(t)−g(t +thp
νj)
φj(t)dt
≤ 2
r
2
π
U
2
hνj +2
s
1
2πνj
exp
−U
2νj
2
We approximately balance the two terms by letting U
2 =
2
νj
log
1
h
√νj
. Substituting and simplifying yields
4
r
2
π
hlog
1
h
√νj
+h
r
2
π
≤ 4
r
2
π
hlog
1
h
√
νmin
+h
r
2
π
.
To simplify, note that h < 1/e and νmin ≤ 1 implies that log
1
h
√νj
≥ 1. Hence, combining the
two terms and simplifying provides a bound of 10q
1
2π
hlog
1
h
√
νmin
. Note that 10/
√
2π ≤ 4 to
complete the proof.
We can now prove Theorem 2.3.2.
Proof. Proof of Theorem 2.3.2: Notice that if aj(θ ) = 0, then the j
th term contribute nothing to
the bias because x(Z) is independent of Zj
, so E
ξjx j(Z)
= 0 = E
Dj(Z)
. Hence, we focus on
terms j where aj(θ ) ̸= 0.
Decompose the j
th term as
E
ξjx j(Z)−Dj(Z)
= E
ξjx j(Z)−
V(Z +h
√νjξje j)−V(Z)
h
√νjaj(θ )
| {z }
(a)
+E
E
V(Z +h
√νjξje j)−V(Z)
h
√νjaj(θ )
−
V(Z +δje j)−V(Z)
h
√νjaj(θ )
Z
| {z }
(b)
We first bound (b). Canceling out the V(Z) yields
1
h
√νja j(θ )
E
V(Z +h
p
νjξje j)−V(Z +δje j)
.
From our previous discussion, V(Z +h
√νjξje j) ∼d V(Z +δje j), whereby
E
V(Z +h
p
νjξje j) − V(Z +δje j)
= 0
.
67
Lemma 2.8.3 bounds (a) by 4hlog
1
h
√
νmin
. Summing over j gives us our intended bound.
2.8.2 Properties of VGC
We next establish smoothness properties of the VGC.
Proof. Proof of Lemma 2.3.7. We begin with i). We first claim that θ 7→ V(z,θ ) is Lipschitz
continuous with parameter Ln(1+∥z∥∞). To this end, write
V(z,
¯θ )−V(z,θ ) =
r
z,
¯θ
−r(z,θ )
⊤
x(z,
¯θ )−r(z,θ )
⊤
x(z,θ )−x(z,
¯θ )
| {z }
≤0 by optimality of x(z,θ )
≤
r
z,
¯θ
−r(z,θ )
⊤
x(z,
¯θ )
≤ ∥r(z,
¯θ )−r(z,θ )∥1∥x(z,
¯θ )∥∞
≤ ∥r(z,
¯θ )−r(z,θ )∥1 (since x(z,
¯θ ) ∈ X ⊆ [0,1]
n
)
≤
n
∑
j=1
aj(
¯θ )−aj(θ )
zj
+
bj(
¯θ )−bj(θ )
≤
n
∑
j=1
L∥z∥∞∥
¯θ −θ ∥+L∥
¯θ −θ ∥
= Ln ·(1+∥z∥∞)∥
¯θ −θ ∥.
Reversing the roles of θ and ¯θ yields an analogous bound, and, hence,
r
z,
¯θ
⊤
x(z,
¯θ )−r(z,θ )
⊤
x(z,θ )
≤ Ln(1+∥z∥∞
)
¯θ −θ
This proves the first statement.
Next, we claim for any z,
1
aj(
¯θ )
V(z,
¯θ )−
1
aj(θ )
V(z,θ )
≤
2nL
amin
amax
amin
∥z∥∞ +
amax +bmax
amin
∥
¯θ −θ ∥. (2.
Write
1
aj(
¯θ )
V(z,
¯θ )−
1
aj(θ )
V(z,θ )
=
aj(θ )V(z,
¯θ )−aj(
¯θ )V(z,θ )
aj(θ )aj(
¯θ )
≤
V(z,
¯θ )−V(z,θ )
aj(
¯θ )
+
V(z,θ )(aj(θ )−aj(
¯θ ))
aj(θ )aj(
¯θ )
,
≤
Ln(1+∥z∥∞)∥
¯θ −θ ∥
amin
+
|V(z,θ )|L∥
¯θ −θ ∥
a
2
min
,
where the first inequality follows by adding and subtracting aj(θ )V(θ ) in the numerator, and the
second inequality follows from the Lipschitz continuity of aj(θ ) and V(z,θ ) (Assumption 2.3.6).
Next note that
|V(z,θ )| ≤ ∥r(z,θ )∥1∥x(z,θ )∥∞ ≤ ∥a(θ ) ◦ z∥1 +∥b(θ )∥1 ≤ n∥z∥∞amax +nbmax.
Substituting above and simplifying proves Eq. (2.36)
We can now prove the lemma. Fix a component j. Then,
Dj(z,
¯θ )−Dj(z,θ ) = E
"
1
h
√νjaj(
¯θ )
V(Z +δje j
,
¯θ )−V(Z,
¯θ )
| Z = z
#
−E
1
h
√νjaj(θ )
V(Z +δje j
,θ )−V(Z,θ )
| Z = z
=
1
h
√νj
E
1
aj(
¯θ )
V(Z +δje j
,
¯θ )−
1
aj(θ )
V(Z +δje j
,θ ) | Z = z
+
1
h
√νj
E
1
aj(θ )
V(Z,θ )−
1
aj(
¯θ )
V(Z,
¯θ ) | Z = z
.
Hence, by taking absolute values and applying Eq. (2.36) twice we obtain
Dj(z,
¯θ )−Dj(z,θ )
≤
2nL
hamin
amax
amin
E
∥Z +δje j∥∞ | Z = z
+
amax +bmax
amin
∥
¯θ −θ ∥
+
2nL
hamin
amax
amin
∥z∥∞ +
amax +bmax
amin
∥
¯θ −θ ∥,
where we have passed through the conditional expectation. Finally, note that E
∥Z +δje j∥∞ | Z = z
≤
∥z∥∞ + E
δj
≤ ∥z∥∞ +
p
h
2 +2h/νj by Jensen’s inequality. We simplify this last expression
by noting for h < 1/e, h
2 < h, so that
q
h
2 +2h/νj ≤
√
h
p
1+2/νmin ≤ 2
s
h
νmin
,
using νmin ≤ 1. Thus, E
∥Z +δje j∥∞ | Z = z
≤ ∥z∥∞ +2
q h
νmin
. Substituting above and collecting terms yields
4nL
hamin
amax
amin
∥z∥∞ +
amax +bmax
amin
+
amax
amin s
h
νmin!
∥
¯θ −θ ∥. (2.37)
We can simplify this expression by letting
C3 ≥
4
amin
·max
amax
amin
,
amax +bmax
amin
,
amax
amin
.
Then Eq. (2.37) is at most
C3nL
h
∥z∥∞ +1+
s
h
νmin!
∥
¯θ −θ ∥2 ≤
C3nL
h
∥z∥∞ +
2
√
νmin
∥
¯θ −θ ∥2 ≤
2C3nL
h
√
νmin
(∥z∥∞ +1)∥
¯θ −θ ∥2,
where we have used the bounds on the precisions (Assumption 2.3.1) and h < 1/e to simplify.
Letting C1 = 2C3 proves the first part of the theorem.
To complete the proof, we require a high-probability bound on ∥Z∥∞. Since Z − µ is subGaussian, such bounds are well-known (Wainwright, 2019), and we have with probability 1−e
−R
,
∥Z∥∞ ≤ Cµ +∥Z − µ∥∞ ≤ Cµ +
C4
√
νmin
p
logn
√
R,
for some universal constant C4. Substitute this bound into our earlier Lipschitz bound for an
arbitrary z, and use the Assumption 2.3.1, h < 1/e, and R > 1 to collect terms and simplify. We
70
then sum over the n terms of D(Z,θ ) to complete the proof for i).
We now bound ii). Focusing on the j
th component of D(Z,(θ ,h)) and writing
Dj(Z,(θ ,h)) ≡ Dj
Z,h,δ
h
j
,θ
= E
"
V(Z +δ
h
j
e j
,θ )−V(Z,θ )
aj(θ )h
√νj
Z
#
,
we see that
Dj
Z,h,δ
h
j
,θ
−Dj
Z,h,δ
h
j
,θ
= Dj
Z,h,δ
h
j
,θ
−Dj
Z,h,δ
h
j
,θ
| {z }
(a)
+Dj
Z,h,δ
h
j
,θ
−Dj
Z,h,δ
h
j
,θ
| {z }
(b)
.
To bound (a) and (b), we see from the proof of Lemma 2.3.8 that,
V(Z,θ )−V(Z +Y e j
,θ )
aj(θ )
≤ |Y|. (2.38)
We first bound (a). We see
Dj
Z,h,δ
h
j
,θ
−Dj
Z,h,δ
h
j
,θ
=
h−h
hh
E
"
V(Z +δ
h
j
e j
,θ )−V(Z,θ )
√νj
Z
#
≤
h−h
h
2
min
E
"
|δ
h
j
|
√νj
Z
#
, by Eq. (2.38)
≤
h−h
h
2
min
1
√
νmins
3h
√
νmin
≤
√
3
h−h
h
2
minν
3/4
min
,
71
where the second to last inequality applies the inequality E
h
|δ
h
j
|
i
= E
hq
|δ
h
j
|
2
i
≤
r
E
h
|δ
h
j
|
2
i
≤
q
√
3h
νmin
. We then bound (b). We see
Dj
Z,h,δ
h
j
,θ
−Dj
Z,h,δ
h
j
,θ
=
E
"
V(Z +δ
h
j
e j
,θ )−V(Z +δ
h
j
e j
,θ )
aj(θ )h
√νj
Z
#
=
1
h
√νj
E
h
f
δ
h
j
− f
δ
h
j
Z
i
≤
1
h
√νj
W2
δ
h
j
,δ
h
j
, by Wainwright (2019, pg. 76)
The Wasserstein distance between two mean-zero Gaussians is known in closed form:
W2
δ
h
j
,δ
h
j
=
s
h
2 +
2h
√νj
−
s
h
2
+
2h
√νj
≤
vuut
h
2 −h
2
+
2
h−h
√νj
≤
s
2+
2
√νj
h−h
,
where the first inequality comes from the common inequality
√
a−
√
b
≤
p
|a−b|. Thus,
Dj
Z,h,δ
h
j
,θ
−Dj
Z,h,δ
h
j
,θ
≤
1
hmin√
νmins
2+
2
√
νmin
h−h
.
Collecting constants of the bounds of (a) and (b), we obtain our result.
We now show the that the components of VGC is bounded.
Proof. Proof of 2.3.8: We see
V(Z +δje j
,θ )−V(Z,θ )
aj(θ )h
√νj
=
1
aj(θ )h
√νj
r(Z,θ )
⊤
x(Z +δje j
,θ )−x(Z,θ )
| {z }
≤0 by optimality of x(Z,θ )
+ aj(θ )δjx j(Z +δje j
,θ )
≤
δjx j(Z +δje j
,θ )
h
√νj
≤
δj
h
√
νmin
.
By an analogous argument,
V(Z,θ )−V(Z +δje j
,θ )
aj(θ )h
√νj
≤
δj
h
√
νmin
.
Taking the conditional expectation, we see
Dj(z)
≤ E
V(Z +δje j
,θ )−V(Z,θ )
aj(θ )h
√νj
Z
≤ E
"
δj
h
√
νmin#
≤
√
3
ν
3/4
min
√
h
.
where the first inequality holds by Jensen’s inequality and the last inequality holds by Jensen’s
inequality as E
δj
≤ E
hq
δ
2
j
i
≤
r
E
h
δ
2
j
i
=
q
h
2 +2h/
√νj ≤
q
√
3h
νmin
.
2.8.3 Bias Under Violations of Assumption 2.2.4
In cases where the precisions νj are not known but estimated by a quantity ν˜j
, we can construct
the VGC in the same fashion, but replacing instances of νj with ν˜j
, giving us,
∑
j:aj̸=0
1
h
p
ν˜jaj(θ )
E
"
V(µ +ξ + ˜δje j)−V(µ +ξ )
Z
#
where ˜δj ∼ N (0,h
2 + 2h/
p
ν˜j). The bias of this VGC is similar to Theorem 2.3.2, except that
it picks up an additional bias term due to the approximation error incurred from ν˜j
, which we
quantify in the following lemma.
Lemma 2.8.4 (Bias of VGC with Estimated Precisions). Suppose Assumption 2.3.1 holds. Let
ν˜j be an estimate of νj and let ˜δj ∼ N (0,h
2 + 2h/
p
ν˜j) and assume νmin ≤ minj ν˜j
. For any
0 < h < 1/e, there exists a constant C dependent on νmin such that
E
"
n
∑
j=1
ξjx j(Z)−
n
∑
j:aj̸=0
E
"
V(Z + ˜δje j)−V(Z)
aj(θ )h
p
ν˜j
Z
##
≤ C· nhlog
1
h
+
C
√
h
∑
j:aj̸=0
ν
1/2
j −ν˜
1/2
j
+
r
ν
1/2
j −ν˜
1/2
j
!
73
Proof. Proof of Lemma 2.8.4 Move the inner conditional expectation outwards and consider a sample path with a fixed Z. Let Dj(t) ≡
V(Z+te j)−V(Z)
ajh
√
ν˜ j
if aj ̸= 0 and 0 otherwise, so that E
h
Dj(
˜δj) | Z
i
is the j
th component of the VGC with the estimated precisions and qν˜ j
νj
E
Dj(δj) | Z
is the j
th
component of the VGC with the correct νj
. Fix some j
th where aj ̸= 0. Note,
ξjx j(Z)−Dj(
˜δj) =
ξjx j(Z)−
V(Z +δje j)−V(Z)
ajh
√νj
+
s
ν˜j
νj
Dj(δj)−Dj(
˜δj)
!
The expectation of the first term was bounded in Theorem 2.3.2, so we focus on the expectation of
the second. We see
s
ν˜j
νj
Dj(δj)−Dj(
˜δj) = s
ν˜j
νj
!
Dj(δj)−Dj(
˜δj)
| {z }
(a)
+
1−
s
ν˜j
νj
!
Dj(
˜δj)
| {z }
(b)
.
To bound the expectation of (a), we see first see t 7→ h
p
ν˜jDj(t) is 1−Lipschitz because
∂
∂t
h
p
ν˜jDj(t) = ∂
∂t
V(Z +te j)−V(Z)
aj
=
1
aj
ajx j(Z) = x j(Z)
by Danskin’s theorem and because x j(Z) is between 0 and 1. Thus, by Wainwright, 2019, pg. 76
E
h
h
p
ν˜j
Dj(δj)−Dj(
˜δj)
Z
i
≤ W2
δj
,
˜δj
,
where W2
δj
,
˜δj
is the Wasserstein distance between two mean-zero Gaussians δj and ˜δj
, which
is known in closed form:
W2
δj
,
˜δj
=
s
h
2 +
2h
ν˜
1/2
j
−
s
h
2 +
2h
ν
1/2
j
≤
vuuut
2h
ν˜
1/2
j
−
2h
ν
1/2
j
≤
vuuut2h
ν
1/2
j −ν˜
1/2
j
νmin
where νmin ≤ minj
min
νj
,ν˜j
.
74
To bound the expectation of (b), we see
E
h
Dj(
˜δj)
Z
i
≤ E
aj
˜δj
|aj
|ν˜
1/2
j
h
Z
=
1
ν˜
1/2
j
h
vuuut
2
π
h
2 +
2h
ν˜
1/2
j
≤
s
2
π
3
νminh
where the first equality holds by directly evaluating the expectation and the last inequality holds
because h < 1/e and νmin ≤ 1.
Putting it all together, we see
E
"
n
∑
j=1
ξjx j(Z)−
n
∑
j=1
V(Z + ˜δje j)−V(Z)
ajh
p
ν˜j
#
≤
E
"
n
∑
j=1
ξjx j(Z)−
s
ν˜j
νj
Dj(δj)
#
+
n
∑
j=1
E
"
E
"s
ν˜j
νj
Dj(δj)−Dj(
˜δj)
Z
##
≤
n
∑
j=1
C· hlog
1
h
+ ∑
j:aj̸=0
1
h
√
νmin
vuuut2h
ν
1/2
j −ν˜
1/2
j
νmin
+
ν
1/2
j −ν˜
1/2
j
νmin
s
2
π
3
hνmin
≤
n
∑
j=1
C· hlog
1
h
+ ∑
j:aj̸=0
r
2
ν
1/2
j −ν˜
1/2
j
√
hν
3/2
min
+
ν
1/2
j −ν˜
1/2
j
q
6
π
√
hν
3/2
min
where the first inequality follows from triangle inequality and the second from applying Theorem 2.3.2 and our bounds on (a) and (b). Collecting constants we obtain our intended result.
We now highlight when ξj are not Gaussian but only sub-Gaussian. Let
Lemma 2.8.5 (Bias VGC with Gaussian assumption violated). Suppose Assumption 2.3.1 holds.
Let ξj be a mean-zero, sub-Gaussian random variable with variance proxy at most σ
2 and admits
75
a density density φ(·). Additionally, let ¯ξj ∼ N (0,1/
√νj with density φ¯(·). Then, there exists a
dimension independent constant C, such that
E
"
n
∑
j=1
ξjx j(Z)−
n
∑
j:aj̸=0
E
"
V(Z +δje j)−V(Z)
aj(θ )h
√νj
Z
##
≤ n
σ
√
2π −log
φ −φ¯
1
4
!!
φ −φ¯
1 +Cnhlog
1
h
+
n
∑
j:aj̸=0
W2(ξj
,
¯ξj)
Proof. Proof of Lemma 2.8.5 Let Z¯ be Z, but with the j
th component replaced by Z¯
j = µj + ¯ξj and
let
Dj(t) = 1
ajh
√νj
E
V
Z +
δj +t −ξj
e j
−V
Z +
t −ξj
e j
Z
.
We see
E
ξjx j(Z)−Dj(ξj)
≤
E
E
ξjx j(Z)− ¯ξjx j(Z¯)
Z
−j
+
E
¯ξjx j(Z¯)−Dj(
¯ξj)
+
E
E
Dj(
¯ξj)−Dj(ξj)
Z
−j
By Lemma C.2 GR 2021, we see
E
ξjx j(Z)− ¯ξjx j(Z¯)
Z
−j
≤ T
φ −φ¯
1 +4exp
−
T
2
2σ2
T +σ
√
2π
.
To optimize T, we first upperbound the latter term as follows
4exp
−
T
2
2σ2
T +σ
√
2π
= 4exp
−
T
2
2σ2
+log
T +σ
√
2π
≤ 4exp
−
T
2
2σ2
+
T +σ
√
2π −1
, since logt < t −1
= 4exp
−
T
2
2σ2
+2T +σ
√
2π −1−T
≤ 4exp
σ
√
2π −1−T
where the last inequality used the fact that the quadratic −
T
2
2σ2 + 2T is maximized at T
∗ = 4σ
2
.
Substituting the upperbound, we see
T
φ −φ¯
1 +4exp
−
T
2
2σ2
T +σ
√
2π
≤ T
φ −φ¯
1 +4exp
σ
√
2π −1−T
We see the right hand side is minimized at T
∗ = σ
√
2π −1−log
∥φ−φ¯∥1
4
. Thus, we see
E
ξjx j(Z)− ¯ξjx j(Z¯)
Z
−j
≤
σ
√
2π −log
φ −φ¯
1
4
!!
φ −φ¯
1
By Theorem 2.3.2, we see
E
¯ξjx j(Z¯)−Dj(Z¯)
Z
−j
≤ Chlog
1
h
.
Finally, since t 7→ h
√νjDj(t) is 1−Lipschitz from Lemma 2.8.4, we see that
E
Dj(Z¯)−Dj(Z)
Z
−j
≤ W2(ξj
,
¯ξj).
Putting it all together, we see
E
ξjx j(Z)−Dj(ξj)
≤
σ
√
2π −log
φ −φ¯
1
4
!!
φ −φ¯
1 +Chlog
1
h
+W2(ξj
,
¯ξj).
Summing over the j terms, we obtain our result
2.8.4 Proof of Theorem 2.3.5.
Before proving the theorem, we require the following lemma.
77
Lemma 2.8.6 (A χ
2
-Tail Bound). Consider δ = (δ1,...,δn)
⊤ where δj
is defined as in the definition of the VGC (Eq. (2.9)) and 0 < h < 1/e. Suppose Assumption 2.3.1 holds. Then,
E
∥δ ∥
2
2
I
∥δ ∥
2
2 >
18hn
√
νmin ≤
36hn
√
νmin
e
−n
.
Proof. Proof of Lemma 2.8.6. Let Y1,...,Yn be independent standard normals.
Observe that since h < 1/e, the variance of δj
is at most h
2+ √
h
νj
≤ 2h/
√
νmin. Then, for t > 1
P
∥δ ∥
2
2 >
2hn
√
νmin
(1+t)
≤ P
2h
√
νmin
n
∑
j=1
Y
2
j >
2hn
√
νmin
(1+t)
!
= P
1
n
n
∑
j=1
Y
2
j > 1+t
!
≤ e
−nt/8
, (2.39)
where the last inequality follows from Wainwright, 2019, pg. 29.
Next, by the tail formula for expectation,
E
∥δ ∥
2
2
I
∥δ ∥
2
2 >
18hn
√
νmin =
Z ∞
0
P
∥δ ∥
2
2
I
∥δ ∥
2
2 >
18hn
√
νmin
> t
dt
=
Z √
18hn
νmin
0
P
∥δ ∥
2
2 >
18hn
√
νmin
dt +
Z ∞
√
18hn
νmin
P
∥δ ∥
2
2 > t
dt
≤
18hn
√
νmin
e
−n +
Z ∞
√
18hn
νmin
P
∥δ ∥
2
2 > t
dt (Applying Eq. (2.39))
≤
18hn
√
νmin
e
−n +
2hn
√
νmin
Z ∞
8
P
∥δ ∥
2
2 >
2hn
√
νmin
(1+s)
ds
≤
9hn
√
νmin
e
−n +
2hn
√
νmin
Z ∞
8
e
−ns/8
ds (Applying Eq. (2.39))
=
18hn
√
νmin
e
−n +
16h
√
νmin
e
−n
Rounding up and combining proves the theorem.
We can now prove the theorem.
Proof. Proof of Theorem 2.3.5. Proceeding as in the main body, we bound each of the three terms
of the out-of-sample estimator error (Eq. (2.16)). Before beginning, note that under Assumption 2.3.1, Var(δj) ≤ √
3h
νmin
. We use this upper bound frequently.
We start with Eq. (2.16b). Consider the k
th non-zero element of the sum. By definition of D
R,
D
R
(Z,δ ,U˜ )−D
R
(Z,δ
k
,U˜ )
=
δk
h
√
νkak
xk(Z +δkU˜
kek)−
δk
h
√
νkak
xk(Z +δkU˜
kek)
≤
1
h
√
νkak
(|δk
|+
δk
)
Hence, squaring and taking expectations,
E
D
R
(Z,δ ,U˜ )−D
R
(Z,δ
k
,U˜ )
2
≤
2
h
2νka
2
k
E
δ
2
k
+E
h
δ
2
k
i
≤
12
hν
3/2
minamin
.
Summing over k shows
Eq. (2.16b) ≤
6n
hν
3/2
minamin
.
We now bound Eq. (2.16c). Again, consider the k
th non-zero element. By definition,
D
R
(Z,δ ,U˜ )−D
R
(Z,δ ,U˜
k
)
=
|δk
|
h
√
νkak
xk(Z +δkUkek)−xk(Z +δkUkek)
≤
2|δk
|
h
√
νminamin
Hence,
E
D
R
(Z,δ ,U˜ )−D
R
(Z,δ ,U˜
k
)
2
≤
4
h
2νmina
2
min
E
δ
2
k
≤
12
hν
3/2
mina
2
min
Summing over k shows
Eq. (2.16c) ≤
6n
hν
3/2
mina
2
min
.
79
Finally, we bound Eq. (2.16a). For convenience, let W k ∈ R
n be the random vector with components Wk j = x j(Z +δjUe˜
j)−x j(Z
k +δjUe˜
j). Then, proceeding as in the main text, we have
D
R
(Z)−D
R
(Z
k
)
2
≤
∥δ ∥
2
2
h
2a
2
minνmin
·
n
∑
j=1
x j(Z +δjUe˜
j)−x j(Z
k +δjUe˜
j)
2
≤
∥δ ∥
2
2
h
2a
2
minνmin
· ∥W k∥
2
2
Notice that E
∥δ ∥
2
2
= O(nh/νmin). We upper bound this expression by splitting on cases
where ∥δ ∥
2
2 > √
18hn
νmin
or not. Note this quantity is much larger than the mean, so we expect contributions when ∥δ ∥
2
2
is large to be small.
Splitting the expression yields
D
R
(Z)−D
R
(Z
k
)
2
≤
∥δ ∥
2
2
h
2a
2
minνmin
· ∥W k∥
2
2
I
∥δ ∥
2
2 >
18hn
√
νmin
+
∥δ ∥
2
2
h
2a
2
minνmin
· ∥W k∥
2
2
I
∥δ ∥
2
2 ≤
18hn
√
νmin
≤
n
h
2a
2
minνmin
∥δ ∥
2
2
I
∥δ ∥
2
2 >
18hn
√
νmin
+
18n
ha2
minν
3/2
min
· ∥W k∥
2
2
Next take an expectation and apply Lemma 2.8.6 to obtain
E
D
R
(Z)−D
R
(Z
k
)
2
≤
36n
2
ha2
minν
3/2
min
e
−n +
18n
ha2
minν
3/2
min
·E
∥W k∥
2
2
Finally summing over k shows
Eq. (2.16a) ≤
36n
3
ha2
minν
3/2
min
e
−n +
18n
3
ha2
minν
3/2
min
·
1
n
2
n
∑
k=1
E
∥W k∥
2
2
Finally, we combine all three terms in Eq. (2.16) yielding
Var(D(Z)) ≤
6n
hν
3/2
minamin
+
6n
hν
3/2
mina
2
min
+
36n
3
2ha2
minν
3/2
min
e
−n +
18n
3
2ha2
minν
3/2
min
·
1
n
2
n
∑
k=1
E
∥W k∥
2
2
≤
C2
h
max(n
3−α
,n),
80
by collecting the dominant terms.
2.8.5 Implementation Details
As mentioned, in our experiments we utilize a second-order forward finite difference approximation. Namely, instead of approximating the derivative using a first-order approximation as in
Eq. (2.12), we approximate
∂
∂ λ V(Z +λξje j)
λ=0
=
1
2h
√νjaj
4V(Z +h
p
νjξje j)−V(Z +2h
p
νjξje j)−3V(Z)
+O(h
2
).
The coefficients in this expansion can be derived directly from a Taylor Series. We then use randomization to replace the unknown hξj and 2hξj
term as before. The j
th element of our estimator
becomes
Dj(Z) ≡ E
1
2h
√νjaj
4V(Z +δ
h
j
e j)−V(Z +δ
2h
j
e j)−3V(Z)
Z
.
where δ
h
j ∼ N
0,h
2 + √
2h
νj
and δ
2h
j ∼ N
0,4h
2 + √
4h
νj
.
As mentioned, one can always compute the above conditional expectation by Monte Carlo
simulation. In special cases, a more computationally efficient method is to utilize a parametric
programming algorithm to determine the values of x(Z +te j) as t ranges over R. Importantly,
for many classes of optimization problems, including, e.g., linear optimization and mixed-binary
linear optimization, x(Z +te j) is piecewise constant on the intervals (ci
, ci+1), taking value x
i
, for
i = 1,...,I, with c0 = −∞ and cI = ∞. In this case,
E
V(Z +δje j)
Z
=
I−1
∑
i=0
r(Z)
⊤x
i
·
Z ci+1
ci
φδj
(t)dt +
r
σ2
2π
exp
c j −Zj
2
2σ2
!
(2.40)
where φδj
(·) is the pdf of δj
. These integrals can then be evaluated in closed-form in terms of the
standard normal CDF. A similar argument holds for E
h
V(Z +δ
2h
j
e j)
Z
i
. We follow this strategy
in our case study in Section 2.5.
2.9 Appendix: Problems that are Weakly Coupled by Variables
2.9.1 Convergence of In-Sample Optimism
In this section we provide a high-probability bound on
sup
x
0∈X 0
sup
θ∈Θ
ξ
⊤
x(Z,θ , x
0
)−E
h
ξ
⊤
x(Z,θ , x
0
)
i
.
As a first step, we bound the cardinality of
X Θ,X 0
(Z) ≡
x
0
, x
1
(Z,θ , x
0
),..., x
K
(Z,θ , x
0
)
: θ ∈ Θ, x
0 ∈ X 0
⊆ R
n
.
Lemma 2.9.1 (Cardinality of Lifted, Decoupled Policy Class). Under the assumptions of Theorem 2.4.3, there exists an absolute constant C such that
log
X Θ,X 0
(Z)
≤ dim(φ)·log
CK
X 0
X 2
max
Proof.Proof: We adapt a hyperplane arrangement argument from Gupta and Kallus, 2021. We
summarize the pertinent details briefly. For any x
0 ∈ X 0
, x
k
(Z,θ , x
0
) is fully determined by the
relative ordering of the values
n
φ(θ )
⊤gk(Z
k
, x
k
, x
0
) : x
k ∈ X k
(x
0
)
o
.
This observation motivates us to consider the hyperplanes in R
dim(φ)
Hk,Z,x
0 (x
k
, x¯
k
) = n
φ(θ ) : φ(θ )
⊤
gk(Z
k
, x
k
, x
0
)−gk(Z
k
, x¯
k
, x
0
)
= 0
o
(2.41)
for all x
k
, x¯
k ∈ Ext
X k
(x
0
)
, k = 1,...,K, and x
0 ∈ X 0
.
On one side of Hk,Z,x
0 (x
k
, x¯
k
), x
k
is preferred to x¯
k
in the policy problem Eq. (2.20), on the other
side x¯
k
is preferred, and on the hyperplane we are indifferent. Thus, if we consider drawing all such
hyperplanes in R
dim(φ)
, then the vector x(Z,θ , x
0
) is constant for all φ(θ ) in the relative interior
of each induced region. Hence,
X Θ,X 0
(Z)
is at most the number of such regions. Gupta and
Kallus, 2021 prove that the number of such regions is at most (1+2m)
dim(φ) where m is number
of hyperplanes in Eq. (2.41). By assumption, m ≤ KX 2
max
X 0
, and, hence,
X Θ(Z, x
0
)
≤
1+2K
X 0
X 2
maxdim(φ)
. Collecting constants yields the bound. □
We can now prove our high-probability tail bound.
Lemma 2.9.2 (Bounding In-sample Optimism). Under the assumptions of Theorem 2.4.3, there
exists a constant C (depending on νmin) such that, for any R > 1, with probability 1−e
−R
sup
θ∈Θ,x
0∈X 0
ξ
⊤
x(Z,θ , x
0
)−E
h
ξ
⊤
x(Z,θ , x
0
)
i
≤ CSmaxR
q
Kdim(φ)log(K|X 0
|Xmax)
Proof.Proof: By triangle inequality,
sup
θ∈Θ,x
0∈X 0
ξ
⊤
x(Z,θ , x
0
)−E
h
ξ
⊤
x(Z,θ , x
0
)
i
≤ sup
x
0∈X 0
(ξ
0
)
⊤x
0 −E
h
(ξ
0
)
⊤x
0
i
+ sup
x
0∈X 0,θ∈Θ
K
∑
k=1
∑
j∈Sk
ξjx j(Z,θ , x
0
)−E
ξjx j(Z,θ , x
0
)
Consider the first term. For a fixed x0, this is a sum of sub-Gaussian random variables each
with parameter at most 1/
√
νmin. We apply Hoeffding’s inequality to obtain a pointwise bound for
a fixed x0 and then take a a union bound over X 0
. This yields for some absolute constant c,
P
(
sup
x
0∈X 0
∑
j∈S0
ξjx
0
j −E
ξjx
0
j
> t
)
≤ 2|X 0
| exp
−
cνmint
2
|S0|
,
Rearranging shows that, with probability at least 1−exp{−R},
sup
x
0∈X 0
∑
j∈S0
ξjx
0
j −E
ξjx
0
j
≤
s
|S0|R
cνmin
log(2|X 0
|)
8
For the second component, we apply Theorem 2.7.1. We bounded the cardinality X Θ,X 0
(Z)
in Lemma 2.9.1. Consider the vector
∑
j∈S1
ξjx j(Z;θ , x
0
),..., ∑
j∈SK
ξjx j(Z;θ , x
0
)
!
.
This vector has independent components. We next construct an envelope F(Z) for it and bound
the Ψ-norm of F(Z).
Since 0 ≤ x j(Z,θ , x
0
) ≤ 1, we take
F(Z) =
∑
j∈S1
ξj
,..., ∑
j∈SK
ξj
!
.
Note that
∑
j∈Sk
ξj
!2
(a)
≤ |Sk
| ∑
j∈Sk
ζj
√νj
2
≤
Smax
νmin
∑
j∈Sk
ζ
2
j
where ζj ∼ N (0,1). Inequality (a) uses ∥t∥1 ≤
√
d ∥t∥2
for t ∈ R
d
. Plugging into ∥∥F(ξ )∥2
∥Ψ
we have
∥∥F(Z)∥2
∥Ψ ≤
s
|Smax|
νmin
·
vuut
K
∑
k=1
∑
j∈Sk
ζ
2
j
Ψ
(b)
≤
s
|Smax|
νmin
· 2|Smax|K.
Inequality (b) follows from Lemma A.1 iv) of GR 2021.
Applying Theorem 2.7.1, combining the bounds of our two components, and collecting constants, we obtain our result. □
2.9.2 Convergence of VGC
Define V(Z,θ , x
0
) and D(Z,θ , x
0
) analogously to V(Z,θ ) and D(Z,θ ) with x(Z,θ ) replaced by
x(Z,θ , x
0
) throughout. The next step in our proof provides a high-probability bound on
sup
θ∈Θ,x
0∈X 0
D(Z,θ , x
0
)−E
D(Z,θ , x
0
)
.
84
We first establish a pointwise bound for a fixed θ , x
0
.
Lemma 2.9.3 (Pointwise Convergence of VGC for a fixed x
0
). Under the assumptions of Theorem 2.4.3, for fixed θ , x
0
, there exists a constant C (that depends on νmin) such that, for any R > 1,
we have with probability 1−2exp(−R),
|D(Z,θ )−E[D(Z,θ )]| ≤ C|Smax|
r
KR
h
Proof.Proof: By definition,
D(Z,θ , x
0
) = ∑
j∈S0
aj(θ )̸=0
Dj(Z,θ , x
0
) +
K
∑
k=1
E
∑
j∈Sk
aj(θ )̸=0
V(Z +δje j
,θ , x
0
)−V(Z,θ , x
0
)
aj(θ )h
√νj
Z
. (2.42)
Consider the first term. Since x
0
is fixed (deterministic),
Dj(Z,θ , x
0
) = 1
aj(θ )h
√νj
E
h
V(Z +δje j
,θ , x
0
)−V(Z,θ , x
0
)
Z
i
=
1
h
√νj
E
δjx
0
j
= 0.
This equality holds almost surely. Hence, it suffices to focus on the second term in Eq. (2.42).
Importantly, the second term is a sum of K independent random variables for a fixed x
0
. We
next claim that each of these random variables is bounded. For any j such that aj(θ ) ̸= 0, let Si be
the subproblem such that j ∈ Si
. Then, write
V(Z +δje j
,θ , x
0
)−V(Z,θ , x
0
)
aj(θ )h
√νj
=
1
aj(θ )h
√νj
r
k
(Z,θ , x
0
)
⊤
x
k
(Z +δje j
,θ , x
0
)−x
k
(Z,θ , x
0
)
| {z }
≤0 by optimality of x
k(Z,θ ,x
0)
+ aj(θ )δjx j(Z +δj
, e j
,θ , x
0
)
≤
δjx j(Z +δje j
,θ , x
0
)
h
√νj
≤
δj
h
√νj
.
85
By an analogous argument,
V(Z,θ , x
0
)−V(Z +δje j
,θ , x
0
)
aj(θ )h
√νj
≤
δj
h
√νj
.
Hence,
Dj(Z,θ , x
0
)
≤ E
V(Z +δje j
,θ , x
0
)−V(Z,θ , x
0
)
aj(θ )h
√νj
Z
≤
1
h
√νj
E
δj
≤
1
h
√νj
·
√
h
ν
1/4
j
≤
1
√
hν
3/4
min
.
(2.43)
Applying Hoeffding’s inequality to Eq. (2.42) and collecting constants shows
P{|D(Z;θ )−E[D(Z;θ )]| ≥ t} ≤ 2exp
−
h ·t
2
C0K|Smax|
2
!
for some constant C0 (depending on νmin). Thus, with probability 1−ε, we see
|D(Z;θ )−E[D(Z;θ )]| ≤
s
C0 ·K|Smax|
2
h
log
2
ε
Combining constants completes our proof. □
We now bound
sup
θ∈Θ
D(Z,θ , x
0
)−E
D(Z,θ , x
0
)
for a fixed x
0
. The key idea is to use the Lipschitz continuity of θ 7→ D(Z,θ , x
0
) to cover the set
Θ.
Lemma 2.9.4 (Uniform Convergence of VGC for a fixed x
0
). Under the assumptions of Theorem 2.4.3 and for H ≡ [hmin,hmax], there exists a constant C (that depends on νmin, Cµ, L) such
that for any R > 1, we have with probability 1−2e
−R
,
sup
θ∈Θ¯
|D(Z,θ )−E[D(Z,θ )]| ≤ CSmaxr
KR
hmin
vuutlogn ·logN
r
hmin
Kn2
,Θ
!
N
hmin
K
,H
.
86
Proof. Proof. Using the full notation D(Z,(θ ,h)), we first write the supremum to be over θ ∈ Θ
and h ∈ H . Let Θ0 be a ε1-covering of Θ and let H be a ε2-covering of H ≡ [hmin,hmax]. Then,
sup
θ∈Θ
h∈H
|D(Z,(θ ,h))−E[D(Z,(θ ,h))]| ≤ sup
θ∈Θ0
h∈H
|D(Z,(θ ,h))−E[D(Z,(θ ,h))]| (2.44a)
+ sup
θ ,θ :∥θ −θ ∥≤ε1
h∈H
D(Z,(θ ,h))−D(Z,(θ ,h))
+ sup
θ ,θ :∥θ −θ ∥≤ε1
h∈H
E
h
D(Z,(θ ,h))−D(Z,(θ ,h))i
(2.44b)
+ sup
θ∈Θ0
h,h:∥h−h∥≤ε2
D(Z,(θ ,h))−D(Z,(θ ,h))
+ sup
θ∈Θ0
h,h:∥h−h∥≤ε2
E
h
D(Z,(θ ,h))−D(Z,(θ ,h))i
(2.44c)
We bound Eq. (2.44b) and Eq. (2.44c) using Lemma 2.3.7. For Eq. (2.44b), there exists a constant
C1 such that with probability at least 1−e
−R
D(Z,(θ ,h))−D(Z,(θ ,h))
≤
C1ε1n
2
h
p
Rlogn.
Similarly, there exists C2, C3 and C4 (depending on νmin,L,Cµ,amin,amax,bmax) such that
E
h
D(Z,(θ ,h))−D(Z,(θ ,h))i
≤
C2ε1n
2
h
(E[∥z∥∞] +1) ≤
C3ε1n
2
h
p
logn+Cµ
≤
C4ε1n
2
h
p
logn,
where the second inequality uses a standard bound on the maximum of n sub-Gaussian random
variables, and we have used Assumption 2.3.1 to simplify. Combining and taking the supremum
over h shows
Eq. (2.44b) ≤
C5ε1n
2
hmin
p
Rlogn,
87
for some constant C5.
For Eq. (2.44c), there exists a constant C6 such that
D(Z,(θ ,h))−D(Z,(θ ,h))
≤
C6ε
1/2
2
n
hmin
.
Since the same holds for the expectation of the same quantity, we see
Eq. (2.44c) ≤
2C6ε
1/2
2
n
hmin
,
Similarly, using Lemma 2.9.3 and applying a union bound over elements in Θ and H shows with
probability at least 1−e
−R
,
Eq. (2.44a) ≤ C7Smaxr
KR
hmin
log(N(ε1,Θ)N(ε2,H )),
for some constant C7. Choosing ε1 =
Smax√
Khmin
n
2
, ε2 =
S
2
maxKhmin
n
2
, we see
Eq. (2.44a)+Eq. (2.44b)+Eq. (2.44c) ≤ C8Smaxr
KR
hmin
log(N(ε1,Θ)N(ε2,H )),
Finally, we obtain our result by simplifying the above bound slightly since n = ∑
K
k=0
|Sk
| ≤ KSmax,
and hence,
ε1 =
KSmax√
Khmin
Kn2
≥
r
hmin
Kn2
ε2 =
S
2
maxK
2hmin
Kn2
≥
hmin
K
.
Substituting the lower-bounds, we obtain our intended result.
□
We can now prove Theorem 2.4.3.
Proof. Proof of Theorem 2.4.3: We proceed to bound each term on the right side of Eq. (2.18).
88
To bound Eq. (2.18a), observe by definition of our lifted policy class and Lemma 2.9.2, we
have, with probability at least 1−e
−R
, that
sup
θ∈Θ¯
ξ
⊤
x(Z,θ )−E
h
ξ
⊤
x(Z,θ )
i
≤ sup
θ∈Θ¯ ,x
0∈X 0
ξ
⊤
x(Z,θ , x
0
)−E
h
ξ
⊤
x(Z,θ , x
0
)
i
≤ CSmaxR
q
Kdim(φ)log(K|X 0
|Xmax).
To bound Eq. (2.18b), let H ≡ [hmin,hmax]. Then, by applying the union bound to Lemma 2.9.4
with R ← R+log(1+
X 0
) we have that with probability at least 1−2e
−R
,
sup
θ∈Θ
x
0∈X 0
|D(Z,θ )−E[D(Z,θ )]| ≤ C1Smaxs
K(R+log(|1+X 0
|)
hmin
p
logn ·log(N (ε1,Θ)N (ε2,H ))
≤ C2Smaxs
KRlog(|1+X 0
|)
hmin
p
logn ·log(N (ε1,Θ)N (ε2,H )),
for some constants C1 and C2 and where ε1 =
qhmin
Kn2 and ε2 =
hmin
K
.
Finally, to bound Eq. (2.18c), use Theorem 2.3.2 and take the supremum over h ∈ H to obtain
Eq. (2.18c) ≤ C5hmaxKSmax log(1/hmin).
Substituting these three bounds into Eq. (2.18) and collecting constants proves the theorem.
□
2.10 Appendix: Problems that are Weakly Coupled by Constraints
2.10.1 Properties of the Dual Optimization Problem
Throughout the section, we use the notation ⟨ℓ,u⟩ to denote the interval [min(ℓ,u),max(ℓ,u)].
89
Recall our dual formulation
λ (z,θ ) ∈ argmin
λ ≥0
L (λ ,z,θ ), where L (λ ,z,θ ) ≡ b
⊤λ +
1
n
n
∑
j=1
h
rj(zj
,θ )−A
⊤
j λ
i+
.
Since L (λ ) is non-differentiable, its (partial) subgradient is not-unique. We identify a particular
subgradient by
∇λ L (λ ,z,θ ) = b −
1
n
n
∑
j=1
I
n
rj(zj
,θ ) > A
⊤
j λ
o
A j
.
The following identity that characterizes the remainder term in a first order Taylor-series expansion of L (λ ) with this subgradient.
Lemma 2.10.1 (A Taylor Series for L (λ )). For any λ ∈ R
m
+, z ∈ R
n
, θ ∈ Θ,
L (λ 2,z,θ )−L (λ 1,z,θ )
= ∇λ L (λ 1,z,θ )
⊤(λ 2 −λ 1) + 1
n
n
∑
j=1
I
n
rj(zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
o
rj(z,θ )−A
⊤
j λ 2
.
Proof. Proof of Lemma 2.10.1: Since z and θ are fixed, drop them from the notation. Using the
definitions of L and ∇λ L , we see it is sufficient to prove that for each j,
[rj−A
⊤
j λ 2]
+ −[rj−A
⊤
j λ 1]
+ +I
n
rj > A
⊤
j λ 1
o
A
⊤
j
(λ 2−λ 1) = I
n
rj ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
o
rj −A
⊤
j λ 2
(2.45)
First notice that if A
⊤
j λ 1 = A
⊤
j λ 2, then both sides of Eq. (2.45) equal zero. Further, ifrj ̸∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩,
then both sides are again zero. Thus, we need only considering the case where rj ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩.
We can confirm the identity directly by considering the cases where A
⊤λ 1 < A
⊤λ 2 and A
⊤λ 1 > A
⊤λ 2
separately. □
The following result is proven in Lemma D.3 of GR 2021. We reproduce it here for completeness.
Lemma 2.10.2 (Dual Solutions Bounded by Plug-in). If X is s0-strictly feasible, then
90
i) ∥λ (z,θ )∥1 ≤
2
ns0
∥r(z,θ )∥1
ii) E[∥λ (z,θ )∥1] ≤
2
ns0
E[∥r(z,θ )∥1]
Proof. Proof of Lemma 2.10.2: By optimality, L (λ (z,θ ),z,θ ) ≤ L (0,z,θ ) ≤
1
n
∥r(z,θ )∥1. Since
λ (z,θ ) ≥ 0, it follows that ∥λ (z,θ )∥1 = e
⊤λ (z,θ ). Thus,
∥λ (z,θ )∥1 ≤ max
λ ≥0
e
⊤λ
s.t. b
⊤λ +
1
n
n
∑
j=1
rj(zj
,θ )−A
⊤
j λ
+
≤
1
n
∥r(z,θ )∥1.
We upper bound this optimization by relaxing the constraint with penalty 1/s0 > 0 to see that
∥λ (z,θ )∥1 ≤ max
λ ≥0
e
⊤λ +
1
s0
1
n
∥r(z,θ )∥1 −b
⊤λ −
1
n
n
∑
j=1
rj(zj
,θ )−A
⊤
j λ
+
!
= max
λ ≥0
e
⊤λ +
1
s0
1
n
∥r(z,θ )∥1 −b
⊤λ −
1
n
n
∑
j=1
max
x j∈[0,1]
x j
rj(zj
,θ )−A
⊤
j λ
!
≤ max
λ ≥0
e
⊤λ +
1
s0
1
n
∥r(z,θ )∥1 −b
⊤λ −
1
n
n
∑
j=1
x
0
j
rj(zj
,θ )−A
⊤
j λ
!
= max
λ ≥0
e −
1
s0
b +
1
ns0
Ax0
⊤
λ +
1
ns0
∥r(z,θ )∥1 −r(z,θ )
⊤x
0
.
By s0-strict feasibility, 1
n
Ax0 +s0e ≤ b ←→ e −
1
s0
b +
1
ns0
Ax0 ≤ 0. Hence, λ = 0 is optimal for this
optimization problem. Thus, for all θ ∈ Θ,
∥λ (z,θ )∥1 ≤
1
ns0
∥r(z,θ )∥1 −r(z,θ )
⊤x
0
≤
2
ns0
∥r(z,θ )∥1.
This proves i). Applying the expectation to both sides completes the proof. □
91
2.10.2 Constructing the Good Set
To construct the set of Z where approximate strong convexity holds or the “good” set, we first
define the following constants:
λmax ≡
2
s0
amax
Cµ +
4
√
νmin
+bmax
, (2.46a)
φmin ≡
√
νmin
amax
√
2π
exp
−
νmax(amaxCµ +bmax +CAλmax)
2
2a
2
min
. (2.46b)
Λn =
(λ 1,λ 2) ∈ R
m
+ ×R
m
+ : ∥λ 1∥1 ≤ λmax, ∥λ 2∥1 ≤ λmax, ∥λ 1 −λ 2∥2 ≥ 4/n
, and
(2.46c)
Tn =
(λ ,θ ,Γ) ∈ R
m
+ ×Θ×R : ∥λ ∥1 ≤ λmax, Γ ≥
1
n
. (2.46d)
These values depend on the constants defined in Assumption 2.4.4 and Assumption 2.4.5.
We now define the “good” set,
En ≡
(
z :
∇λ L (λ 1,z,θ )−∇λ L (λ 2,z,θ )
⊤
(λ 1 −λ 2)
≥ φminβ∥λ 1 −λ 2∥
2
2 − ∥λ 1 −λ 2∥
3/2
2
V
2
log(V)
log2
n
√
n
∀(λ 1,λ 2) ∈ Λn, ∀θ ∈ Θ,
1
n
n
∑
j=1
I
n
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
o
≤ Γ
√
νmax +Γ
1/2V log(V)
log2
n
√
n
, ∀(λ ,θ ,Γ) ∈ Tn
∥z∥1 ≤ nCµ +
2n
√
νmin
,
∥z∥∞ ≤ logn
)
,
For clarity, we stress that φmin > 0 and λmax > 0 are dimension independent constants.
We show in the next section that P(Z ̸∈ En) = O˜(1/n). Thus, the event {Z ∈ En} happens
with high-probability, and we will perform our subsequent probabilistic analysis conditional on
this “good” set.
9
2.10.3 Bounding the “Bad” Set
The purpose of this section is to bound P(Z ∈/ En). Since, En consists of four conditions, we
treat each separately. The last two conditions on ∥Z∥1 and ∥Z∥∞ can be analyzed using standard
techniques for sub-Gaussian random variables.
Lemma 2.10.3 (Bounding ∥Z∥1). Under Assumption 2.3.1,
P
∥Z∥1 > nCµ +
2n
√
νmin
≤ e
−n/32
.
Proof. Proof of Lemma 2.10.3: Note that E
Zj − µj
≤ √
1
νmin
by Jensen’s inequality. Furthermore, because each Zj
is sub-Gaussian with variance proxy 1
νmin
, we have by Lemma A.1
of GR 2021 that ∥
Zj − µj
∥Ψ = ∥Zj − µj∥Ψ ≤ √
2
νmin
. Thus,
Zj − µj
− E
Zj − µj
is
a mean-zero sub-Gaussian random variable with variance proxy at most 16
νmin
. Finally, observe
∥Z∥1 ≤ nCµ +∑
n
j=1
Zj − µj
. Hence,
P
∥Z∥1 > nCµ +
2n
√
νmin
≤ P
1
n
n
∑
j=1
Zj − µj
>
2
√
νmin!
≤ P
1
n
n
∑
j=1
Zj − µj
−E
Zj − µj
>
1
√
νmin!
≤ e
−
n
32 ,
by the usual bound on the sum of independent sub-Gaussian random variables.
□
Lemma 2.10.4 (Bounding ∥Z∥∞). Under Assumption 2.3.1, there exists a dimension independent
constant n0 such that for all n ≥ n0,
P(∥Z∥∞ > logn) ≤
1
n
2
.
93
Proof. Proof of Lemma 2.10.4: By Wainwright, 2019, E[∥Z∥∞∥] ≤ C1
√
logn, for some dimension independent constant C1. Moreover, by Wainwright, 2019, Example 2.29, ∥Z∥∞ −E[∥Z∥∞] is
sub-Gaussian with variance proxy at most 1/νmin. Hence,
P(∥Z∥∞ > logn) ≤ P
∥Z∥∞ −E[∥Z∥∞] > logn−C1
p
logn
.
For n sufficiently large, logn−C1
√
logn ≥
1
2
logn. Hence, for n sufficiently large, this last probability is at most
P
∥Z∥∞ −E[∥Z∥∞] >
1
2
logn
≤ exp
−
νmin log2
n
8
≤ n
−
νmin logn
8 .
For n sufficiently large, the exponent is at most −2, proving the lemma. □
We next establish that the inequality bounding the behavior over Tn hold with high probability.
As a preparation, we first bound the supremum of a particular stochastic process over this set.
Lemma 2.10.5 (Suprema over Tn). Recall the definition of Tn in Eq. (2.46d). Under Assumptions 2.3.1, 2.3.6 and 2.4.6, there exist dimension independent constants C and n0 such for all
n ≥ n0, we have that for any R > 1, with probability at least 1−e
−R
,
sup
(λ ,θ ,Γ)∈Tn
n
∑
j=1
I
n
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
o
−P
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
Γ
−1/2 ≤ CRV logV
√
n.
Proof. Proof of Lemma 2.10.5: Our goal will be to apply Theorem 2.7.3. As a first step, we claim
that for a fixed λ ,θ ,Γ, there exists a dimension independent constant C1 such that
P
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
≤
p
νjC1Γ.
94
To prove the claim, notice that this quantity is the probability that a Gaussian random variables
lives in an interval of length 2Γ. Upper bounding the density of the Gaussian by its value at its
mean shows the probability is at most q νj
2π
. Thus,
P
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
≤ 2Γ
r
νj
2π
≤ Γ
p
νj
. (2.47)
This upperbound further implies that there exists a dimension independent constant C2 such
that the parameter “σ
2
” in Theorem 2.7.3 is at most C2, because the indicator squared equals the
indicator. We also take the parameter “U” to be √
n since Γ ≥
1
n
.
Thus, to apply the theorem it remains to show the set
F ≡
I
n
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
on
j=1
: (λ ,θ ,Γ) ∈ Tn
is Euclidean and compute its parameters.
Consider the set
F1 ≡
I
n
rj(Zj
,θ )−A
⊤
j λ ≤ Γ
on
j=1
: (λ ,θ ,Γ) ∈ Tn
.
By Assumption 2.4.6, this set has VC-dimension at most V, and, hence, also has pseudo-dimension
at most V. The same is true of the set
F2 ≡
I
n
A
⊤
j λ −rj(Zj
,θ ) ≤ Γ
on
j=1
: (λ ,θ ,Γ) ∈ Tn
.
Since F = F1∧F2, by Pollard, 1990, Lemma 5.1 there exists an absolute constantC2 such that F
has pseudo-dimension at mostC2V. By Theorem A.3 of GR 2021, F is Euclidean with parameters
A = (C2V)
6C2V
and W = 4C2V. The relevant complexity parameter “V(A,W)” is then at most
6C2V log(C2V) +4C2V
p
6C2V log(C2V)
≤ C3
p
V logV,
95
for some dimension independent constant C3.
Theorem 2.7.3 now bounds the suprema by C4RV log(V)
√
n, completing the proof. □
Equipped with Lemma 2.10.5, we can now show the relevant condition holds with highprobability.
Lemma 2.10.6 (Bounding Away from Degeneracy). Recall the definition of Tn in Eq. (2.46d).
Under Assumptions 2.3.1, 2.3.6 and 2.4.6 there exists a dimension independent constant n0 such
that for all n ≥ n0, with probability at least 1−1/n we have that
1
n
n
∑
j=1
I
n
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
o
≤ Γ
√
νmax + Γ
1/2V log(V)
log2
n
√
n
, ∀(λ ,θ ,Γ) ∈ Tn.
Proof. Proof of Lemma 2.10.6 Apply Lemma 2.10.5 with R = logn to conclude that with probability at least 1−1/n, for all (λ ,θ ,Γ) ∈ Tn simultaneously, we have
n
∑
j=1
I
n
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
o
−P
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
≤ CΓ
1/2V log(V)log(n)
√
n.
Then observe that as in the proof of Eq. (2.47), P
rj(Zj
,θ )−A
⊤
j λ
≤ Γ
≤ Γ
√
νmax. Finally,
for n sufficiently large, C ≤ logn. Rearranging then completes the proof. □
Remark 2.10.7. We describe the condition in Lemma 2.10.6 as “Bounding Away from Degeneracy” because rj(Z j
,θ ) − A
⊤
j λ is the reduced cost of the jth component at the dual solution λ .
Hence, the lemma asserts that there are not too many reduced costs that are less than 1/n.
It remains to establish that the approximate strong convexity condition over Λn holds with high
probability. As preparation, we again bound the suprema of a particular stochastic process.
96
Lemma 2.10.8 (Suprema over Λn). Under Assumptions 2.3.1, 2.3.6 and 2.4.6, there exists a dimension independent constant C such that for any R > 1, with probability at least 1 − e
−R
, we
have
sup
(λ 1,λ 2)∈Λn,θ∈Θ
1
n
n
∑
j=1
I
n
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
o
−P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
A
⊤
j
(λ 1 −λ 2)
∥λ 1 −λ 2∥
3/2
2
≤ CRV2
log(V)
√
n.
Proof. Proof of Lemma 2.10.8: Our strategy will be to apply Theorem 2.7.3. To this end, we first
claim that there exists a dimension independent constant φmax such that for any fixed (λ 1,λ 2) ∈ Λn
and θ ∈ Θ
P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
≤ φmax∥λ 1 −λ 2∥1.
To prove the claim, notice that this is the probability that a Gaussian random variable lives in an
interval of length at most
A
⊤
j
(λ 1 −λ 2)
≤ CA∥λ 1 −λ 2∥1. Upper bounding the Gaussian density
by the square root of its precision proves the claim.
We next argue that this claim implies that there exists a dimension independent constant C1
such that the parameter “σ
2
” in Theorem 2.7.3 is at most C1. Indeed, an indicator squared is still
the same indicator. Scaling by
A
⊤
j
(λ 1 −λ 2)
2
∥λ 1 −λ 2∥
3
2
≤
C
2
Am
∥λ 1 −λ 2∥2
,
and then averaging over j proves that σ
2
is at most C1m.
We can take the parameter “U” to be CA
√
n since
A
⊤
j
(λ 1 −λ 2)
∥λ 1 −λ 2∥
3/2
2
≤ CA
√
m∥λ 1 −λ 2∥
−1/2
2 ≤ CA
√
mn,
because (λ 1,λ 2) ∈ Λn.
97
Thus, to apply Theorem 2.7.3 we need only show that the set
F ≡
I
n
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
o
A
⊤
j
(λ 1 −λ 2)
∥λ 1 −λ 2∥
3/2
2
n
j=1
: (λ 1,λ 2) ∈ Λn, θ ∈ Θ
is Euclidean and determine its parameters. To this end, first consider the sets
F1 ≡
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ 1
on
j=1
: λ 1 ∈ R
m
+, θ ∈ Θ
F2 ≡
I
n
rj(Zj
,θ ) ≤ A
⊤
j λ 2
on
j=1
: λ 2 ∈ R
m
+, θ ∈ Θ
F3 ≡
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ 2
on
j=1
: λ 2 ∈ R
m
+, θ ∈ Θ
F4 ≡
I
n
rj(Zj
,θ ) ≤ A
⊤
j λ 1
on
j=1
: λ 1 ∈ R
m
+, θ ∈ Θ
By Assumption 2.4.6, each of these sets has VC-dimension at most V (they are indicator sets for
functions with pseudo-dimension at most V). Now define the set
F5 ≡
I
n
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,λ 2⟩
on
j=1
: (λ 1,λ 2) ∈ Λn
,
and notice that
F5 ⊆ (F1 ∧F2)∨(F3 ∧F4).
Hence, by Pollard, 1990, Lemma 5.1, there exists an absolute constant C2 > 1 such that F5 has
pseudodimension at most C2V. By Theorem A.3 of GR 2021, F5 is thus Euclidean with parameters A = (C2V)
6C2V
and W = 4C2V.
Now consider the set
F6 =
A
⊤
j
(λ 1 −λ 2)
∥λ 1 −λ 2∥
3/2
2
!n
j=1
: (λ 1,λ 1) ∈ Λn
,
98
and notice
F6 ⊆ {
A
⊤
j λ
n
j=1
: λ ∈ R
m
}.
This latter set belongs to a vector space of dimension at most m, and hence has pseudo-dimension at
most m ≤ V. Thus, by Theorem A.3 of GR 2021, it is Euclidean with parameters at most A = V
6V
and W = 4V.
To conclude, notice that F is the pointwise product of F5 and F6. Hence, by Pollard, 1990,
Lemma 5.3, we have that F is Euclidean with parameters A = (C3V)
C3V
·C
C3V
3
and W = C3V for
some absolute constant C3. In particular, the relevant complexity parameter “V(A,W)” for F is at
most C4
p
V log(V) for some dimension independent parameter C4.
Applying Theorem 2.7.3 now shows that suprema of the lemma is at mostC5R(
√
V logV)
2m
√
mn,
for some dimension independent C5. Since m ≤ V, this completes the lemma. □
Equipped with Lemma 2.10.8, we can prove the approximate strong convexity condition holds
with high probability.
Lemma 2.10.9 (Approximate Strong Convexity with High Probability). Under Assumptions 2.3.1,
2.3.6 and 2.4.6, there exists a dimension independent constant n0 such that for all n ≥ n0, we have
with probability at least 1−
1
n
that the following inequality holds simultaneously for all (λ 1,λ 2) ∈
Λn and θ ∈ Θ:
∇λ L (λ 1,z,θ )−∇λ L (λ 2,z,θ )
⊤
(λ 1 −λ 2) ≥ φminβ∥λ 1 −λ 2∥
2
2 − ∥λ 1 −λ 2∥
3/2
2
V
2
log(V)
log2
n
√
n
.
Proof. Proof of Lemma 2.10.9: By choosing R = logn, Lemma 2.10.8 shows that there exists a
dimension independent constant C1 with probability at least 1−1/n
sup
(λ 1,λ 2)∈Λn,θ∈Θ
1
n
n
∑
j=1
I
n
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
o
−P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
A
⊤
j
(λ 1 −λ 2)
∥λ 1 −λ 2∥
3/2
≥ −CV2
log(V)
√
nlogn.
9
This inequality implies that for any (λ 1,λ 2) ∈ Λn and θ ∈ Θ,
1
n
n
∑
j=1
I
n
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
o
A
⊤
j
(λ 1 −λ 2)
≥
1
n
n
∑
j=1
P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
A
⊤
j
(λ 1 −λ 2)
−C∥λ 1 −λ 2∥
3/2V
2
log(V)
logn
√
n
.
Thus our first goal will be to bound the summation on the right side. Isolate the j
th term. The
probability P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
is the probability that a Gaussian random variable lives
in an interval of length at most
A
⊤
j
(λ 1 −λ 2)
. Moreover, the endpoints of this interval are most
A
⊤
j λ i
≤ CAλ max for i = 1,2, since (λ 1,λ 2) ∈ Λn. It follows that these endpoints are no further
than aj(θ )µj +bj(θ )+CAλmax from the mean of the relevant Gaussian. Thus, we can lower bound
the density of the Gaussian on this interval. This reasoning proves
P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
A
⊤
j
(λ 1 −λ 2)
≥
A
⊤
j
(λ 1 −λ 2)
2
·
√νj
aj(θ )
√
2π
exp
−
νj(aj(θ )µj +bj(θ ) +CAλmax)
2
2aj(θ )
2
≥ φmin
A
⊤
j
(λ 1 −λ 2)
2
.
Averaging over j shows
1
n
n
∑
j=1
P
rj(Zj
,θ ) ∈ ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩
A
⊤
j
(λ 1 −λ 2)
≥ φmin(λ 1 −λ 2)
⊤ 1
n
n
∑
j=1
A jA
⊤
j
(λ 1 −λ 2)
≥ φminβ∥λ 1 −λ 2∥
2
2
,
by Assumption 2.4.5.
Substitute above, and notice if n0 = e
C, then logn ≥ C to complete the proof. □
Finally, a simple union bound gives
Lemma 2.10.10 (Z ∈ En with High Probability). Under Assumptions 2.3.1, 2.3.6 and 2.4.6 there
exists a dimension independent constant n0 such that for all n ≥ n0, P(Z ∈ En) ≥ 1−
4
n
.
100
Proof. Proof. Combine Lemmas 2.10.3, 2.10.4, 2.10.6 and 2.10.9 and apply a union bound.
2.10.4 Properties of the Good Set
In this section, we argue that for data realizations z ∈ En, our optimization problems satisfy a
number of properties, and, in particular, the dual solutions and VGC satisfy a bounded differences
condition. We start by showing that small perturbations to the data z still yield dual solutions that
are bounded. Note, any z ∈ En satisfies the assumptions of the next lemma.
Lemma 2.10.11 (Bounded Duals). Suppose Assumptions 2.3.1 and 2.3.6 hold and ∥t∥∞ ≤
3
√
n √
νmin
and z satisfies ∥z∥1 ≤ nCµ + √
2n
νmin
. Then, for all j = 1,...,n,
sup
θ∈Θ
∥λ (z +tje j
,θ )∥1 ≤ λmax.
Proof. Proof of Lemma 2.10.11: Write
∥λ (z +tje j
,θ )∥1 ≤
2
ns0
∥r(z +tje j
,θ )∥1 (Lemma 2.10.2)
≤
2
ns0
amax∥z∥1 +amax
tj
+bmaxn
(Definition of r(·,θ ))
≤
2
s0
amax
Cµ +
2
√
νmin
+
3
√
nνmin
+bmax
(by assumptions on ∥z∥1 and ∥t∥∞)
≤ λmax,
since 3/
√
n ≤
√
3 ≤ 2. Taking the supremum of both sides over θ ∈ Θ completes the proof.
□
We next establish a bounded differences condition for the dual solution λ (z,θ ).
10
Lemma 2.10.12 (Bounded Differences for the Dual). Suppose Assumptions 2.3.1 and 2.3.6 hold
and that z ∈ En and ∥λ (z,θ )∥ ≤ λmax. Then, there exists a dimension independent constant C
such that
∥λ (z,θ )−λ (z¯,θ )∥2 ≤ CV3
log2
(V)
log4
n
n
n
∑
j=1
I
zj ̸= z¯j
.
Proof. Proof of Lemma 2.10.12: To declutter the notation, define
f1(λ ) ≡ L (λ ,z,θ ), λ 1 ≡ λ (z,θ ),
f2(λ ) ≡ L (λ ,z,θ ), λ 2 ≡ λ (z,θ ).
Furthermore, let Ij = ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩.
Notice if ∥λ 1 − λ 2∥2 ≤ 4/n, the inequality is immediate for C = 4 since m ≥ 1. Hence, we
assume throughout that ∥λ 1 −λ 2∥2 > 4/n.
Using Lemma 2.10.1 we have that
f1(λ 2)− f1(λ 1) = ∇ f1(λ 1)
⊤(λ 2 −λ 1) + 1
n
n
∑
j=1
I
rj(zj) ∈ Ij
rj(zj)−A
⊤
j λ 2
≥
1
n
n
∑
j=1
I
rj(zj) ∈ Ij
rj(zj)−A
⊤
j λ 2
, (2.48)
where the inequality uses f1(·) is convex and λ 1 is an optimizer. Analogously, we have that
f2(λ 1)− f2(λ 2) ≥
1
n
n
∑
j=1
I
rj(z¯j) ∈ Ij
rj(z¯j)−A
⊤
j λ 1
. (2.49)
Adding Eqs. (2.48) and (2.49) yields
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2)
≥
1
n
n
∑
j=1
I
rj(z¯j) ∈ Ij
rj(z¯j)−A
⊤
j λ 1
+I
rj(zj) ∈ Ij
rj(zj)−A
⊤
j λ 2
102
Isolate the j
th term on the right. To lower bound this term, note that when zj ̸= z¯j
,
I
rj(z¯j) ∈ Ij
rj(z¯j)−A
⊤
j λ 1
−I
rj(zj) ∈ Ij
rj(zj)−A
⊤
j λ 1
≤ I
rj(z¯j) ∈ Ij
rj(z¯j)−A
⊤
j λ 1
+I
rj(zj) ∈ Ij
rj(zj)−A
⊤
j λ 1
(a)
≤ 2
A
⊤
j
(λ 1 −λ 2)
≤ 2CA∥λ 1 −λ 2∥2,
where inequality (a) follows because each indicator is non-zero only when the corresponding r is
in the interval ⟨A
⊤
j λ 1,A
⊤
j λ 2⟩. Hence, substituting above and rearranging yields
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2) +2CA∥λ 1 −λ 2∥2 ·
1
n
n
∑
j=1
I
zj ̸= z¯j
≥
1
n
n
∑
j=1
I
rj(zj) ∈ Ij
rj(zj)−A
⊤
j λ 2
+
rj(zj)−A
⊤
j λ 1
=
1
n
n
∑
j=1
I
rj(zj) ∈ Ij
A
⊤
j
(λ 1 −λ 2)
, (2.50)
=
∇λ L (λ 1,z,θ )−∇λ L (λ 2,z,θ )
⊤
(λ 1 −λ 2)
≥ φminβ∥λ 1 −λ 2∥
2
2 −V
2
log(V)·
log2
n
√
n
· ∥λ 1 −λ 2∥
3/2
2
, (since z ∈ En,(λ 1,λ 2) ∈ Λn))
where Eq. (2.50) follows because when the indicator is non-zero, rj(zj) is between A
⊤
j λ 1 and
A
⊤
j λ 2.
To summarize the argument so far, we have shown that
f1(λ 2)−f1(λ 1) + f2(λ 1)− f2(λ 2) + 2CA∥λ 1 −λ 2∥2
1
n
n
∑
j=1
I
zj ̸= z¯j
(2.51)
≥ φminβ∥λ 1 −λ 2∥
2
2 −V
2
log(V)
log2
n
√
n
∥λ 1 −λ 2∥
3/2
2
.
10
The next step of the proof upper bounds the left side. By definition of f1(·), f2(·),
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2)
=
1
n
n
∑
j=1
h
rj(zj)−A
⊤
j λ 2
i+
−
h
rj(zj)−A
⊤
j λ 1
i+
+
h
rj(z¯j)−A
⊤
j λ 1
i+
−
h
rj(z¯j)−A
⊤
j λ 2
i+
.
The j
th term is non-zero only if zj ̸= z¯j
. In that case,
h
rj(zj)−A
⊤
j λ 2
i+
−
h
rj(zj)−A
⊤
j λ 1
i+
+
h
rj(z¯j)−A
⊤
j λ 1
i+
−
h
rj(z¯j)−A
⊤
j λ 2
i+
≤ 2
A
⊤
j
(λ 2 −λ 1)
≤ 2CA∥λ 1 −λ 1∥2
≤ 2CA∥λ 1 −λ 1∥2.
Summing over j for which zj ̸= z¯j and substituting into the left side of Eq. (2.51) yields,
4CA∥λ 1 −λ 1∥2 ·
1
n
n
∑
j=1
I
zj ̸= z¯j
≥ φminβ∥λ 1 −λ 2∥
2
2 −V
2
log(V)
log2
n
√
n
∥λ 1 −λ 2∥
3/2
2
. (2.52)
To simplify this expression, recall that by assumption
∥λ 1 −λ 2∥2 ≥
4
n
=⇒
√
n∥λ 1 −λ 2∥
1/2
2 ≥ 1.
Hence we can inflate the left side of Eq. (2.52) by multiplying by √
n∥λ 1 − λ 2∥
1/2
2
and then
rearranging to obtain
φminβ∥λ 1 −λ 2∥
2
2 ≤ 4CA∥λ 1 −λ 2∥
3/2
·
1
√
n
n
∑
j=1
I
zj ̸= z¯j
+V
2
log(V)
log2
n
√
n
∥λ 1 −λ 2∥
3/2
2
≤ C1V
2
log(V)
log2
n
√
n
n
∑
j=1
I
zj ̸= z¯j
∥λ 1 −λ 2∥
3/2
2
,
104
for some dimension independent constant C1. Dividing both sides by ∥λ 1 −λ 2∥
3/2
2
and combining
constants yields
∥λ 1 −λ 2∥
1/2
2 ≤ C2V
2
log(V)
log2
n
√
n
n
∑
j=1
I
zj ̸= z¯j
, (2.53)
for some dimension independent constant C2. Multiply Eq. (2.53) by V log(V)
log2
n √
n
∥λ 1 −λ 2∥2 to
see that
V log(V)
log2
n
√
n
∥λ 1 −λ 2∥
3/2
2 ≤ C2V
3
log2
(V)
log4
n
n
n
∑
j=1
I
zj ̸= z¯j
∥λ 1 −λ 2∥2.
Substitute this upper-bound to Eq. (2.52), yielding
4CA∥λ 1 −λ 1∥2 ·
1
n
n
∑
j=1
I
zj ̸= z¯j
≥ φminβ∥λ 1 −λ 2∥
2
2 −C2V
3
log2
(V)
log4
n
n
n
∑
j=1
I
zj ̸= z¯j
∥λ 1 −λ 2∥2.
Now divide by ∥λ 1 −λ 2∥2 and rearrange to complete the proof. □
We now use this result to show the VGC is also Lipschitz in the Hamming distance. The key
to the following proof is that that strong-duality shows V(z,θ ) = nL (λ (z,θ ),z,θ ).
Lemma 2.10.13 (Bounded Differences for VGC). Let z,z ∈ En. Suppose Assumptions 2.3.1 and 2.3.6
hold. Then, there exists a dimension independent constant C, such that for any n such that logn
nh ≤ 1,
we have that
|D(z,θ )−D(z,θ )| ≤
C
h
V
3
log2
(V)log4
(n)
n
∑
i=1
I{zi ̸= zi}
Proof. Proof of Lemma 2.10.13: Notice if z = z¯ the lemma is trivially true. Hence, throughout,
we assume z ̸= z¯. Since θ is fixed throughout, we also drop it from the notation.
105
As a first step, we will prove the two inequalities
V(z +δje j)−V(z) ≥
rj(z) +ajδj −A
⊤
j λ (z +δje j)
+
−
rj(zj)−A
⊤
j λ (z +δje j)
+
,
(2.54a)
V(z +δje j)−V(z) ≤
rj(z) +ajδj −A
⊤
j λ (z)
+
−
rj(zj)−A
⊤
j λ (z)
+
. (2.54b)
To prove the first inequality, write
V(z +δje j)−V(z) = nL
λ
z +δje j
,z +δje j
−nL (λ (z),z) (2.55)
≤ nL
λ (z),z +δje j
−nL (λ (z),z)
=
rj(zj) +ajδj −A
⊤
j λ (z)
+
−
rj(zj)−A
⊤
j λ (z)
+
where the inequality holds by the sub-optimality of λ (z) for L
λ ,z +δje j
, and the last equality
holds since all terms except the j
th in the summation of the Lagrangian cancel out. A similar
argument using the sub-optimality of λ (z +δje j) for L (λ ,z +δje j) proves the lower bound.
The next step of the proof establishes that there exists a dimension independent constant C1
such that
E
V(z +δje j)−V(z)−
V(z +δje j)−V(z)
≤ C1
V
3
log2
(V)
log4
n
n
n
∑
i=1
I{zi ̸= zi}
!
I
zj = zj
+C1
√
h I
zj ̸= zj
(2.56)
As suggested by the bound, we will consider two cases depending on whether zj = z¯j
.
Case 1: zj ̸= z¯j
. Notice the inequalities Eq. (2.54) apply as well when z is replaced by z¯. Hence,
applying the upper bound for the first term and the lower bound for the second term shows
V(z +δje j)−V(z)−
V(z +δje j)−V(z)
≤
rj(zj) +ajδj −A
⊤
j λ (z)
+
−
rj(zj)−A
⊤
j λ (z)
+
−
rj(z¯j) +ajδj −A
⊤
j λ (z +δje j)
+
−
rj(z¯j)−A
⊤
j λ (z +δje j)
+
≤ 2
aj
δj
,
because t 7→t
+ is a 1-Lipschitz function. Take expectations of both sides, using Jensen’s inequality
and upper bounding the variance of δj shows
E
V(z +δje j)−V(z)−(V(z +δje j)−V(z))
≤ 2amaxE
δj
≤ 2amax
√
3h
ν
1/4
min
.
Collecting constants proves the inequality when zj ̸= z¯j
.
Case 2: zj = z¯j
. Proceeding as in Case 1,
V(z +δje j)−V(z)−
V(z +δje j)−V(z)
≤
rj(zj) +ajδj −A
⊤
j λ (z)
+
−
rj(zj)−A
⊤
j λ (z)
+
−
rj(z¯j) +ajδj −A
⊤
j λ (z +δje j)
+
−
rj(z¯j)−A
⊤
j λ (z +δje j))+
≤ 2
A
⊤
j
(λ (z)−λ (z¯ +δje j))
≤ 2CA∥λ (z)−λ (z¯)∥+2CA∥λ (z)−λ (z +δje j)∥,
1
where the second inequality follows again because t 7→ t
+ is a contraction, but we group the terms
in a different order, and the last inequality follows from the triangle-inequality and the CauchySchwarz inequality. We can bound the first term by invoking Lemma 2.10.12 yielding
V(z +δje j)−V(z)−
V(z +δje j)−V(z)
(2.57)
≤ C2V
3
log2
(V)
log4
n
n
n
∑
i=1
I{zi ̸= z¯i}+2CA∥λ (z)−λ (z +δje j)∥,
for some dimension independent constant C2. Taking expectations shows that to prove a bound,
it will suffice to bound E
∥λ (z)−λ (z +δje j)∥
. To this end, consider splitting the expectation
based on whether
δj
≥ 3
q n
νmin
.
If
δj
≤ 3
q n
νmin
, then by Lemma 2.10.11, ∥λ (z + δje j)∥1 ≤ λmax. Hence we can invoke
Lemma 2.10.12 again yielding
E
∥λ (z)−λ (z +δje j)∥I
δj
≤ 3
r n
νmin ≤ C3V
3
log2
(V)
log4
n
n
P
δj
≤ 3
r n
νmin
≤ C3V
3
log2
(V)
log4
n
n
.
Next, assume
δj
≥ 3
q n
νmin
. Write
∥λ (z)−λ (z +δje j)∥ ≤ λmax +
2
ns0
∥r(z +δje j)∥1 (by Lemma 2.10.2)
≤ λmax +
2
ns0
∥r(z)∥1 +
2
aj
ns0
∥
δj
(by def. of r(·))
≤ λmax +C4 +
2amax
ns0
δj
,
for some dimension independent constant C4, because z ∈ E implies that ∥r(z)∥/n is bounded by
a (dimension-independent) constant.
10
Thus, for some dimension independent constant C5 we have
E
∥λ (z)−λ (z +δje j)∥I
δj
> 3
r n
νmin ≤ E
C5 +
C5
n
δj
I
δj
> 3
r n
νmin
≤ E
C5 +
C5
n
δj
I
δj
> 3
r n
νmin
= E
2C5
δj
I
δj
> 3
r n
νmin,
where the final inequality uses 3q n
νmin
≥ 1 since n ≥ 3. Integration by parts with the Gaussian
density shows there exists a dimension independent constant C6 such that
E
δj
I
δj
> 3
r n
νmin ≤ C6
√
he−nh/
√
νmin ≤ C6
√
he−nh ≤
C6
n
,
because nh > logn and h < 1 by assumption.
Combining the two cases shows that
E
∥λ (z)−λ (z +δje j)∥
≤ C7V
3
log2
(V)
log4
n
n
for some dimension independent constant C7.
Taking the expectation of Eq. (2.57), substituting this bound and collecting constants proves
V(z +δje j)−V(z)−
V(z +δje j)−V(z)
≤ C8V
3
log2
(V)
log4
n
n
n
∑
j=1
I
zj ̸= z¯j
for some constant C8. Combining with Case 1 establishes Eq. (2.56).
Now, by symmetry, Eq. (2.56) holds with the roles of z and z reversed. Hence, Eq. (2.56) also
holds after taking the absolute values of both sides.
10
We can now write,
|D(z)−D(z)| ≤
n
∑
j=1
1
h
√νj
aj
E
V(z +δje j)−V(z)−
V(z +δje j)−V(z)
≤
C1
h
√
νminamin
V
3
log2
(V)
log4
n
n
n
∑
i=1
I{zi ̸= zi}
n
∑
j=1
I
zj = zj
+
C1
√
hνminamin
n
∑
j=1
I
zj ̸= zj
≤
C9
h
V
3
log2
(V)log4
(n)
n
∑
i=1
I{zi ̸= zi},
for some constant C9. This completes the proof. □
Finally, we show that θ 7→ λ (z,θ ) is also smooth on En, at least locally.
Lemma 2.10.14 (Local Smoothness of Dual Solution in θ ). Suppose z ∈ En and that Assumptions 2.3.1, 2.3.6 and 2.4.6 hold. Then, there exist dimension independent constants C and n0 such
that for any n ≥ n0 and any ¯θ such that ∥
¯θ −θ ∥ ≤ 1
n
, we have that
λ (z,θ )−λ (z,
¯θ )
2 ≤ CV2
logV
log5/4
n
√
n
Proof. Proof of Lemma 2.10.14: The proof is similar to that of Lemma 2.10.12. To declutter the
notation, define
f1(λ ) ≡ L (λ ,z,θ ), λ 1 ≡ λ (z,θ )
f2(λ ) ≡ L (λ ,z,
¯θ ), λ 2 ≡ λ (z,
¯θ )
Furthermore, let Ij =
D
A
⊤
j λ 1,A
⊤
j λ 2
E
.
If ∥λ 1 − λ 2∥2 ≤ 4V
2
logV
logn √
n
, then the lemma holds trivially for C = 4. Hence, for the remainder, we assume ∥λ 1 − λ 2∥2 > 4V
2
logV
logn √
n
. In particular, by Lemma 2.10.11, this implies
(λ 1,λ 2) ∈ Λn.
11
Using Lemma 2.10.1, we have that
f1(λ 2)− f1(λ 1) = ∇ f1(λ 1)
⊤(λ 2 −λ 1) + 1
n
n
∑
j=1
I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
≥
1
n
n
∑
j=1
I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
where f1(·) is convex and λ 1 is an optimizer. Similarly, we have that
f2(λ 1)− f2(λ 2) ≥
1
n
n
∑
j=1
I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−A
⊤
j λ 1
.
Adding yields
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2) (2.58)
≥
1
n
n
∑
j=1
I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
+I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−A
⊤
j λ 1
.
We would like to combine the j
th
summand to simplify. To this end, adding and subtracting
I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
yields,
I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
+I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−A
⊤
j λ 1
= I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
+
rj(zj
,θ )−A
⊤
j λ 1
(2.59a)
+I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−A
⊤
j λ 1
−I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 1
(2.59b)
We simplify Eq. (2.59a) by noting that when the indicator is non-zero,
rj(zj
,θ )−A
⊤
j λ 2
+
rj(zj
,θ )−A
⊤
j λ 1
=
A
⊤
j
(λ 2 −λ 1)
Hence,
Eq. (2.59a) = I
rj(zj
,θ ) ∈ Ij
A
⊤
j
(λ 2 −λ 1)
.
111
We rewrite Eq. (2.59b) as
Eq. (2.59b) = I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−A
⊤
j λ 1
−
rj(zj
,θ )−A
⊤
j λ 1
+
rj(zj
,θ )−A
⊤
j λ 1
I
rj(zj
,
¯θ ) ∈ Ij
−I
rj(zj
,θ ) ∈ Ij
(a)
≥ −I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−rj(zj
,θ )
+
rj(zj
,θ )−A
⊤
j λ 1
I
rj(zj
,
¯θ ) ∈ Ij
−I
rj(zj
,θ ) ∈ Ij
,
(b)
≥ −
rj(zj
,
¯θ )−rj(zj
,θ )
−
rj(zj
,θ )−A
⊤
j λ 1
I
rj(zj
,
¯θ ) ̸∈ Ij
, rj(zj
,θ ) ∈ Ij
,
(c)
≥ −L∥θ − ¯θ ∥2(
zj
+1)−
rj(zj
,θ )−A
⊤
j λ 1
I
rj(zj
,
¯θ ) ̸∈ Ij
, rj(zj
,θ ) ∈ Ij
,
(d)
≥ −
L
n
(
zj
+1)−
rj(zj
,θ )−A
⊤
j λ 1
I
rj(zj
,
¯θ ) ̸∈ Ij
, rj(zj
,θ ) ∈ Ij
,
where inequality (a) is the triangle inequality, inequality (b) rounds the indicators, inequality (c)
follows from the Lipschitz assumptions on aj(θ ) and bj(θ ), and inequality (d) uses ∥θ − ¯θ ∥ ≤ 1
n
.
Finally note that when the last indicator is non-zero,
rj(zj
,θ )−A
⊤
j λ 1
≤
A
⊤
j
(λ 2 −λ 1)
≤ 2CA
√
mλmax.
Substituting this bound above and the resulting lower bound on Eq. (2.59b) into Eq. (2.59) proves
I
rj(zj
,θ ) ∈ Ij
rj(zj
,θ )−A
⊤
j λ 2
+I
rj(zj
,
¯θ ) ∈ Ij
rj(zj
,
¯θ )−A
⊤
j λ 1
(2.60)
≥ I
rj(zj
,θ ) ∈ Ij
A
⊤
j
(λ 2 −λ 1)
−
L
n
(
zj
+1)−2CA
√
mλmaxI
rj(zj
,
¯θ ) ̸∈ Ij
, rj(zj
,θ ) ∈ Ij
.
We can further clean up the last indicator by noting that
I
rj(zj
,
¯θ ) ̸∈ Ij
, rj(zj
,θ ) ∈ Ij
=⇒ Either
rj(zj
,θ )−A
⊤
j λ 1
≤
rj(z,θ )−rj(z,
¯θ )
or
rj(zj
,θ )−A
⊤
j λ 2
≤
rj(z,θ )−rj(z,
¯θ )
.
112
Moreover, because z ∈ E , we can use the Lipschitz assumptions on aj(θ ) and bj(θ ) and the fact
that ∥θ − ¯θ ∥ ≤ 1
n
to write
rj(z,θ )−rj(z,θ )
≤ 2L∥θ − ¯θ ∥logn ≤
2Llogn
n
.
Thus,
I
rj(zj
,
¯θ ) ∈/ Ij
,rj(zj
,θ ) ∈ Ij
≤ I
rj(zj
,θ )−A
⊤
j λ 1
≤
2Llogn
n
+ I
rj(zj
,θ )−A
⊤
j λ 2
≤
2Llogn
n
.
Making this substitution into Eq. (2.60), averaging over j, and substituting this bound into
Eq. (2.58) shows
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2)
≥
1
n
n
∑
j=1
I
rj(zj
,θ ) ∈ Ij
A
⊤
j
(λ 2 −λ 1)
−
L
n
2
n
∑
j=1
(
zj
+1)
−2CA
√
mλmax
1
n
n
∑
j=1
I
rj(zj
,θ )−A
⊤
j λ 1
≤
2Llogn
n
+I
rj(zj
,θ )−A
⊤
j λ 2
≤
2Llogn
n
= (∇λ L (λ 1,z,θ )−∇λ L (λ 2,z,θ ))⊤
(λ 1 −λ 2)−
L
n
(∥z∥1/n+1)
−2CA
√
mλmax
1
n
n
∑
j=1
I
rj(zj
,θ )−A
⊤
j λ 1
≤
2Llogn
n
+I
rj(zj
,θ )−A
⊤
j λ 2
≤
2Llogn
n
≥ φminβ∥λ 1 −λ 2∥
2
2 −V
2
log(V)
log2
n
√
n
∥λ 1 −λ 2∥
3/2
2 −
2L
n
Cµ +2/
√
νmin
−8CA
√
mλmax
L
√
νmax
logn
n
+
√
2LV log(V)
log5/2
n
n
!
,
11
because 2Llogn
n ≥
1
n
by Assumption 2.3.1 and z ∈ En. Using Assumption 2.3.1 to further simplify,
we have thus far shown that for some dimension independent constant C2,
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2) ≥ φminβ∥λ 1 −λ 2∥
2
2 −V
2
log(V)
log2
n
√
n
∥λ 1 −λ 2∥
3/2
2
(2.61)
−C2V
2
logV
log5/2
n
n
We next proceed to upper bound left side of this inequality. By definition of f1(·), f2(·),
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2)
=
1
n
n
∑
j=1
h
rj(zj
,θ )−A
⊤
j λ 2
i+
−
h
rj(zj
,θ )−A
⊤
j λ 1
i+
+
h
rj(zj
,
¯θ )−A
⊤
j λ 1
i+
−
h
rj(zj
,
¯θ )−A
⊤
j λ 2
i+
.
Focusing on the j
th term, we see
h
rj(zj
,θ )−A
⊤
j λ 2
i+
−
h
rj(zj
,θ )−A
⊤
j λ 1
i+
+
h
rj(zj
,
¯θ )−A
⊤
j λ 1
i+
−
h
rj(zj
,
¯θ )−A
⊤
j λ 2
i+
≤ 2
rj(zj
,θ )−rj(zj
,
¯θ )
≤ 2L∥θ − ¯θ ∥2(
zj
+1),
≤
2L
n
(
zj
+1),
where the penultimate inequality uses Assumption 2.3.6. Averaging over j, we see
f1(λ 2)− f1(λ 1) + f2(λ 1)− f2(λ 2) ≤
2L
n
∥z∥1
n
+1
≤
C3
n
for some constant C3, since z ∈ En.
Substitute into Eq. (2.61) to see that
C3
n
≥ φminβ∥λ 1 −λ 2∥
2
2 − ∥λ 1 −λ 2∥
3/2
2
V
2
log(V)
log2
n
√
n
−C2V
2
logV
log5/2
n
n
114
Rearranging and collecting constants shows
φminβ∥λ 1 −λ 2∥
2
2 − ∥λ 1 −λ 2∥
3/2
2
V
2
log(V)
log2
n
√
n
≤ C4V
2
logV
log5/2
n
n
,
for some dimension-independent constant C4.
We can also lower bound the left side by recalling
∥λ 1 −λ 2∥2 > V
2
logV
logn
√
n
=⇒
n
1/4∥λ 1 −λ 2∥
1/2
2
log1/2
(n) V2
log(V)
> 1.
Hence, inflating the second term on the left yields
φminβ∥λ 1 −λ 2∥
2
2 −
log3/2
n
n
1/4
∥λ 1 −λ 2∥
2
2 ≤ C5V
2
logV
log5/2
n
n
.
For n sufficiently large, the first term on the left is at least twice the second. Rearranging and
taking square roots shows
∥λ 1 −λ 2∥2 ≤ C6
p
V2
logV
log5/4
n
√
n
.
Recalling that V ≥ 2 proves the theorem.
□
2.10.5 Pointwise Convergence Results
To prove our theorem, we require the uniform convergence of the in-sample optimism to expectation and the uniform convergence of VGC to its expectation. In this section, we first establish several pointwise convergence results to assist with this task. Our main workhorse will be
Theorem 2.7.5 where En defines the good set on which our random variables satisfy a bounded
differences condition.
115
As a first step, we will show that the dual solutions converge (for a fixed θ ) to their expectations.
In preparation, we first bound the behavior of the dual on the bad set.
Lemma 2.10.15 (Dual Solution Conditional Expectation Bound). Suppose Assumptions 2.3.1 and 2.3.6
both hold. Let
E1,n ≡
z : ∥z∥1 ≤ nCµ +
2n
√
νmin
.
Then, there exists a dimension independent constant C, such that
E
λi(Z,θ )I
Z ∈ E
c
1,n
≤ Cexp
−
n
C
Proof. Proof of Lemma 2.10.15: We first bound
E
∥Z∥1
I
Z ∈ E
c
1,n
=
Z ∞
0
P
∥Z∥1
I
Z ∈ E
c
1,n
≥ t
dt
=
Z nCµ+ √
2n
νmin
0
P
∥Z∥1 > nCµ +
2n
√
νmin
dt +
Z ∞
nCµ+ √
2n
νmin
P(∥Z∥1 ≥ t)dt
≤
nCµ +
2n
√
νmin
e
−n/32 +
Z ∞
nCµ+ √
2n
νmin
P(∥Z∥1 ≥ t)dt
By inspection, there exists a dimension independent constant C1 such that the first term is at most
C1e
−n/C1
.
To analyze the second term, recall ∥Z∥1 ≤ nCµ +∑
n
j=1
Zj − µj
. Hence,
Z ∞
nCµ+ √
2n
νmin
P(|Z|
1 ≥ t)dt =
Z ∞
nCµ+ √
2n
νmin
P
1
n
n
∑
j=1
Zj − µj
≥
1
n
t −Cµ
!
dt
=
Z ∞
nCµ+ √
2n
νmin
P
1
n
n
∑
j=1
Zj − µj
−E
Zj − µj
≥
1
n
t −Cµ −
1
νmin!
dt,
11
since E
Zj − µj
≤ √
1
νmin
by Jensen’s inequality. Now make the change of variable s =
t
n −Cµ −
√
1
νmin
to obtain
n
Z ∞
ν
−1/2
min
P
1
n
n
∑
j=1
Zj − µj
−E
Zj − µj
≥ s
!
ds ≤ n
Z ∞
ν
−1/2
min
e
−
s
2νminn
32 ds
because
Zj − µj
− E
Zj − µj
is a mean-zero, sub-Gaussian random variable with variance
proxy at most 16
νmin
. (See Lemma 2.10.3 for clarification.) Making another change of variables
proves this last integral is equal to
4
√
νminn
Z ∞
√
n/4
e
−t
2/2
dt, ≤
16
n
√
νmin
e
−
n
32
This value is also at most C2e
−n/C2 for some constant C2.
In summary, we have shown that there exists a dimension independent constant C3 such that
E
∥Z∥1
I
Z ∈ E
c
1,n
≤ C3e
−n/C3
.
Now to prove the lemma, recall by Lemma 2.10.2,
λi(Z,θ ) ≤ ∥λ (Z,θ )∥1 ≤
2
s0n
∥r(Z,θ )∥1 ≤
2
s0n
(amax ∥Z∥1 +bmaxn) (2.62)
where the second inequality holds by Assumption 2.3.6. Multiplying by I
n
Z ∈ E
c
1,n
o
and taking
expectations hows,
E
λi(Z,θ )I
Z ∈ E
c
1,n
≤
C4
n
e
−n/C4 +C4P
Z ∈ E
c
1,n
≤
C4
n
e
−n/C4 +C4e
−n/32
,
by Lemma 2.10.3. Collecting constants proves the lemma. □
We now use Theorem 2.7.5 to prove that the dual solution concentrates at its expectation for
any fixed θ ∈ Θ.
11
Lemma 2.10.16 (Pointwise Convergence Dual Solution). Fix some θ ∈ Θ and i = 1,...,m. Under
Assumptions 2.3.1, 2.3.6 and 2.4.6 There exists dimension independent constants C and n0, such
that for all n ≥ n0e
R
, the following holds with probability 1−exp(−R)
|λi(Z,θ )−E[λi(Z,θ )]| ≤ CV3
log2V
log4
n
√
n
√
R.
Proof. Proof of Lemma 2.10.16: The proof will use the dual stability condition (Lemma 2.10.12)
to apply Theorem 2.7.5. Since θ is fixed throughout, we drop it from the notation.
By triangle inequality,
|λi(Z)−E[λi(Z)]| ≤ |λi(Z)−E[λi(Z)|Z ∈ En]|
| {z }
(a)
+|E[λi(Z)|Z ∈ En]−E[λi(Z)]|
| {z }
(b)
. (2.63)
We first bound (b) by a term that is O(1/n). We see that
E[λi(Z)|Z ∈ En]−E[λi(Z)] =
E[λi(Z)|Z ∈ En]−E[λi(Z)|Z ∈/ En]
P{Z ∈/ En}
≤
C1
n
+E[λi(Z)I{Z ∈/ En}],
where we used Lemmas 2.10.10 and 2.10.11 to bound the first term and C1 is a dimension independent constant. To bound the second term, define the set
E0 ≡
z : ∥z∥1 ≤ nCµ +
2n
√
νmin
.
Notice, En ⊆ E0. Then write,
E[λi(Z)I{Z ∈/ En}] = E[λi(Z) (I{Z ∈ E
c
0
}+I{Z ∈ E0 ∩E
c
n
})]
≤ C2 exp
−
n
C2
+λmaxP{Z ∈ E
c
n
} (Lemmas 2.10.11 and 2.10.15)
≤ C3 exp
−
n
C3
+
C2
n
(Lemma 2.10.10),
118
for dimension independent constants C2 and C3.
Collecting terms shows that there exists a dimension independent constant C3 such that for n
sufficiently large,
Term (b) of Eq. (2.63) = |E[λi(Z) | Z ∈ En]−E[λi(Z)]| ≤
C3
n
.
We now bound Term (a) by leveraging Theorem 2.7.5. First note that for any Z,Z¯ ∈ En, we
have
λi(Z)−λi(Z¯)
=
q
λi(Z)−λi(Z¯)
2 ≤
s m
∑
i=1
λi(Z)−λi(Z¯)
2 =
λ (Z)−λ (Z¯)
2
.
Thus, by Lemma 2.10.12, we see that
λi(Z)−λi(Z¯)
≤ C4V
3
log2V ·
log4
(n)
n
n
∑
j=1
I
Zj ̸= Z¯
j
,
and, hence, λi(·)satisfies the bounded differences condition on En. By Lemma 2.10.10, P{Z ̸∈ En} ≤
C5
n
.
By the assumptions, n > 4C5e
R =⇒ e
−R >
2C5
n
. Theorem 2.7.5 then shows that with probability
at least 1−e
−R
,
λi(Z)−E[λi(Z) | Z ∈ En] ≤ C5C4V
3
log2
(V)
log4
n
n
+C4V
3
log2
(V)
log4
n
√
n
s
log
2
e−R −2C5/n
≤ C6V
3
log2
(V)
log4
n
√
n
s
log
2
e−R −2C5/n
≤ C6V
3
log2
(V)
log4
n
√
n
s
log
4
e−R
≤ C7V
3
log2
(V)
log4
n
√
n
√
R,
where the third inequality again uses n > 4C5e
R
, and the remaining inequalities simply collect
constants and dominant terms.
119
To summarize, by substituting the two bounds into the upperbound of |λi(Z)−E[λi(Z)]| in
Eq. (2.63), we obtain that with probability at least 1−e
−R
|λi(Z)−E[λi(Z)]| ≤ C8V
3
log2
(V)
log4
n
√
n
√
R+
C8
n
.
Collecting terms completes the proof.
□
Proof. Proof of Lemma 2.4.9: Since θ is fixed, we drop it from the notation. By triangle inequality,
|D(Z)−E[D(Z)]| ≤ |D(Z)−E[D(Z)|Z ∈ En]|+|E[D(Z)|Z ∈ En]−E[D(Z)]|. (2.64)
We bound the latter term first. Since D(Z) is bounded by Lemma 2.3.8, we see
|E[D(Z)|Z ∈ En]−E[D(Z)]| =
E[D(Z)|Z ∈ En]−E[D(Z)|Z ∈/ En]
P{Z ∈/ En}
≤
C1
n
√
h
for some dimension independent constant C1 by using Lemma 2.10.10.
We now bound the first term. We use Theorem 2.7.5. Recall from Lemma 2.10.13 that
D(Z)−D(Z)
≤ C2 log4
n ·V
3
log2V ·
1
h
n
∑
j=1
I
Zj ̸= Z j
for Z,Z ∈ En and from Lemma 2.10.10 P{Z ̸∈ En} ≤
C1
n
. Finally if n > 4C1e
R
, then 2C1/n <
1
2
e
−R
,
and we have by Theorem 2.7.5 that
|D(Z)−E[D(Z) | Z ∈ En]| ≤ C3V
3
log2
(V)
log4
(n)
h
+C3V
3
log2
(V)
log4
(n)
√
n
h
s
log
2
e−R −2C1/n
≤ C4V
3
log2
(V)
log4
(n)
√
n
h
s
log
2
e−R −2C1/n
≤ C5V
3
log2
(V)
log4
(n)
√
n
h
√
R,
120
where the last line again uses n > 4C1e
R
.
Returning to the initial upper bound Eq. (2.64), we apply our two bounds to see
|D(Z)−E[D(Z)]| ≤ C6V
3
log2
(V)
log4
(n)
√
n
h
√
R+
C6
n
√
h
By Assumption 2.3.1, h < 1 < n implies that
√
n
h ≥
1
n
√
h
. Hence, collecting dominant terms completes the proof. □
2.10.6 Uniform Convergence of Dual Solutions
The goal of this section is to extend our previous pointwise results to uniform results over all
θ ∈ Θ. Let Θ be a minimal 1
n
-covering of Θ. Then, for every θ ∈ Θ there exists ¯θ ∈ Θ such that
∥θ − ¯θ ∥2 ≤
1
n
.
Lemma 2.10.17 (Uniform Convergence Dual Solution). Under the assumptions of Theorem 2.4.7,
there exists dimension independent constants C and n0 such that for any R > 1 and any n ≥ n0e
R
,
the following holds with probability 1−2e
−R
:
sup
θ∈Θ
∥λ (Z,θ )−E[λ (Z,θ )]∥∞ ≤ CV2
log2V logm
s
RlogN
1
n
,Θ
log4
n
√
n
Proof. Proof of Lemma 2.10.17: By triangle inequality,
sup
θ∈Θ
∥λ (Z,θ )−E[λ (Z,θ )]∥∞ ≤ sup
θ∈Θ
∥λ (Z,θ )−λ (Z,
¯θ )∥∞
| {z }
(a)
+ sup
θ∈Θ
∥E
λ (Z,θ )−λ (Z,
¯θ )
∥∞
| {z }
(b)
+ sup
¯θ∈Θ¯
∥λ (Z,
¯θ )−E
λ (Z,
¯θ )
∥∞
| {z }
(c)
We bound each term separately.
121
First we bound Term (a). If Z ∈ En, then, from Lemma 2.10.14, and bounding the ℓ∞-norm by
the ℓ2-norm,
sup
θ∈Θ
∥λ (Z,θ )−λ (Z,
¯θ )∥∞ ≤ C1V
3
log2V
log5/4
n
√
n
for some dimension independent constant C1. By Lemma 2.10.10, this occurs with probability at
least 1−4/n.
Next, we bound (b). Telescoping the expectation as before, we have for any i = 1,...,m that
E
λi(Z,θ )−λi(Z,
¯θ )
= E
h
λi(Z,θ )−λi(Z,
¯θ )
Z ∈ En
i
P{Z ∈ En}
+E
h
λi(Z,θ )−λi(Z,
¯θ )
Z ∈/ En
i
P{Z ∈/ En}
We can bound the first term using Lemmas 2.10.10 and 2.10.14. To bound the second term, define
the set
E1,n ≡
z : ∥z∥1 ≤ nCµ +
2n
√
νmin
,
and recall that En ⊆ E1,n. Observe that
E
h
λi(Z,θ )
Z ∈/ En
i
P(Z ∈/ En) = E[λi(Z,θ )I{Z ∈/ En}]
≤ E
λi(Z,θ )
I
Z ∈/ E1,n
+I
Z ∈ E1,n,Z ∈/ En
≤ E
λi(Z,θ )I
Z ∈/ E1,n
+λmaxP(Z ∈/ En) (Lemma 2.10.11)
≤ C2 exp
−
n
C2
+
C2λmax
n
Lemmas 2.10.10 and 2.10.15),
for some dimension independent constant C2. Combining these observations shows that
E
λi(Z,θ )−λi(Z,
¯θ )
≤ C3V
3
log2V
log5/4
n
√
n
+C3 exp
−
n
C3
+
C3λmax
n
≤ C4V
3
log2V
log5/4
n
√
n
12
where C3 and C4 are dimension-independent constants. Taking the supremum over θ ∈ Θ and over
i = 1,...,m, bounds Term (b).
Finally, we bound Term (c). We see that
sup
¯θ∈Θ¯ ,0≤i≤m
λi(Z,
¯θ )−E
λi(Z,
¯θ )
≤ C5V
3
log2
(V)
log4
n
√
n
s
Rlog
m·N
1
n
,Θ
by applying Lemma 2.10.16 and taking the union bound over the
Θ
≤ N
1
n
,Θ
elements in Θ by
the m choices of i.
Taking a union bound over the probabilities that bounds hold on Terms (a) and (c) and adding
term (b) shows that there exists a dimension independent constant C such that with probability
1−e
−R −4/n
sup
θ∈Θ
∥λ (Z,θ )−E[λ (Z,θ )]∥∞ ≤ CV3
log2
(V)logm
s
RlogN
1
n
,Θ
log4
n
√
n
.
Finally, note that if n > 4e
−R
, this last probability is at least 1−2e
−R
to complete the proof.
2.10.7 Uniform Convergence of In-Sample Optimism
In this section, we construct a high-probability bound for
sup
θ∈Θ
1
n
n
∑
j=1
ξjx j(Z,θ )−E
ξjx j(Z,θ )
where we recall that Z = µ +ξ . Note for convenience we have scaled this by 1
n
.
Constructing the bound requires decomposing the in-sample optimism into several sub-components.
We outline the subcomponents by providing the proof to Lemma 2.10.18 first. For convenience, in
this section only, we use the notation λ (θ ) ≡ E[λ (Z,θ )] as shorthand.
12
Lemma 2.10.18 (Uniform In-sample Optimism for Coupling Constraints). Let N(ε,Θ) be the
ε−covering number of Θ. Under the assumptions of Theorem 2.4.7, there exists dimension independent constants C and n0 such that for any R > 1 and n ≥ n0e
R
, the following holds with
probability 1−6exp(−R).
sup
θ∈Θ
1
n
n
∑
j=1
ξjx j(Z;θ )−E
ξjx j(Z;θ )
≤ CV3
log3V
s
logN
1
n
,Θ
·
Rlog4
(n)
√
n
Proof. Proof of Lemma 2.10.18. By triangle inequality, we see,
sup
θ∈Θ
1
n
n
∑
j=1
ξjx j(Z,θ )−E
ξjx j(Z,θ )
≤ sup
θ∈Θ
1
n
n
∑
j=1
ξj
x j(Z,θ )−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
| {z }
Rounding Error
+ sup
θ∈Θ
1
n
n
∑
j=1
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
| {z }
Dual Approximation Error
+ sup
θ∈Θ
1
n
n
∑
j=1
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
−E
h
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
oi
| {z }
ULLN for Dual Approximation
+ sup
θ∈Θ
1
n
n
∑
j=1
E
h
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
oi
| {z }
Expected Dual Approximation Error
+ sup
θ∈Θ
1
n
n
∑
j=1
E
h
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−x j(Z,θ )
i
| {z }
Expected Rounding Error
.
For Rounding and Expected Rounding Error, we have
n
∑
j=1
ξj
x j(Z,θ )−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
≤ ∥ξ ∥∞
x j(Z,θ )−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
1
≤ ∥ξ ∥∞ m
124
where the first inequality follows Holder’s inequality and the second inequality holds by complementary slackness. Note,
P{∥ξ ∥∞ ≥ t} ≤
n
∑
j=1
P
ξj
≥ t
≤
n
∑
j=1
2exp
−
νmint
2
2
.
Moreover, E[∥ξ ∥∞] ≤ C1
√
logn for some dimension independent constant C1. Thus, with probability at least 1−e
−R
, we have
Rounding Error+Expected Rounding Error ≤ ∥ξ ∥∞
m
n
+E∥ξ ∥∞
m
n
=
m
n
(∥ξ ∥∞ +E∥ξ ∥∞
)
≤
C2m
n
p
Rlogn,
for some dimension independent constant C2.
We bound the Dual Approximation Error terms in Lemma 2.10.20 with our uniform bounds on
the dual solutions from Lemma 2.10.17 below, proving that with probability at least 1−4e
−R
,
Dual Approximation Error+Expected Dual Approximation Error (2.65)
≤ C3R
r
V
n
+C3V
2
log2V log(m)·
s
RlogN
1
n
,Θ
·
log4
(n)
√
n
(2.66)
≤ C4V
2
log2V log(m)
R
√
n
·
s
logN
1
n
,Θ
log4
(n)
√
n
(2.67)
for some dimension independent constants C3 and C4.
We bound the ULLN for Dual Approximation term in Lemma 2.10.19 below to prove that
ULLN for Dual Approximation ≤ C5R
r
V
n
with probability 1−exp(−R).
Taking a union bound over all probabilities and summing all bounds yields the result.
125
Lemma 2.10.19 (ULLN for Dual Approximation). Under Assumptions 2.3.1 and 2.4.6, there exists
a dimension independent constant C such that for any R > 1, with probability at least 1−e
−R
,
sup
θ∈Θ
n
∑
j=1
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
−E
h
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
oi
≤ C·R
√
V n
Proof.Proof of Lemma 2.10.19: We first note that
sup
θ∈Θ
n
∑
j=1
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
−E
h
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
oi
≤ sup
θ∈Θ,λ ∈Rm
n
∑
j=1
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ
o
−E
h
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ
oi
and the summation is a sum of centered independent random variables. We will apply Theorem 2.7.1 to the last expression. Specifically, we consider the envelope F(Z) = (
Zj
)
n
j=1
. Then,
we have
∥∥F(Z)∥2
∥Ψ
(a)
≤
∥ζ ∥
√
2
νmin
Ψ
(b)
≤
r
2n
νmin
= C1
√
n
for some dimension independent constant C1. We see (a) holds by letting ζj =
√νjξj and (b)
holds by Lemma A.1 iv) from GR2020.
Next,
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ
on
j=1
: θ ∈ Θ, λ ∈ R
m
≤
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ
on
j=1
: θ ∈ Θ, λ ∈ R
m
,
and by Assumption 2.4.6, the latter set has VC-dimension V and hence cardinality at most 2V
.
Thus, we see that with probability 1−e
−R
, that
sup
θ∈Θ,λ ∈Rm
n
∑
j=1
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ
o
−E
h
ξjI
n
rj(Zj
,θ ) ≥ A
⊤
j λ
oi
≤ C2R
√
V n
for some absolute constant C2.
126
We next provide bounds for the Dual Approximation Error terms in the proof of Lemma 2.10.18.
Lemma 2.10.20 (Dual Approximation Error). Assume Assumptions 2.3.1 and 2.4.6 hold. Then,
there exists dimension independent constants C and n0 such that for any R > 1 and n > n0e
R
, we
have with probability at least 1−4e
−R
, the following two inequalities hold simultaneously:
sup
θ∈Θ
n
∑
j=1
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
≤ CR√
V n,
sup
θ∈Θ
n
∑
j=1
E
h
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
oi
≤ CV2
log3
(V)·
s
RlogN
1
n
,Θ
·log4
(n)
√
n.
Proof.Proof of Lemma 2.10.20: First observe that under the conditions of the theorem, Lemma 2.10.17
implies that for some dimension independent constant C1, with probability at least 1−2e
−R
,
sup
θ∈Θ
∥λ (Z,θ )−λ (θ )∥2 ≤ C1V
2
log2
(V)log(m)·
s
RlogN
1
n
,Θ
·
log4
(n)
√
n
| {z }
≡δ
,
where we have by bounding the ℓ2-norm by √
m times the ℓ∞ norm. Define the right side to be the
constant δ as indicated.
We will restrict attention to the events where both the above inequality holds and also Z ∈ En.
By the union bound and Lemma 2.10.10, this event happens with probability at least 1−2e
−R −
4/n. For n > 4e
R
, this probability is at least 1−3e
−R
.
Now write
n
∑
j=1
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
(2.68)
≤
n
∑
j=1
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
≤
n
∑
j=1
ξj
I
n
rj(Zj
,θ )−A
⊤
j λ (θ ) ∈ [−CAδ,CAδ]
o
,
127
because ∥λ (Z,θ )−λ (θ )∥2 ≤ δ. Furthermore, when the indicator is non-zero, we can bound
ξj
≤
1
amin
(CAλmax +CAδ +bmax) +Cµ ≤ C2(1+δ), (2.69)
for some dimension independent C2.
By Lemma 2.10.2, ∥λ (θ )∥1 ≤
2
ns0
E∥r(Z,θ )∥1 ≤ λmax and thus we can upper bound,
sup
θ∈Θ
n
∑
j=1
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
≤C2(1+δ) sup
θ∈Θ,∥λ ∥1≤λmax
n
∑
j=1
I
n
rj(Zj
,θ )−A
⊤
j λ ∈ [−CAδ,CAδ]
o
−P
n
rj(Zj
,θ )−A
⊤
j λ ∈ [−CAδ,CAδ]
o
| {z }
(i)
+C2(1+δ) sup
θ∈Θ
P
n
rj(Zj
,θ )−A
⊤
j λ (θ ) ∈ [−CAδ,CAδ]
o
| {z }
(ii)
To bound the supremum (i), we will apply Theorem 2.7.1. Not that the vector e is a valid envelope.
To bound the cardinality of
F ≡
I
n
rj(Zj
,θ )−A
⊤
j λ ∈ [−CAδ,CAδ]
on
j=1
: λ ∈ R
m
+, θ ∈ Θ
,
consider the two sets
F1 ≡
I
n
rj(Zj
,θ )−A
⊤
j λ ≥ −CAδ
on
j=1
: λ ∈ R
m
+, θ ∈ Θ
,
F2 ≡
I
n
rj(Zj
,θ )−A
⊤
j λ ≤ CAδ
on
j=1
: λ ∈ R
m
+, θ ∈ Θ
.
Under Assumption 2.4.6, both sets have pseudo-dimension at most V. Furthermore, F = F1∧F2.
Hence, by Pollard, 1990, there exists an absolute constant C3 such that the pseudo-dimension of
F is at most C3V, and hence its cardinality is at most n
C3V
.
128
Thus, applying Theorem 2.7.1 shows that there exists a constant C4 such that with probability
at least 1−e
−R
,
Term (i) ≤ C4(1+δ)R
p
V nlogn.
To evaluate Term (ii), we recognize it as the probability as the probability that a Gaussian
random variable lives in an interval of length 2CAδ. Upper bounding the Gaussian density by its
value at the mean shows
P
n
rj(Zj
,θ )−A
⊤
j λ ∈ [−CAδ,CAδ]
o
≤ 2CA
r
νmax
2π
δ ≤ C5δ. (2.70)
Thus,
Term (ii) ≤ C6(1+δ)δ.
Combining our bounds, we see that with probability at least 1−4e
−R
,
sup
θ∈Θ
n
∑
j=1
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
≤ C7(1+δ)R
√
V n+C7(1+δ)δ
≤ C7R
√
V n,
by substituting the value of δ and only retaining the dominant terms. This proves the first result of
the lemma.
To prove the second result of the lemma, note that
n
∑
j=1
E
h
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
oi
≤
n
∑
j=1
E
h
ξj
I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (Z,θ )
o
−I
n
rj(Zj
,θ ) ≥ A
⊤
j λ (θ )
o
i
≤ C2(1+δ)
n
∑
j=1
P
rj(Zj
,θ )−A
⊤
j λ (θ ) ∈ [−CAδ,CAδ]
≤ nC8(1+δ)δ,
129
where the second inequality uses the bound on
ξj
(Eq. (2.69)) and the last inequality follows
from argument leading to Eq. (2.70) above. Substituting the value of δ, using the assumption that
V ≥ m, and retaining only the dominant terms completes the proof.
2.10.8 Uniform Convergence of VGC
Lemma 2.10.21 (Uniform VGC for Coupling Constraints). Let N(ε,Θ) be the ε−covering number
of Θ and the assumptions under Theorem 2.4.7 hold. There exists dimension independent constants
C and n0 such that for n ≥ n0e
R
the following holds with probability 1−Ce−R
,
sup
θ∈Θ¯
|D(Z,θ )−E[D(Z,θ )]| ≤ C·V
2
log2V ·
q
R· nlog
n ·N(n
−3/2
,Θ)
·
log4
n
hmin
Proof. Proof of Lemma 2.10.21 We follow a similar strategy to Lemma 2.9.4 and again consider
the full notation version of the VGC, D(Z,(θ ,h)), and take the supremum over θ ∈ Θ and h ∈ H
where H ≡ [hmin,hmax]. Let Θ0 be a minimal n
−3/2
-covering of Θ. In particular, for any θ ∈ Θ
there exists a ¯θ ∈ Θ0 such that
¯θ −θ
2 ≤ n
−3/2
. Similarly, let H¯ be the n
−1
-covering of H .
By telescoping,
sup
θ∈Θ
D(Z,(θ ,h))−E
D(Z,(θ ,h))
≤ sup
¯θ∈Θ0
h∈H
D(Z,(
¯θ ,h))−E
D(Z,(
¯θ ,h))
| {z }
(i)
(2.71)
+ sup
θ ,θ :∥θ −θ ∥≤n
−3/2
h∈H
D(Z,(θ ,h))−D(Z,(
¯θ ,h))
+ sup
θ ,θ :∥θ −θ ∥≤n
−3/2
h∈H
E
D(Z,(
¯θ ,h))
−E[D(Z,(θ ,h))]
| {z }
(ii)
+ sup
¯θ∈Θ0
h,h:∥h−h∥≤n
−1
D(Z,(
¯θ ,h))−D(Z,(
¯θ ,h))
+ sup
¯θ∈Θ0
h,h:∥h−h∥≤n
−1
E
D(Z,(
¯θ ,h))
−E
D(Z,(
¯θ ,h))
| {z }
(iii)
13
We bound Term (i) by taking a union bound over the N(n
−3/2
,Θ) elements of Θ¯ and N(n
−1
,H )
elements of H in combination with the pointwise bound from Lemma 2.4.9. This shows that with
probability at least 1−4e
−R
sup
¯θ∈Θ0
h∈H
D(Z,
¯θ )−E
D(Z,
¯θ )
≤ C1V
2
log2
(V)
q
Rlog
N
n
−3/2
,Θ
N(n−1
,H )
√
nlog4
(n)
hmin
.
Terms (ii) and (iii) of Eq. (2.71) can be bounded as follows.
First, for (ii), we see by Lemma 2.3.7 that for ∥θ −θ ∥ ≤ n
−3/2
there exists a constant C1 such
that
D(Z,(θ ,h))−D(Z,(
¯θ ,h))
≤
C1L
h
r
R
νmin
· n
1/2p
logn
with probability 1−exp(−R). Similarly, there existsC2,C3 andC4 (depending on νmin,L,Cµ,amin,amax,bmax)
such that
E
h
D(Z,(θ ,h))−D(Z,(θ ,h))i
≤
C2n
1/2
h
(E[∥z∥∞] +1) ≤
C3n
1/2
h
p
logn+Cµ
≤
C4n
1/2
h
p
logn,
where the second inequality uses a standard bound on the maximum of n sub-Gaussian random
variables, and we have used Assumption 2.3.1 to simplify. Combining the two terms and taking
the supremum over h ∈ [hmin,hmax], we see there exists a constant C5 (depending on C1 and C4),
such that
(ii) ≤
C5
√
Rnlogn
hmin
Finally, for (iii) we see by Lemma 2.3.7 that for ∥h−h∥ ≤ n
−1
, there exists an absolute constant
C6 such that,
(iii) ≤
C6
√
n
hminνmin
1
Combining, we see there exists dimension independent constants C7 and C8 such that with probability 1−5e
−R
,
sup
θ∈Θ
h∈H
|D(Z,θ )−E[D(Z,θ )]| ≤ C7V
2
log2
(V)
q
Rnlog
N
n
−3/2
,Θ
N(n−1
,H )
log4
(n)
hmin
+C7
√
Rnlogn
hmin
≤ C8 ·V
2
log2V ·
q
R· nlog
n ·N(n
−3/2
,Θ)
·
log4
n
hmin
where the last inequality holds as N(n
−1
,H ) ≤ n. This completes the proof.
To obtain uniform bounds, we characterize the complexity of the policy class through the n
−3/2
covering number of the parameter space Θ. As an example to demonstrate the size of the covering
number, consider the case where Θ is a compact subset of R
p with a finite diameter Γ. Applying
Lemma 4.1 of Pollard, 1990 and that the ε−packing number bounds the ε−covering number, we
see that N(n
−3/2
,Θ) ≤
3n
3/2Γ
p
. Combining this bound with Eq. (2.71), we obtain the following
corollary.
Corollary 2.10.22 (Uniform Convergence for Finite Policy Class). Let Θ be a compact subset
of R
p with a finite diameter Γ. There exists dimension independent constants C,n0 such that for
n ≥ n0 the following holds with probability 1−Cexp{−R},
sup
θ∈Θ
1
n
|D(Z,θ )−E[D(Z,θ )]| ≤ Clog4
n ·V
2
logV ·
1
hmin
r
R
n
· plog
3n
3/2Γ
This corollary shows that the complexity of the policy class depends on the number of parameters of the plug-in policies. We see from Section 2.2.2 that p for many common policy classes does
not depend on n, implying that the convergence of the VGC estimator to its expectation follows the
rate from Corollary 2.10.22 up to log terms. For example, p for mixed effect policies depends on
the dimension of W j which reflects the information available, such as features, for each µj
. This
is typically fixed even as the number of observations n may increase. This implies that for many
policy classes, the estimation error converges to 0 as n → ∞.
2.10.9 Proof of Theorem 2.4.7
We can now prove Theorem 2.4.7.
Proof. Proof of Theorem 2.4.7: We proceed to bound each term on the right side of Eq. (2.18).
To bound Eq. (2.18a), we have by Lemma 2.10.18, that with probability at least 1−e
−R
, that
sup
θ∈Θ¯
ξ
⊤
x(Z,θ )−E
h
ξ
⊤
x(Z,θ )
i
≤ CV3
log3V
s
n ·logN
1
n
,Θ
·Rlog4
(n).
To bound Eq. (2.18b), let H ≡ [hmin,hmax]. Then, by Lemma 2.10.21, we have that for some
dimension independent constant C1 that with probability at least 1−C1e
−R
,
sup
θ∈Θ¯
|D(Z,θ )−E[D(Z,θ )]| ≤ C1 ·V
2
log2V ·
q
R· nlog
n ·N(n
−3/2
,Θ)
·
log4
n
hmin
.
Finally, to bound Eq. (2.18c), use Theorem 2.3.2 and take the supremum over h ∈ H to obtain
Eq. (2.18c) ≤ C2hmaxnlog(1/hmin).
Substituting these three bounds into Eq. (2.18) and collecting constants proves the theorem.
13
Chapter 3
Decision-Aware Denoising
Modern decision-making in urban planning, climate change, and healthcare leverage large geospatial and panel datasets. These data are often extremely noisy, resulting in low-quality downstream
decisions. We propose a new “light touch” framework to adapt techniques from machine learning originally designed for denoising these data to instead guide decision-making. The key to our
method is a novel approach to debiasing in-sample policy performance to estimate out-of-sample
policy performance that we call the one-shot Varance Gradient Correction (one-shot VGC). Using
the one-shot VGC, we tune the machine learning methods to minimize downstream costs (instead
of minimizing prediction error or maximizing signal recovery). We uniformly bound the relative error of the one-shot VGC as an estimate of downstream costs by an intuitive measure of
solution stability for the problem plus a term that vanishes as problem dimension grows. This
measure depends both on the policy and the structure of the downstream optimization problem.
We bound the solution stability for two classes of policies and problems – i) regularized plug-in
policies for convex problems and ii) (unregularized) affine plug-in policies for weakly-coupled
(potentially non-convex) problems. In both cases, we show the solution stability vanishes (relative
to out-of-sample cost) as the dimension of the optimization problem grows. Finally, we present
a case study based on real traffic accident data from New York City on deploying speed humps
to reduce pedestrian injury. Our “light-touch” decision-aware approach outperforms traditional
decision-blind techniques provided the problem dimension is sufficiently large.
134
3.1 Introduction
Many decision-making problems involve large-scale systems, such as cities for urban design, road
networks for transportation, and ecosystems for wildfire management. To analyze and make decisions in these settings, practitioners must use data to construct estimates of the many uncertain
parameters that govern these large systems. Crucial data are often missing or very imprecise as
they must be collected over time through a limited number of sensors, satellites, and people. As
a result, estimates of the key model parameters can be noisy leading to poor decision-making
policies.
A canonical solution to reducing the noise of estimates, particularly in large-scale systems, is
to apply denoising. Originating from signal processing to separate signal from noise, denoising
been successfully applied, from domain specific settings such as extracting a patient’s signal from
electrocardiograms (Chatterjee et al., 2020) and separating speech from background noise (Wilson
et al., 2008). For large systems, denoising has been applied in systems biology to better study
the impact of drugs, cancer progression, and cell development at a molecular level (Kavran and
Clauset, 2021). Additionally, denoising also has been used to improve weather forecasts (Kim et
al., 2020). In each of these domains, the relevant denoising algorithm leverages similarities across
these large-systems to better learn the underlying signal. Moreover, the ultimate task is signal
recovery; there is no explicit downstream decision task.
By contrast, this paper proposes a new method for denoising when the estimated signal is
meant to inform a decision-task. Our approach combines the denoising of parameters and the optimization of decisions into a unified task, sometimes known as decision-aware learning. Decisionaware methods contrast estimate-then-optimize methods which would first denoise the parameters using standard statistical and machine learning tools and then plug-in the estimates into the
decision optimization task. Recent works have shown that decision-aware learning methods outperform estimate-then-optimize methods, particularly in small-data, large-scale settings (Gupta,
Huang, and P. Rusmevichientong, 2022) and when the parameter estimation models are misspecified (Adam N Elmachtoub and Grigas, 2022).
135
The decision-aware learning framework allows our method to be effective even in settings
where the true model parameters can not be recovered through denoising. Indeed, many applications with geospatial data exhibit poor data quality and/or missing covariates which make learning
the true model parameters difficult. For example, in Section 3.7, we consider a case study that
involves predicting the probability of pedestrian injuries in motor vehicle crashes in New York city
for every census tract. We utilize data collected by the New York City Policy Department which
maps the locations of motor vehicle accidents. However, we lack important covariate information
such as foot and vehicle traffic that would intuitively improve our prediction of injury rates for
census tracts with limited crash data. The collection and integration such useful covariate information into existing datasets can be costly, time consuming, and even conflict with data privacy
regulations. Our approach instead seeks to identify higher quality decisions only using the data at
hand, even if these data are insufficient to generate high-quality estimates.
Throughout this paper, we focus on the following linear optimization problem,
x
∗ ∈ argmin
x∈X ⊆[0,1]
n
µ
⊤x, (3.1)
where X is a known, potentially non-convex, feasible region contained within [0,1]
n
, and µ ∈ R
n
is an unknown vector of parameters. For each µj
, we have access to noisy, unbiased prediction
Zj and observe defining features W j ∈ R
p
that are fixed and known. Our goal is to find a policy
x(Z,W ) with good out-of-sample performance µ
⊤x(Z,W ). Since µ is unknown, even estimating
the out-of-sample performance of a given policy x(Z,W ) is not trivial. The naive in-sample estimator Z
⊤x(Z,W ) is biased, i.e., E
Z
⊤x(Z,W )
̸= E
µ
⊤x(Z,W )
. (See Gupta, Huang, and P.
Rusmevichientong, 2022 for discussion).
Instead, we propose a novel one-shot Variance Gradient Correction (one-shot VGC) to debias
this in-sample performance Z
⊤x(Z,W ) to obtain an estimate of µ
⊤x(Z,W ). The one-shot VGC
approximately computes the gradient of the objective value with respect to Z, similar to the method
of (Gupta, Huang, and P. Rusmevichientong, 2022) which we refer to as the multi-shot VGC.The
novelty of the one-shot VGC is two-fold: First, we utilize the one-shot VGC to highlight a general
136
approach to debias in-sample performance for any plug-in and regularized plug-in policy for linear
optimization problems. By contrast, Gupta, Huang, and P. Rusmevichientong (2022) focuses on a
particular separable affine plug-in policy class (see Remark 3.2.3). Second, we show that important
theoretical guarantees of the one-shot VGC can be characterized by a measure of solution stability.
Unlike Gupta, Huang, and P. Rusmevichientong (2022) which also showed the multi-shot VGC
also depends on a solution stability term, we are able to bound the solution stability term directly.
This allows us obtain theoretical bounds and guarantees on the resulting out-of-sample performance estimator with minimal assumptions for a significantly larger variety of policy classes.
In addition to being more general and theoretically more tractable, the one-shot VGC also has
better empirical estimation error performance as demonstrated in Fig. 3.1. We attribute the better performance due to the structural differences. While both estimate the debiasing component
by adding noise, we provide a heuristic derivation that shows the multi-shot VGC adds significantly more noise and is thus a uniformly higher variance estimator. We provide a more in-depth
comparison between the two estimators in Section 3.3.3.
For our theoretical bounds, we show in particular that our solution stability term bounds the
bias of the one-shot VGC and that its variance and tail-bounds can be bounded with common
assumptions on the data and plug-in class. This differs from Gupta, Huang, and P. Rusmevichientong (2022) which bounds the variance relative to solution stability. Bounding the solution stability
term allows us to obtain uniform estimation error bounds on our one-shot VGC estimator. Since
the solution stability term depends on the structure of the policy, we provide separate theoretical
guarantees for regularized plug-in policies and non-regularized plug-in policies. In both sets of
results, we focus on instances where the full information optimum of Problem (3.1) is O(n) and
show that the relative error is vanishing as n → ∞.
For regularized plug-in policies, we show that the problem structure needed to bound the solution stability term only depends on the choice of regularizer. This allows us to obtain one-shot
VGC results for any plug-in class and choice of Problem (3.1) where the feasible space X is
convex. As shown in Grigas, Qi, et al. (2021), regularized policies in these settings often have
137
performance comparable to their non-regularized counterparts, which highlights the generality of
our decision-aware approach with the one-shot VGC.
For plug-in policies, the solution stability term depends more directly on the choice plug-in
class and optimization problem Problem (3.1). These choices can cause different decision components of the policy to be correlated, making the solution stability term difficult to analyze. Thus, our
results primarily focus on the setting where the plug-in class is affine and instances of Problem (3.1)
that are “weakly-coupled” which was first introduced in Gupta, Huang, and P. Rusmevichientong
(2022). In Section 3.2.3, we show many modern machine learning methods are affine plug-ins.
Additionally, many classes of optimization problems are “weakly-coupled” such as knapsack and
two-stage stochastic programs.
In our results, we show decision components in these settings are weakly correlated and can be
analyzed as if they were independent. To that end, we introduce novel approaches for removing
the dependence structure. To handle the “weakly-coupled” problem structure, we introduce a new
technique that lifts coupling parameters into the policy space, reducing the optimization problem
into independent components. The new insight allows us to propose a specialized one-shot VGC
for weakly-coupled problems that leverages the optimal dual solution and is computationally more
efficient. For the affine plug-in class, we show how to construct a set of decision components
that are independent. Our approach characterizes correlation between decision components via
edges in a network graph and constructs the independent sets via graph coloring. This allows us to
characterize the dependence through the network graph’s chromatic number.
Finally, we perform a numerical case study with motor vehicle accident data from New York
City. We highlight how our decision-aware denoising approach outperforms estimate-then-optimize
denoising in identifying locations with the most pedestrian injuries caused by motor vehicle accidents. Beyond being able to identify regions with more pedestrian injuries, we show how our
decision-aware approach can better address bias towards regions with lower crash rates and less
data found in estimate-then-optimize approaches.
138
3.1.1 Our Contributions
1. We propose the one-shot Variance Gradient Correction (one-shot VGC) for debiasing insample policy performance to estimate out-of-sample policy performance. The one-shot
VGC applies to any plug-in or regularized plug-in policy for Problem (3.1).
2. Under common assumptions on the data-generating process, we bound the variance and tail
behavior of the one-shot VGC. Furthermore we show the expected error of the one-shot
VGC in estimating in-sample bias is bounded by a specific measure of solution stability that
depends on both the policy and structure of Problem (3.1).
In particular, by further bounding this solution stability, we prove that:
3. When Problem (3.1) is convex, the plug-in function is Lipschitz, and the policy is regularized, we can uniformly bound the estimation error of the one-shot VGC. In particular, this
bound implies the relative error of our procedure vanishes and one can learn a best-in-class
policy as the problem size grows.
4. When Problem (3.1) is potentially non-convex but weakly coupled, the plug-in function is
affine, and the policy is unregularized, we uniformly bound the estimation error of the oneshot VGC in terms of the chromatic number of a particular graph depending on both X and
the affine policy. We show that even with crude upper bounds on this chromatic number, the
relative error of our procedure vanishes and it is possible to learn a best-in-class policy as
the problem size grows.
5. We present a case study on deploying speed humps to reduce pedestrian injuries in motor
vehicle accidents using data from New York City. We compare our decision-aware denoising
approach to a decision-blind estimate-then-optimize approach. We find that the ideal amount
of smoothing depends on the downstream loss, and, hence, decision-blind methods cannot
always perform optimally. By contrast, our decision-aware approach finds a policy with
near-oracle performance.
139
3.1.2 Relationship to Prior Work
Traditional denoising typically focus on methods that recover an underlying signal corrupted by
noise and cover a variety of different domains. These methods often look to learn the underlying signal. For example, in denoising audio, Fourier transforms are used to decompose recorded
sound into key sub-components and components related to noise are filtered out (Rabiner and Gold,
1975). In compressed sensing, total variation denoising looks to learn a sparse signal through
methods solve an ℓ1-regularization problem similar to LASSO (Candes, Romberg, and Tao, 2006).
Similarly, many modern denoising techniques utilize neural networks and autoencoders to reconstruct underlying signal for video and images (Tian et al., 2020).
In our work, we primarily focus denoising techniques that reduce noise by smoothing, known
in the signal processing literature as linear filters. These techniques are simple as they smooth each
data point by replacing them with a linear combination of all available data points. To illustrate,
digital image processing uses linear filters to replace each pixel with a linear combination of pixels
in its neighborhood (Gonzalez and Woods, 2006). The drawback of these simple methods is that
they can often blur images rather than recover the original image. Since we are primarily interested
in using denoising to improve decision-making rather than recovering the original signal, linear
filters offer the benefits of denoising while being simple enough to integrate into a decision-making
framework.
In machine learning literature, denoising has been applied to training data to improve downstream classifier performance. Zhu and Wu, 2004 highlight work that show how noise in the label
data and attribute data can negatively impact the training time and resulting prediction accuracy
of classifiers. Most of the existing work focuses on filtering out noisy training label data (Brodley
and Friedl, 1999; Zhu, Wu, and Chen, 2003) which has been shown to improve the classification
accuracy of the trained classifier. Our work highlights the benefits of applying denoising ahead
of tasks similar to, but more general than classification. Linear optimization problems such as
Problem (3.1) resemble classification settings, but with more structure due to the feasible set and
140
objective costs. Similar parallels have been drawn by El Balghiti, Adam N. Elmachtoub, et al.
(2022).
Our work also closely relates to the growing literature of decision-aware learning which combine the two separate processes of estimating unknown problem parameters and optimization decisions into a single end-to-end pipeline Grigas, Qi, et al. (2021), Adam N Elmachtoub and Grigas
(2022), and Wilder, Dilkina, and Tambe (2019). These works focus on selecting estimators that
optimize the empirical objective performance rather than optimizing prediction accuracy of the unknown problem parameters. We focus on a contextual optimization setting for linear optimization
similar to works such as Adam N Elmachtoub and Grigas (2022) and Hu, Kallus, and Mao (2022).
Our work differs as we focus on the data regime where our denoised estimators depend explicitly
and non-trivially on the observed data. As discussed in Gupta, Huang, and P. Rusmevichientong
(2022), empirical objective performance in this data regime are biased which can lead to overfitting.
To debias the empirical performance and obtain estimate of the out-of-sample performance,
we utilize a technique inspired by the Variance Gradient Correction proposed by Gupta, Huang,
and P. Rusmevichientong (2022). That work shows their debiasing approach out-performs crossvalidation in estimating out-of-sample performance for a special class of plug-in policies. Our
work considers a new set of policies with plug-ins that leverage cross-learning, or in other words,
are functions of different components with similar covariates. Unlike the previous work, we consider plug-in policy classes that allow us to vary the amount cross-learning between components.
To evaluate this new class of plug-in policies, we propose a more general debiasing estimator in
order to obtain good performance guarantees. Our debiasing estimator contributes to the class
of perturbation approaches in the literature (Gupta, Huang, and P. Rusmevichientong, 2022; Ito,
Yabe, and Fujimaki, 2018b; Guo, Jordan, and Zhou, 2022) but involves some novel randomization
steps to facilitate a simpler theoretical analysis. For a more detailed comparison of the debiasing
estimator to previous works, see Section 3.3.3.
141
Our results also contribute to generalization bounds related to data-driven policies for Problem (3.1). Among the work, El Balghiti, Adam N. Elmachtoub, et al. (2022) and Hu, Kallus, and
Mao (2022) bound the difference between in-sample performance and out-of-sample performance
in the large-sample regime (when the amount of data grows large). Similar to our results, Gupta,
Huang, and P. Rusmevichientong (2022) and Gupta and P. Rusmevichientong (2021b) consider a
small-data, large-scale setting and bound the difference between debiased in-sample performance
and out-of-sample performance.
In contrast to generalization results primarily centered on random design settings common
in machine learning, our research is tailored to small-data, large-scale problems with fixed design settings common for regression. This emphasis presents several practical advantages, notably
sidestepping covariate shift concerns. Furthermore, our work delves into characterizing the dependence induced by large-scale optimization problems. We introduce a simpler proof technique that
makes the dependence between optimization structure and the generalization result more transparent.
3.2 Model
Let µ ∈ R
n be a fixed but unknown signal. We observe corrupted estimates Z = µ + ξ where
ξ ∈ R
n
is a mean-zero, random vector with independent components, and ξj has (known) precision
νj ≥ 0. (Recall the variance of ξj
is 1/νj
.) For convenience, let νmin ≡ minj νj
. We observe fixed
features W j ∈ R
p
that potentially contain information relating components of µ, and let W ∈ R
n×p
be the matrix with j
th row W j and Σ be the covariance of ξ .
In traditional signal-processing applications, e.g., with geospatial data, Z represents sensor
readings at n locations. The random corruptions ξ represent irreducible measurement error, and
the features W j encode information about the j
th location (latitude, longitude, elevation, terrain
type, rainfall level, etc.). Often, such applications entail many locations (large n), and improving
sensor quality is prohibitively expensive (low precisions νj). Such applications are perhaps best
studied in the so-called small-data, large-scale regime, (Gupta and P. Rusmevichientong, 2021b;
142
Gupta, Huang, and P. Rusmevichientong, 2022; Gupta and Kallus, 2022), and our analysis takes
this perspective.
Our model can also approximate some machine learning applications with panel data (Ignatiadis and Wager, 2019b; Chung et al., 2022; Gupta and Kallus, 2023). Here, j refers to individual
“units” such as patients, facilities, or regions, with W j encoding the corresponding unit information. The “observations” Z represent predictions from an upstream, black-box machine learning
model with known precisions ν. Because of privacy constraints, missing data and a host of other
issues, precisions can be low (see above references for examples). Moreover, typical applications
often focus on a specific set of existing units — e.g., a certain set of existing facilities — rather than
assessing the performance of a policy on a hypothetical “out-of-sample” unit, such as a prospective
facility yet to be constructed.
In either interpretation, we might believe two components j and k with “similar” features W j
and W k have similar underlying µj and µk
. Thus, we expect the observation Zk
to be informative
of the underlying signal µj
.
To highlight an application of Problem (3.1) with such data, consider the following:
Example 3.2.1 (Speed Hump Planning in NYC). Certain studies such as Elvik et al. (2009) have
shown speed humps can reduce the number of crashes by 50%–60%. They are highly requested
traffic calming measures used to reduce crashes, injuries, and fatalities. In New York City, residents
can request speed humps; however, Kuntzman (2022) recently found that the wait time for such
requests can be as long as 12 years. Speed studies that assess the feasibility of deploying a speed
hump at a requested location are a salient bottleneck, taking an average two years to complete.
To prioritize speed hump requests, consider the following linear optimization problem that
chooses speed humps that would, if built, reduce the most pedestrian injuries:
min
x
K
∑
k=1
µkxk
, s.t.
K
∑
k=1
xk ≤ B, xk ∈ {0,1} ∀k = 1,...,K. (3.2)
143
Here K is the number of speed hump requests and B is the number of speed humps that can be
processed in a year. The binary variable xk encodes if request k will be processed. We assume
that the potential relative reduction of pedestrian injuries is the same for all requests. The cost
µk
is the expected number of pedestrian injuries per year at location k. In cities like New York
City, estimates Zk of µk can formed from traffic accident reports, but are often noisy. Features W k
capture information such as longitude and latitude of the accident, street width, speed limit, and
number of bike lanes.
As mentioned, many denoising methods share information across similar locations to estimate
µ. We next introduce two policy classes that exploit this intuition.
3.2.1 Policy Classes
We study a class of (potentially regularized) plug-in policies. Let T : R
n × Θ¯ 7→ R
n be a transformation of the data depending on a user-defined parameter ¯θ ∈ Θ¯ , and let φ : [0,1] 7→ R+ be a
strongly convex regularizer.
Definition 3.2.2 (Regularized Plug-in Policy). A regularized plug-in policy x
Z,(ρ,
¯θ )
for Problem (3.1) takes the form1
x(Z,(ρ,
¯θ )) ∈ argmin
x∈X
n
∑
j=1
Tj(Z,
¯θ )
⊤x j +ρφ(x j)
. (3.3)
The policy is defined by (ρ,
¯θ ) – i.e., T and φ are fixed – hence we sometimes write θ = (ρ,
¯θ )
for brevity. Given Θ¯ and an interval [ρmin,ρmax], let Θ = R+ ×Θ¯ for brevity, and define the policy
class XΘ(Z) ≡
x(Z,(ρ,
¯θ )) : (ρ,
¯θ ) ∈ Θ
. We assume throughout that Θ is compact.
Intuitively, regularized plug-in policies proxy the unknown µ by a transformation T (Z,
¯θ ) of
the data Z, and plug this proxy into a version of Problem (3.1) with a regularizer. The use of
the regularizer φ(·) in this proxy problem is common (see, e.g. Grigas, Qi, et al. (2021) and
1When Eq. (3.3) admits multiple optima, we tie-break via some deterministic rule, e.g., taking the lexicographically
smallest solution.
14
Wilder, Dilkina, and Tambe (2019)), because when X is convex it ensures policies are (almost
everywhere) differentiable with respect to ¯θ . However, in settings where X is non-convex – e.g.,
when Problem (3.1) encodes a minimum spanning tree problem – we lose this differentiability, and
the value of regularization is unclear. Hence, in what follows, we treat i) regularized policies, i.e.,
ρ > 0 and ii) unregularized policies, i.e., ρ = 0 separately in Sections 3.5 and 3.6, respectively.
We focus on regularized plug-in policies because they are “light touch” in the sense of Chung
et al. (2022), i.e., they are only marginally more complex to implement than an estimate-thenoptmize procedure. Specifically, for a fixed ¯θ , one need not alter the measurement or prediction
procedure used to obtain Z. The transformation T (Z,
¯θ ) is applied as a post-processing step.
Unregularized plug-in policies are also “light touch” with respect to the optimization. If one has
a specialized algorithm to solve Problem (3.1) efficiently, one can apply this algorithm directly to
the transformed vector T (Z,
¯θ ).
Remark 3.2.3 (Relation to Separable, Affine Plug-in Policies). Regularized plug-in policies generalize the separable affine plug-in policy class studied in Gupta, Huang, and P. Rusmevichientong
(2022). Namely, separable affine plug-in policies take ρ = 0 and Tj(Z,
¯θ ) = aj(
¯θ )Zj + bj(
¯θ ),
where the functions aj and bj may depend on W . We focus on general plug-ins as they are strictly
necessary to exploit our earlier intuition that Zk may be informative for µj when W j and W k are
similar. Because Tj(Z,
¯θ ) may depend on Zk
for k ̸= j, our analysis of general, regularized plugin policies below must use substantively different techniques than those in Gupta, Huang, and P.
Rusmevichientong (2022).
3.2.2 A Decision-Aware Oracle Benchmark
Unlike previous works on denoising, we do not assume that µ = T(Z,
¯θ
∗
) for some value of ¯θ
∗
.
Instead, we take a best-in-class perspective. We define the oracle parameter
θ
OR ∈ argmin
θ∈Θ
µ
⊤x(Z,θ ), (3.4)
145
and corresponding oracle policy x(Z,θ
OR). This oracle is “best-in-class” over Θ, but not easily
computable because µ is unknown.
In what follows, we consider two tasks:
1. (Policy Evaluation) Given a fixed θ , provide an estimate of µ
⊤x(T (Z,θ )). Our strategy will
be to debias the observed performance Z
⊤x(Z,θ ).
2. (Policy Learning) Find a parameter ˆθ (Z) (depending on the data) such that µ
⊤(x(Z,
ˆθ (Z)−
x(Z,θ
OR))) is “small.” Our strategy will be to show that our debiased estimators from task
1 are accurate uniformly over Θ, and hence optimizing the estimate yields a policy with near
oracle performance.
3.2.3 Examples of Denoising Plug-Ins
Before delving into the details of our approach, we describe popular denoising plug-ins inspired
by modern machine learning methods. The following examples are all affine transformations of Z.
Affine structure is not necessary for the majority of our paper, but is necessary for our results in
Section 3.6.
Example 3.2.4 (Kernel Smoothers). Kernel smoothers define similarity via a kernel function
K¯θ
(W j
,W k) with parameters ¯θ . For example, the box-kernel is defined by Kθ¯(W j
,W k) = I
∥W j −W k∥ ≤ θ¯
.
Inspired by kernel smoothing, we consider a plug-in policy using Nadaraya-Watson regression as
our transformation, i.e.,
T
KR
j
(Z,
¯θ ) =
n
∑
i=1
K¯θ
(W j
,W i)
∑
n
l=1K¯θ
(W j
,W l)
Zi
. (3.5)
This expression is affine in Z. By varying ¯θ ∈ Θ¯ , we obtain a class of plug-in-policies each with a
different amount of smoothing.
146
Example 3.2.5 (Local Regression). Local linear regression uses a kernel to upweight similar data
points when predicting for a given test point. To estimate µj
, it solves
β
LR
j
(Z,
¯θ ) ∈ argmin
β
n
∑
i=1
K¯θ
(W j
,W i)
W jβ −Zj
2
,
and outputs T
LR
j
(Z,
¯θ ) = W ⊤
j β
LR
j
(Z,
¯θ ). Defining the Graham matrix K = (K¯θ
(W j
,W k))i j ∈
R
n×n
, we can rewrite this estimator as
T
LR
j
(Z,
¯θ ) = W ⊤
j
W ⊤diag(K j)W
−1
W ⊤diag(K j)Z,
which is again affine in Z. By varying ¯θ (and thus the Graham Matrix), we obtain a class of plugin policies based on local linear regression with different sized neighborhoods. Polynomial local
regression methods can also be used in a similar way.
Example 3.2.6 (Ridge Regression and Random Features). In high dimensional settings when p ≫
n, regularized regression is often use to improve stability. For example, ridge regression solves
β
RR(Z,θ¯) ∈ argmin
β
∥W β −Z∥
2
2 +θ¯∥β ∥2, (3.6)
and then proxies µ by
T
RR(Z,θ¯) = W β
RR(Z,θ¯) = W
W ⊤W +θ¯I
−1
W ⊤Z,
which is affine in Z. By varying θ¯, we can construct a class of plug-in policies corresponding to
different degrees of regularization.
If we first apply a non-linear transformation to each feature W j
7→ ψ(W j) ∈ R
p¯
and then
apply ridge regression, the resulting transformation is still in affine in Z. Thus, plug-in policies
based on kernelized ridge regression also fall within our scope. Recently, randomized feature
14
maps (Gallicchio and Scardapane, 2020; Rahimi and Recht, 2008) have been used to approximate
kernelized ridge regression. Such methods also give rise to affine policies.
Example 3.2.7 (Clustering). Consider a clustering method indexed by a hyperparameter ¯θ . (For
example, take K-means clustering indexed by the number of clusters K.) Applied to the features
W j
, j = 1,...,n, clustering yields a partition S (
¯θ ) = {C1,...,CK} of the set {1,...,n}. We
estimate µj by the average of its cluster:
T
CL
j
Z,
¯θ
=
K
∑
k=1
I{ j ∈ Ck}
1
|Ck
| ∑
i∈Ck
Zi
!
.
This transformation is affine in Z. By varying ¯θ (e.g. varying the number of clusters), we again
obtain a class of plug-in policies with different local neighborhoods.
The above list is not exhaustive. Many transformations T (·) can be cast as linear smoothers (see
Buja, Trevor Hastie, and Robert Tibshirani (1989)). Such predictors yield affine plug-in policies,
and are often indexed by a low dimensional parameter ¯θ .
Of course, not all policies are affine plug-in policies. For example, when using a deep neural
network to predict µj
, the resulting predictions are generally not affine in Z. Outside of Section 3.6,
our results will not require affine structure. However, we stress that non-affine plug-ins can sometimes be approximated by affine plug-ins. For example, Jacot, Gabriel, and Hongler (2018) show
that kernel regression using neural tangent kernels approximates infinitely wide fully-connected
neural networks, and Arora et al. (2020) provide evidence that this approximation has comparably
strong performance in small-data tasks.
3.3 One-Shot Variance Gradient Correction for Policy Evaluation
To denoise the in-sample performance Z
⊤x(Z,θ ), we must essentially estimate the in-sample optimism since
µ
⊤x(Z,θ )
| {z }
Out-of-Sample Performance
= Z
⊤x(Z,θ )
| {z }
In-Sample Performance
− ξ
⊤
x(Z,θ )
| {z }
In-Sample Optimism
.
14
Loosely, our approach to estimating ξ
⊤
x(Z,θ ) is as follows: Define
xˆ(t,ρ) ∈ argmin
x∈X
t
⊤x +ρ
n
∑
j=1
φ(x j) and V(t,ρ) ≡ t
⊤xˆ(t,ρ) +ρ
n
∑
j=1
φ(xˆj(t,ρ)).
Fix some small user-defined constant h > 0 and random vector δ h ∈ R
n
. (We discuss how this
choice affects estimation quality in Lemma 3.4.2 below.) Finally, compute the one-shot VGC
Correction:
D(Z,θ ) = D
Z,(ρ,
¯θ )
≡
1
h
E
V(T (Z,
¯θ ) +δ h,ρ)−V(T (Z,
¯θ ),ρ)
Z
. (3.7)
Computationally, the expectation can be approximated by simulating δ h and averaging.
The one-shot VGC D(Z,θ ) is our estimate of the in-sample optimism ξ
⊤
x(Z,θ ). Hence our
estimate of out-of-sample performance is µ
⊤x(Z,θ ) ≈ Z
⊤x(Z,θ )−D(Z,θ ).
3.3.1 A Heuristic Derivation
To provide some intuition, we outline a heuristic derivation of our correction similar to that in
Gupta, Huang, and P. Rusmevichientong (2022). We consider a fixed policy and drop ρ,
¯θ from
the notation in this derivation.
Consider the function
λ 7→ V(T (Z) +λξ ), (3.8)
where ξ ≡ Z − µ. By Danskin’s theorem, whenever xˆ(r(Z) + λξ ) is the unique optimizer, the
derivative of this function is
∂
∂ λ V(T (Z) +λξ )
λ=0
= ξ
⊤
xˆ(T (Z)).
14
(When xˆ(T(Z) + λξ ) is not the unique optimizer, then ξ
⊤
xˆ(T (Z)) is a subgradient.) Hence,
instead of estimating the rightside, we attempt to estimate the leftside.
Since we cannot evaluate the derivative directly, we approximate it with a first-order, forward
finite-difference approximation (LeVeque, 2007, Chapter 1). For sufficiently small h > 0, we
expect that
∂
∂ λ V(T (Z) +λξ )
λ=0
=
1
h
V(T (Z) +hξ )−V(T (Z))
+ op(h) as h → 0. (3.9)
Finally, since ξ is unknown, we approximate T (Z) + hξ by T (Z) + δ h. The resulting estimator 1
h
(V(T (Z) +δ h)−V(T (Z))) depends on the random variable δ h. To reduce variance, we
derandomize our estimator by taking a conditional expectation over δ h (Rao-Blackwellization).
3.3.2 Properties of the One-Shot VGC
Under minimal assumptions, we can bound the variance of our in-sample optimism estimator
D(Z,θ ). Our bound only requires that T(Z,θ ) be Lipschitz continuous in Z.
Assumption 3.3.1 (Lipschitz Plug-in). The function Z 7→ T (Z,
¯θ ) is CT (
¯θ )-Lipschitz with respect
to the ℓ1 norm, i.e., ∥T (Z,
¯θ )−T (Y ,
¯θ )∥1 ≤ CT (
¯θ )∥Z −Y ∥1.
Define CT = max ¯θ∈Θ¯ CT (
¯θ ).
When T (Z,
¯θ ) = L(
¯θ )Z +l(
¯θ ) is affine, CT (
¯θ ) = maxj ∥L j(
¯θ )∥1, i.e., the maximal ℓ1-norm
of the columns. For many linear smoothers presented in Section 3.2.3, the matrix L(
¯θ ) has nonnegative entries with columns that sum to 1. Hence, CT (
¯θ ) = 1 in these examples.
We apply the Efron-Stein Inequality to bound the variance of the one-shot VGC:
Theorem 3.3.2 (Variance of the VGC). Under Assumption 3.3.1, the variance of the one-shot VGC
satisfies
Var(D(µ +ξ ,θ )) ≤
4C
2
T
(
¯θ )n
h
2νmin
.
150
This bound on the variance suggests that if CT (
¯θ ) is Op(1), the stochastic fluctuations of the
one-shot VGC are Op(n
−1/2/h). Thus, in typical cases where the full information solution to
Problem (3.1) is O(n), the stochastic contributions are asymptotically negligible if h
√
n → ∞ as
n → ∞.
We can strengthen this result to a high probability tail bound assuming ξ has light tails:
Assumption 3.3.3 (Independent Sub-Gaussian Corruptions). For all j = 1,...,n, ξj
is sub-Gaussian
with variance proxy at most κ
2
.
Theorem 3.3.4 (Concentration of the VGC). Suppose Assumptions 3.3.1 and 3.3.3 hold. Then,
with probability at least 1−ε,
|D(Z,θ )−E[D(Z,θ )]| ≤ CT (
¯θ )
κ
h
s
nlog
2
ε
.
3.3.3 Comparison to Multi-Shot VGC
In Gupta, Huang, and P. Rusmevichientong (2022), a similar approach is used to motivate what
we refer to as the multi-shot VGC. Providing a rigorous, theoretical comparison of the methods is
difficult because the multi-shot VGC only applies to separable affine plug-in policies. To develop
some intuition, we compare the two estimators in the special case that T(Z,
¯θ ) = Z, ρ = 0, and ξ j
are independent standard normals for all j.
Under these simplifying assumptions, multi-shot VGC with a stepsize of h¯ is
D
MS(Z) = 1
h¯
n
∑
j=1
E
V(Z +δj)−V(Z)
| Z
, where δj ∼ N (0,h¯2 +2h¯), j = 1,...,n. (3.10)
Define the random variable ¯δ ∈ R
n by ¯δ ∼ δje j with probability 1/n, and let h =
h¯
n
. Then,
D
MS(Z) = 1
h
E
h
V(Z + ¯δ )−V(Z) | Z
i
.
Thus, multishot VGC is a special case of the one-shot VGC with a specific random step.
15
0.005
0.010
0.015
0.03125 0.12500 0.50000
h
(a)
MSE (% of Full-Info)
Multi-Shot
One-Shot
0.02
0.04
0.06
16 32 64 128 256
n
(b)
Figure 3.1: Numerical Comparison of Multi-Shot vs. One-Shot. The graphs show the mean
squared error (MSE) of the one-shot VGC (Eq. (3.11)) is better than the MSE of the multi-shot
VGC (Eq. (3.10)) for the plug-in policy with T (Z,
¯θ ) = Z. The values are computed over 100
trials and use the same δ h. Plot (a) varies the step size parameter h for n = 320 and plot (b) varies
the number of decisions n for h = .5. See Section 3.15.1 for details of the optimization problem.
As we will show in Lemma 3.4.2 and Proposition 3.4.3, when ξ j
are independent, standard
normals, the random variable δ h which minimizes estimation error in our one shot VGC is δ h ∼
N
0,(h
2 +2h)I
. Hence the relevant comparison is to
D(Z) = 1
h
E[V(Z +δ h)−V(Z) | Z]. (3.11)
When h < 1/n, we see that both methods take a step size of length Op(
√
nh). When h > 1/n,
multi-shot VGC takes a step of size Op(nh), while the one-shot VGC takes a smaller step of size
Op(
√
nh). In both cases, the multishot VGC only steps in coordinate directions, while the one-shot
VGC steps in a uniformly random direction.
Empirically, the larger steps of the multi-shot VGC can cause it to have larger variance and
estimation error than the one-shot VGC. We illustrate this phenomenon in Fig. 3.1.
Theoretically, the seemingly small difference in step size greatly generalizes both the applicability of the one-shot method and simplifies its analysis. Specifically, because it restricts to coordinate directions, bounding the bias of the multishot VGC is straightforward (Gupta, Huang, and P.
Rusmevichientong, 2022, Theorem 3.2), but analyzing its stochasic behavior is quite difficult and
involves a subtle duality argument and a number of assumptions on the policy and optimization.
15
See Gupta, Huang, and P. Rusmevichientong (2022, Theorem 3.5). By contrast, as we have shown
in Theorems 3.3.2 and 3.3.4, by fully randomizing the step direction, the one-shot VGC admits a
very simple analysis of its stochastic deviation around its mean, but analyzing its bias is now more
difficult.
3.4 Bounding the Estimation Error by Solution Stability
In this section, we bound the estimation error of our method by a more manageable term related to
the stability of our policies. To that end, write
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ )− µ
⊤x(Z,θ ))
| {z }
Estimation Error
≤ sup
θ∈Θ
ξ
⊤
x(Z,θ )−
1
h
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
| {z }
Finite Difference Error
(3.12a)
+ sup
θ∈Θ
1
h
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
−D(Z,θ )
| {z }
Replication Error
. (3.12b)
Intuitively, the Finite Difference Error measures the error incurred by approximating the derivative
by a forward finite step approximation. The Replication Error measures the error introduced by
replacing ξ by δ .
We can further bound Finite Difference Error by invoking the concavity of f(λ) =V(T (Z,θ )+
λξ ,ρ).
2 By concavity,
f(h) ≤ f(0) +h f ′
(0) =⇒
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
h
≤ ξ
⊤
xˆ(T (Z,
¯θ ),ρ),
f(0) ≤ f(h)−h f ′
(h) =⇒
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
h
≥ ξ
⊤
xˆ(T (Z,
¯θ ) +hξ ,ρ),
2This map is concave because it is the minimum of linear functions of λ.
1
where we have again used Danskin’s theorem to compute the derivatives. Combining shows
Eq. (3.12a) ≤ sup
θ∈Θ
ξ
⊤
xˆ(T (Z,
¯θ ),ρ)−xˆ(T (Z,
¯θ ) +hξ ,ρ)
| {z }
Solution Stability
. (3.13)
As we will argue in Section 3.4.1 below, Solution Stability depends strongly on the structure of
Problem (3.1) and the choice of policy class.
By contrast, we can bound the Replication Error with minimal additional assumptions. Our
results primarily depend on the complexity of the plug-in class which we capture by deterministic
covering of the set Θ. Specifically, we assume the following covering exists.
Assumption 3.4.1 (Plug-in and Regularization Covering). For any ε > 0, there exists a deterministic set Θ0(ε) such that for every (
¯θ ,ρ) ∈ Θ there exists a (
¯θ 0,ρ0) ∈ Θ0(ε) such that ∥T (Z,
¯θ )−
T (Z,
¯θ 0)∥1 ≤ εCΘ(∥Z∥1 +1) and |ρ −ρ0| ≤ ε for some universal constant C ≥ 1.
Many of the machine learning plug-ins in Section 3.2.3 satisfy the assumption due to their
affine structure as T (Z,
¯θ ) is C(∥Z∥1 + 1)-Lipschitz with respect to ¯θ for some constant C. We
also note the assumption is trivially satisfied if Θ0 is finite as often happens with clustering plug-ins
(Example 3.2.7).
Combining these two bounds yields our first bound on estimation error:
15
Lemma 3.4.2 (Bounding Estimation Error by Stability). Suppose Assumptions 3.3.1, 3.3.3 and 3.4.1
hold. Then, there exists an absolute constant C (depending on φmax, νmin, ∥µ∥∞) such that the following holds with probability at least 1−2ε,
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤ C
(CT +h)κ
h
s
n ·logn ·log
Θ0
n
−1/2
·log
2
ε
| {z }
Deviation of Replication Error
+
1
h
sup
θ∈Θ
W1
T (Z,
¯θ ) +hξ , T (Z,
¯θ ) +δ h
| {z }
Expected Replication Error
+ sup
θ∈Θ
ξ
⊤
xˆ(T (Z,
¯θ ),ρ)−xˆ(T (Z,
¯θ ) +hξ ,ρ)
| {z }
Solution Stability
where W1 the Wasserstein 1-distance metric.
For many of the policies presented earlier in Section 3.2.3, e.g., Examples 3.2.4 and 3.2.7,
Θ0
n
−1/2
is polynomial in n and, hence, the first term in the bound is O˜(
√
n/h). The second
term of the bound depends on how well we replicate the (unknown) perturbation hξ . In special
cases, it possible to exactly replicate hξ so that this term is 0.
Proposition 3.4.3 (Exact Distribution Replication with Multivariate Gaussians). Let Z ∼ N (µ,Σ)
and let T (Z,
¯θ ) = L(
¯θ )Z +l(
¯θ ) be affine. If L(
¯θ )Σ +ΣL(
¯θ )
⊤ is a positive semi-definite matrix,
then choosing δ h ∼ N (0,h(L(
¯θ )Σ+ΣL(
¯θ )
⊤)+h
2Σ), yields W1
T (Z,
¯θ ) +hξ , T (Z,
¯θ ) +δ h
=
0.
Finally the third term in Lemma 3.4.2 is the solution stability. As we argue in the next section, we believe that some dependence on solution stability is fundamental and unavoidable. In
Sections 3.5 and 3.6 we use problem structure to bound this term.
3.4.1 Is Dependence on Solution Stability Unavoidable?
We build on our heuristic derivation from Section 3.3.1 to develop some intuition for why solution
stability arises in our analysis and why we believe it is a fundamental quantity. Other authors (Yu,
2013; Shalev-Shwartz et al., 2010) have similarly observed the importance of stability in learning
applications. We fix θ and drop it from the notation throughout this subsection.
For a twice differentiable function, a classical Taylor series argument shows that the error
in approximating the first derivative by a first-order finite step difference scales like the second
derivative, i.e.,
f(h)− f(0)
h
− f
′
(0)
=
h
2
f
′′(0)
+o(h).
Hence, for the function λ 7→ V(T (Z) +λξ ) from Eq. (3.8), we might expect that
V(T (Z) +hξ )−V(T (Z))
h
−ξ
⊤
xˆ(T (Z))
≈
h
2
∂
2
∂ λ2
V(T (Z) +λξ )
λ=0
,
and
∂
2
∂ λ2
V(T (Z) +λξ )
λ=0
=
∂
∂ λ ξ
⊤
xˆ(T (Z) +λξ )
λ=0
= ξ
⊤
J(T (Z))ξ ,
where J(T (Z)) ∈ R
n×n
is the Jacobian of λ 7→ xˆ(T (Z)+λξ ). This heuristic analysis suggests that
the magnitude of changes in the solution vector xˆ(T (Z))to perturbations of its input drives the error
in our method. More specifically, if we apply a first order finite difference to ∂
∂ λ ξ
⊤
xˆ(T (Z) +λξ )
at λ = 0, we recover ξ
⊤
(xˆ(T (Z) +hξ )−xˆ(T (Z))), the solution stability of Eq. (3.13).
This perspective also highlights why solution stability depends on the particular structure of
Problem (3.1) and the policy class. Specifically, the (i, j)
th element of J(T (Z)) is ∂
∂Zj
xˆi(T (Z)),
measuring how much the i
th component of the solution changes with small perturbations of the j
th
component of the input. The magnitude of this change depends strongly on the constraints in the
feasible set X of Problem (3.1) and how T (·) magnifies/attenuates the perturbation. Hence, unlike
156
bounding the replication error, bounding the finite difference error seemingly requires stronger
assumptions on Problem (3.1).
In the next two sections, we present rigorous bounds on the solution stability (and hence, finite
difference error) under different structural assumptions on Problem (3.1). In Section 3.5, we treat
the simpler case of regularized policies and convex X , where we can bound the solution stablity
using strong convexity. In Section 3.6, we treat the more difficult case of unregularized affine
policies where we instead require a weakly-coupled assumption on X .
3.5 Bounds for Regularized Plug-in Policies in Convex Problems
We first present bounds for regularized plug-in policies when the feasible set X is convex. We
leverage the strong convexity of t 7→ V(t,ρ) to argue plug-in policies are Lipschitz.
Lemma 3.5.1 (Regularized Plug-in Policies are ρ-Lipschitz). When X is convex and ρ > 0,
xˆ(t,ρ)−xˆ(t
′
,ρ)
≤
1
ρ
t −t
′
∗
,
where ∥ · ∥∗ is the dual norm of ∥ · ∥.
The lemma allows us to bound the solution stability directly, yielding,
Theorem 3.5.2 (Error Error for Regularized Policies). Assume Assumptions 3.3.1, 3.3.3 and 3.4.1
hold, X is convex, and Θ = [pmin, pmax]×Θ¯ for 0 < pmin < pmax. If φ(·) is 1-strongly convex with
respect to some norm ∥ · ∥, then there exists a constant C (depending on φmax, νmin, ∥µ∥∞) such
that with probability at least 1−ε
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤
h
ρmin
∥ξ ∥
2
∗
| {z }
Solution Stability Measure
+C
(CT +h)κ
h
s
nlog
Θ0
n
−1/2
log
1
ε
+
1
h
sup
θ∈Θ
W1(T (Z,
¯θ ) +hξ ,T (Z,
¯θ ) +δ h)
| {z }
Replication Error
,
15
The bound depends on the choice of regularizer. Common choices include φ(x) = 1
2
x
2
, which
is 1-strongly convex with respect to the ℓ2 norm, and φ(x) = x logx, which is 1-strongly convex
with respect to the ℓ1 norm. We prove corollaries by bounding ∥ξ ∥
2
∗
explicitly.
Corollary 3.5.3 (∥ · ∥1 Strongly Convex Regularizers). Under the assumptions of Theorem 3.5.2,
if φ(·) is 1-strongly convex with respect to ∥· ∥1, then there exists a absolute constant C (depending
on φmax, νmin, ∥µ∥∞) such that with probability at least 1−2ε,
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤ C
hκ
2
lognlog(1/ε)
ρmin
+C
(CT +h)κ
h
s
nlog
Θ0
n
−1/2
log
1
ε
+
1
h
sup
θ∈Θ
W1(T (Z,
¯θ ) +hξ ,T (Z,
¯θ ) +δ h).
Corollary 3.5.4 (∥· ∥2 Strongly Convex Regularizers). Under the assumptions of Theorem 3.5.2, if
φ(·) is 1-strongly convex with respect to ∥· ∥2, then there exists an absolute constant C (depending
on φmax, νmin, ∥µ∥∞) such that with probability at least 1−2ε,
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤
Chnκ
2
ρmin
log(1/ε)
+C
(CT +h)κ
h
s
nlog
Θ0
n
−1/2
log
1
ε
+
1
h
sup
θ∈Θ
W1(T (Z,
¯θ ) +hξ ,T (Z,
¯θ ) +δ h),
In the special case that the Wasserstein component is zero, we can optimize the choice of
h yielding yielding errors of O˜
p(n
1/4ρ
−1/2
min ) for Corollary 3.5.3 and O˜
p(n
3/4ρ
−1/2
min ) for Corollary 3.5.3. Both errors vanish relative the out-of-sampel performance in the typical case that the
optimal value of Problem (3.1) is O(n).
3.6 Bounds for Unregularized Plug-in Policies in Weakly-Coupled Problems
For combinatorial optimization problems with non-convex feasible sets Lemma 3.5.1 fails to hold.
We next prove estimation error bounds for non-regularized plug-in policies, i.e. when ρ = 0 and
1
Θ = {0} × Θ¯ . We focus on a general class of linear optimization problems known as weaklycoupled linear optimization problems, first introduced in Gupta, Huang, and P. Rusmevichientong
(2022).
To describe such problems, let S0,...,SK be a disjoint partition of {1,...,n} and let x
k = (x j
:
j ∈ Sk) denote vectors of decision variables for k = 0,...,K. Without loss of generality, reorder
the indices so that the Sk occur “in order,” i.e., (j ∈ S0),...,(j ∈ SK) is a consecutive sequence.
Weakly-coupled optimization problems have the following form:
min
x
0,...,x
K
µ
0
⊤
x
0 +
K
∑
k=1
µ
k
⊤
x
k
(3.14a)
s.t.
K
∑
k=0
∑
j∈Sk
A
0
j
(x
0
)x j ≤ b
0
(x
0
) (3.14b)
x
0 ∈ X 0
(3.14c)
x
k ∈ X k
(x
0
), ∀k = 1,...,K, (3.14d)
where for each fixed x
0
, A
0
j
(x
0
) ∈ R
m for all j, b
0
(x
0
) ∈ R
m, and X k
(x
0
) for k = 1,...,K are convex. The coupling set X 0
is possibly non-convex. We let X WC denote the feasible set that satisfies
Eqs. (3.14b) to (3.14d) and assume X WC ⊆ [0,1]
n
. For convenience, define Smax ≡ maxk
|Sk
| to
be the size of the largest subproblem.
The optimization problem is “weakly-coupled” in the sense that by removing the linear constraints (Eq. (3.14b)) and fixing a choice of x
0 ∈ X 0
(Eq. (3.14c)), the optimization problem
decouples into K separate convex subproblems of the form:
min
x
k
µ
k
⊤
x
k
, s.t. x
k ∈ X k
(x
0
). (3.15)
We assume throughout that after fixing x
0
, Problem (3.14) satisfies strong convex duality.
Many applications of linear optimization have the form of Problem (3.14). For example, twostage stochastic optimization problems are often weakly coupled by variables. In these problems,
15
x
0
are first-stage decision variables (which are sometimes binary), and each x
k
represents secondstage decisions for the K different scenarios. Gupta, Huang, and P. Rusmevichientong (2022)
consider such an instance to model drone assisted emergency medical response. In that application,
m = 0 and
X 0
= O(1) as K → ∞. Other applications are weakly-coupled by constraints, where
the binding constraints (Eq. (3.14b)) model resource budgets, e.g., on time or labor. Example 3.2.1
is such an example, where the binding constraint limits the number of speed humps that can be
built. In general, we argue that in typical applications, the number of subproblems K = O(n), the
number of coupling constraints m = O(1), and log
X 0
= o(n). Although we present results for
general instances, the reader may want to focus on these scalings as they interpret our results.
Additional Assumptions
Proving strong estimation error bounds for Problem (3.14) with general policies appears challenging. We make several simplifying assumptions. First, we specialize our policy class.
Definition 3.6.1 (Affine Plug-in Policy). An affine plug-in policy x
Z,(0,
¯θ )
is a plug-in policy
where T (z,
¯θ ) = L(
¯θ )z +l(
¯θ ) and L(·), l(·) may implicitly depend W.
As highlighted in Section 3.2.3, many machine learning methods are W -dependent affine transformations of Z. The separable affine plug-in policy class studied in Gupta, Huang, and P. Rusmevichientong (2022) corresponds to the special case where L(
¯θ ) is diagonal.
To simplify exposition, we also assume the parameters are appropriately rescaled:
Assumption 3.6.2 (Scaling of Plug-in Parameters). Let σj(
¯θ ) be the jth largest singular value of
L(
¯θ ).We assume that sup ¯θ∈Θ¯ σ1(
¯θ ) ≤ 1.
Assumption 3.6.2 almost holds without loss of generality because (L(
¯θ ), l(
¯θ )) can be scaled
by small positive constant; such scaling does not alter x(Z,(0,
¯θ )). We say “almost” because
Assumption 3.6.2 requires this constant be chosen uniformly over Θ¯ .
We also require invertibility of the submatrices of L(
¯θ ) and L(
¯θ ) +hI:
16
Assumption 3.6.3 (Invertibility of Submatrices). Let Lk
¯θ
∈ R
|Sk
|×|Sk
| be the submatrix of L
¯θ
induced by the set Sk
. We assume there exists a constant σmin such that
0 < σmin ≤ inf
¯θ∈Θ¯
σmin(L
k
(
¯θ )), and 0 < σmin ≤ inf
¯θ∈Θ¯
σmin(L
k
(
¯θ ) +hI), ∀k = 1,...,K,
where σmin(·) denotes the smallest singular value, and h is the step size in the one-shot VGC (cf.
Eq. (3.7)).
Practically, we can satisfy Assumption 3.6.3 by adding a small perturbation γI to L(
¯θ ). In
Section 3.7, however, we show our approach is effective even Assumption 3.6.3 does not hold.
Finally, following Gupta and P. Rusmevichientong (2021b) and Gupta, Huang, and P. Rusmevichientong (2022), we assume Gaussian corruptions. Recall Σ is the covariance of Z.
Assumption 3.6.4 (Gaussian Corruptions). We assume Z ∼ N (µ,Σ) and that L(
¯θ )Σ +ΣL(
¯θ )
⊤
is positive semidefinite for every ¯θ ∈ Θ¯ .
We focus on the Gaussian case for simplicity. Using Gupta, Huang, and P. Rusmevichientong
(2022, Lemma B.4) one might extend our results to the case of approximately Gaussian noise,
but we do not pursue this below. The assumption that L(
¯θ )
⊤Σ +ΣL(
¯θ )
T
is positive semidefinite
allows us to invoke Proposition 3.4.3 to simplify our analysis.
Under these assumptions we first analyze the special case of block decoupled problems, i.e.,
when m = 0 and
X 0
= 0 in Section 3.6.1. We utilize these results to construct a specialized
one-shot VGC for general weakly-coupled problems in Section 3.6.2. We bound the estimation
error of this specialized one-shot for weakly-coupled problems in Section 3.6.3.
Since ρ = 0 throughout this section, we write xˆ(t) and drop ρ from the notation.
3.6.1 Block Decoupled Problems
Our first result bounds the solution stability. The proof uses the chromatic number of a particular
graph built from the policy class to bound the dependence between subproblems. For simplicity
1
of exposition, we present a weaker bound using Brook’s Theorem to upperbound this chromatic
number. (See Section 3.13.4 for the tighter bound and proof.)
Lemma 3.6.5 (Solution Stability of Block-Decoupled Problems). Suppose Assumptions 2.2.4,
3.3.1, 3.6.2 and 3.6.3 hold. Consider Problem (3.14) with m = 0, and
X 0
= 0. If Θ = {0} ×Θ¯ ,
then there exists a constant C (depending on νmin,νmax,σmin) such that, with probability at least
1−ε,
sup
¯θ∈Θ¯
ξ
⊤
xˆ
T (Z,
¯θ )
−xˆ
T (Z,
¯θ ) +hξ
≤ CSmaxnhs
log
1
h
| {z }
Expected Solution Stability
+ CS3/2
maxp
nTmax logXmax ·log
SmaxTmax
ε
| {z }
Stochastic Error
where Tmax = maxj ∑
n
i=1max ¯θ∈Θ¯ I
Li j(
¯θ ) ̸= 0
, and Xmax = maxz∈Rn |XΘ(z)|.
We interpret each term of the bound. The bound on expected solution stability scales with Smax,
the size of the blocks. In the typical case where Smax = O(1) as n → ∞, such as Example 3.2.1, the
expected solution stability is O˜
p(nh).
The stochastic error depends on the level of independence between policy components x j(Z,θ )
as measured by SmaxTmax, which bounds number of subproblems that can be correlated to one
another. The p
logXmax measures the policy class complexity to allow for a uniform bound. The
constants Tmax and Xmax depend on our choice of policy class. Loosely, when L(
¯θ ) is sparse, e.g.
when denoising is based on “local” neighborhoods, Tmax = O(1). Furthermore, if
Θ¯
is small,
we expect Xmax is also small. We formalize these intuitions in Section 3.13.1 by revisiting our
examples from Section 3.2.3.
We can now combine Lemma 3.4.2 and Lemma 3.6.5 to obtain a bound on the estimation error
of the one-shot VGC. Let Lmax = sup ¯θ∈Θ¯ supi, j
|Li j(
¯θ )|.
Theorem 3.6.6 (Estimation Error for Block Decoupled Problems). Suppose Assumptions 2.2.4,
3.3.1, 3.4.1, 3.6.2 and 3.6.3 hold. Consider Problem (3.14) with m = 0, and
X 0
= 0. Define δ h
1
as in Proposition 3.4.3. Finally assume h < CT . If Θ = {0} × Θ¯ , then there exists a constant C
(depending on νmin,νmax,σmin,∥µ∥∞,Lmax) such that with probability at least 1−ε,
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤ C
Smaxn
1/2
h+
Tmax
h
| {z }
VGC Step-Size Trade-Off
s
nlogn ·log
1
h
·log
Θ0
n
−1/2
·log
2
ε
+ CS3/2
maxp
nTmax logXmax log
SmaxTmax
ε
| {z }
Stochastic Error
.
The VGC step-size trade-off groups terms that depend on the choice of h, and the stochastic
error groups terms related to the structure of the optimization problem and plug-in estimator. By
optimizing h, we obtain the following corollary.
Corollary 3.6.7 (Optimized Step-Size). Under the assumptions of Theorem 3.6.6, if h = O
n
−1/4T
1/2
maxS
−1/2
max
there exists a constant C (depending on νmin,νmax,σmin,∥µ∥∞,Lmax) such that with probability at
least 1−ε,
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
≤ CS3/4
maxn
3/4
logn
q
Tmax logXmax log
Θ0
n
−1/2
·log
SmaxTmax
ε
.
The corollary shows that when all problem and policy parameters are O˜(1), then the estimation
error grows at a rate O˜
p(n
3/4
). Thus, in typical settings where full-information performance is of
order O(n), the one-shot VGC estimator has vanishing relative error.
3.6.2 A Modified VGC for Weakly-Coupled Problems
Our strategy for the weakly-coupled Problem (3.14) will be to i) “fix” values of the coupling
variables Eq. (3.14c) and Lagrangian duals for the coupling constraints Eq. (3.14b) ii) apply our
1
previous results to the resulting decoupled problem iii) take a union bound over all possible values
of the “fixed” quantities. To execute this strategy we define a lifted policy class indexed by the
“fixed” quantities and then define a modified VGC for this lifted class.
Lifted Affine Policy Class
We first introduce dual notation. Recall the general affine plug-in for the weakly-coupled optimization problem Problem (3.14) solves
x(Z,θ ) ∈ argmin
x
n
∑
j=1
Tj(Z,
¯θ )x j (3.16a)
s.t.
K
∑
k=0
∑
j∈Sk
A
0
j
(x
0
)x j ≤ b
0
(x
0
) (3.16b)
x
0 ∈ X 0
(3.16c)
x
k ∈ X k
(x0), ∀k = 1,...,K. (3.16d)
where A
0
j
(x
0
) ∈ R
m for j = 1,...,n and b
0
(x
0
) ∈ R
m are vectors that may or may not depend on
x
0
.
Fix some x
0 ∈ X 0
and consider relaxing Eq. (3.16b) with dual variables λ ∈ R
m
+ to obtain the
block decoupled optimization problem
min
x
K
∑
k=1
⟨T
k
(Z,
¯θ ) +A
0
Sk
(x
0
)
⊤λ , x
k
⟩, s.t. x
k ∈ X k
(x
0
), ∀k = 1,...,K (3.17)
where A
0
Sk
(x
0
) is formed by taking columns j ∈ Sk
. Let x˜
k
(Z,λ , x
0
,
¯θ ) denote the solution of the
k
th subproblem and x˜(Z,λ , x
0
,
¯θ ) =
x
0
, x˜
1
(Z,λ , x
0
,
¯θ )
⊤,..., x˜
K(Z,λ , x
0
,
¯θ )
⊤
⊤
. Although the
notation is onerous, the intuition is to view Eq. (3.17) as defining the plug-in policy x˜(Z,λ , x
0
,
¯θ )
for the target optimization problem
min
x
K
∑
k=1
⟨µ
k
, x
k
⟩, s.t. x
k ∈ X k
(x
0
), ∀k = 1,...,K,
16
where
x
0
,λ ,
¯θ
define the plug-in policy. Our strategy will be to use Theorem 3.6.6 to debias the
in-sample performance of this policy, and then relate it back to the original plug-in policy for our
weakly coupled problem. To that end, define the lifted affine policy class
X˜(Z) =
x˜(Z,λ , x
0
,
¯θ ) : λ ∈ R
m
+, x
0 ∈ X 0
,
¯θ ∈ Θ¯
and the corresponding plug-in function
V˜
t, x
0
= minn
t
⊤x : x
k ∈ X k
(x
0
), ∀k = 1,...,K.
o
Our one-shot VGC applied to this lifted, block-decoupled problem is
D˜(Z,λ , x
0
,
¯θ ,h) ≡
1
h
E
h
V˜(T (Z,
¯θ ) + (A
0
)
⊤λ +δ h, x
0
)−V˜(T (Z,
¯θ ) + (A
0
)
⊤λ , x
0
)
Z
i
,
where δ h is defined as in Proposition 3.4.3.
Of course Eq. (3.17) is not the same as Problem (3.16). However, one might intuit that if we
set (λ , x
0
) to their optimal values, Eq. (3.17) might closely approximate Problem (3.16) and the
corresponding VGC might approximately debias Problem (3.16). To that end, for a fixed x
0
, we
define the optimal dual variables
λ (Z,
¯θ , x
0
) ∈ argmax
λ ≥0
−
D
b
0
(x
0
), λ
E
+
K
∑
k=0
D
T
k
(Z,
¯θ ) +A
0
Sk
(x
0
)
⊤λ , x˜
k
(Z,λ , x
0
,
¯θ )
E
(3.18)
and optimal coupling variables x
0
(Z,θ ) as the portion of the optimal solution to Problem (3.16)
corresponding to block S0. Our modified VGC correction is then
D
WC(Z,θ ,h) ≡ D˜(Z,λ (Z,
¯θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ ,h). (3.19)
Remark 3.6.8 (Computing the Modified VGC). Computing the modified VGC is not much harder
than solving Problem (3.16) since one can extract x
0
(Z,θ ) and then solve a convex problem to
1
obtain λ (Z,θ , x
0
) and x˜
Z,λ (Z,θ , x
0
), x
0
(Z,θ ),θ
. In practice, however, we adopt a simpler
approach. Let V(t) denote the optimal objective value of Problem (3.16) when the cost coefficients
are t ∈ R
n
. We approximate D
WC(Z,θ ,h) ≈
1
h
E
V
T (Z,
¯θ ) +δ h,0
−V
T (Z,
¯θ ),0
Z
. As
can be seen in the proof, these quantities are asymptotically identical. We treat Eq. (3.19) in our
analysis for simplicity.
3.6.3 Estimation Error of Weakly-Coupled One-Shot VGC
The key insight to leveraging Theorem 3.6.6 to analyzing D
WC is relating solutions to Problem (3.16) to our lifted affine policy class. In particular, we will show that
x
k
(Z,θ ) = x˜
k
Z,λ (Z,
¯θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ
whenever x˜
k
Z,λ (Z,
¯θ , x
0
), x
0
(Z,θ ),
¯θ
is the unique solution to its defining optimization problem. We next introduce a technical assumption that allows us to bound the number of blocks with
multiple solutions.
Let Dk
(x
0
) be the set of unit vectors d such that minx∈X k(x
0)
d
⊤x has multiple optima and
define the set of vectors
F(Z, x
0
,
¯θ ) ≡
[
K
k=1
Tj(Z,
¯θ )
A
0
j
−dj
: j ∈ Sk
, d ∈ D
k
(x
0
)
⊆ R
m+2
.
We assume these vectors are in general position.
Assumption 3.6.9 (Induced Cost Vectors in General Position). We have
P
F(Z, x
0
,
¯θ ) are in general position for all ¯θ ∈ Θ¯ and x
0 ∈ X 0
= 1
We use this assumption to show that the number of subproblems of our lifted affine policy that
have multiple solutions is at most m in Lemma 3.14.1.
Additionally, we require assumptions on the boundedness of the optimal dual values λ
0
(Z,θ , x
0
)
to cover the policy class. We adopt a simple assumption similar to Gupta, Huang, and P. Rusmevichientong (2022) and Gupta and P. Rusmevichientong (2021b).
Assumption 3.6.10 ( ¯s-Strict Feasibility of Coupling Constraints). For each x0 ∈ X 0
, there exists
an s¯> 0 and x¯ ∈ X WC such that ∑
n
j=1 A
0
j
(x
0
)x¯j +se¯ ≤ b
0
(x
0
).
Lemma 3.14.2 bounds ∥λ (Z,θ )∥1 under this assumption.
We now can present our uniform bound on the estimation error of the weakly-coupled one-shot
VGC. Let ∥A
0
∥∞ be the entry-wise infinity norm for matrices.
Theorem 3.6.11 (Estimation Error of the Weakly-Coupled VGC). Assume Assumptions 2.2.4,
3.3.1, 3.4.1, 3.6.2, 3.6.3, 3.6.9 and 3.6.10 hold and let Θ = {0}×Θ¯ . Choose h = O
n
−1/4T
1/2
maxS
−1/2
max
and δ h as in Proposition 3.4.3. Consider Problem (3.14). There exists a constant C (depending
on νmin, νmax, σmin, ∥µ∥∞, Lmax, log ¯s
−1
, ∥A
0
∥∞ ) such that the following holds with probability at
least 1−ε −exp(−n),
sup
θ∈Θ
Z
⊤x(Z,θ )−D
WC(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤ C4S
3/4
maxn
3/4
log3/2
n
q
mTmax logXmax log
Θ0
n
−1/2/2
·log
SmaxTmax
X 0
ε
!
| {z }
Block Decoupled Bound
,
+CmSmaxs
logn ·log
1
ε
| {z }
Lifted Policy Estimation Error
Since we apply the one-shot VGC to the lifted class, our bound for weakly-coupled problems
only differs from Theorem 3.6.6 by the additional lifted policy approximation error. This error
bounds the difference between the in-sample optimism of the lifted policy and the original policy.
In typical settings where m = O(1) (c.f. Example 3.2.1), this term is O˜(1).
16
We stress that since our weakly-coupled VGC applies to the lifted decoupled problem, it depends on the complexity of the lifted policy class X˜Θ(Z), not just XΘ(Z). Since the lifted class
simply shifts the plug-in by an affine amount, it can often be bounded using the same techniques
as in Section 3.13.1
3.7 Case Study: Prioritizing Speed Hump Requests
Our case study further develops Example 3.2.1, i.e., prioritizing speed hump requests to improve
traffic safety. We utilize motor vehicle accident data from the New York Police Department spanning 2012-2023 (New York, 2023). Our goal is to identify census tracts that have the highest
incidence of pedestrian injury due to motor vehicle accidents subject to a budget constraint on the
number of tracts that can be serviced in a given year (B). All code and data for this example are
available on Github3
.
Data and Setup. We consider selecting the top 5% of the 2,157 total census tracts in New York
city to prioritize speed hump requests, i.e., B = 107 and n = 2,157.
We let µj represent the expected number of crashes per year with pedestrian, cyclist, or motorist
injuries in census tract j. We model µj = λj pj where λj and pj are the respective crash rate per
year and probability a pedestrian is injured given an accident for census tract j. As ground truth,
we take pj
to be the observed pedestrian injury rate for each census tract across the entire time
period covered by the dataset (2012-2023). These ground-truth values range from 0 to 0.1576
across census tracts. We treat these values as unknown. Similarly, we take the ground-truth λj
to
be the observed average number of crashes per year over the time frame. These values range from
1.4 to 448 across census tracts with an average of 52 crashes per year. However, we treat these
value as known and fixed.
We set Nj
to be the number of accidents observed in 2018. To generate our observations, we
let ˆpj ∼ Binomial(γNj
, pj) for γ ∈ {1,2,3,4}. Intuitively, this represents estimating the pedestrian
injury rate based on γ years of historical data. The predicted number of crashes with injuries is
3https://github.com/mh3166/Decision-Aware-Denoising
168
then Zj = λj pˆj/γ. This setup allows us to generate multiple simulation paths Zj and also vary the
amount of historical data available in forming these estimates.
Policy Class. We consider a non-regularized kernel smoother plug-in policy class (Example 3.2.4):
Tj(pˆ,W ,θ) = λj
n
∑
i=1
pˆi
I
W j −W i
≤ θ
∑
n
l=1
I
W j −W l
≤ θ
,
where we let the bandwidth θ ≥ 0 vary. Here W j are the longitude and latitude coordinates of the
centroid of census tract j. Our goal is to select θ that maximizes the out-of-sample performance
∑
n
j=1
λj pjx j(pˆ,W ,θ).
Experiments. We compare decision-aware approaches to decision-blind approaches for selecting
θ as we vary the amount of crash data observed (γ).
We utilize our one-shot VGC to estimate and select the choice of θ with the best out-of-sample
performance estimate. We set h = n
−1/4
and simulate δ h as multivariate Gaussians similar to
Proposition 3.4.3. For full implementation details, see Appendix 3.15.2.
For our decision-blind policy, we consider a predict-then-optimize policy that chooses the θ
that minimizes (oracle) mean squared error (MSE) of the predicted pedestrians injured. By using
the oracle value of MSE, we provide an advantage to the decision-blind method and avoid any
noise introduced by perhaps misestimating MSE. Formally, the MSE policy selects
θ
MSE ∈ argmin
θ≥0
1
n
n
∑
j=1
Tj(pˆ,W ,θ)−λj pj
2
Finally, we also benchmark against the oracle θ that optimizes the true out-of-sample performance for each realization of our data pˆ.
For ease of comparison, we present all the performance of all three benchmarks relative to the
full information optimal performance maxx∈X ∑
n
j=1
λj pjx j
.
16
90.0
92.5
95.0
97.5
1 2 3 4
Years of Data
% of Full-Info
MSE
VGC
Oracle
Figure 3.2: Performance Results. We compare the performance of our method to a MSE
benchmark over 50 trials. We show how the performance changes as we increase the amount of
data. While VGC approach achieves best-in-class performance as the amount of data increases,
the MSE approach does not.
3.7.1 Results
Figure 3.2 plots the performance of the three methods. The one-shot VGC outperforms the oracle
MSE approach across all amounts of past data. The performance of the one-shot VGC is also more
robust across data samples as seen in the shorter boxplot whiskers.
Figure 3.3 maps the selected census tracts along with contours of pedestrian injury rates generated from observed crashes. Many of the census tracts selected by the MSE policy lie on regions
where there are multiple contour lines indicating a changing density of crashes with pedestrian
injury. By contrast, the one-shot VGC policy tends to select census tracts in the center of regions
with a high pedestrian injury rate.
To better understand this phenomenon, we plot the distribution of θ
MSE
, θ
OR and θ
V GC (the
θ that optimizes our estimator) in Fig. 3.4. Note that θ
OR and θ
MSE are quite different. By
construction, θ
MSE is the best possible θ to minimize MSE, not downstream decision loss. In this
example, it systematically over-smooths the data. This partially explains why it chooses census
tracts that cross contour lines – the larger θ effectively smooths away the differences across the
contour. By contrast, the one-shot VGC selects a θ
V GC much closer to the oracle value. We see
this as strong evidence that i) Decision-aware approaches are necessary for finding the “right” level
170
Figure 3.3: Map of Census Tracts Selected. We plot the census tracts selected by the MSE policy
and the one-shot VGC policy for the Manhattan and Brooklyn boroughs of New York City. The
contour lines plot the density of crashes with pedestrian injuries.
1 2 3 4
Years of Data
Amount of Smoothing (θ)
MSE
VGC
Oracle
Figure 3.4: Amount of Smoothing by Method The figure compares the values of θ chosen by
each method. Across realizations, θ
MSE tends to oversmooth relative to θ
OR, particularly as the
amount of historical data grows. By contrast, the one-shot VGC performs less smoothing.
171
of smoothing in these types of applications, and ii) our one-shot VGC, despite being light-touch,
is able to identify a good level of smoothing.
3.8 Conclusion
Our paper highlights the benefits of adapting traditional denoising techniques to decision-making
settings. We provide rigorous guarantees of our approach for a wide class of regurlarized plug-in
and non-regularized plug-in policies. Specifically, we provide a general bound of the error of the
approach in terms of the stability of the underlying policies. For regularized plug-in policies and
convex optimization, this approach readily yields strong guarantees that show the relative error of
the method is vanishing as the problem size increases. We also develop a customized VGC estimator for general affine plug-in policies for weakly-coupled, potentially non-convex optimization
problems. Under somewhat stronger assumptions, we also show the relative error in these settings
vanishes as the problem size grows.
The fundamental role of stability in these results complements the role of stability in more
traditional generalization guarantees in machine learning. Exploring other aspects of stability in
data-driven optimization is an open area.
Finally, while our results primarily focus on debiasing policy performance for optimization
problems with linear objectives, developing similar debiasing techniques for other structured optimization classes remains an exciting direction for future work.
3.9 Appendix: Background Results
Recall, a mean-zero random variable X is subgaussian with variance proxy κ
2
if E[exp(λX)] ≤
exp
λ
2κ
2
2
. Let f : R
n
7→ R be a function. We say that f is 1-Lipschitz with respect to the cweighted ℓ1 norm if
| f(x1,..., xn)− f(y1,..., yn)| ≤
n
∑
i=1
ci
|xi −yi
|,
172
for some c ∈ R
n
+. Kontorovich (2014) proved that such functions applied to subGaussian random
variables concentrate.
Theorem 3.9.1 (Kontrovich 2014). Let X = (X1,...,Xn) be a vector of independent random variables such that Xi −E[Xi
] is subGaussian with variance proxy κ
2
i
. Suppose f(·) is 1-Lipschitz with
respecct to the c-weighted ℓ1 norm. Then, there exists a universal constant C such that for any
0 < ε <
1
2
, with probability at least 1−ε,
| f(X)−E[ f(X)]| ≤ C
s n
∑
i=1
c
2
i
κ
2
i
·
p
log(2/ε).
Remark 3.9.2. The original result in Kontorovich, 2014 is stated with respect the sub-Gaussian
diameters of the Xi
. Note however that if Xi −E[Xi
] is sub-Gaussian with variance proxy κ
2
i
, then
Xi has sub-Gaussian diameter at most a constant times κi
.
3.10 Appendix: Section 3.3 Proofs
3.10.1 Auxilliary Lemmas
Many of our results leverage that the function V(z,ρ) = minx∈X z
⊤x +ρ ∑
n
j=1
φ(x j) is Lipschitz.
Lemma 3.10.1 (Lipschitz Bounds on V). Let V(z,ρ) = minx∈X z
⊤x + ρ ∑
n
j=1
φ(x j) and X ⊆
[0,1]
n
. Then, the following holds,
i) |V(z,ρ)−V(y,ρ)| ≤ ∥z −y∥1.
ii) |V(z,ρ)−V(z,ρ
′
)| ≤ nφmax |ρ −ρ
′
|, where φmax = supx∈[0,1]
|φ(x)|
173
Proof. Proof of Lemma 3.10.1:
Part i) We first upper-bound the difference. We see,
V(z,ρ)−V(y,ρ) = z
⊤xˆ(z,ρ) +ρ
n
∑
j=1
φ(xˆj(z,ρ))−z
⊤xˆ(y,ρ)−ρ
n
∑
j=1
φ(xˆj(y,ρ))
| {z }
≤0, by optimality of xˆ(z,ρ)
+ (z −y)
⊤
xˆ(y,ρ)
≤ ∥z −y∥1∥xˆ(y,ρ)∥∞ ≤ ∥z −y∥1
where the first inequality holds by Holder’s inequality and the second inequality holds because
∥xˆ(y,ρ)∥∞ ≤ 1. By symmetry, this completes the proof of Part i).
Part ii) We apply a similar argument. Letting φ (x) = ∑
n
j=1
φ(x j),
V(z,ρ)−V(z,ρ
′
) = z
⊤xˆ(z,ρ) +ρφ (xˆ(z,ρ))−z
⊤xˆ(z,ρ
′
)−ρφ (xˆ(z,ρ
′
))
| {z }
≤0, by optimality of xˆ(z,ρ)
+(ρ −ρ
′
)φ (xˆ(z,ρ
′
))
≤
ρ −ρ
′
n
∑
j=1
sup
x∈[0,1]
|φ(x)| = nφmax
ρ −ρ
′
.
By symmetry, we can also bound V(z,ρ
′
)−V(z,ρ), completing the proof.
Since V(·) is 1-Lipschitz relative to ∥ · ∥1, we can bound the expected error introduced by
proxying ξ by δh in our proof:
Lemma 3.10.2 (Distribution Replication Bound). The following holds,
|E[V(T (Z) +hξ )−V(T (Z) +δ h)]| ≤ W1 (T (Z) +hξ , T (Z) +δ h).
Proof. Proof of Lemma 3.10.2: The result is immediate given Lemma 3.10.1 and integral-probability
metric representation of the Wasserstein distance (Wainwright, 2019, pg. 76).
174
3.10.2 Proof for Theorem 3.3.2
We fix θ througout and drop it from the notation. Recall Z = ξ + µ. Let ¯ξ be an i.i.d. copy of ξ .
Let ξ
k
denote the vector ξ with element k replaced with ¯ξk
. By the Efron-Stein Inequality,
Var(D(Z)) ≤
1
2
n
∑
k=1
E
D(ξ + µ)−D(ξ
k + µ)
2
Focusing on the term inside the square, we see
D(ξ + µ)−D(ξ
k + µ)
≤
1
h
E
h
V(T (ξ + µ) +δ h)−V(T (ξ
k + µ) +δ h)
ξ ,ξ
k
i
+
1
h
V(T (ξ + µ))−V(T (ξ
k + µ))
≤
2
h
∥T (ξ + µ)−T (ξ
k + µ)∥1
≤
2
h
CT ∥ξ −ξ
k
∥1
≤
2
h
CT
ξk − ¯ξk
.
The second inequality holds by Lemma 3.10.1 and the third holds by Assumption 3.3.1.
Plugging in the upper-bound,
E
D(ξ + µ)−D(ξ
k + µ)
2
≤ E
4
h
2
ξk − ¯ξk
2
C
2
T
≤
8C
2
T
νkh
2
Letting νmin = mink νk
, we see,
Var(D(Z)) ≤
4nC2
T
νminh
2
This completes the proof.
175
3.10.3 Proof for Theorem 3.3.4
We leverage Theorem 3.9.1 on the function D(Z). Let Z
k
be identical to Z except possibly in the
k
th component. Following the same steps of the proof of Theorem 3.3.2, we have
D(Z)−D(Z
k
)
≤
2
h
|Zk −Z¯
k
|CT ,
We take ci =
2
h
CT and κ(Zi) ≤ κ.
Applying Theorem 3.9.1 yields the result.
3.11 Appendix: Section 3.4 Proofs
3.11.1 Proof for Lemma 3.4.2
Recall from Eq. (3.12), the estimation error can be decomposed into the following two components:
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ )− µ
⊤x(Z,θ ))
| {z }
Estimation Error
≤ sup
θ∈Θ
ξ
⊤
x(Z,θ )−
1
h
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
| {z }
Finite Difference Error
+ sup
θ∈Θ
1
h
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
−D(Z,θ )
| {z }
Replication Error
.
We bound the finite difference error by solution stability in the main text (c.f. Eq. (3.13)). Hence,
we focus on on bounding the replication error.
To bound replication error, first note that by definition,
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z,
¯θ ),ρ)
h
−D(Z,θ ,h)
=
V(T (Z,
¯θ ) +hξ ,ρ)−E[V(T (Z,
¯θ ) +δ h,ρ)|Z]
h
.
1
Applying the triangle inequality to the right side for a fixed θ ,
Replication Errorθ ≤
V(T (Z,
¯θ ) +hξ ,ρ)−E[V(T (Z,
¯θ ) +hξ ,ρ)]
h
| {z }
(i)
+
E[V(T (Z,
¯θ ) +δ h,ρ)|Z]−E[V(T (Z,
¯θ ) +δ h,ρ)]
h
| {z }
(ii)
+
E[V(T (Z,
¯θ ) +δ h,ρ)−V(T (Z,
¯θ ) +hξ ,ρ)]
h
| {z }
(iii)
Bounding Component (i).
We first fix θ = (¯θ ,ρ) and prove a point-wise bound. We will apply Theorem 3.9.1. To that end,
define ξ
k =
ξ1,...,
¯ξk
,...,ξn
⊤
and Z
k = (Z1,...,Z¯
k
,...,Zn)
⊤ where ¯ξk and Z¯
k are i.i.d. copies
of ξk and Zk
, respectively. Then,
V(T (Z,
¯θ ) +hξ ,ρ)−V(T (Z
k
,
¯θ ) +hξ
k
,ρ)
≤ ∥T (Z,
¯θ ) +hξ −T (Z
k
,
¯θ )−hξ
k
∥1
≤ (CT (
¯θ ) +h)∥ξ −ξ
k
∥1
≤ (CT +h)
ξk − ¯ξk
,
where CT = max ¯θ CT (
¯θ ). The first inequality follows from Lemma 3.10.1 and the second inequality follows from Assumption 3.3.1. Since ξk are sub-Gaussian, we can then apply Theorem 3.9.1
with ck = CT +h to show the following holds with probability at least 1−ε,
V(T (Z,
¯θ ) +hξ ,ρ)−E[V(T (Z,
¯θ ) +hξ ,ρ)]
≤ C·(CT +h)κ
s
nlog
2
ε
(3.20)
for a fixed θ and universal constant C.
17
Now consider the covering Θ0(ε). Letting θ 0 be the element in Θ0(n
−1/2
) closest to θ , we
have that for any t h ∈ R
n
,
V(T (Z,
¯θ ) +t h,ρ)−E[V(T (Z,
¯θ ) +t h,ρ)]
≤ sup
θ∈Θ0(n−1/2
)
V(T (Z,
¯θ ) +t h,ρ)−E[V(T (Z,
¯θ ) +t h,ρ)]
(3.21a)
+
V(T (Z,
¯θ ) +t h,ρ)−V(T (Z,
¯θ 0) +t h,ρ0)
(3.21b)
+
E[V(T (Z,
¯θ ) +t h,ρ)−V(T (Z,
¯θ 0) +t h,ρ0)]
. (3.21c)
Equation (3.20) and the union bound show that for t h = hξ that with probability at least 1−ε
Eq. (3.21a) ≤ C(CT +h)· κ
s
nlog
|Θ0
n
−1/2
|
log
2
ε
,
where |Θ0(n
−1/2
)| is the cardinality of Θ0(n
−1/2
).
To bound Eq. (3.21b), recall that θ = (¯θ ,ρ) and θ 0 = (¯θ 0,ρ0), so with probability at least
1−ε,
V(T (Z,
¯θ ) +t h,ρ)−V(T (Z,
¯θ 0) +t h,ρ0)
≤
V(T (Z,
¯θ ) +t h,ρ)−V(T (Z,
¯θ 0) +t h,ρ)
+
V(T (Z,
¯θ 0) +t h,ρ)−V(T (Z,
¯θ 0) +t h,ρ0)
(a)
≤ ∥T (Z,
¯θ )−T (Z,
¯θ 0)∥1 + nφmax |ρ −ρ0|
(b)
≤ n
−1/2C(∥Z∥1 +1) +n
−1/2
· nφmax
(c)
≤
C
√
n
κ
s
nlog
1
ε
+n
ν
−1/2
min +∥µ∥∞ +φmax
+1
!
(d)
≤ C1κ
s
nlog
1
ε
1
where φmax = maxx∈[0,1] φ(x) and C1 = 4C
ν
1/2
min +∥µ∥∞ +φmax +1
. We see (a) holds by the
Lipschitz results of V in Lemma 3.10.1, (b) holds by applying Assumption 3.4.1 with ε = n
−1/2
and universal constant C ≥ 1, (c) holds since C ≥ 1 and
∥Z∥1 ≤ ∥ξ ∥1 +∥µ∥1 ≤ (∥ξ ∥1 −E∥ξ ∥1) +E∥ξ ∥1 +n∥µ∥∞ ≤ κ
s
nlog
1
ε
+nν
1/2
min +n∥µ∥∞
where the last inequality applies Theorem 3.9.1 to ∥ξ ∥1 − E∥ξ ∥1 with c j = 1 for all j and that
E|ξj
| ≤ E
q
ξ
2
j
≤
q
Eξ
2
j ≤ ν
−1/2
min . For the former, we see c j = 1 since ∥ξ − ξ
k
∥1 =
ξk − ¯ξk
.
Inequality (d) holds for ε < 1/e and noting a+b ≤ 2ab for a ≥ 1,b ≥ 1.
A similar argument can be applied to bound Eq. (3.21c),
E
V(T (Z,
¯θ ) +t h,ρ)−V(T (Z,
¯θ 0) +t h,ρ0)
≤ E
h
n
−1/2CΘ (∥Z∥1 +1) +n
−1/2
· nφmaxi
≤ CΘn
−1/2
nν
1/2
min +n∥µ∥∞ +nφmax +1
≤ C1κ
√
n
Combining Eq. (3.21a), Eq. (3.21b), and Eq. (3.21c), collecting terms and applying a supremum over θ, we have for some universal constant C,
(i) ≤ C
C1(CT +h)κ
h
s
n ·log
Θ0
n
−1/2
·log
2
ε
Bounding Component (ii).
We can make an argument similar to (i). By Jensen’s inequality,
sup
θ∈Θ
E[V(T (Z,
¯θ ) +δ h,ρ)|Z]−E[V(T (Z,
¯θ ) +δ h,ρ)]
= sup
θ∈Θ
E
V(T (Z,
¯θ ) +δ h,ρ)−E
V(T (Z,
¯θ ) +δ h,ρ)
δ h
Z
≤ E
sup
θ∈Θ
V(T (Z,
¯θ ) +δ h,ρ)−E
V(T (Z,
¯θ ) +δ h,ρ)
δ h
Z
1
We then bound the term inside the expectation. For a fixed δ h, we apply the bounds to Eq. (3.21)
for t h = δ h. We only need to adjust the bound to Eq. (3.21a) since it depends on the choice t h.
Note that,
E
V(T (Z,
¯θ ) +δ h,ρ)
Z
−E
h
V(T (Z
k
,
¯θ ) +δ h,ρ)
Z
k
i
≤ ∥E
h
T (Z,
¯θ ) +δ h −T (Z
k
,
¯θ )−δ h
Z,Z¯
k
i
∥1
≤ CT (
¯θ )∥ξ −ξ
k
∥1
≤ CT
ξk − ¯ξk
,
Thus, for t h = δ h and applying Theorem 3.9.1 with ck = CT , we have
Eq. (3.21a) ≤ CCT · κ
s
nlog
|Θ0
n
−1/2
|
log
2
ε
.
Since the bounds for Eq. (3.21b) and Eq. (3.21c) do not depend on t h, we have
(ii) ≤ C
C1CT κ
h
s
n ·log
Θ0
n
−1/2
·log
2
ε
Bounding Component (iii).
Component (iii) is bounded by Lemma 3.10.2.
Combining (i), (ii), and (iii), we prove the result.
3.11.2 Proof for Proposition 3.4.3.
First observe that since h > 0 and L(
¯θ )Σ +ΣL(
¯θ )
⊤ is positive semidefinite, the matrix h(L(
¯θ )Σ +
ΣL(
¯θ )
⊤) +h
2Σ) is also positive semidefinite and δ h is well-defined.
Since the sum of two multivariate Gaussians is also a multivariate Gaussian, we see both
T (Z,
¯θ ) + hξ and T (Z,
¯θ ) + δ h are distributed as N (µ,(L + hI)Σ(L + hI)
⊤). Thus, their 1-
Wasserstein distance is 0.
3.12 Appendix: Regularized Policy Proofs
3.12.1 Proof for Lemma 3.5.1
Let hρ (x,t) ≡ ∑
n
j=1
tjx j +ρφ(x j). Since hρ (x,t) is strongly convex over X and by the optimality
of xˆ(t,ρ) and xˆ(y,ρ), we see
hρ (xˆ(t,ρ), y)−hρ (xˆ(y,ρ), y) ≥
ρ
2
∥xˆ(t,ρ)−xˆ(y,ρ)∥
2
hρ (xˆ(y,ρ),t)−hρ (xˆ(t,ρ),t) ≥
ρ
2
∥xˆ(t,ρ)−xˆ(y,ρ)∥
2
Adding the two inequalities, we see
ρ ∥xˆ(t,ρ)−xˆ(y,ρ)∥
2 ≤ hρ (xˆ(t,ρ), y)−hρ (xˆ(y,ρ), y) +hρ (x(y,ρ),t)−hρ (x(z,ρ),t)
=
n
∑
j=1
y j(xˆj(t,ρ)−xˆj(y,ρ)) +tj(xˆj(y,ρ)−xˆj(t,ρ))
=
n
∑
j=1
y j −tj
xˆj(t,ρ)−xˆj(y,ρ)
≤ ∥y −t∥∗
∥xˆ(t,ρ)−xˆ(y,ρ)∥,
where the last line holds by Holder’s inequality. Rearranging, we get our intended result.
1
3.12.2 Proof for Theorem 3.5.2
By Lemma 3.4.2, we can write,
sup
θ∈Θ
Z
⊤x(Z,θ )−D(Z,θ ,h)− µ
⊤x(Z,θ )
| {z }
Estimation Error
≤ C
(CT +h)κ
h
s
n ·logn ·log
Θ0
CT
√
n
·log
2
ε
| {z }
Replication Error
+
1
h
sup
θ∈Θ
W1
T (Z,
¯θ ) +hξ , T (Z,
¯θ ) +δ h
| {z }
Replication Error
+ sup
θ∈Θ
ξ
⊤
xˆ(T (Z,
¯θ ),ρ)−xˆ(T (Z,
¯θ ) +hξ ,ρ)
| {z }
Solution Stability
To bound the solution stability, we first observe by Lemma 3.5.1, that
ξ
⊤
xˆ(T (Z,
¯θ ),ρ)−xˆ(T (Z,
¯θ ) +hξ ,ρ)
≤ ∥ξ ∥∗∥xˆ(T (Z,
¯θ ),ρ)−xˆ(T (Z,
¯θ ) +hξ ,ρ)∥ ≤ h
ρ
∥ξ ∥
2
∗
.
Plugging in the solution stability bound proves the result.
3.12.2.1 Proofs of Corollaries
Proof. Proof of Corollary 3.5.3. It suffices to bound ∥ξ ∥
2
∞. Since ξ is mean-zero, ∥ξ ∥∞ = ∥ξ −
E[ξ ]∥∞,. Furthermore, since ξj − E
ξj
is sub-Gaussian with variance proxy κ
2
, we have that
with probability at least 1−ε, there exists an absolute constant C1 such that
ξj −E
ξj
≤ C1κ
p
log(n/ε) ≤ C1κ
p
lognlog(1/ε),
simultaneously for all j. Hence, with probability at least 1−ε,
∥ξ ∥
2
∞ ≤ C
2
1κ
2
lognlog(1/ε).
Substituting into Theorem 3.5.2 and collecting constants proves the corollary.
Proof. Proof of Corollary 3.5.4. It suffices to bound ∥ξ ∥
2
2 = ∑
n
j=1
ξ
2
j
. Since ξj
is sub-Gaussian
with parameter κ, Vershynin, 2018, Lemma 2.7.6 shows that there exists an absolute constant
C1 such that that ∥ξ
2
j
∥Ψ1 ≤ C1κ
2
. (Recall, for any random variable Y, ∥Y∥Ψ1 ≡ inf{t > 0 :
E[exp(|Y|/t)] ≤ 2}, c.f., Vershynin (2018, pg. 31).) Hence, by the triangle inequality, ∥∑
n
j=1
ξ
2
j
∥Ψ1 ≤
C1nκ
2
. Finally, by Markov’s Inequality, with probability at least 1−ε, ∑
n
j=1
ξ
2
j ≤ C2nκ
2
log(1/ε),
for some absolute constant C2. Plugging this upper bound into Theorem 3.5.2 and collecting constants proves the result.
3.13 Appendix: Proofs and Notes for Block Decoupled Problems
3.13.1 What are typical values of Tmax and Xmax?
In this subsection, we revisit some examples from Section 3.2.3 to provide explicit bounds on the
constants Tmax and Xmax.
Example 3.13.1 (Clustering Revisited). Recall our clustering based policy class from Example 3.2.7. Let x
CL(Z,θ¯) be the resulting policy when clustering into θ¯ clusters. We form a policy
class by allowing the number of clusters to vary between θ¯min and θ¯max.
Then, Tmax corresponds to the size of the largest cluster. Hence, if θ¯min = O(n), i.e., we only
want to average over “local” neighborhoods, then Tmax = O(1). Furthermore, the number of possible policies Xmax is bounded by the number of possible clusterings θ¯max −θ¯min which is in turn at
most n. Hence, in the typical case where Smax = 1 and we constrain ¯θ min = O(n), Corollary 3.6.7
bounds the error by a term of size O˜
p(n
3/4
).
In the previous example, Θ is finite, since we can have at most n clusters. We next provide an
example where Θ is not finite, but the induced set of plug-in policies is finite.
Example 3.13.2 (Kernel Smoothers Revisited). Consider the kernel regression policy from Example 3.2.4 with the box-kernel with bandwidth θ¯, i.e., Kθ¯(W j
,W k) = I
∥W j −W k∥ ≤ θ¯
. Let
183
x
KR(Z,θ¯) be the corresponding plug-in policy. Consider the policy class X KR(Z) = {x
KR(Z,θ¯) :
0 ≤ θ¯ ≤ θ¯max}.
Then,
Tmax = max
j
n
∑
i=1
max
θ¯∈[0,θ¯max]
I
Kθ¯
W j
,W i
̸= 0
= max
j
n
∑
i=1
I
∥W j −W i∥ ≤ θ¯max
,
which decreases with θ¯max. In other words, for small enough θ¯max, i.e., “local”, neighborhoods,
Tmax can be made O(1).
That said, [0,θ¯] is an infinite set. However, the set of kernel weights induced by θ ∈ [0,θ¯] is
finite. Namely, partition the real number line at the O(n
2
) points ∥W i −W j∥. Then, for any θ1,θ2
in the same interval, Kθ1
(W i
,W j) = Kθ2
(W i
,W j). Letting T KR
Θ
(Z) =
T
KR(Z,θ) : θ ∈ Θ
, we
see |T KR
Θ
(Z)| ≤ n(n+1)/2+1 by counting the number of intervals. We can then bound Xmax as
follows,
|X KR
Θ (Z)| = |
xˆ(t) : t ∈ T
KR
Θ (Z)
| ≤ |T
KR
Θ (Z)| ≤ n(n+1)
2
+1.
In settings where neither Θ or TΘ(Z) are finite, Gupta, Huang, and P. Rusmevichientong (2022)
describe how to bound Xmax under a general assumption on the plug-in class. Roughly, they show
that ∥XΘ(Z)∥ = O(n
dim(Θ)
) for separable affine plug-in classes, or in words, they show Xmax is
polynomial in n if the dimension of θ ∈ Θ is fixed. Similar techniques can be applied to bound the
affine plug-ins considered in our paper. We omit the details.
18
3.13.2 Proof for Bounding Solution Stability, Lemma 3.6.5
We bound the solution stability by decomposing it into the following components,
sup
θ∈Θ
ξ
⊤
xˆ
T (Z,
¯θ )
−xˆ
T (Z,
¯θ ) +hξ
≤ sup
¯θ∈Θ
E
h
ξ
⊤
xˆ
T (Z,
¯θ )
−xˆ
T (Z,
¯θ ) +hξ
i
+ sup
θ∈Θ
ξ
⊤
xˆ
T (Z,
¯θ )
−E
h
ξ
⊤
x
T (Z,
¯θ )
i
+ sup
θ∈Θ
ξ
⊤
xˆ
T (Z,
¯θ ) +hξ
−E
h
ξ
⊤
xˆ
T (Z,
¯θ ) +hξ
i
≤ sup
θ∈Θ
E
h
ξ
⊤
xˆ
T (Z,
¯θ )
−xˆ
T (Z,
¯θ ) +hξ
i
(3.22a)
+2 sup
θ∈Θ
h
′∈{0,h}
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
−E
h
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
i
(3.22b)
We bound Eq. (3.22a) in Lemma 3.13.3 and bound Eq. (3.22b) by invoking Lemma 3.13.6
with κ = 1/
√
νmin and using the crude bound on the chromatic number given after the theorem.
Combining the two bounds gives us our bound on the solution stability.
3.13.3 Bound on Eq. (3.22a)
Lemma 3.13.3 (Expected Solution Stability). Suppose Assumptions 2.2.4, 3.6.2 and 3.6.3 hold.
Then, for any 0 ≤ h ≤ 1 where hSmax ≤ 1, there exists a constant C (depending on νmin,νmax,σmin)
such that,
sup
θ∈Θ
E
h
ξ
⊤
xˆ
T (Z,
¯θ )
−xˆ
T (Z,
¯θ ) +hξ
i
≤ CSmaxnhs
log
Proof. Proof for Lemma 3.13.3. Fix ¯θ and suppress it in the notation. Rewrite the argument of
the supremum as a sum over blocks Sk
:
K
∑
k=1
E
ξ
k
⊤
xˆ
k
T (Z)
−xˆ
k
T (Z) +hξ
=
K
∑
k=1
E
ξ
k
⊤
xˆ
k
T
k
(Z)
−xˆ
k
T
k
(Z) +hξ
k
=
K
∑
k=1
E
E
ξ
k
⊤
xˆ
k
T
k
(Z)
−xˆ
k
T
k
(Z) +hξ
k
Z
k
where Z
k ∈ R
n−|Sk
|
correspond to the components of Z not in Sk
. The first equality holds since the
k
th block only depends on the objective cost components in the k
th block and the second by law of
iterated expectations.
We now focus on bounding the conditional expectation
E
ξ
k
⊤
xˆ
k
T
k
(Z)
−xˆ
k
T
k
(Z) +hξ
k
Z
k
(3.23)
for a fixed k. We first note
T
k
(Z) = e
k
◦ (LZ +l) = L
kZ
k +L
k Z
k +l
k = L
k
ξ
k +L
kµ
k +L
k Z
k +l
k
where e
k
is a vector where component j is 1 if j ∈ Sk and 0 otherwise. The matrix L
k ∈ R
|Sk
|×|Sk
|
is the square submatrix induced by the rows and columns in Sk
. (Recall, this matrix is invertible
by Assumption 3.6.3.) The matrix L
k ∈ R
|Sk
|×(n−|Sk
|)
is the matrix with rows corresponding Sk and
columns corresponding to S
c
k
. We then fix a k and Z
k
, and temporarily suppress k in the notati
We define y0 = Lξ , yh = Lξ +hξ and g(y) = xˆ(y +Lµ +L Z +l) ∈ R
|Sk
|
. Then, the conditional
expectation Eq. (3.23) is
E
h
L
−1
y0
⊤
g(y0
)
i
−E
h
(L +hI)
−1
yh
⊤
g(yh
)
i
≤
E
hL
−1 −(L +hI)
−1
yh
⊤
g(yh
)
i
| {z }
Term (i)
+
E
h
L
−1
y0
⊤
g(y0
)
i
−E
h
L
−1
yh
⊤
g(yh
)
i
| {z }
Term (ii)
Bounding Term (i)
Since ∥g(y)∥∞ ≤ 1, we have that
hL
−1 −(L +hI)
−1
yh
⊤
g(yh
)
i
≤ ∥
L
−1 −(L +hI)
−1
yh∥1 = h∥L
−1
ξ ∥1
≤ h
p
|Sk
|∥L
−1
ξ ∥2 ≤
h
p
|Sk
|
σmin
∥ξ ∥2
where last inequality holds by definition of the operator norm on the matrix L
−1
.
Hence, we have
E
hL
−1 −(L +hI)
−1
yh
⊤
g(yh
)
i
≤
h
p
|Sk
|
σmin
E[∥ξ ∥2] ≤
h|Sk
|
σmin√
Bounding Term (ii)
To bound the second term, we would like to appeal to a total variation argument, however the terms
in the expectation are not bounded. We thus consider truncating the terms at U (a constant we will
determine later).
E
h
L
−1
y0
⊤
g(y0
)
i
−E
h
L
−1
yh
⊤
g(yh
)
i
≤
E
h
L
−1
yh
⊤
g(yh
)I
∥L
−1
yh∥1 > U
i
(3.24a)
+
E
h
L
−1
y0
⊤
g(y0
)I
∥L
−1
y0∥1 > U
i
(3.24b)
+
E
h
L
−1
y0
⊤
g(y0
)I
∥L
−1
y0∥1 ≤ U
i
−E
h
L
−1
yh
⊤
g(yh
)I
∥L
−1
yh∥1 ≤ U
i
(3.24c)
We first bound Eq. (3.24a). By Holder’s inequality, we have
E
h
L
−1
yh
⊤
g(yh
)I
∥L
−1
yh∥1 > U
i
≤ E
L
−1
yh
⊤
g(yh
)
2
1/2
P
∥L
−1
yh∥1 > U
1/2
(3.25)
after observing E
h
I
∥L
−1
yh∥1 > U
2
i1/2
= P
∥L
−1
yh∥1 > U
1/2
.
To bound Eq. (3.25), we first bound the term inside the expectation. Applying Holder’s inequality again shows
L
−1
yh
⊤
g(yh
)
2
≤ ∥L
−1
yh∥
2
1∥g(yh
)∥
2
∞ ≤ ∥L
−1
yh∥
2
where the second holds since ∥g(yh
)∥∞ ≤ 1. For 0 ≤ h ≤ 1 and σmin ≤ 1, the last term can be
further bounded as follows,
∥L
−1
yh∥
2
1 = ∥L
−1
(L +hI)ξ ∥
2
1
≤ 2∥ξ ∥
2
1 +2h
2
∥L
−1
ξ ∥
2
1
(Triangle Inequality and (a+b)
2 ≤ 2a
2 +2b
2
)
≤ 2|Sk
|∥ξ ∥
2
2 +2|Sk
|h
2
∥L
−1
ξ ∥
2
2
≤ 2|Sk
|∥ξ ∥
2
2 +
2|Sk
|h
2
σ
2
min
∥ξ ∥
2
2
≤
4|Sk
|
σ
2
min
∥ξ ∥
2
2
, (since σmin ≤ 1,h ≤ 1).
Applying this bound to the the probability term as well, we have
P
∥L
−1
yh∥1 > U
1/2
≤ P
(
2
p
|Sk
|
σmin
∥ξ ∥2 > U
)1/2
= P
(
∥ξ ∥2 >
σminU
2
p
|Sk
|
)1/2
.
We can then bound the probability as follows,
P
(
∥ξ ∥2 >
σminU
2
p
|Sk
|
)
= P
∥ξ ∥
2
2 >
σ
2
minU
2
4|Sk
|
≤ 2exp
−σ
2
minνminU
2
C1|Sk
|
2
for some absolute constantC1. The inequality holds by first noting ∥ξ ∥2−E[∥ξ ∥2] is sub-Gaussian
with variance proxy O(
p
2|Sk
|/νmin) by Gupta and P. Rusmevichientong (2021b, Lemma A.1).
Then, via Vershynin (2018, Lemma 2.7.6), this implies ∥ξ ∥
2
2
is sub-exponential satisfying
P
∥ξ ∥
2
2 > t
≤ 2exp
−tνmin
C1|Sk
|
for some absolute constant C1.
Plugging in our bounds to Eq. (3.25), we have
Eq. (3.24a) ≤ E
4|Sk
|
σ
2
min
∥ξ ∥
2
2
1/2
P
(
∥ξ ∥2 >
σminU
2
p
|Sk
|
)1/2
≤
2
√
2|Sk
|
σmin√
νmin
exp
−σ
2
minνminU
2
C2|Sk
|
2
1/2
,
189
for some absolute constant C2.
To bound Eq. (3.24b), we can repeat the same argument but take h = 0 throughout. Consequently,
Eq. (3.24b) ≤
2|Sk
|
σmin√
νmin
exp
−σ
2
minνminU
2
C2|Sk
|
2
1/2
.
Combining the two bounds, we have for σmin ≤ 1 and 0 ≤ h ≤ 1,
Eq. (3.24a)+Eq. (3.24b) ≤
5|Sk
|
σmin√
νmin
exp
−σ
2
minνminU
2
C2|Sk
|
2
1/2
.
Finally, we bound Eq. (3.24c) using a Total Variation argument. We first note the function
y 7→
L
−1
y
⊤
g(y) is bounded by U for all y as
L
−1
y
⊤
g(y)I
∥L
−1
yh∥1 ≤ U
≤ I
∥L
−1
yh∥1 ≤ U
∥L
−1
y∥1∥g(y)∥∞ ≤ U.
Thus, we have
Eq. (3.24c) ≤ U ·TV(y0
, yh
),
where TV(y0
, yh
) is the total variation distance between y0
and yh
. By Lemma 3.13.4 (c.f. Remark 3.13.5, this term is at most U ·C3|Sk
|h for the constant C3 described in the lemma.
Putting it all together in Eq. (3.24), we have
Term (ii) ≤ UC4|Sk
|h+C4|Sk
| exp
−U
2
C4|Sk
|
2
1/2
.
for some constant C4 (depending on νmin, νmax, σmin). Choosing U =C4 |Sk
|
p
logh−2 and collecting constants shows
Term (ii) ≤ C5
p
log(1/h)|Sk
|
2
h+C5 |Sk
|h ≤ C5 |Sk
|
2
h
p
log(1/h),,
1
for some C5 (depending on νmin, νmax, σmin).
Combining Term (i) and Term (ii) and summing across k, we have
K
∑
k=1
C6h|Sk
|+C6|Sk
|
2
h
s
log
1
h
≤ 2C6Smaxnhs
log
1
h
,
for a constant C6. This completes the proof.
The next lemma bounds the total variation distance between two random variables of interest.
When we invoke the lemma, the role of L will be played by L
k
. We phrase the lemma with out this
k notation for simplicity.
Lemma 3.13.4 (Bound on TV distance). Assume L−1
exists for L ∈ R
d×d
. Let
ξ ∼ N (0,Σ), Σ = diag
(1/νj)
d
j=1
, y0 = Lξ , yh = (L +hI)ξ .
Suppose the matrix LΣ +ΣL
⊤ is positive semi-definite. Then, for 0 ≤ h ≤ 1, there exists a constant
C (depending on σmin, νmin, νmax) such that
TV(y0
, yh
) ≤ Cdh,
where νmin = mini νi
, νmax = maxi νi
, and σmin = min{1,σd(L)} bounds the smallest singular
value of L.
Proof. Proof for Lemma 3.13.4. Let
Σ0 = LΣL
⊤, Σh = (L +hI)Σ(L +hI)
⊤,
191
and λi be the eigvenvalues of Σ
−1
0
Σh − I. Since y0
and yh
are multivariate Gaussians, Devroye,
Mehrabian, and Reddad (2018, Theorem 1.1) proves that
TV(y0
, yh
) ≤
vuut
d
∑
i=1
λ
2
i
.
To bound the right side, we will first argue that λi ≥ 0 for all i. Write Σ
−1
0
Σh −I = Σ
−1
0
(Σh −Σ0).
Notice Σh −Σ0 = h(LΣ +ΣL
⊤) +h
2Σ is the sum of two positive semidefinite matrices, and hence
positive semidefinite. It is also symmetric. On the other hand, Σ
−1
0
is symmetric and positive
semidefinite. Finally, the product of two positive semidefinite, symmetric matrices is positive
semidefinite. This proves that λi ≥ 0 for each i.
We next bound the summation inside the square root by first observing,
0 ≤
d
∑
i=1
λ
2
i ≤
d
∑
i=1
λi
!2
(since λi ≥ 0)
= tr
Σ
−1
0
(Σh −Σ0)
2
.
Since Σ
−1
0
(Σh −Σ0) = Σ
−1
0
(hLΣ +hΣL
⊤ +h
2Σ), we have
tr
Σ
−1
0
(Σh −Σ0)
= tr
hΣ
−1
0
LΣ
+tr
hΣ
−1
0
ΣL
⊤
+tr
h
2Σ
−1
0
Σ
= h ·tr
(L
⊤)
−1Σ
−1L
−1LΣ
+h ·tr
(L
⊤)
−1Σ
−1L
−1ΣL
⊤
+h
2
·tr
(L
⊤)
−1Σ
−1L
−1Σ
= h ·tr
(L
⊤)
−1
+h ·tr
L
−1
+h
2
·tr
(L
⊤)
−1Σ
−1L
−1Σ
where the first equality holds since the trace is a linear mapping, the second equality holds by
plugging in the definition of Σ0, the third equality holds by applying the cyclic property of the
trace.
Next, note that multiplying by Σ or Σ
−1
scales the diagonals by at most 1/νmin or νmax, respectively. Hence,
tr
(L
⊤)
−1Σ
−1L
−1Σ
≤
tr
(L
⊤)
−1Σ
−1L
−1
νmin
=
tr
L
−1
(L
⊤)
−1Σ
−1
νmin
≤
νmax
νmin
tr
(L
⊤L)
−1
.
Thus,
tr
Σ
−1
0
(Σh −Σ0)
≤ 2h ·
tr
L
−1
+
h
2νmax
νmin
·tr
(L
⊤L)
−1
.
We can further bound
tr
(L
⊤L)
−1
=
d
∑
i=1
1
σ
2
i
≤
d
σ
2
min
since the trace is the sum of the squared singular values of L
−1 which are inverse of the singular
values of L.
Lastly, we bound
tr(L
−1
)
. Letting λ
L
j
be the eigenvalues of L, we see
tr(L
−1
)
=
d
∑
j=1
1
λ
L
j
≤
d
σmin
where λ
L
min = minj λ|λ
L
j
|. The first equality holds by definition of trace and noting the eigenvalues
of L
−1
are the inverses the eigenvalues of L. The last inequality holds since σmin ≤
λ
L
j
for all j.
In summary, we have shown
TV(y0
, yh
) ≤
2hd
σmin
+
h
2dνmax
νminσ
2
min
≤ dh
2
σmin
+
2νmax
νminσ
2
min
This completes the proof.
Remark 3.13.5 (Applying Lemma 3.13.4). We observe that under Assumptions 2.2.4, 3.6.2 and 3.6.3,
to the random variables induced by S k
, i.e., that
TV(L
k
ξ
k
,(L
k +hI)ξ
k
) ≤ C|Sk
|h.
To do so, we simply verify the assumptions of the lemma. The matrix Lk
is invertible by Assumption 3.6.3. By Assumption 2.2.4, the matrix LΣ + ΣL
⊤ ∈ R
n×n
is positive semidefinite. We
claim that this implies that LkΣ
k +Σ
k
(L
k
)
⊤ ∈ R
|Sk
|×|Sk
|
is also positive semidefinite, Σ
k
is the diagonal covariance matrix of ξ
k
. To see this, observe that the i jth component of LΣ can be written as
Li jσj since Σ is a diagonal matrix. Consequently, LkΣ
k
is a principle submatrix of LΣ. Similarly,
we see Σ
k
(L
k
)
⊤ is a principle submatrix of ΣL
⊤. Thus, LkΣ
k +Σ
k
(L
k
)
⊤ is a principle submatrix
of LΣ + ΣL
⊤. Finally, the principle sub-matrices of a positive semidefinite matrix are also positive semidefinite. Thus, under Assumptions 2.2.4, 3.6.2 and 3.6.3, we can apply the lemma to the
subcomponents pertaining to Sk
.
3.13.4 Bound on Eq. (3.22b)
To bound Eq. (3.22b) for block decoupled problems (i.e., Problem (3.14) with m = 0 and |X0| = 0),
the primary challenge is that for k ̸= ¯k, even though ξ
k
and ξ
¯k
are independent, it may still be that
xˆ
k
(T (Z,
¯θ )) and xˆ
¯k
(T (Z,
¯θ )) are correlated, since L(
¯θ ) may smooth across subproblems. Hence,
we cannot treat ∑
K
k=1
ξ
k⊤
xˆ
k
(T (Z,
¯θ )−E
ξ
k⊤
xˆ
k
(T (Z,
¯θ )
as a sum of independent random variables.
To circumvent this issue, we will identify groups of subproblems such that subproblems in the
same group are independent. We construct these groups by looking at the coloring of a particular
graph. We will then sum over subproblems of a particular color (which will be independent by
construction), and use a union bound over the colors.
To that end, we introduce the block graph G(V,E). The vertex V = {0,...,K} corresponds to
our disjoint partition {S0,...,SK}. An edge e(k,
¯k) exists between vertices k and ¯k if there exists
i ∈ Sk
, j ∈ S¯k
, l ∈ {1,...,n}, and ¯θ ∈ Θ¯ such that Lil(
¯θ ) ̸= 0 and Ljl(
¯θ ) ̸= 0. In words, this means
that Ti(Z,
¯θ ) and Tj(Z,
¯θ ) both depend on ξl
, despite i and j belonging to different subproblems.
Let χ(G) be the chromatic number of G(V,E) and fix a vertex coloring for the graph. We claim
that if k ̸= ¯k have the same color, then ξ
k⊤
xˆ
k
(T (Z,θ )) and ξ
k⊤
xˆ
k
(T (Z,θ )) must be independent.
Indeed, suppose this was not the case. Then, there must exist an i ∈ Sk and a j ∈ S¯k
such that
194
Ti(Z,
¯θ ) and Tj(Z,
¯θ ) both depend on some component ξl
, which suggests e(k,
¯k) is present, and
hence k and ¯k cannot be the same color. This is a contradiction.
Let R1,...,Rχ(G) be the partition of V corresponding to the different colors. Using this partition, we prove the following result under the weaker Assumption 3.3.3 rather than the stronger
Assumption 3.6.4 assumed in Theorem 3.6.6.
Lemma 3.13.6 (Convergence of Solution Stability to Expectation). Suppose Assumption 3.3.3
holds. Then, there exists an absolute constant C such that with probability at least 1−ε,
sup
¯θ∈Θ¯
h
′∈{0,h}
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
−E
h
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
i
≤ CκSmaxp
χ(G)nlog(Xmax)log(χ(G))log(1/ε).
Proof. Proof for Lemma 3.13.6. We first decompose Eq. (3.22b) into the sets determined by the
vertex coloring as follows,
sup
¯θ∈Θ¯
h
′∈{0,h}
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
−E
h
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
i
≤
χ(G)
∑
l=1
sup
¯θ∈Θ¯
h
′∈{0,h}
∑
k∈Rl
(ξ
k
)
⊤xˆ
k
T (Z,
¯θ ) +h
′
ξ
−E
h
(ξ
k
)
⊤xˆ
k
T (Z,
¯θ ) +h
′
ξ
i
.
We bound each term of the outer summation using Theorem A.1 of Gupta, Huang, and P. Rusmevichientong (2022) since fk(Z,
¯θ ,h
′
) = (ξ
k
)
⊤xˆ
k
T (Z,
¯θ ) +h
′ξ
for k ∈ Rl are independent.
Fixing an l, we take
F(Z) =
∥ξ
k
∥1
k∈Rl
∈ R
|Rl
|
We can bound the Orlicz norm as follows,
∥∥F(Z)∥2∥Ψ ≤
p
Smax∥(ξ
k
)k∈Rl
∥2
Ψ
≤
p
Smaxκ
p
2Smax|Rl
| = Smaxκ
p
2|Rl
where the first inequality holds by noting ∥ξ
k
∥1 ≤
√
Smax∥ξ
k
∥2 and the second inequality holds
by Gupta and P. Rusmevichientong (2021b, Lemma A.1).
Note that
(ξ
k
)
⊤xˆ
k
T (Z,θ ) +h
′
ξ
k∈Rl
:
¯θ ∈ Θ¯ ,h
′ ∈ {0,h}
≤ 2Xmax.
Now apply Gupta and P. Rusmevichientong (2021b, Theorem A.1) to show that there exists an
absolute constant C such that probability at least 1−ε,
sup
¯θ∈Θ¯
h
′∈{0,h}
∑
k∈Rl
(ξ
k
)
⊤xˆ
k
T (Z,
¯θ ) +h
′
ξ
−E
h
(ξ
k
)
⊤xˆ
k
T (Z,
¯θ ) +h
′
ξ
i
≤ CSmaxκ
p
|Rl
|log(Xmax)log
1
ε
.
To complete the proof, we take a union bound over l allowing us to show that with probability
1−ε
sup
¯θ∈Θ¯
h
′∈{0,h}
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
−E
h
ξ
⊤
xˆ
T (Z,
¯θ ) +h
′
ξ
i
≤
χ(G)
∑
l=1
CSmaxκ
p
|Rl
|log(Xmax)log
χ(G)
ε
.
≤ Cχ(G)Smaxκ
s
K
χ(G)
log(Xmax)log
χ(G)
ε
≤ CκSmaxp
χ(G)nlog(Xmax)log
χ(G)
ε
where the first inequality is the union bound, and the second inequality uses ∑
χ(G)
l=1
|Rl
| = K and
Jensen’s inequality. Collecting constants proves the lemma.
In the main body, we use a slightly looser bound by further bounding χ(G). Specifically,
by Brook’s Theorem, χ(G) is at most the maximal degree of G(V,E) plus one. Let Tmax =
maxj ∑
n
i=1
sup ¯θ∈Θ¯ I
Li j(
¯θ ) ̸= 0
. In words, Tmax is the maximal number of non-zero elements
in a column of L(
¯θ ).
Then we claim that the maximal degree of G(V,E) is at most SmaxTmax. To see this, consider S1.
For each k ∈ S1, Tk(Z,
¯θ ) can depend on at most Tmax different components of ξ . In a worst-case,
these components each belong to different subproblems, creating at most Tmax outgoing edges from
node 1. This is true for each k ∈ S1, so S1 has an outdoing degree of at most |S1|Tmax. This is true
for each subproblem, yielding the upperbound SmaxTmax.
For simplicity of exposition, we use this upper bound in the main text, but the chromatic number
bound above is often substantively tighter.
3.13.5 Proof for Block Decoupled Estimation Error, Theorem 3.6.6
We apply Lemma 3.4.2 and use Lemma 3.6.5 to bound the solution stability term. We complete
the bound by obtaining CT for the affine plug-in class. We see
T (Z,
¯θ )−T (Y ,
¯θ )
1 =
L(
¯θ ) (Z −Y )
1
≤
n
∑
j=1
L j(
¯θ )(Zj −Yj)
1 =
max
j
L j(
¯θ )
1
∥Z −Y ∥1 ≤ TmaxLmax ∥Z −Y ∥1
,
so CT = TmaxLmax.
197
3.14 Appendix: Weakly-Coupled Policy Proofs
Proof. Proof for Theorem 3.6.11 We can bound the estimation error for weakly-coupled problems
as follows,
sup
θ∈Θ
ξ
⊤
x(Z,θ )−D
WC(Z,θ ,h)
≤ sup
θ∈Θ
ξ
⊤
x(Z,θ )−x˜(Z,λ (Z,
¯θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ )
+ sup
θ∈Θ
ξ
⊤
x˜(Z,λ (Z,
¯θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ )−D˜(Z,λ (Z,
¯θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ ,h)
≤ sup
θ∈Θ
∥ξ ∥∞
x(Z,θ )−x˜(Z,λ (Z,
¯θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ )
1
| {z }
(a)
+ sup
θ∈Θ, x
0∈X 0
ξ
⊤
x˜(Z,λ (Z,
¯θ , x
0
), x
0
,
¯θ )−D˜(Z,λ (Z,
¯θ , x
0
), x
0
,
¯θ ,h)
| {z }
(b)
The first inequality adds and subtracts our approximate policy. The last inequality gives us the key
terms we must bound.
To bound (a), we apply Lemma 3.14.1, showing
sup
θ∈Θ
∥ξ ∥∞
x(Z,θ )−x˜(Z,λ
0
(Z,θ , x
0
(Z,θ )), x
0
(Z,θ ),
¯θ )
1
≤ ∥ξ ∥∞
·m· Smax
Since ξ is a vector of independent sub-Gaussian random variables, with probability at least 1−ε,
∥ξ ∥∞ ≤ C1
s
logn
νmin
·log
1
ε
where C1 is a universal constant (Wainwright, 2019) . Thus, with probability at least 1−ε,
(a) ≤ C1 ·m· Smaxs
logn
νmin
·log
1
ε
.
19
To bound (b), we first observe by Lemma 3.14.2 ∥λ
0
(Z,
¯θ , x
0
)∥1 ≤
2
s¯
∥T (Z,
¯θ )∥1, for every
¯θ , x
0
. Moreover,
∥T (Z,
¯θ )∥1 ≤ ∥L(
¯θ )∥1∥Z∥1 +∥l(
¯θ )∥1 ≤ ∥µ∥1 +∥ξ ∥1 +1 ≤ Cµ(n+1) +n · ∥ξ ∥∞.
Plugging in our bound on ∥ξ ∥∞, we see with probability at least 1−ε
∥T (Z,
¯θ )∥1 ≤ 2C1Cµn
s
logn
νmin
·log
1
ε
.
Letting ε = e
−n
, we see that for λmax = 4C1Cµs¯
−1ν
−1/2
min n
2
that P
∥λ (Z,θ , x
0
(Z,θ , x
0
))∥1 ≥ λmax
<
e
−n
.
Let Λ =
λ ∈ R
m
+ : ∥λ ∥1 ≤ λmax
. Then, when λ (Z,θ , x
0
) ∈ Λ,
(b) = sup
θ∈Θ, x
0∈X 0
ξ
⊤
x˜(Z,λ
0
(Z,θ , x
0
), x
0
,θ )−D˜(Z,λ
0
(Z,θ , x
0
), x
0
,θ ,h)
≤ sup
x
0∈X 0
sup
θ∈Θ
λ ∈Λ
ξ
⊤
x˜(Z,λ , x0,θ )−D˜(Z,λ , x0,
¯θ ,h)
We complete the bound on (b) by first fixing x
0
in the last line and then applying Corollary 3.6.7
to the block decoupled setting induced by the lifted affine policy class to show
sup
(
¯θ ,λ )∈Θ˜
ξ
⊤
x˜(Z,λ , x0,θ )−D˜(Z,λ , x0,
¯θ ,h)
≤ C2S
3/4
maxn
3/4
logn
q
Tmax logXmax log
Θ˜
0
n
−1/2
·log
SmaxTmax
ε
,
where C2 is the constant from the corollary and (
¯θ ,λ ) ∈ Θ˜ ≡ Θ × Λ. Note that the new plug-in
class T (Z,
¯θ )+λ
⊤
A
0
(x
0
) indexed by Θ˜ is still an affine policy class and satisfies Assumption 3.3.1
with the same Lipschitz constant as the original plug-in class.
19
We can further bound
Θ˜
0(n
−1/2
)
with respect to the covering Θ0(ε). Letting Λ0(ε) be the
ε-covering of Λ0, we see
Θ˜
0(2ε)
(i)
≤ |Θ0 (ε)| ·
Λ0
ε
∥A
0
∥∞n
(ii)
≤ |Θ0(ε)| ·
3λmax∥A
0
∥∞n
ε
!m
.
The inequality (i) follows because Θ0 (ε)×Λ0
ε
∥A
0
∥∞n
is a valid covering that satisfies Assumption 3.4.1 since for each (
¯θ ,λ ) ∈ Θ˜ there exists (
¯θ 0,λ 0) ∈ Θ˜
0(n
−1/2
) we have
∥T (Z,
¯θ ) +λ
⊤
A
0
(x
0
)−
T (Z,
¯θ 0) +λ
⊤
0 A
0
(x
0
)
∥1 ≤ ∥T (Z,
¯θ )−T (Z,
¯θ 0)∥1 +∥(λ −λ 0)
⊤
A
0
(x
0
)∥1
≤ C(∥Z∥1 +1) ε +
n
∑
j=1
(λ −λ 0)
⊤
A
0
j
(x
0
)
≤ C(∥Z∥1 +1) ε +n∥λ −λ 0∥1
∥A
0
∥∞
≤ C(∥Z∥1 +1) ε +ε
≤ 2C(∥Z∥1 +1) ε.
Inequality (ii) applies Pollard (1990, Lemma 4.1).
We can further simplify our bound for n > 2 by
log
Θ˜
0(2n
−1/2
)
≤ log
Θ0(n
−1/2
)
+log
3λmax∥A
0
∥∞n
1/2
m
≤
log
Θ0(n
−1/2
)
mlog
3λmax∥A
0
∥∞n
1/2
≤ m
log
Θ0(n
−1/2
)
log
3λmax∥A
0
∥∞
+logn
1/2
≤ mlogn ·
log
Θ0(n
−1/2
)
log
3λmax∥A
0
∥∞
+1
.
Redefining 2n
−1/2 → n
−1/2
shows
log
Θ˜
0(n
−1/2
)
≤ C3mlogn ·
log
Θ0(n
−1/2
/2)
for some constant C3 depending on log(λmax) and log(∥A
0
∥∞).
200
Taking a union bound over x
0 ∈ X 0
, and collecting terms, we have
(b) ≤ sup
x
0∈X 0
sup
θ∈Θ
λ ∈Λ
ξ
⊤
x˜(Z,λ , x0,θ )−D˜(Z,λ , x0,
¯θ ,h)
≤ C4S
3/4
maxn
3/4
log3/2
n
q
mTmax logXmax log
Θ0
n
−1/2/2
·log
SmaxTmax
X 0
ε
!
where C4 depends on logλmax, log∥A
0
∥∞, νmin, νmax, σmin, ∥µ∥∞, Lmax.
Combining bounds on (a) and (b) completes the proof.
Lemma 3.14.1 (Error Approximating Weakly-Coupled by Constraints Policies). Assume Assumption 3.6.9 holds. Then,
∥x(Z,θ )−x˜(Z,λ (Z,θ , x
0
(Z,θ )), x
0
(Z,θ ),θ )∥1 ≤ m· Smax, a.s.
Proof. Proof for Lemma 3.14.1 To streamline the proof, we fix a θ ∈ Θ and Z, dropping them
from the notation. We let x¯ and λ represent the optimal primal solution x(Z,θ ) and optimal dual
solution λ (Z,θ , x
0
(Z,θ )), respectively.
Our goal is to bound
∥x¯ −x˜
λ , x¯
0
∥1 =
K
∑
k=0
∥x¯
k −x˜
k
λ , x¯
0
∥1.
Focusing on the k
th term, recall x˜
k
¯λ , x¯
0
solves the optimization problem,
min
x
k≥0
∑
j∈Sk
(Tj +λ
⊤
A
0
j
(x¯
0
))x j
, s.t. x
k ∈ X k
(x¯
0
). (3.26)
20
We claim that if Problem (3.26) has a unique solution, then x¯
k = x˜
k
(λ , x¯
0
). To prove the claim,
we show first show x¯
k
is the optimal solution of Problem (3.26). Let
g(λ ) = −
b(x¯
0
), λ
+
K
∑
k=1
min
x
k∈X k
D
T
k +λ
⊤
A
0
Sk
(x¯
0
), x
k
E
be the dual problem objective solved by λ . Using strong duality of Problem (3.16) with a fixed x
0
,
we can show
K
∑
k=1
D
T
k
, x¯
k
E
= g(λ ), by strong duality,
= −
D
b(x¯
0
), λ
E
+
K
∑
k=1
min
x
k∈X k
D
T
k +λ
⊤
A
0
Sk
(x¯
0
), x
k
E
(∗)
≤ −D
b(x¯
0
), λ
E
+
K
∑
k=1
D
T
k +λ
⊤
A
0
Sk
(x¯
0
), x¯
k
E
, since x¯
k ∈ X k
(x¯
0
),
=
D
A
0
(x¯
0
)x¯ −b(x¯
0
), λ
E
| {z }
≤0, since x¯∈X WC and λ ≥0
+
K
∑
k=1
D
T
k
, x¯
k
E
≤
K
∑
k=1
D
T
k
, x¯
k
E
.
Here we have shown that equality holds throughout. Since inequality (∗) is an equality, we see that
x¯
k minimizes Problem (3.26) for each k. Since we claim Problem (3.16) has a unique solution, we
have x¯
k = x˜
k
λ , x¯
0
.
Thus, to bound ∥x¯ −x˜
λ , x¯
0
∥1, we count the number of sub-problems that do not have unique
solutions. Let K be the set of k where x˜
k
λ , x¯
0
is not a unique solution. We see k ∈ K if:
1. The subproblem cost vector is 0, or formally,
Tj
A
0
j
0
⊤
1
λ
0
= 0, ∀ j ∈ Sk
202
2. The subproblem cost vector points in a direction corresponding to a unit vector in Dk
(x¯
0
),
or formally, there exists d ∈ Dk
(x¯
0
) and α
k > 0 such that
Tj
A
0
j
−dj
⊤
1
λ
α
k
= 0, ∀ j ∈ Sk
We let K1 define the set of k that satisfy the first condition.
Combining all the equalities across k ∈ K , j ∈ Sk
, we see they each represent a linear relation
among the vectors of F in Assumption 3.6.9. By the assumption, the maximal number of equalities
we can have among the vectors is at most the number of variables:
m
|{z}
of λ
+ |K |−|K1|
| {z }
of αk
for k∈K \K1
+ 1.
Hence
m+|K |−|K1|+1 ≥ ∑
k∈K
|Sk
| = ∑
k∈K \K1
|Sk
|+ ∑
k∈K1
|Sk
|.
Next, we claim that for any k ∈ K \ K1, we must have |Sk
| ≥ 2. Indeed, if |Sk
| = 1, then X k
is
an interval, and the only way to have multiple solutions is if the subproblem cost vector is 0, i.e.,
k ∈ K1, a contradiction. As a result, we can further lower bound this last quantity by
m+|K |−|K1|+1 ≥ 2(|K | −|K1|) +|K1|.
Rearranging, proves |K | ≤ m+1.
Finally, for k ∈ K , ∥x¯
k −x˜
k
λ , x¯
0
∥1 ≤ |Sk
|. Hence, ∥x¯ −x˜
λ , x¯
0
∥1 ≤ m· Smax. This completes the proof.
203
Lemma 3.14.2 (Dual Variables Bounded by Plug-in). Under Assumption 3.6.10, we see for all
θ ∈ {0}×Θ¯ , z ∈ R
n
, and x0 ∈ X 0
that
∥λ (z,
¯θ , x
0
)∥1 ≤
2
s¯
∥T (z,
¯θ )∥1.
Proof. Proof of Lemma 3.14.2 Fix z,θ , x
0
. To reduce notation, let x˜ = x˜
z,λ
z,
¯θ , x
0
, x
0
,
¯θ
and
L (x,λ ) = T (z,
¯θ )
⊤x +λ
⊤
A
0
(x
0
)x −b
0
(x
0
)
,
By optimality,
L
x˜,λ
z,
¯θ , x
0
≥ L (x˜,0) ≥ −∥T (z,
¯θ )∥1
where the last inequality holds since decision variables are bounded between 0 and 1.
Since λ (z,
¯θ , x
0
) ≥ 0, we see ∥λ (z,
¯θ , x
0
)∥1 = e
⊤λ (z,
¯θ , x
0
). Thus,
∥λ (z,
¯θ , x
0
)∥1 ≤ max
λ ≥0
e
⊤λ
s.t. min
x∈X˜
L (x,λ ) ≥ −∥T (Z,
¯θ )∥1,
where X˜ =
x : x
k ∈ X k
(x
0
), ∀k = 1,...,K
. We upper bound the optimization problem by relaxing our one constraint with penalty 1/s¯> 0 to show,
∥λ (z,
¯θ , x
0
)∥1
≤ max
λ ≥0
e
⊤λ +
1
s¯
min
x∈X˜
L (x,λ ) +∥T (z,
¯θ )∥1
= max
λ ≥0
e
⊤λ +
1
s¯
min
x∈X˜
T (z,
¯θ )
⊤x +λ
⊤
A
0
(x
0
)x −b
0
(x
0
)
+∥T (z,
¯θ )∥1
≤ max
λ ≥0
e
⊤λ +
1
s¯
T (z,
¯θ )
⊤x¯ +λ
⊤
A
0
(x
0
)x¯ −b
0
(x
0
)
+∥T (z,
¯θ )∥1
,
= max
λ ≥0
1
s¯
T (z,
¯θ )
⊤x¯ +λ
⊤
A
0
(x
0
)x¯ +se¯ −b
0
(x
0
)
+∥T (z,
¯θ )∥1
where x¯ ∈ X WC satisfies our strict feasibility assumption (Assumption 3.6.10) for x
0
. The first
inequality is our upper-bound from relaxing the one constraint. The first equality expands L (x,λ ).
The second inequality holds since x¯ ∈ X WC ⊆ X˜. The second equality collects terms.
Since A
0
(x
0
)x¯ +se¯ −b
0
(x
0
) ≤ 0 by the strict feasibility assumption,
∥λ (z,
¯θ , x
0
)∥1 ≤
1
s¯
T (z,
¯θ )
⊤x¯ +∥T (z,
¯θ )∥1
≤
2
s¯
∥T (z,
¯θ )∥1
Note the bound holds for any choice of z,θ , x
0
. This completes the proof.
3.15 Appendix: Details for Numerics and Examples
3.15.1 Details for Fig. 3.1
The example focuses on the full information problem
max
x∈X
K
∑
k=1
3
∑
j=1
µk jxk j, where X =
(
x :
3
∑
j=1
xk j = 1∀k; xk j ∈ {0,1} ∀k, j
)
We observe Zk j ∼ N
µk j,1/νk j
where µ1 j = 0,µ2, j = 0.5,µ3, j = 1 and ν1 j = 1,ν2, j = 5,ν3, j = 3
for every j.
3.15.2 Implementation Details for Numerics
The primary challenge of implementing the one-shot VGC is choosing a distribution for δ h for
each policy and evaluating the expectation
E
V(T (Z,
¯θ ) +δ h,ρ)
Z
.
In Section 3.7, for a fixed L, we choose δ h ∼ N
0,Σˆ
h
and evaluate the expectation by Monte
Carlo simulation with 25 samples of δ h. To construct Σˆ
h
, consider the matrix Σ
h = h
LΣ +ΣL
⊤
+
2
h
2Σ. If Σ
h
is positive semi-definite, then Σˆ
h
= Σ
h
. If Σ
h
is not positive semi-definite, then we form
the eigenvalue decomposition Σ
h = Q
hΛ
hQ
h⊤ and construct
Σˆ
h
= ρQ
hΛ
h
+Q
h⊤, where Λ
h
+ = diagn
max(0,Λ
h
11),...,max(0,Λ
h
nn)
o
and ρ =
tr
Λ
h
tr
Λ
h
+
.
This method of “correcting” a non-positive definite matrix was also used in G. Chan and Wood,
1997 when simulating stationary Gaussian random fields.
206
Chapter 4
Learning Best-in-Class Policies in the Predict-then-Optimize
Framework
This chapter represents some ongoing work extending our ideas from the small-data, large-scale
optimization setting to show how they can be adapted to a contextual linear optimization setting to
improve upon existing methods.
Decision-aware learning is an area of data-driven optimization that combines prediction and
optimization into a single meta-optimization problem. In decision-aware learning, many works
have studied linear and integer programming with uncertain objective costs and proposed methods
broadly categorized as end-to-end predict-then-optimize. These methods often require optimizing
a non-convex, discontinuous loss function. Many existing works utilize computationally inefficient
grid search or propose computationally tractable surrogate loss functions. Among the existing
surrogate loss functions, many have no theoretical guarantees. Those with theoretical guarantees
often require prediction models that are well specified. To address these challenges, we propose a
novel finite-step difference inspired surrogate loss function that is differentiable almost everywhere
and Lipschitz. These properties allow us to prove theoretical guarantees on our surrogate loss
function with relatively mild assumptions while being computationally tractable.
207
4.1 Introduction
Decision-aware learning broadly categorizes many recent works that introduce methods that combine prediction and decision-making into a single, unified learning task. These approaches have
been shown to empirically outperform traditional predict-then-optimize methods that fit a prediction model and solve the decision-making problem in independent stages. Decision-aware approaches focus on directly optimizing the decision-loss rather than prediction-loss and thus can
perform better when the optimal prediction loss is large, such as in misspecified settings.
While decision-aware methods in these settings produce higher quality decisions, there are a
variety of theoretical and computational challenges that make analyzing and implementing these
methods challenging. The primary source of these challenges is optimizing the decision-making
loss which is often non-convex, discontinuous, and has non-informative gradients. To address this
challenge, most works construct differentiable and sometimes convex surrogate losses that can be
optimized via gradient methods. These can range from general losses like SPO+ in Adam N Elmachtoub and Grigas (2022) which applies a convex relaxation to the decision-loss and specialized
losses such as in Wilder, Dilkina, and Tambe (2019) which applies a continuous relaxation to the
objective of discrete optimization problems. Unfortunately, many of these losses lack theoretical
guarantees, particularly in settings where the prediction model is misspecified.
To address these challenges, we propose a novel surrogate loss with informative gradients and
prove it has attractive theoretical properties and guarantees for decision-aware learning.
4.1.1 Setting
We focus on the contextual linear optimization setting with uncertain cost coefficients. Define
π
∗
(x) ∈ Z
∗
(x) = argmin
z∈Z
f
∗
(x)
⊤z, f
∗
(x) = E[Y|X = x], (4.1)
where Z is potentially a non-convex set of decisions. The random variable X ∈ R
p
represents
contextual features and the random variable Y ∈ R
d
represents random cost coefficients. The
208
policy π
∗
(x) maps the context x to a decision that optimizes the respective conditional expected
costs.
In our setting, we do not observe the distribution of (X,Y) nor have knowledge of f
∗
. Instead,
we have data D = {(X1,Y1),...,(Xn,Yn)} which are n independent draws of (X,Y) and use the data
to identify a policy with low regret, which we define to be,
Regret(π) = ED
h
EX
h
f
∗
(X)
⊤ (π(X)−π
∗
(X))ii,
for a policy π.
We specifically restrict our search to the class of plug-in policies ΠF =
πf
: f ∈ F
where
πf(x) ∈ argmin
z∈Z
f(x)
⊤z. (4.2)
These policies plug in f(x) in place of f
∗
(x) and solve the optimization problem. When Eq. (4.2)
admits multiple optima, we tiebreak via a deterministic rule such as using lexicographic order. The
target problem we would like to solve to minimize regret is thus,
f
OR ∈ argmin
f∈F
ED
h
EX
h
f
∗
(X)
⊤
πf(X)
ii = ED
h
Y
⊤πf(X)
i
. (4.3)
We drop the π
∗
term since it is constant in the optimization problem.
We highlight that our approach encompasses misspecified settings where f
∗ ∈/ F. In these
settings f ∈ F can be interpreted as scores used for decision rules rather than predictions of f
∗
.
This deviates from other decision-aware settings such as Hu, Kallus, and Mao (2022), Adam N
Elmachtoub and Grigas (2022), and Liu and Grigas (2021) which look to recover f
∗ or require
f
∗ ∈ F for their theoretical results to hold. Throughout the paper and for our results, we assume
the decision space is bounded, maxz∈Z ∥z∥ ≤ B, and the cost coefficients are bounded without loss
of generality, Y ∈ Y = {y : ∥y∥ ≤ 1}.
20
4.1.2 Related Work
Many works have studied how to computationally optimize the decision loss, particularly for linear
optimization, by proposing surrogate losses that have informative gradients. Wilder, Dilkina, and
Tambe (2019) construct surrogate losses for linear optimization problems via smoothing through
regularizers or randomization. Adam N Elmachtoub and Grigas (2022) utilize duality to construct
a closed-form, convex surrogate loss called SPO+ that has attractive theoretical guarantees if the
hypothesis class of prediction models is well specified (Liu and Grigas, 2021). We provide similar
types of theoretical guarantees for our surrogate loss.
Similar to our work is Vlastelica et al. (2019) which also highlights the issue of zero almost
everywhere gradients and approximates gradients by interpolating between discontinuities of the
decision-loss in a similar fashion to our approach. However, their approach is primarily used to estimate gradients and thus do not prove any finite sample results related to a surrogate loss. Our approach instead draws inspiration from the Variance Gradient Correction from Gupta, Huang, and P.
Rusmevichientong (2022) which leverages insights from Danskin’s theorem to estimate in-sample
bias by differentiating through a maximum. Since key terms in the derivative are unknown they
approximate a directional derivative via a finite step difference and adding noise. More broadly the
finite step difference approach has been used more recently to estimate influence functions (Guo,
Jordan, and Zhou, 2022; Jordan, Y. Wang, and Zhou, 2022). We approximate decision-loss using
a finite step difference as well, but use it because the terms in the approximation have informative
gradients.
4.2 Differentiable Surrogate Loss
To identify the best plug-in policy in ΠF in the misspecified setting, we would like to solve the
empirical risk minimization (ERM) problem
ˆf
ERM ∈ argmin
f∈F
1
n
n
∑
i=1
Y
⊤
i πˆ(f(Xi)) = 1
n
n
∑
i=1
l(f(Xi),Yi). (4.4)
210
where πˆ(t) ∈ argminz∈Z t
⊤z so that πf(x) = πˆ(f(x)) and l(f(x), y) = y
⊤πˆ(f(x)). This loss is an
unbiased estimate of the objective the target problem (Eq. (4.3)), that is,
E
"
1
n
n
∑
i=1
l(f(Xi),Yi)
#
= E[l(f(X),Y)].
Unfortunately, Eq. (4.4) is practically and theoretically a poor representation of Eq. (4.3) when
Z is polyhedral or discrete. Practically, we see l(t, y) with respect to t is non-convex, discontinuous, and has has gradients that are zero almost everywhere. In other words, we cannot use outof-the-box gradient-based approaches to optimize the ERM objective due to l. Theoretically, we
see the gradients of the ERM problem are not a good approximation of the gradients for Eq. (4.3).
While it can be shown that ∇f
1
n ∑
n
i=1
l(f(Xi),Yi) ≈ E
∇f
l(f(X),Y)
, we see that
E
∇f
l(f(X),Y)
̸= ∇fE[l(f(X),Y)]
as the gradient and expectation cannot be interchanged. For example, in the setting where t ∈ R,
the interchange fails since the partial derivative of l(t, y) with respect to t is unbounded for some
t as it is a derivative of a step-function. In some sense, this theoretical observation is why the
gradient of the ERM loss is non-informative.
Instead, we propose a surrogate loss that closely approximates l(t, y) using a function with
informative gradients. The key insight comes from Danskin’s theorem, which shows for λ ∈ R,
∂
∂ λ min
z∈Z
(t +λY)
⊤z
λ=0
=
∂
∂ λ (t +λY)
⊤πˆ(t +λY)
λ=0
= Y
⊤πˆ(t).
To that end, we first define
V(t) = min
z∈Z
t
⊤z = t
⊤πˆ(t),
211
which is the plug-in policy objective. We then fix some small user-defined h > 0 and propose the
following function to approximate l(t, y),
ˆlh(t, y) = V (t +hy)−V (t −hy)
2h
. (4.5)
The function is a finite step approximation of ∂
∂ λ V(t +λy)|λ=0 = l(t, y) with step size h. Using
the approximation, we replace the ERM loss and instead propose solving the following,
ˆf
SL ∈ argmin
f∈F
1
n
n
∑
i=1
ˆlh(f(Xi),Yi). (4.6)
The surrogate loss objective has informative gradients. Applying Danskin’s theorem again, we
see ˆlh(t, y) has the following gradient with respect to t,
∇t
ˆlh(t, y) = πˆ (t +hy)−πˆ (t −hy)
2h
. (4.7)
Note the gradient is non-zero if πˆ(t + hy) ̸= πˆ(t − hy). The gradient is more informative in a
practical sense compared to l(t, y) which has non-zero gradients almost everywhere.
In practice, we can utilize randomized smoothing to further smooth our surrogate loss and
make the gradients non-zero almost everywhere with minimal impact to the approximation error.
Letting ε ∼ N
0,σ
2
I d
, we can show
l(t, y) = E
h
(y+ε)
⊤πˆ(t)
i
= E[l(t, y+ε)]
and approximate l(t, y+ε) with ˆlh(t, y+ε). The resulting gradient is
∇t E
ˆlh(t, y+ε)
= E
∇t
ˆlh(t, y+ε)
= E
πˆ (t +h(y+ε))−πˆ (t −h(y+ε))
2h
.
Note with the expectation the gradients are non-zero almost everywhere.
21
4.2.1 Properties of Differentiable Surrogate Loss
We highlight a variety of attractive properties of our differentiable surrogate loss before proceeding
to our main results which bounds the generalization error of π
DSL
.
The first property we highlight is that our differentiable surrogate loss function is Lipschitz
smooth relative to the plug-in.
Lemma 4.2.1 (Lipschitz with Respect to Plug-in). The following holds for the surrogate loss function ˆlh in the contextual linear optimization setting,
ˆlh(t, y)− ˆlh(t
′
, y)
≤
B
h
t −t
′
We later use this property to leverage vector contraction inequalities to obtain optimization
problem independent bounds on the convergence of our differentiable surrogate loss to its expectation.
The second property we highlight is that differentiable surrogate loss is bounded relative to Y.
We highlight this property as the function V(t) itself is unbounded if t is unbounded.
Lemma 4.2.2 (Surrogate Loss Function is Bounded). The following holds for the surrogate loss
function ˆlh in the contextual linear optimization setting,
ˆlh(t, y)
≤ B∥y∥.
This property allows us to leverage many standard Rademacher complexity results out of the
box, which drastically simplifies our proofs.
Lastly, we show that the sample surrogate loss gradient is an unbiased estimate of expected
surrogate loss gradient.
Lemma 4.2.3 (Unbiased Gradients). Let ∇ ˆl(t, y) represent the gradient with respect to t. Then,
∇E
ˆlh(f(X),Y)
= E
∇ ˆlh(f(X),Y)
.
213
4.3 Generalization Results for Surrogate Loss
We show that learning with our surrogate loss can learn a plug-in policy with best-in-class regret,
or formally, a policy that has regret comparable to the policy,
π
OR ∈ argmin
π∈ΠF
Regret(π).
We achieve this by proving a stronger generalization result, which shows our surrogate loss closely
approximates the regret loss ED
Y
⊤πf(X)
in Eq. (4.3).
Since El Balghiti, Adam N. Elmachtoub, et al. (2022) and Hu, Kallus, and Mao (2022) have
provided generalization bounds for the ERM loss, our main contribution focuses on bounding the
difference between the ERM loss and our surrogate loss. To bound the difference between the two
losses, we make the following observation that shows it depends on a sense of solution stability.
Proposition 4.3.1 (Solution Stability Bounds Estimation Error). For any t, y,h, the error between
the ERM loss and surrogate loss can be bounded as follows,
ˆlh(t, y)
| {z }
Surrogate Loss
− l(t, y)
| {z }
ERM Loss
≤
1
2 ∑
h
′∈{0,h}
y
⊤
πˆ(t −hy+h
′
y)−πˆ(t +h
′
y)
All of our generalization bounds involves bounding the expectation of the right hand side which
in words is the expected change of the decisions if the plug-in t vector is perturbed by −hY or hY.
For general choices of Z , we can bound the quantity if the plug-in t, or in our setting f(X), admits
a smooth density.
Assumption 4.3.2 (Smooth Density). Let g be the probability density function of a random variable T ∈ R
d
. The random variable T satisfies the smooth density function if there exists L > 0 and
R > 0 such that for all t1,t2 ∈ R
d
,
|g(t1)−g(t2)| ≤ L∥t1 −t2∥
21
and g(t) = 0 for t ∈/
t
′
: ∥t
′∥ ≤ R
2
.
The assumption while seemingly restrictive is used to capture the intuition that density of f(X)
is bounded at points where πˆ(f(X)) changes when f(X) is perturbed by hY. We capture this
intuition formally by showing that for small h the distribution of y
⊤πˆ(T) is similar to y
⊤πˆ(T +hy)
in the following result.
Lemma 4.3.3 (General Expected Solution Stability Bound). Let T be a random variable that
satisfies Assumption 4.3.2 with constants L and R, ∥y∥ ≤ 1, and 0 < h ≤ 1. Then, the following
bound holds for h′ ∈ {0,h},
E
h
y
⊤
πˆ(T −hy+h
′
y)−πˆ(T +h
′
y)
i
≤ B·L ·(R+1)
d
· h
Our proof hinges on showing the total variation (TV) distance of the random variables T and
T + hy is O(h). Thus, similar results can be achieved for distributions of T that do not satisfy
Assumption 4.3.2 if the TV distance can be bounded. As an example, Arbas, Ashtiani, and Liaw
(2023) show if T is a multivariate Gaussian, then respective TV distance is O(h∥y∥).
Combining Assumption 4.3.2 with Lemma 4.3.3 shows the expected difference between the
ERM loss and surrogate loss is O(h) since B,L,R,d do not scale with n. Using this result, we
obtain the generalization results presented in the following section.
4.3.1 Generalization Bounds
We present two generalization bounds in the section. Our first generalization bound applies a standard Rademacher complexity argument which holds holds for any choice of Z , that is, the bound
does not depend on Z being polyhedral, non-convex, strongly convex, etc. Instead, the result
leverages the property that the surrogate loss is Lipschitz (Lemma 4.2.1). This property allows us
to apply an elegant vector contraction inequality from Maurer (2016) to bound the Rademacher
complexity of our sample surrogate loss by a multivariate Rademacher complexity of the vectorvalued hypothesis class induced by F.
21
The multivariate Rademacher complexity is defined as follows,
Rn
(F) = E
Rˆ n
(F)
= ED
"
sup
f∈F
Eσ
"
1
n
n
∑
i=1
σ
⊤
i
f(Xi)
## (4.8)
where σ i = (σi1,...,σid) and σi j are i.i.d. Rademacher random variables. Note the multivariate
Rademacher complexity only depends on the data and F, but not Z . El Balghiti, Adam N.
Elmachtoub, et al. (2022) motivate the use of multivariate Rademacher complexity and show it is
of order O˜(d/
√
n for a variety of linear function classes.
We now present our first generalization bound.
Theorem 4.3.4 (Generalization Bound). Assume f(Xi)|Yi satisfies Assumption 4.3.2 with constants
L,R for every f ∈ F. Then, for a universal constant C, the following holds with probability at least
1−δ,
sup
f∈F
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E[l(f(Xi),Yi)]
≤ C
BL(R+1)
d
h+
B
2
h
Rn
(F) +B
s
1
2n
log
1
δ
!
For linear function classes, we see the generalization error is O
h+
1
n
1/2h
, which can be
optimized to be O
n
1/4
by setting h = O(n
−1/4
). This achieves the same rate as Liu and Grigas
(2021), but for the misspecified setting where f
∗ ∈/ F.
The first generalization bound implicitly suggests that the tail of the surrogate loss is O
1
n
1/2h
,
which is counter-intuitive as Lemma 4.2.2 shows the surrogate loss is bounded by a constant that
does not depend on h. This suggests a weakness in our analysis of the generalization error that
occurred from omitting information about the structure of Z . Thus, we present a second generalization bound that shows leverages the assumption that Z is polyhedral.
Our second generalization bound leverages results from Hu, Kallus, and Mao (2022) to show
the tails of the surrogate loss do not depend on on h. To that end, we present the definition of VClinear-subgraph dimension from Hu, Kallus, and Mao (2022) which is an alternative to multivariate
216
Rademacher complexity for quantifying the complexity of the plug-in policy class πf
induced by
F.
Definition 4.3.5 (VC-Linear-Subgraph Dimension). The VC-linear-subgraph dimension of a class
of functions F ⊆ [R
p →R
d
] is the VC dimension of the sets F◦ =
(x,β,t) : β
⊤ f(x) ≤ t
: f ∈ F
in R
p+d+1
, that is, the largest integer ν for which there exist x1,..., xν ∈ R
p
, β1,...,βν ∈ R
d
,
t1 ∈ R, . . . , tν ∈ R such that
nI
n
β
⊤
1
f (x1) ≤ t1
o
,...,I
n
β
⊤
ν
f (xν ) ≤ tν
o : f ∈ F
o
We make the following assumption which bounds the VC-linear-subgraph dimension by a constant.
Assumption 4.3.6 (Bounded VC Dimension). The function class
F¯ =
¯f :
¯f(x, y) = f(x) +hy, for f ∈ F,h ∈ R
has VC-linear-subgraph dimension at most ν.
This assumption slightly differs from the assumption from Hu, Kallus, and Mao (2022) since
it incorporates the perturbation term hy into the plug-in class. However, since the hypothesis class
is augmented by an addition parameter h, the VC-linear-subgraph-dimension of F should only
differ by a universal constant. Using VC-linear-subgraph dimension, we can obtain the following
generalization bound for polyhedral Z . Let Z∠ be the set of extreme points of Z .
Theorem 4.3.7 (Generalization Bound for Polyhedral). Assume Definition 4.3.5 holds and f(Xi)
satisfies Assumption 4.3.2 with constants L,R for every f ∈ F. For h > 0 and polyhedral Z such
that maxz∈Z ∥z∥ ≤ B, there exists a universal constant C such that with probability at least 1−δ,
sup
f∈F
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E[l(f(Xi),Yi)]
≤ BL(R+1)
d
h+CBr
ν log(|Z∠|+1)log(5/δ)
n
217
The polyhedral bound shows the generalization error is Op
h+n
−1/2
, removing the 1/h
dependence from the tail bound portion of the bound. This result suggests that choosing h = 0
optimizes the error bound in order to achieve a Op(n
−1/2
) generalization error rate, which matches
the generalization error of the ERM loss in proven in Hu, Kallus, and Mao (2022) and El Balghiti,
Adam N. Elmachtoub, et al. (2022).
Theorem 4.3.7 begs the question of what is the trade-off of selecting small h or even h = 0 for
polyhedral Z . In the numerics, we demonstrate that choosing small h has a bias-variance trade-off
effect on the regret of the learned policy.
4.4 Numerics
For our numerics section, we consider a simple shortest path problem to highlight the performance
of different surrogate loss functions. The shortest path problem solves,
min
z
f
∗
1
(x)z1 + f
∗
2
(x)z2 s.t. z1 +z2 = 1, z1,z2 ∈ {0,1},
which is a graph with two edges and the goal is to identify the edge with the smallest expected
travel time given a context x ∈ [0,2]. In our experiments, we let
f
∗
1
(x) = 4x and f
∗
2
(x) = 8x−3,
and generate Xi ∼ Unif[0,2] and Yi j = f
∗
j
(Xi)+εi j for j = 1,2 and εi j ∼ N (0,1). For our data, we
generate n samples Dn = {(Xi
,Yi1,Yi2) : i = 1,...,n}.
We consider a plug-in class of the form
f(x,β) = (β11x+β10, β21x+β20) for β ∈ R
4
.
218
The policy class solves
min
z
f1(x,β)z1 + f2(x,β)z2 s.t. z1 +z2 = 1, z1,z2 ∈ {0,1}.
For ease of plotting, we will consider the policy class that only varies β10 and fixes the remaining
parameters. The choice of the fixed parameters determines whether the plug-in class is wellspecified or misspecified. We note that the optimal policy in this setting is z1 = 1,z2 = 0 if Xi ≥ 0.75
and z1 = 0,z2 = 1 otherwise. Thus, even if the plug-in class is misspecified, the optimal policy can
be recovered.
Our experiments primarily compare the performance of our differentiable surrogate loss which
we label as DSL and SPO+ from Adam N Elmachtoub and Grigas (2022). We compare against
SPO+ as it is the main surrogate loss with generalization guarantees. The goal of these numerics
is to highlight the quality of the policies learned from the different loss surrogates and to highlight
how the well the DSL estimates the ERM loss as we vary h and n.
4.4.1 Results
Our results primarily focus on the the misspecified setting where f
∗ ∈/ F. Our misspecified setting
sets β11 = −1, β21 = 0, and β20 = 0, that is f where f2(x) = 0 for all x and f1(x) = −x+β10. Thus,
our plug-ins are just functions of β01. In our misspecified setting, the optimal policy is β01 = 0.75
since f1(x) < 0 for x ≥ 0.75.
We first demonstrate visually how DSL constructs a differentiable surrogate in of the ERM loss
in Fig. 4.1. The figure highlights how ERM loss has gradients that are zero almost everywhere as it
is a step function. The DSL effectively smooths the ERM loss with the smoothing level controlled
by h. The figure also demonstrates intuition on where approximation error occurs. For h = 0.08,
we see the approximation error incurred by DSL is large at points where there is a large jump in
ERM loss. This corresponds to points where the bound in Lemma 4.3.3 is large.
219
4.0
4.5
5.0
-1 0 1 2 3
Policy (β10)
Loss
DSL, h = 0.08 DSL, h = 0.16 DSL, h = 0.32 ERM
Figure 4.1: Choice of h The figure plots the ERM loss and DSL loss for a single simulation
where n = 40. The figure highlights how the DSL approach constructs a differentiable surrogate
by effectively smoothing the ERM loss. As h → 0, we see the ERM loss is recovered.
0
5
10
15
20
10 20 40 80 160 320 640
n
Regret (% Full Info)
DSL
ERM
SPO+
Figure 4.2: Regret of Learned Policies The figure plots the regret of the policies learned using
the different loss functions as the number of samples increase over 100 trials. We see that SPO+
does not learn the best-in-class policy even as the number of samples increases. Additionally, we
see DSL seems to converges to optimal regret at a faster rate than even ERM.
In Fig. 4.2, we plot the regret of the learned policies using DSL, ERM, and SPO+. The figure
shows SPO+ fails to learn the best-in-class policy. As the number of samples increases, the SPO+
regret converges to approximately 10% of full information while ERM and DSL converge to zero
regret. Interestingly, the DSL regret seems to coverge slightly faster compared to the ERM regret.
This makes intuitive sense as the DSL is a smoothed form of ERM and thus may have favorable
bias-variance trade-off as an estimate of the true loss.
To understand why SPO+ fails visually, we plot the ERM, DSL, and SPO+ losses in Fig. 4.3.
The key observation is that the SPO+ loss does not approximate ERM loss like DSL loss. As a
result, the learned policy for SPO+ may be far from the optimal policy as shown in the figure.
220
2
3
4
5
-1 0 1 2 3
Policy (β10)
Loss
DSL
ERM
SPO+
Figure 4.3: Why SPO+ Fails The figure plots a single simulation of the different losses for n = 80
in the misspecified setting. The vertical lines plot the choices of β10 that minimize the respective
losses. The plot shows that SPO+ does not directly estimate ERM and thus minimum can be
drastically different from the minimum of ERM. Thus, in misspecified settings SPO+ may learn a
suboptimal policy policy.
Since DSL loss closely mirrors the ERM loss, we see their minima are closer and the performance
of the corresponding policies is more similar. In the trial plotted, we see that the DSL loss ends up
learning a better policy compared to ERM loss since it is closer to the optimal policy of β10 = 0.75.
Finally, Fig. 4.4 looks to address the question of why we do not take h to be as close to 0 as
possible. The figure plots the regret of the policies learned by the DSL loss for each choice of h
over 100 simulations. The plot shows that taking h to be too small or two big can increase the
regret of the learned policies. For small h the DSL estimate more closely recovers ERM which
can be noisy as shown in Fig. 4.1. However, as h grows large, we see that it increases bias in the
estimates. As demonstrated, both factors impact the regret of the learned policy. Additionally, this
plot suggests that while DSL is an estimate of ERM, it may be better loss to optimize over if h can
be tuned correctly.
h → 0 can increase variability of the learned policy. This suggests that the ERM loss
4.5 Conclusion
We have proposed a novel surrogate loss for decision-aware learning in the contextual linear optimization setting. Our surrogate loss has attractive theoretical properties and our numerics show it
221
0.0
1.0
2.0
3.0
4.0
5.0
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
h
Regret (% Full Info)
n = 20
n = 80
Figure 4.4: Choice of h on Regret The figure plots the regret of the policy learned by DSL for
different values of h across 100 simulations. We plot the results for n = 20 and n = 80 to highlight
different noise settings. The figure highlights a bias-variance trade-off for the choice of h.
performs well empirically, in some cases better than the ERM loss which it looks to approximate.
Future work may focus on how to best optimize over the loss practically since it is non-convex and
seems to perform better or worse depending on the choice of h. Additionally, our work may provide insight into constructing theoeretical guarantees for other surrogate losses for decision-aware
problems.
4.6 Appendix: Proofs for Section 4.2.1 Results
Proof. Proof for Lemma 4.2.1.
Expanding we see
ˆlh(t, y)− ˆlh(t
′
, y) = V(t +hy)−V(t −hy)
2h
−
V(t
′ +hy)−V(t
′ −hy)
2h
=
V(t +hy)−V(t
′ +hy)
2h
+
V(t
′ −hy)−V(t −hy)
2h
We observe that,
V(t)−V(s) = t
⊤πˆ(t)−s
⊤πˆ(s) = t
⊤ (πˆ(t)−πˆ(s))
| {z }
≤0, by optimality of πˆ(t)
+ (t −s)
⊤
πˆ(s)
≤ ∥t −s∥∥πˆ(s)∥ ≤ B∥t −s∥,
222
where the last inequality follows our assumption that ∥z∥ ≤ B for all z ∈ Z . Thus, we can show,
ˆlh(t, y)− ˆlh(t
′
, y) ≤
B∥t −t
′∥
2h
+
B∥t −t
′∥
2h
=
B∥t −t
′∥
h
.
By symmetry, we have
ˆlh(t, y)− ˆlh(t
′
, y)
≤
B∥t −t
′∥
h
,
completing the proof.
Proof. Proof for Lemma 4.2.2 As observed in the proof of Lemma 4.2.1,
V(t)−V(s) ≤ B∥t −s∥.
Thus,
ˆlh (t, y) = V(t +hy)−V(t −hy)
2h
≤
B∥2hy∥
2h
= B∥y∥
Using the same argument, we can obtain the same lower bound. This completes the proof.
Proof. Proof for Lemma 4.2.3 To prove the result, we focus on the i
th component of the gradient
∇ ˆl(t, y). Let ei be a d-dimensional unit vector where the i
th component is 1. Then,
e
⊤
i ∇E
ˆlh(f(X),Y)
= lim
ε→0
E
"
ˆlh(f(X) +εei
,Y)− ˆlh(f(X),Y)
ε
#
.
We see that
ˆlh(f(X) +εei
,Y)− ˆlh(f(X),Y)
ε
→ e
⊤
i ∇ ˆlh(f(X),Y)
223
as ε → 0 and that ∇ ˆlh(f(X),Y) exists almost everywhere since ˆlh(f(X), y) is B/h-Lipschitz with
respect to t by Lemma 4.2.1. We also see that
ˆlh(f(X) +εei
,Y)− ˆlh(f(X),Y)
ε
≤ B∥Y∥.
Thus, we can apply dominated convergence theorem to show
e
⊤
i ∇E
ˆlh(f(X),Y)
= E
h
e
⊤
i ∇ ˆlh(f(X),Y)
i
for every i. This completes the proof.
4.7 Appendix: Proofs for Section 4.3 Results
Proof. Proof for Proposition 4.3.1.
Expanding the term inside the absolute value, we see,
ˆlh(t, y)−l(t, y)
=
V (t +hy)−V (t)
2h
+
V (t)−V (t −hy)
2h
−y
⊤πˆ (t)
≤
1
2
V (t +hy)−V (t)
h
−y
⊤πˆ (t)
| {z }
(a)
+
1
2
V (t)−V (t −hy)
h
−y
⊤πˆ (t)
| {z }
(b)
To bound (a), we note
V (t +hy)−V (t)
h
−y
⊤πˆ (t) = 1
h
(t +hy)
⊤
(πˆ (t +hy)−πˆ (t)) ≤ 0
224
where the last inequality holds by optimality of πˆ(t +hy). Thus,
(a) ≤
1
h
(t +hy)
⊤
(πˆ (t)−πˆ (t +hy)) = 1
h
t
⊤ (πˆ (t)−πˆ (t +hy))
| {z }
≤0, by optimality of πˆ(t)
+ y
⊤ (πˆ (t)−πˆ (t +hy))
≤ y
⊤ (πˆ (t)−πˆ (t +hy)).
To bound (b), we note
V (t)−V (t −hy)
h
−y
⊤πˆ (t) = 1
h
(t −hy)
⊤
(πˆ (t)−πˆ (t −hy)) ≥ 0
where the last inequality holds by optimality of πˆ(t −hy). Thus,
(b) ≤
1
h
(t −hy)
⊤
(πˆ (t)−πˆ (t −hy)) = 1
h
t
⊤ (πˆ (t)−πˆ (t −hy))
| {z }
≤0, by optimality of πˆ(t)
− y
⊤ (πˆ (t)−πˆ (t −hy))
≤ y
⊤ (πˆ (t −hy)−πˆ (t)).
Combining the bounds on (a) and (b) proves the result.
Proof. Proof for Lemma 4.3.3 Let BR =
s ∈ R
d
: ∥s∥ ≤ R
be a ball of radius R. We can write
the expectation as
E
h
y
⊤
πˆ(T −hy+h
′
y)−πˆ(T +h
′
y)
i
=
Z
BR
y
⊤πˆ(t −hy+h
′
y)g(t)dt −
Z
BR
y
⊤πˆ(t +h
′
y)g(t)dt
22
where we integrate over a ball of radius R since Assumption 4.3.2 assumes g(t) = 0 for t ∈/ BR.
Applying change of variable we can write
E
h
y
⊤
πˆ(T −hy+h
′
y)−πˆ(T +h
′
y)
i
=
Z
BR+h∥y∥
y
⊤πˆ(s+h
′
y)g(s+hy)ds −
Z
BR+h∥y∥
y
⊤πˆ(t +h
′
y)g(t)dt
where we integrate over the larger ball BR+h∥y∥
to account for non-zero values of g(s + hy) and
note g(s+hy) = 0 for s ∈/ BR+h∥y∥
.
We next bound the difference as follows
Z
BR+h∥y∥
y
⊤πˆ(s+h
′
y)g(s+hy)ds −
Z
BR+h∥y∥
y
⊤πˆ(t +h
′
y)g(t)dt
=
Z
BR+h∥y∥
y
⊤πˆ(t +h
′
y) (g(t +hy)−g(t)) dt
≤
Z
BR+h
y
⊤πˆ(t +h
′
y)
|g(t +hy)−g(t)| dt
≤
Z
BR+h
B|g(t +hy)−g(t)| dt
≤
Z
BR+h
BL∥hy∥ dt
≤ BLhZ
BR+h
1dt
where the second inequality follows
y
⊤πˆ(t +h
′
y)
≤ ∥y∥∥πˆ(t + h
′
y)∥ ≤ 1 · B and the third inequality holds by Assumption 4.3.2. The integral R
BR+h
1dt is the volume of the ball which can be
bounded by (R+h)
d
. Applying the assumption h ≤ 1 completes the bound.
22
Proof. Proof for Theorem 4.3.4 We can upper-bound the generalization error as follows,
sup
f∈F
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E[l(f(Xi),Yi)]
≤ sup
f∈F
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E
ˆlh (f(Xi),Yi)
| {z }
(i)
+ sup
f∈F
1
n
n
∑
i=1
E
ˆlh (f(Xi),Yi)−l(f(Xi),Yi)
| {z }
(ii)
We first bound (i). Letting
Rn
SL(F) = E
Rˆ n
SL(F)
= E
"
Eσ
"
sup
f∈F
1
n
n
∑
i=1
σi
ˆlh (f(Xi),Yi)
##
and noting 0 ≤
ˆlh(f(Xi),Yi)+B
2B ≤ 1 by Lemma 4.2.2, we can apply the standard Rademacher complexity result (Mohri, Rostamizadeh, and Talwalkar, 2018, Theorem 3.3) to show for any δ > 0,
the following holds for all f ∈ F with probability at least 1−δ,
1
n
n
∑
i=1
E
"
ˆlh(f(Xi),Yi) +B
2B
#
≤
1
n
n
∑
i=1
ˆlh(f(Xi),Yi) +B
2B
+2Rn
SL(F) +s
1
n
log
1
δ
.
To obtain the two sided bound for (i), we can obtain the same bound with ˆlh(f(Xi),Yi)+B
2B
replaced
with −ˆlh(f(Xi),Yi)+B
2B
. Taking the union bound, we see the following holds with probability at least
1−2δ,
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E
ˆlh (f(Xi),Yi)
≤ 4BRn
SL(F) +2B
s
1
n
log
1
δ
We next bound Rn
SL(F) by applying Corollary 4 of Maurer (2016) to show
Rn
SL(F) = E
"
sup
f∈F
1
n
n
∑
i=1
σi
ˆlh (f(Xi),Yi)
#
≤
√
2
B
h
E
"
sup
f∈F
1
n
n
∑
i=1
σ
⊤
i
f(Xi)
#
=
√
2
B
h
Rn
(F)
227
where the inequality holds via the corollary with the Lipschitz constant from Lemma 4.2.1. Combining the bounds, we show with probability at least 1−δ,
(i) ≤
4
√
2B
2
h
Rn
(F) +2B
s
1
n
log
1
δ
Next, we bound (ii). Focusing on the term inside the supremum, we see
1
n
n
∑
i=1
E
ˆlh (f(Xi),Yi)−l(f(Xi),Yi)
≤ E
"
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−l(f(Xi),Yi)
#
≤
1
n
n
∑
i=1
1
2 ∑
h
′∈{0,h}
E
h
E
h
Y
⊤
i
πˆ(f(Xi)−hYi +h
′Yi)−πˆ(f(Xi) +h
′Yi)
Yi
ii
≤ BLRd
h
where the first inequality applies Jensen’s inequality and triangle inequality, the second inequality applies Proposition 4.3.1 and law of iterated expectations, and the third inequality applies
Lemma 4.3.3 to the conditional expectation. Since the bound holds for all f , we have (ii) ≤
BL(R+1)
dh.
Combining the bounds on (i) and (ii) proves the result.
Proof. Proof for Theorem 4.3.7 We consider the following alternative decomposition of the generalization error. We can upper-bound the generalization error as follows,
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E[l(f(Xi),Yi)]
≤
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−l(f(Xi),Yi)
| {z }
(i)
+
1
n
n
∑
i=1
l(f(Xi),Yi)−E[l(f(Xi),Yi)]
22
Bounding (i), we have
(i) ≤
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−l(f(Xi),Yi)
≤
1
2n
n
∑
i=1
∑
h
′∈{0,h}
Y
⊤
i
πˆ(f(Xi)−hYi +h
′Yi)−πˆ(f(Xi) +h
′Yi)
=
1
2 ∑
h
′∈{0,h}
1
n
n
∑
i=1
Y
⊤
i πˆ(f(Xi)−hYi +h
′Yi)−E
h
Y
⊤
i πˆ(f(Xi)−hYi +h
′Yi)
i
+
1
2 ∑
h
′∈{0,h}
1
n
n
∑
i=1
Y
⊤
i πˆ(f(Xi) +h
′Yi)−E
h
Y
⊤
i πˆ(f(Xi) +h
′Yi)
i
+
1
2 ∑
h
′∈{0,h}
1
n
n
∑
i=1
E
h
Y
⊤
i
πˆ(f(Xi)−hYi +h
′Yi)−πˆ(f(Xi) +h
′Yi)
i
≤ 2sup
h
1
n
n
∑
i=1
Y
⊤
i πˆ(f(Xi)−hYi)−E
h
Y
⊤
i πˆ(f(Xi)−hYi)
i
+
1
2 ∑
h
′∈{0,h}
1
n
n
∑
i=1
E
h
Y
⊤
i
πˆ(f(Xi)−hYi +h
′Yi)−πˆ(f(Xi) +h
′Yi)
i
where the first inequality applies the triangle inequality, the second inequality applies Proposition 4.3.1, and the last inequality combines similar terms by taking the supremum over h. Applying
the bound on (i), we see
sup
f∈F
1
n
n
∑
i=1
ˆlh (f(Xi),Yi)−E[l(f(Xi),Yi)]
≤ 3 sup
¯f∈F¯
1
n
n
∑
i=1
Y
⊤
i πˆ(
¯f(Xi
,Yi))−E
h
Y
⊤
i πˆ(
¯f(Xi
,Yi))i
| {z }
(a)
+ sup
f∈F
1
2 ∑
h
′∈{0,h}
1
n
n
∑
i=1
E
h
Y
⊤
i
πˆ(f(Xi)−hYi +h
′Yi)−πˆ(f(Xi) +h
′Yi)
i
| {z }
(b)
Component (a) is bounded using Theorem 1 and Theorem 2 of Hu, Kallus, and Mao (2022) showing there exists a universal constant C such that the following holds with probability at least 1−δ,
(a) ≤ CBr
ν log(|Z∠|+1)log(5/δ)
n
.
Component (b) is bounded by Lemma 4.3.3. Combining (a) and (b) components proves the result.
230
Chapter 5
Conclusion and Future Work
The small-data, large-scale regime highlights a setting where decision-aware approaches are necessary for learning high quality decision-making policies. This dissertation provides a comprehensive example on how to develop the key components for an effective decision-aware algorithm.
Chapter 2 introduced the fundamental concept of using the Variance Gradient Correction to evaluate data-driven policies. Chapter 3 presented a new denoising problem class that expands the scope
and application of the Variance Gradient Correction to more practical machine learning settings.
Finally, Chapter 4 uses insight from the Variance Gradient Correction to show how to improve the
implementation of decision-aware learning approaches.
With the rise of more advanced artificial intelligence and machine learning models such as
generative artificial intelligence, this dissertation also acts as starting point for work on integrating
these models into decision-making tasks. Future work considers the following:
• This dissertation focuses on decision-making problems that can be formulated as linear optimization problems. Can our results be generalized to other settings such as non-linear
optimization problems or sequential decision-making problems?
• Many modern machine learning models like generative AI may not fall into the predict-thenoptimize framework. How do we generalize the decision-aware insights of this dissertation
to these types of models?
231
Bibliography
Aouad, Ali et al. (2019). “Market segmentation trees”. In: arXiv preprint arXiv:1906.01174.
Arbas, Jamil, Hassan Ashtiani, and Christopher Liaw (2023). “Polynomial time and private learning of unbounded Gaussian Mixture Models”. In: arXiv preprint arXiv:2303.04288.
Arora, Sanjeev et al. (2020). “Harnessing the Power of Infinitely Wide Deep Nets on Small-data
Tasks”. In: ICLR 2020.
Baardman, Lennart et al. (2019). “Scheduling promotion vehicles to boost profits”. In: Management Science 65.1, pp. 50–70.
Bertsekas, Dimitri P (1997). “Nonlinear programming”. In: Journal of the Operational Research
Society 48.3, pp. 334–334.
Bousquet, Olivier and Andre Elisseeff (2001). “Algorithmic stability and generalization perfor- ´
mance”. In: Advances in Neural Information Processing Systems, pp. 196–202.
Boutilier, Justin J and Timothy CY Chan (2019). “Response time optimization for drone-delivered
automated external defibrillators”. In: arXiv preprint arXiv:1908.00149.
Breiman, Leo (1996). “Bagging predictors”. In: Machine Learning 24.2, pp. 123–140.
Brodley, Carla E and Mark A Friedl (1999). “Identifying mislabeled training data”. In: Journal of
Artificial Intelligence Research 11, pp. 131–167.
Buja, Andreas, Trevor Hastie, and Robert Tibshirani (1989). “Linear smoothers and additive models”. In: The Annals of Statistics, pp. 453–510.
Candes, Emmanuel J, Justin K Romberg, and Terence Tao (2006). “Stable signal recovery from
incomplete and inaccurate measurements”. In: Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 59.8, pp. 1207–
1223.
Chan, Grace and Andrew TA Wood (1997). “Algorithm AS 312: An Algorithm for simulating
stationary Gaussian random fields”. In: Applied Statistics, pp. 171–181.
Chatterjee, Shubhojeet et al. (2020). “Review of noise removal techniques in ECG signals”. In:
IET Signal Processing 14.9, pp. 569–590.
Cheskes, Sheldon et al. (2020). “Improving access to automated external defibrillators in rural
and remote settings: a drone delivery feasibility study”. In: Journal of the American Heart
Association 9.14, e016687.
Chu, Jamal et al. (2021). “Machine learning-based dispatch of drone-delivered defibrillators for
out-of-hospital cardiac arrest”. In: Resuscitation 162, pp. 120–127.
Chung, Tsai-Hsuan et al. (2022). “Decision-Aware Learning for Optimizing Health Supply Chains”.
In: arXiv preprint arXiv:2211.08507.
Cohen, Maxime C, Jeremy J Kalas, and Georgia Perakis (2021). “Promotion optimization for multiple items in supermarkets”. In: Management Science 67.4, pp. 2340–2364.
232
Cohen, Maxime C. et al. (2017). “The impact of linear optimization on promotion planning”. In:
Operations Research 65.2, pp. 446–468.
Combes, Richard (2015). “An extension of McDiarmid’s inequality”. In: arXiv preprint arXiv:1511.05240.
Deng, Yunxiao and Suvrajeet Sen (2018). “Learning enabled optimization: Towards a fusion of
statistical learning and stochastic programming”. In: INFORMS Journal on Optimization (submitted).
Devroye, Luc, Abbas Mehrabian, and Tommy Reddad (2018). “The total variation distance between high-dimensional Gaussians”. In: arXiv preprint arXiv:1810.08693 6.
Donti, Priya, Brandon Amos, and J Zico Kolter (2017). “Task-based end-to-end model learning in
stochastic optimization”. In: Advances in Neural Information Processing Systems, pp. 5484–
5494.
Dud´ık, Miroslav, John Langford, and Lihong Li (2011). “Doubly robust policy evaluation and
learning”. In: Proceedings of the 28th International Conference on International Conference
on Machine Learning, pp. 1097–1104.
El Balghiti, Othman, Adam N Elmachtoub, et al. (2019). “Generalization bounds in the predictthen-optimize framework”. In: Advances in Neural Information Processing Systems, pp. 14412–
14421.
— (2022). “Generalization Bounds in the Predict-Then-Optimize Framework”. In: Mathematics
of Operations Research 0.0, null. DOI: 10.1287/moor.2022.1330.
Elmachtoub, Adam, Jason Cheuk Nam Liang, and Ryan McNellis (2020). “Decision trees for
decision-making under the predict-then-optimize framework”. In: International Conference on
Machine Learning. PMLR, pp. 2858–2867.
Elmachtoub, Adam N and Paul Grigas (2021). “Smart “predict, then optimize””. In: Management
Science.
— (2022). “Smart “predict, then optimize””. In: Management Science 68.1, pp. 9–26.
Elvik, Rune et al. (2009). “The handbook of road safety measures, Bingley”. In: UK: Emerald
Group Publishing Limited 4.5, p. 12.
Friedman, J., T. Hastie, and R. Tibshirani (2001). The Elements of Statistical Learning. Berlin:
Springer.
Gallicchio, Claudio and Simone Scardapane (2020). “Deep randomized neural networks”. In: Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning
Conference (INNSBDDL2019). Springer, pp. 43–68.
Gamarnik, David (2013). “Correlation decay method for decision, optimization, and inference in
large-scale networks”. In: Theory Driven by Influential Applications. INFORMS, pp. 108–121.
Gelman, A. et al. (2014). Bayesian Data Analysis. Vol. 2. Chapman & Hall/CRC Boca Raton, FL,
USA.
Gonzalez, Rafael C. and Richard E. Woods (2006). Digital Image Processing (3rd Edition). USA:
Prentice-Hall, Inc. ISBN: 013168728X.
Grigas, Paul, Meng Qi, et al. (2021). “Integrated conditional estimation-optimization”. In: arXiv
preprint arXiv:2110.12351.
Guo, Wenshuo, Michael Jordan, and Angela Zhou (2022). “Off-policy evaluation with policydependent optimization response”. In: Advances in Neural Information Processing Systems 35,
pp. 37081–37094.
Gupta, Vishal, Michael Huang, and Paat Rusmevichientong (2022). “Debiasing in-sample policy
performance for small-data, large-scale optimization”. In: Operations Research. Forthcoming.
233
Gupta, Vishal and Nathan Kallus (2021). “Data Pooling in Stochastic Optimization”. In: Management Science. URL: https://doi.org/10.1287/mnsc.2020.3933.
— (2022). “Data pooling in stochastic optimization”. In: Management Science 68.3, pp. 1595–
1615.
— (2023). “Contextual Data Pooling for Panel Data”. Working paper.
Gupta, Vishal and Paat Rusmevichientong (2021a). “Small-data, large-scale linear optimization
with uncertain objectives”. In: Management Science 67.1, pp. 220–241.
— (2021b). “Small-data, large-scale linear optimization with uncertain objectives”. In: Management Science 67.1, pp. 220–241.
Hastie, Trevor, Robert Tibshirani, and Ryan Tibshirani (2020a). “Best Subset, Forward Stepwise
or Lasso? Analysis and Recommendations Based on Extensive Comparisons”. In: Statistical
Science 35.4, pp. 579–592. DOI: 10.1214/19-STS733. URL: https://doi.org/10.1214/
19-STS733.
— (2020b). “Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on
Extensive Comparisons”. In: Statistical Science 35.4, pp. 579–592.
Hu, Yichun, Nathan Kallus, and Xiaojie Mao (2022). “Fast rates for contextual linear optimization”. In: Management Science 68.6, pp. 4236–4245.
Ignatiadis, Nikolaos and Stefan Wager (2019a). “Covariate-Powered Empirical Bayes Estimation”.
In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32.
Curran Associates, Inc.
— (2019b). “Covariate-Powered empirical Bayes estimation”. In: Advances in Neural Information
Processing Systems 32.
Ito, Shinji, Akihiro Yabe, and Ryohei Fujimaki (July 2018a). “Unbiased Objective Estimation in
Predictive Optimization”. In: Proceedings of the 35th International Conference on Machine
Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Stockholm Sweden, pp. 2176–
2185.
— (2018b). “Unbiased objective estimation in predictive optimization”. In: International Conference on Machine Learning. PMLR, pp. 2176–2185.
Jacot, Arthur, Franck Gabriel, and Clement Hongler (2018). “Neural tangent kernel: Convergence ´
and generalization in neural networks”. In: Advances in Neural Information Processing Systems
31.
Javanmard, Adel and Andrea Montanari (2014). “Confidence intervals and hypothesis testing for
high-dimensional regression”. In: The Journal of Machine Learning Research 15.1, pp. 2869–
2909.
— (2018). “Debiasing the lasso: Optimal sample size for gaussian designs”. In: Annals of Statistics
46.6A, pp. 2593–2622.
Javanmard, Adel, Hamid Nazerzadeh, and Simeng Shao (2020). “Multi-Product Dynamic Pricing in High-Dimensions with Heterogeneous Price Sensitivity”. In: 2020 IEEE International
Symposium on Information Theory (ISIT), pp. 2652–2657. DOI: 10.1109/ISIT44484.2020.
9174296.
Jordan, Michael, Yixin Wang, and Angela Zhou (2022). “Empirical Gateaux Derivatives for Causal
Inference”. In: Advances in Neural Information Processing Systems 35, pp. 8512–8525.
Kavran, Andrew J and Aaron Clauset (2021). “Denoising large-scale biological data using network
filters”. In: BMC Bioinformatics 22, pp. 1–21.
234
Kearns, Michael and Dana Ron (1999). “Algorithmic stability and sanity-check bounds for leaveone-out cross-validation”. In: Neural Computation 11.6, pp. 1427–1453.
Kim, Ki-Su et al. (2020). “Prediction of ocean weather based on denoising autoencoder and convolutional LSTM”. In: Journal of Marine Science and Engineering 8.10, p. 805.
Kontorovich, Aryeh (2014). “Concentration in unbounded metric spaces and algorithmic stability”.
In: International Conference on Machine Learning. PMLR, pp. 28–36.
Kuntzman, Gersh (Nov. 2022). stuckatdot: It takes Years (and years) to get a speed hump in this
city. URL: https : / / nyc . streetsblog . org / 2022 / 11 / 11 / stuckatdot - it - takes -
years-and-years-to-get-a-speed-hump-in-this-city/.
LeVeque, Randall J (2007). Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. SIAM.
Li, Xiaocheng and Yinyu Ye (2019). “Online linear programming: Dual convergence, new algorithms, and regret bounds”. In: arXiv preprint arXiv:1909.05499.
Liu, Heyuan and Paul Grigas (2021). “Risk bounds and calibration for a smart predict-then-optimize
method”. In: Advances in Neural Information Processing Systems 34, pp. 22083–22094.
Liyanage, Liwan H and J George Shanthikumar (2005). “A practical inventory control policy using
operational statistics”. In: Operations Research Letters 33.4, pp. 341–348.
Maurer, Andreas (2016). “A vector-contraction inequality for rademacher complexities”. In: Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21,
2016, Proceedings 27. Springer, pp. 3–17.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2018). Foundations of machine
learning. MIT press.
New York, City of (2023). Motor Vehicle Collisions (Crashes) Dataset. https://data.cityofnewyork.
us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95.
Pani Abhishek, S. Raghavan and Mustafa Sahin (2017). “Large-Scale Advertising Portfolio Optimization in Online Marketing”. In: Working Paper. URL: http://terpconnect.umd.edu/
~raghavan/preprints/lsoapop.pdf.
Pollard, David (1990). “Empirical processes: Theory and applications”. In: NSF-CBMS Regional
Conference Series in Probability and Statistics. Vol. 2. Institute of Mathematical Statistics,
pp. i–86.
Rabiner, Lawrence R and Bernard Gold (1975). “Theory and application of digital signal processing”. In: Englewood Cliffs: Prentice-Hall.
Rahimi, Ali and Benjamin Recht (2008). “Weighted sums of random kitchen sinks: Replacing
minimization with randomization in learning”. In: Advances in Neural Information Processing
Systems 21.
Rusmevichientong Paat, David P. Williamson (2006). “An adaptive algorithm for selecting profitable keywords for search-based advertising services”. In: Proceedings of the 7th ACM Conference on Electronic Commerce. ACM, pp. 260–269.
Sanfridsson, J et al. (2019). “Drone delivery of an automated external defibrillator–a mixed method
simulation study of bystander experience”. In: Scandinavian journal of trauma, resuscitation
and emergency medicine 27.1, pp. 1–9.
Schmoys, David and Shujing Wang (Sept. 2019). How to solve a linear optimization problem
on incentive allocation? URL: https : / / eng . lyft . com / how - to - solve - a - linear -
optimization-problem-on-incentive-allocation-5a8fb5d04db1.
235
Shalev-Shwartz, Shai et al. (2010). “Learnability, stability and uniform convergence”. In: The Journal of Machine Learning Research 11, pp. 2635–2670.
Shao, Jun (1993). “Linear model selection by cross-validation”. In: Journal of the American statistical Association 88.422, pp. 486–494.
Smith, James E and Robert L Winkler (2006). “The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”. In: Management Science 52.3, pp. 311–322.
Tian, Chunwei et al. (2020). “Deep learning on image denoising: An overview”. In: Neural Networks 131, pp. 251–275.
Van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic
Mathematics 3. Cambridge, UK: Cambridge University Press.
Vershynin, Roman (2018). High-Dimensional Probability: An Introduction with Applications in
Data Science. Vol. 47. Cambridge University Press.
Vlastelica, Marin et al. (2019). “Differentiation of blackbox combinatorial solvers”. In: arXiv
preprint arXiv:1912.02175.
Wager, Stefan and Susan Athey (2018). “Estimation and inference of heterogeneous treatment
effects using random forests”. In: Journal of the American Statistical Association 113.523,
pp. 1228–1242.
Wainwright, Martin J (2019). High-dimensional statistics: A non-asymptotic viewpoint. Vol. 48.
Cambridge University Press.
Wilder, Bryan, Bistra Dilkina, and Milind Tambe (2019). “Melding the data-decisions pipeline:
Decision-focused learning for combinatorial optimization”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33, pp. 1658–1665.
Wilder, Bryan, Eric Ewing, et al. (2019). “End to end learning and optimization on graphs”. In:
Advances in Neural Information Processing Systems, pp. 4672–4683.
Wilson, Kevin W et al. (2008). “Speech denoising using nonnegative matrix factorization with
priors”. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.
IEEE, pp. 4029–4032.
Yu, Bin (2013). “Stability”. In: Bernoulli, pp. 1484–1500.
Zhang, Cun-Hui and Stephanie S. Zhang (2014). “Confidence intervals for low dimensional parameters in high dimensional linear models”. In: Journal of the Royal Statistical Society: Series B:
Statistical Methodology, pp. 217–242.
Zhu, Xingquan and Xindong Wu (2004). “Class noise vs. attribute noise: A quantitative study”. In:
Artificial Intelligence Review 22, pp. 177–210.
Zhu, Xingquan, Xindong Wu, and Qijun Chen (2003). “Eliminating Class Noise in Large Datasets”.
In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03. Washington, DC, USA: AAAI Press, pp. 920–927. ISBN: 1577351894.
236
Abstract (if available)
Abstract
Modern decision-making has rapidly evolved to integrate data into the decision-making pipeline via machine learning. While many machine learning algorithms have become more specialized to leverage the intricacies of specific "big data" datasets to make better predictions, fewer algorithms target general decision-making problems. Instead, most practical data-driven approaches use machine learning as an auxiliary step to predict unknown parameters used in decision-making. Unfortunately, most decision-making problems often have "small data," that is, noisy or sparse data. As a result, machine learning predictions often have high uncertainty, which leads to poor decisions.
This dissertation addresses these challenges by proposing new methods and algorithms that combine prediction and decision-making into a single learning procedure to improve data-driven decision-making. The decision information helps machine learning algorithms learn predictions, leading to better decisions, particularly in small-data settings.
The first part of the dissertation focuses on a fundamental issue in decision-aware learning: evaluating policies' performance. The second part of the dissertation introduces a new decision-aware problem class leveraging VGC concepts. It proposes the one-shot VGC that generalizes its use to evaluate policies that plugin predictions from any machine learning algorithm. Finally, the last chapter focuses on developing theoretically sound yet computationally tractable decision-aware learning algorithms.
Decision-aware learning provides a general framework for integrating the power of modern machine learning models into decision-making. This dissertation offers a starting point for understanding the actual benefits of decision-aware learning.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Robustness of gradient methods for data-driven decision making
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
RANSAC-based semi-supervised learning algorithms for partially labeled data, with application to neurological screening from eye-tracking data
PDF
Modeling customer choice in assortment and transportation applications
PDF
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
PDF
Adapting statistical learning for high risk scenarios
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
A survey on the computational hardness of linear-structured Markov decision processes
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Essays on revenue management with choice modeling
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Fairness in machine learning applied to child welfare
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Robust and adaptive online decision making
Asset Metadata
Creator
Huang, Michael
(author)
Core Title
Decision-aware learning in the small-data, large-scale regime
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2023-12
Publication Date
06/01/2024
Defense Date
12/01/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
contextual linear optimization,decision-aware learning,empirical risk minimization,end-to-end,large-scale optimization,machine learning,OAI-PMH Harvest,policy learning,predict-then-optimize,small data
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gupta, Vishal (
committee chair
), Drakopoulos, Kimon (
committee member
), Razaviyayn, Meisam (
committee member
), Rusmevichientong, Paat (
committee member
)
Creator Email
michael.huang.phd@marshall.usc.edu,michael.x.huang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113781809
Unique identifier
UC113781809
Identifier
etd-HuangMicha-12513.pdf (filename)
Legacy Identifier
etd-HuangMicha-12513
Document Type
Dissertation
Format
theses (aat)
Rights
Huang, Michael
Internet Media Type
application/pdf
Type
texts
Source
20231205-usctheses-batch-1111
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
contextual linear optimization
decision-aware learning
empirical risk minimization
end-to-end
large-scale optimization
machine learning
policy learning
predict-then-optimize
small data