Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
(USC Thesis Other)
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Artificial Decision Intelligence
Integrating Deep Learning and Combinatorial Optimization
by
Aaron Matthew Ferber
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2023
Copyright 2023 Aaron Matthew Ferber
Dedication
To the constellation of hearts and minds that have shaped this academic and
personal journey. Firstly, to my advisor, Professor Bistra Dilkina, your steadfast
mentorship, guidance, and leadership have been pivotal in shaping this work
and myself as a researcher. This body of work, and my expertise, are thanks
to your tireless dedication and endless enthusiasm. To my committee members,
Professors Yan Liu and Phebe Vayanos. Professor Liu, yours was the first class I
took at USC, and there I would not only discuss research topics in deep learning,
but also meet many of my best friends and roommates during my PhD. Professor
Vayanos, it is always a pleasure to topics in research and everyday life with you
and your students, thank you for inspiring me to pursue the many challenges
in this work. To Yuandong Tian, thank you for mentoring me and pushing me
to search for insights. To David Shmoys, Shane Henderson, Eoin O’Mahony,
and Daniel Freund, thank you for introducing me to the world of solving open
ii
problems by asking basic questions and taking the time to mentor me as I was
just beginning my research journey.
To my collaborators, Brandon Amos, Emily Barbee, Meredith Gore, Umang
Gupta, Taoan Huang, Burcu Keskin, Alexandru Niculescu-Mizil, Jialin Song, Martin Schubert, Benoit Steiner, Milind Tambe, Bryan Wilder, Yisong Yue, Daochen
Zha, and Arman Zharmagambetov, thank you for working with me on these
many interesting projects and enabling these contributions. To my first lab mates,
Amrita Gupta, Elias Khalil, and Caleb Robinson, thank you for your guidance,
mentorship, and introduction to the Dilkina group. To Taoan Huang, Haoming
Li, and Weizhe Chen, thank you for being amazing lab mates and bringing the
Dilkina group to life with intense discussions, fun events, and eagerness to explore. To the many people who have shared the lab and their journeys with me,
Jackson Killian, Lily Xu, Eric Ewing, Elizabeth Ondula, Aditya Mate, Kai Wang,
Han Ching Ou, Bryan Wilder, Elizabeth Bondi-Kelly, Aida Rahmattalabi, Sarah
Cooney, Aaron Schlenker, Sarah McCarthy, Haifeng Xu, Omkar Thakoor, Laksh
Matai, Shivam Goel, Sadegh Farhang, Sadiq Patel, Hsun Ta, Sina Aghaei, Jason
Chen, Andrew Perrault, Shahrzad Gholami, and Amulya Yadav. It was a privilege to get to know you and I couldn’t have done this without your kindness and
friendship.
iii
To my friends and extended family from undergrad who continue to mentor,
encourage, and push me to do more, Bernardo Casares, Shannon Casares, Mateo
Espinosa, Eyvind Niklasson, Joseph Dwyer, Harshil Mattoo, Kaushik Venkataraman, Rainy Niu, Minghui Chen, Nini Shan, Fiona Chin, Tiffany Zheng, Robert
Spiers, Sharon Spiers, and Laya Hiridjee. To the many people who have lived
with me and shared many insane housing adventures, Hrayr Harutyunyan, Bryan
Wilder, Shivam Goel, Aliya Ali, Raffaello, Samad Ali, Jasmine Zain, Robert Spiers,
and Sharon Spiers, thank you for going through the many extraordinary events
that occurred during my studies including earthquakes, fires, and downpours.
Sharing these extraordinary circumstances brought us closer and I really appreciate every moment spent together. To my partner Aliya Ali, thank you for being
there every step of the way and encouraging me to explore new directions and
countries. You have made this time exciting beyond belief and I am so thankful
to have spent this time with you in my life. Your care and kindness have been a
encouraged me to keep looking forward and stay confident on my path. Thank
you as well for sharing the joy of training our dog Raffaello who is an absolute
ball of fun and happiness.
To my grandparents, aunts, uncles, and cousins in Uruguay, St. Louis, and
China, thank you for sharing unforgettable joy, encouragement, and advice. The
iv
time we spent together in person and on group chats have been grounding and
a constant reminder to keep looking at the bigger picture. To my parents and
sister. Mom and dad, you have been an incredible support throughout my life
and I am so thankful to share my accomplishments with you any way I can. You
have encouraged me and given me so much that I can never repay. Jessi, you
have supported me so much in so many different ways its hard to count. You are
an inspiration to me and I’m thankful to be your brother.
v
Acknowledgements
This work was made possible from financial support from the Georgia Tech presidential award, NSF award #1914522, NSF award #1935451, NSF award #2112533,
and internships at NEC Labs and Meta AI Research.
vi
Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Learning and Optimization . . . . . . . . . . . . . . . . . . . . . 1
1.2 Perspectives on Integrated Machine Learning and Combinatorial
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Combinatorial Optimization as Deep Learning Modules . 6
1.2.2 Learning-Accelerated Optimizers . . . . . . . . . . . . . 12
vii
Chapter 2: MIPaaL: Mixed Integer Program as a Layer . . . . . . . 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 MIPaaL: Encoding MIP in a Neural Network . . . . . . . . . . . 22
2.4 Decision-Focused Learning with MIPaaL . . . . . . . . . . . . . 27
2.5 MIPaaL Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Benchmark problems . . . . . . . . . . . . . . . . . . . . 31
2.6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . 36
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Appendix 2.A Portfolio optimization . . . . . . . . . . . . . . . . . . 49
2.A.1 Data specification . . . . . . . . . . . . . . . . . . . . . . 49
2.A.2 Optimization model . . . . . . . . . . . . . . . . . . . . . 54
Appendix 2.B Bipartite matching . . . . . . . . . . . . . . . . . . . . 58
2.B.1 Data specification . . . . . . . . . . . . . . . . . . . . . . 58
2.B.2 Optimization model . . . . . . . . . . . . . . . . . . . . . 59
viii
Chapter 3: PredictingWildlife Trafficking Airline Routes with Differentiable Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . 61
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Flight Itinerary Prediction Formulation . . . . . . . . . . . . . . 71
3.3.1 Predictive Model: Edge Transition Estimator . . . . . . . 72
3.3.2 Model Training: Path-Integrated Learning . . . . . . . . 76
3.3.3 Model Training: Edge-Myopic Learning . . . . . . . . . 80
3.4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . 84
3.5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5.3 Results Discussion . . . . . . . . . . . . . . . . . . . . . 87
3.5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 4: SurCo: Learning Linear Surrogates For Combinatorial
Nonlinear Optimization Problems . . . . . . . . . . . . . . . . . . 95
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2 Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3 SurCo: Learning Linear Surrogates . . . . . . . . . . . . . . . . . 103
ix
4.3.1 SurCo-zero: on-the-fly optimization . . . . . . . . . . . 103
4.3.2 SurCo-prior: offline surrogate training . . . . . . . . . 105
4.3.3 SurCo-hybrid: fine-tuning a predicted surrogate . . . . 107
4.4 Predicting Surrogate Cost vs Predicting Solution, A Theoretical
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4.1 Lipschitz constant and sample complexity . . . . . . . . 110
4.4.2 Difference between Cost and Solution Regression . . . . 111
4.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 113
4.5.1 Embedding Table Sharding . . . . . . . . . . . . . . . . . 114
4.5.2 Inverse Photonic Design . . . . . . . . . . . . . . . . . . 116
4.5.3 Nonlinear Route Planning . . . . . . . . . . . . . . . . . 120
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Appendix 4.A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Appendix 4.B Experiment Details . . . . . . . . . . . . . . . . . . . . 132
4.B.1 Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.B.2 Network Architectures . . . . . . . . . . . . . . . . . . . 133
4.B.2.1 Embedding Table Sharding . . . . . . . . . . . 133
4.B.2.2 Inverse Photonic Design . . . . . . . . . . . . . 134
x
Appendix 4.C Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 136
Appendix 4.D Additional Failed Baselines . . . . . . . . . . . . . . . 139
Chapter 5: GenCO: Generating Diverse Solutions to Nonlinear Combinatorial Optimization Problems . . . . . . . . . . . . . . . . . . 144
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Generating Diverse Solutions to Nonlinear Combinatorial Problems 148
5.2.1 Mathematical Formulation . . . . . . . . . . . . . . . . . 149
5.2.2 GenCO: Generating Solutions with Combinatorial Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.2.3 GenCO - GAN . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2.4 GenCO - VQVAE . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Chapter 6: ML for MIP: Learning Pseudo-Backdoors for Mixed Integer Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
xi
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.3 Learning Pseudo-Backdoors . . . . . . . . . . . . . . . . . . . . 169
6.3.1 MIP and Backdoor Data Representation . . . . . . . . . . 171
6.3.2 Neural Network Architecture . . . . . . . . . . . . . . . 173
6.3.3 Learning the Scorer Model . . . . . . . . . . . . . . . . . 175
6.3.4 Learning the Classifier Model . . . . . . . . . . . . . . . 177
6.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.4.1 Problem domains . . . . . . . . . . . . . . . . . . . . . . 178
6.4.2 Data Generation and Model Evaluation . . . . . . . . . . 180
6.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . 187
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . 189
7.2.1 Constraint Prediction . . . . . . . . . . . . . . . . . . . . 189
7.2.2 Handling Missing Counterfactuals . . . . . . . . . . . . 190
7.2.3 Accessibility of Artificial Decision Intelligence . . . . . . 190
7.2.4 Artificial Decision Intelligence . . . . . . . . . . . . . . . 191
xii
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
xiii
List of Tables
2.1 Decision quality. Comparison in terms of realized optimization
objective: monthly percentage increase for portfolio optimization (SP500 and DAX), number of pairs successfully matched for
Matching, and value of items for Knapsack. MIPaaL gives 2x
monthly returns on SP500 and 8x on DAX, and improves the
objective by 40.3% and 1.2% for Matching and Knapsack respectively. MIPaaL outperforms all other variants considered including naive application of previous work [Wilder et al., 2019a] for
differentiating through the root LP. . . . . . . . . . . . . . . . . 36
xiv
2.2 ML performance on test set. TwoStage wins on ML metrics used
for training (MSE, CE), whereas MIPaaL has inferior ML metrics
while improving decision quality. In all benchmarks, the predictive problem is hard as evidenced by the ML metrics of all methods. Bolded entries have 95% confidence intervals overlapping
with the best entry. . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Problem statistics and timing results. Timing results are number
of epochs until validation MIP performance convergence, average time per epoch, and average % time taken per epoch to compute Forward and Backward passes through the MIP Layer. . . . 41
2.4 Transfer learning metrics for models trained on SP-30a
and tested
varying data distribution. Both training and testing problems optimize over 30 stocks. The model is trained on 30 stock instances
and tested on selecting from a different subset of stocks, and 30
stocks from the DAX, a german stock exchange. . . . . . . . . . 41
2.5 Transfer learning metrics for models trained on SP-30a
and tested
varying the problem size. . . . . . . . . . . . . . . . . . . . . . . 42
xv
2.6 Transfer learning monthly rate of return on different problem
sizes: trained on SP-30a
. The decision-focused approaches ensure that the predictions perform well in multiple deployment
settings whereas the TwoStage approach is unable to extract relevant information. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7 Transfer learning MSE results on different problem sizes: trained
on SP-30a
. The TwoStage approach clearly does better in approximating the distribution of objective coefficients even when the
data distribution changes. However, the improvement in MSE
does not ensure good deployment quality as show in Table 2.6 . 54
3.1 Node and edge features of the flight network. Features in Bold
were selected by recursive feature elimination. . . . . . . . . . . 91
xvi
3.2 Summary statistics from 10-fold cross-validation of models using either the full set of features or an algorithmically-selected
subset. We evaluate two training methods, edge-predictive myopic learning which aims to correctly predict how often individual edges are used, and path-integrated learning which aims to
identify the complete intended source - destination path. We report the average performance across folds with 95% confidence
intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1 Notations used in this work. . . . . . . . . . . . . . . . . . . . . 100
4.2 Conceptual comparison of optimizers (both traditional and MLguided). Our approach (SurCo) can handle nonlinear objective
without a predefined analytical form, does not require pre-computed
optimal solutions in its training set, can handle combinatorial
constraints (via commercial solvers it incorporates), and can generalize to unseen instances. . . . . . . . . . . . . . . . . . . . . . 102
4.3 Task randomization of 4 different tasks in inverse photonic design. 135
4.4 Solution prediction results, most methods give infeasible solutions. 141
4.5 Comparison of only feasible solution prediction method against
worst baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xvii
5.1 Game level design results comparison. Here, we compare our
approach with an updating adversary against ours with a fixed
adversary, and lastly, previous work that simply postprocesses
solutions to make them generate valid game levels. Here, all generated levels are valid, but our approach generates more diverse
solutions by training the gan to generate solutions with the gan
adversary taking fixed solutions as input. . . . . . . . . . . . . . 158
6.1 Variable, constraint, and edge features that encode a MIP instance
to be used as input to machine learning models. The features here
capture problem information and leverage solution information
of the root LP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.2 Runtime (secs) of standard Gurobi (grb), the score model (scorer),
and the joint score and classify model (scorer+cls). We report
mean and standard deviation, 25th, 50th, and 75th percentiles.
Finally, we report Win/Tie/Loss vs Gurobi. . . . . . . . . . . . . 185
xviii
List of Figures
3.1 Visualization of itineraries with historical seizures in red as well
as a subset of the global flight network in grey. . . . . . . . . . 68
3.2 Visualization of the discrepancies between Path - Integrated predicted itineraries in blue and observed itineraries in red. Additionally, domain experts identified two likely errors in Fig. 3.2d
where our model’s predictions are unrealistic. . . . . . . . . . . 93
3.3 Feature importance boxplot of the model coefficients across 10
training folds. Positive values suggest higher trafficking rates
when the indicator is prevalent and negative values indicate lower
rates when the indicator is prevalent. . . . . . . . . . . . . . . . 94
4.1 Overview of our proposed framework SurCo. . . . . . . . . . . . 96
xix
4.2 Table placement plan latency (left) and solver runtime (right). We
evaluate SurCo against Dreamshard [Zha et al., 2022b], a SoTA offline
RL sharding tool, a domain-heuristic of assigning tables based on dimension, and a greedy heuristic based on the estimated runtime increase. Striped approaches require pre-training. . . . . . . . . . . . . 111
4.3 Left The solution loss (% of failed instances when the design loss
is not 0), and right test time solver runtime in log scale. For
both, lower is better. We compare against the Pass-Through gradient approach proposed in Schubert et al. [2022]. We observe
that SurCo-prior achieves similar success rates to the previous
approach Pass-Through with a substantially improved runtime.
Additionally, SurCo-zero runs comparably or faster, while finding more valid solutions than Pass-Through. SurCo-hybrid obtains valid solutions most often and is faster than SurCo-zero at
the expense of pretraining. Striped approaches use pretraining. . 113
xx
4.4 Inverse photonic design settings from the ceviche challenges Schubert et al. [2022] along with SurCo-zero solution designs and
wavelength intensities. Light is fed in on the left and is routed
at desired intensities to the output by designing the intermediate
region. In the Wavelength Multiplexer setting, two wavelengths
of interest are visualized as they are routed to different locations. 117
4.5 Inverse photonic design convergence example [Schubert et al.,
2022]. In (a), SurCo-zero smoothly lowers the loss while the
pass-through baseline converges noisily. Also, SurCo-hybrid quickly
fine-tunes an already high-quality solution. (b) visualizes the
SurCo-zero solution and (c) visualizes the two wavelengths of
interest which are successfully routed from the top to the bottom. 118
4.6 Nonlinear route planning visualization. The goal is to route from
the top left to bottom right corner, with the edge weights being
normally distributed, maximizing the probability of arriving before a set deadline. . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xxi
4.7 Comparison of nonlinear route planning probability of arriving on time.
We compare against a domain heuristic [Nikolova et al., 2006] and SCIP
[Achterberg, 2009]. SurCo-zero outperforms the domain heuristic, and
is similar to SCIP using less time. SCIP-1s fails to find feasible solutions. 123
6.1 The pseudo-backdoor pipeline visualizes the pipeline for a single
MIP instance consisting of scoring S(P, B; θS) and classification
C(P, B; θC). First k pseudo-backdoor sets of decision variables
B1, . . . , Bk are sampled according to the decision variables’ LP
fractionality. These candidate pseudo-backdoor sets are ranked
according to the scoring module S(P, B; θS) to predict the best
pseudo-backdoor B
∗
. The classification module then determines
whether to run the solver using B
∗ or not based on the predicted
pseudo-backdoor success C(P, B
∗
; θC). . . . . . . . . . . . . . . 171
xxii
Abstract
Artificial Intelligence (AI) has the potential to impact many facets of our society largely due to its ability to quickly make high-quality data-driven decisions
at scale. We consider Artificial Decision Intelligence (ADI) to be a paradigm
for building artificial intelligence methods geared explicitly toward automatic
decision-making. In the rapidly evolving paradigms of machine learning (ML)
and combinatorial optimization (CO), remarkable progress has been made in different directions, revolutionizing how we synthesize insights from data as well
as how to best act on those insights. Machine learning, specifically deep learning,
with its ability to learn intricate patterns from seemingly unstructured data, has
seen profound success across diverse applications. Simultaneously, combinatorial optimization has made significant strides, efficiently performing industrialscale decision-making by searching for optimal solutions from combinatorially
large and highly-structured search spaces. This thesis explores different perspectives on the tight integration of these two paradigms: machine learning and
xxiii
combinatorial optimization, developing new tools that demonstrate the strengths
of both approaches for solving complex tasks. Taking different perspectives on
machine learning, combinatorial optimization, and how they can be combined
in a cohesive and complementary manner, we propose new methodologies that
enable end-to-end data-driven decision-making, deep predictive models that respect combinatorial constraints, methods that solve complex problems by learning to formulate simpler surrogate optimization problems, and optimization algorithms that learn from historical data to improve solver performance. The proposed methodologies contribute to the advancement of our capability in handling
new and complex real-world problems. Specifically, we demonstrate the impact
of our methodologies in several domains, such as identifying wildlife trafficking routes, designing photonic devices, large-scale recommendation systems, financial portfolio optimization, generating game levels, and smart energy grid
scheduling. Thus, this thesis serves as a step forward in artificial decision intelligence by solving complex tasks and providing decision-support tools that
leverage machine learning and combinatorial optimization.
xxiv
Chapter 1
Introduction
“You think because you understand ’one’ you must also understand ’two’, because
one and one make two. But you must also understand ’and’.” - Rumi
1.1 Learning and Optimization
Artificial Intelligence (AI) has gained considerable interest as it has the potential
to support people in a variety of domains, such as finance, transportation, energy
management, and more, by performing complex tasks automatically. AI systems
deployed in real-world decision-making settings frequently leverage different
techniques, such as learning patterns from data and identifying optimal decisions
1
from large search spaces. Machine learning (ML), and deep learning (DL) specifically, has achieved recent success in a wide range of applications by learning
complex patterns from unstructured data such as images, text, social networks,
spreadsheets, and more. Simultaneously, combinatorial optimization (CO) effectively solves many industrial-scale problems by finding the best solution from
vast but highly structured search spaces to optimize user-specified objectives.
In this thesis, we develop new tools that deeply integrate these paradigms to
yield methods exhibiting the strengths of both approaches. Taking different
perspectives on machine learning, combinatorial optimization, and how they
can be combined in a cohesive and complementary manner, we propose new
methodologies that enable end-to-end data-driven decision-making, deep predictive models respect combinatorial constraints, methods that solve complex
problems by learning to formulate simpler surrogate optimization problems, and
optimization algorithms that learn from historical data to improve solver performance.
Machine learning enables AI systems to learn from historical data and progressively improve their performance on a given task without being explicitly
told how to perform said task. By deriving insights from copious amounts of
data, ML seeks to automatically detect patterns, relationships, and trends based
2
on feedback that signals how it performs and what it can do to improve. Furthermore, the goal is to train a model that performs the task generally well on
unseen data, not just the training data. This ability to learn from data allows AI
systems to handle complex tasks that are difficult to program explicitly. ML approaches generally consider training a model, essentially a scaffold or template
for the desired program. Training algorithms then learn the program’s specifics
to ensure that the program performs well on the task based on the available data.
Deep learning represents a modular and multi-modal paradigm where there are
many different types of components that handle different types of tasks, such as
understanding images, text, spreadsheets, and more. These models are designed
to output different data structures and seamlessly connect these components to
perform complex tasks by training the model to mimic the model’s real-world
deployment. At its core, deep learning considers training very flexible models
by learning parameters using gradient descent, which iteratively updates the parameters in the most promising direction to improve the model. This training
procedure requires us to differentiate through the different modules, understanding how changes in the parameters affect changes in the model’s output.
Combinatorial optimization is a branch of mathematical optimization that
seeks to identify the optimal solution from a large set of possibilities in a highly
3
structured and explicitly constrained search space. These approaches naturally
arise in settings like operations or logistics, where the practitioner must figure
out the best way to route goods, preserve space for wildlife, or design an electrical
grid, among many other tasks. Underlying each of these settings is the space of
possible decisions, such as whether to purchase a specific plot of land for preservation, the amount of a good to route along a specific leg of the supply chain,
or whether to build a cable between two stations. Here, a given solution is the
complete set of decisions consisting of a fully specified supply chain, a list of the
plots of land to purchase, or the set of cables to build. Additionally, practitioners
define certain constraints a complete solution must satisfy, such as ensuring that
the supply chain meets all customer demand, the lots purchased must be contiguous, or the electrical grid must be connected. Lastly, an objective function
quantifies the quality of a solution, such as the efficiency of a supply chain, the
cost of a conservation plan, or the reliability of an electrical grid. The goal is
to find the complete solution that optimizes the objective function while satisfying the constraints. CO provides various mathematical frameworks to represent problems and often employs algorithmic techniques to explore the solution
space efficiently by dividing up the problem and leveraging mathematical guarantees to rule out certain areas of the search space. Frameworks such as linear
4
programming, integer programming, and mixed integer linear programming are
common general-purpose tools used to solve industrial-scale problems like those
mentioned above. Fundamentally, combinatorial optimization presents tools to
deal with problems where the task at hand requires searching a well-defined
space for solutions that satisfy a number of intricate constraints and optimize
numerical objectives.
1.2 Perspectives on Integrated Machine Learning
and Combinatorial Optimization
In this thesis, we explore how the integration of machine learning and combinatorial optimization can provide new types of tools that synergistically combine the strengths of both machine learning and combinatorial optimization. We
propose several perspectives on how joint machine learning and combinatorial
optimization models can solve new classes of problems. By following these perspectives, we develop new methodologies that enable end-to-end data-driven
decision-making, deep predictive models that respect combinatorial constraints,
methods that solve complex problems by learning to formulate simpler surrogate
optimization problems, and optimization algorithms that learn from historical
5
data to improve solver performance. Each approach requires rethinking various
components in both ML and CO to take a unified perspective that draws inspiration from each paradigm individually but requires novel solution methodology.
Ultimately, our goal is to solve complex tasks and provide decision-support tools
requiring new approaches that leverage machine learning and combinatorial optimization.
1.2.1 Combinatorial Optimization as
Deep Learning Modules
Our first perspective is to view combinatorial optimization as modules that can be
integrated into deep learning pipelines. The motivation here is that once we can
use optimization inside deep learning ecosystems, we can tackle a variety of new
tasks that require the structure and constraints of combinatorial optimization, as
well as the ability of deep learning to handle diverse data types.
Decision Focused Learning Initially, our work in decision-focused learning
is motivated by settings in which we need to make decisions based on the outputs of a predictive model. We propose an approach for training the full prediction and optimization pipeline end-to-end by obtaining gradients through the
6
optimization model back to the predictive model. To illustrate this, consider the
setting of financial portfolio optimization, where we train a deep learning model
that predicts the next day’s stock prices. The price predictions are then used to
decide how to reallocate stocks to maximize revenue while minimizing risk using standard optimization techniques. Finally, the portfolio is deployed, and then
in the next time step, we observe the actual prices, which determine the realized
returns of the proposed portfolio. The core idea here is that there are a number of
ways to train the initial price prediction model. For instance, a standard approach
would be to train the model to predict the returns of individual stocks to be as
close as possible on average. However, the model will invariably have errors, and
minor errors in the price predictions may significantly impact the output of the
downstream optimization problem. Thus, having close predictions on average
may lead to poor revenue. Instead, we consider training the machine learning
model such that when the price predictions are fed into the optimization model,
the resulting portfolio has as high a return as possible when evaluated with the
hidden stock prices. In order to train the predictive model, we need to perform
gradient-based updates of the model parameters, meaning we need to backpropagate from the realized return, to the suggested portfolio, through the optimization model to the predicted prices, and then back to the parameters of the deep
7
learning pipeline. Both the evaluation of the proposed portfolio and the inference
of the predictive model are readily available as differentiable modules. However,
differentiation through the optimization problem is nontrivial and has not previously been considered for broad classes of optimization problems. In order to
determine actionable gradients, we need to understand how changes in the predicted optimization inputs will affect changes in the resulting optimal solution
to identify the most promising directions to modify the predicted optimization
inputs. To enable differentiation of a broad range of combinatorial optimization
problems, we propose MIPaaL: Mixed Integer Program as a Layer in chapter 2,
which provides gradients for mixed integer linear programs to be used in these
sequential prediction and optimization settings. Ultimately, MIPaaL ensures that
the training procedure of sequential prediction and optimization pipelines mimics the real-world deployment and thus is better aligned with the true objective
of the system, whether it is the observed return of the portfolio strategy or the
revenue generated from smart energy management, among others. Overall, this
perspective considers the deep learning and combinatorial optimization models
as two steps in a pipeline that are combined and trained end-to-end.
8
Constrained Predictive Models In addition to better aligning the training
and deployment of prediction and optimization pipelines, consider how differentiable combinatorial optimization models can help flexibly enforce hard constraints on the outputs of machine learning models in an interpretable manner.
Specifically, we investigate the potential for tightly integrated machine learning
and combinatorial optimization for social impact by predicting likely wildlife
trafficking supply chain routes. The motivation here is that we want to predict
trafficking routes given known wildlife sourcing and consumption locations. In
addition, we want to ensure that the predictions meet certain structural constraints, such as ensuring that the predicted routes are simple paths, meaning
that they don’t revisit intermediate stops and that they start and end in prespecified locations. While this would require convoluted deep learning architectures to ensure, we can readily achieve this by passing deep learning predictions
through an optimization solver. Specifically, we can consider the deep learning
model as predicting the likelihood of whether an individual edge will be used or
not and then pass these edge probabilities through a solver that finds the most
likely path satisfying the specified constraints. The overall goal is to ensure that
the predicted path matches observed paths as much as possible on the training
data. This approach represents a second perspective on the tight integration of
9
deep learning and combinatorial optimization. The optimization procedure can
operate as a layer that flexibly but explicitly enforces certain domain-specific
constraints. This perspective looks at the complete pipeline as a single predictive model that outputs structured objects satisfying combinatorial constraints.
Learning Efficient Surrogates for Hard Problems Looking at things from
the optimization point of view, we are solving an optimization problem where
some parameters of the optimization model, such as parameters of the objective
function, are outputs of a flexible learning model. Taking this perspective allows
us to develop learning-based optimizers where we can tune specific components
of the optimization problem to output useful solutions. Here, we are motivated
by settings where we are trying to solve combinatorial nonlinear optimization
problems, which are notoriously difficult to solve at scale using general-purpose
optimization methods. Specifically, we consider solving these nonlinear problems by learning surrogate linear problems that are comparatively easier to solve.
Here, the surrogate linear problem represents a learning-enabled solver where
we learn the linear coefficients of a solver such that when we solve the surrogate
linear optimization, we end up with solutions that have high quality when evaluated with the original nonlinear objective. Here, the optimization models ensure
10
that the solutions meet the combinatorial constraints, and the learning model is
trained to point the surrogate linear solver toward high-quality solutions.
Generating Diverse Combinatorial Solutions Treating the highly structured optimization problem as a differentiable module enables us to use it as
a flexible layer in generative models. Specifically, we consider using these differentiable optimization layers to train a generative model that generates diverse
combinatorial structures. Using a combinatorial optimization model as the last
step in a generative pipeline, we can ensure that the objects being generated meet
certain domain-specific requirements in a tuneable manner. For instance, when
generating game levels, we want to ensure not only that the generative model is
aligned with the data distribution but also that the generative model is guaranteed to output solutions that meet certain constraints, such as ensuring that the
game level is solvable, or that the level has a specific number of enemies, walls,
and doors among other constraints. Additionally, since the optimization model
is differentiable, we can train the generative model to generate high-quality solutions with respect to some differentiable objective function, as is the case when
generating photonic devices that must split and route light from one location to
another.
11
These various perspectives enable novel approaches which integrate machine
learning and combinatorial optimization. The deep integration of these two
paradigms can enhance AI systems’ accuracy, efficiency, and versatility in realworld applications. By approaching the merger from different angles: training
predictive models jointly with downstream decision-making algorithms, utilizing combinatorial optimization to enforce constraints on deep learning systems,
learning efficient surrogates for complex problems, and generating diverse combinatorial solutions, we unleash the potential of hybrid models that exploit the
strengths of both paradigms. Each perspective presents a different yet complementary path to harness the power of machine learning’s pattern recognition
and combinatorial optimization’s structured problem-solving approaches. The
tools and techniques serve as stepping stones towards the broader goal of creating robust and adaptable AI systems capable of tackling multifaceted real-world
problems in a wide range of domains that require a multitude of skill sets.
1.2.2 Learning-Accelerated Optimizers
In addition to viewing combinatorial optimization as modules within deep learning frameworks, we also consider perspectives on improving the efficiency of
12
existing combinatorial optimization solvers using machine learning. While industrial scale solvers can handle many real-world optimization problems, several
internal algorithmic decisions, parameters, and methods are determined heuristically. Tuning these components for a specific class of problems often yields
significant performance improvement depending on the task. However, tuning
relevant components requires extensive expert knowledge of the internals of the
solver, how to modify them, and how to understand the solver feedback to guide
tuning. To tackle this, we investigate machine learning approaches to set the
solver internals automatically based on data.
Traditionally, an optimization algorithm follows a predefined set of rules or
steps to find the optimal solution. However, the performance of such an algorithm can significantly depend on the structure and specifics of the problem it
is trying to solve. A significant challenge in optimization is determining which
decisions to make first so that the rest of the problem is comparatively easy to
solve. Intuitively, if we can determine which decisions constitute the “core hardness” of the problem, we can first make those hard decisions to simplify the rest
of the problem.
13
With its ability to model complex patterns from data, machine learning can be
harnessed to identify these core hard sets. Inspired by the concept of strong backdoors, which consists of a small set of decision variables that, when branched on,
guarantees an optimal integral solution and proof of optimality, we introduce the
concept of pseudo-backdoors. Pseudo-backdoors correspond to a subset of decision variables such that prioritizing branching on them leads to faster solve
times.
While strong backdoors are hard to identify due to their dependence on the
intricate structure of the problem instance, the pseudo-backdoor can be seen
as a relaxation tied to the runtime rather than strict optimality property. This
makes pseudo-backdoors easier to learn because the signal for supervision gives
dense feedback. Additionally, this directly aligns the goal of the pseudo-backdoor
with that of the practitioner, who likely cares more about runtime than other
properties. We propose a method to learn to estimate the solver performance of
a candidate pseudo-backdoor and decide whether or not to use it. This method
can identify high-quality pseudo-backdoors on unseen MIP instances, given a
distribution of MIP problems.
14
Our research investigates the extent to which machine learning can help improve the efficiency of optimization algorithms. The proposed learning accelerated optimizers improve runtime and determine whether to use the pseudobackdoor.
15
Chapter 2
MIPaaL: Mixed Integer Program as a Layer
“The whole is greater than the sum of its parts.” - Aristotle
Machine learning components commonly appear in larger decision making
pipelines; however, the model training process typically focuses on training to
minimize a loss that measures average accuracy between predicted values and
ground truth values. Decision-focused learning explicitly integrates the downstream decision problem when training the predictive model, in order to optimize
the quality of decisions induced by the predictions. It has been successfully applied to several limited combinatorial problem classes, such as those that can
be expressed as linear programs (LP), and submodular optimization. However,
these previous applications have uniformly focused on problems with simple
16
constraints. Here, we enable decision-focused learning for the broad class of
problems that can be encoded as a mixed integer linear program (MIP), hence
supporting arbitrary linear constraints over discrete and continuous variables.
We show how to differentiate through a MIP by employing a cutting planes solution approach, an algorithm that iteratively tightens the continuous relaxation
by adding constraints removing fractional solutions. We evaluate our new endto-end approach on several real world domains and show that it outperforms the
standard two phase approaches that treat prediction and optimization separately,
as well as a baseline approach of simply applying decision-focused learning to
the LP relaxation of the MIP. Lastly, we demonstrate generalization performance
in several transfer learning tasks.
2.1 Introduction
We propose a method of training predictive models to directly optimize the quality of decisions that are made based on the model’s predictions. We are particularly interested in decision-making problems that take the form of mixed integer
programs (MIPs) because they arise in settings as diverse as electrical grid load
control Mohsenian-Rad and Leon-Garcia [2010], RNA string prediction Sato et al.
[2011], and many other industrial applications Nemhauser [2013]. MIPs naturally
17
arise in so many settings largely due to their flexibility, computational complexity (ability to capture NP-hard problems), and interpretability. In many practical
situations it is often necessary to predict some component (e.g., the objective)
of the MIP based on historical data, such as estimated demand O’Mahony and
Shmoys [2015], price forecasts Demirović et al. [2019], or patient readmission
rate Chan et al. [2012]. Alternatively, practitioners may use a MIP to enforce that
the outputs of the predictions meet semantically meaningful objectives such as
ensuring predictions result in making fair decisions downstream Benabbou et al.
[2018], Trilling et al. [2006], Warner [1976]. In these settings with a prediction and
optimization component, the predictive models are often trained without regard
for the downstream optimization problem. For example, the mean squared error
will consider errors for different examples to have the same importance. However, in practice it may be more important to distinguish between some values
rather than others, meaning average error metrics may ignore impactful distinctions. Consider predicting item costs for a unit-weight knapsack setting, where
we try to select 5 items minimizing total cost. Training a model to minimize
mean squared error will concentrate model capacity on getting all values correct
on average, and in fact multiplicative errors for large values will dominate the
18
loss function. However, in this setting, we would rather dedicate capacity to correctly predicting cost for items which are potentially in the cheapest set of items
over capturing the nuances between items that are unlikely to be in our set. As
such, good average predictions may not lead to high quality decisions.
Decision-focused learning, first introduced for a financial criterion in Bengio
[1997] and extended to the more general quadratic programs in Amos and Kolter
[2017], and linear programs in Wilder et al. [2019a], explicitly and automatically
integrates the downstream decision problem when training the predictive model,
in order to optimize the quality of decisions induced by the predictive model. In
the commonly used gradient-based predictive models, the central challenge is
in passing gradients back to give the predictive model an indication of how it
should shift its weights in order to improve decision quality of the resulting optimal solution. The discrete and discontinuous solution space that makes the MIP
so widely applicable also prevents us from easily differentiating through it, as
has been done for embedding continuous optimization problems in neural networks Amos and Kolter [2017], Wilder et al. [2019a], de Avila Belbute-Peres et al.
[2018]. Our approach for computing gradients relies on the fact that we can algorithmically generate a continuous surrogate for the original discrete optimization problem. We employ previous work on cutting plane approaches, which
19
can tighten a MIP relaxation by iteratively solving a continuous relaxation and
cutting off the discovered integer infeasible solution until an integer feasible solution is found Gomory [1960]. The final continuous, and convex, optimization
problem can then be used for backpropagation by differentiating the KKT conditions Karush [1939] of the continuous surrogate, as has been suggested for convex problems Amos and Kolter [2017]. While pure cutting plane approaches are
often slower than alternate branch-and-bound MIP solvers in practice Dash et al.
[2014], we note that our approach only needs the cutting plane methodology for
backpropagation during training. Indeed, at test time, we can make predictions
and find optimal decisions based on those predictions using any state-of-the-art
MIP solver, ensuring the running time in deployment is exactly the same were
any other training method used. Due to the computational complexity of backpropagating through large optimization problems, we analyze approaches that
stop cut generation after a fixed number of cuts have been generated, trading off
the tightness of the differentiable solver with improved training runtime. Finally,
we compare against a decoupled learning and optimization approach, that simply
relies on training using a relevant classification or regression loss function.
20
We demonstrate the effectiveness of our approach on three different realworld MIP problem domains concerning investment portfolio optimization, bipartite matching with diversity constraints, and energy-related knapsack, showing significant improvements in solution quality over the baseline. We then evaluate our method’s ability to generalize to unseen tasks with differing problem
sizes, and data distribution.
2.2 Problem description
We consider problems that combine learning with an optimization problem that
can be modeled as a mixed integer program (MIP). Specifically, each instance of
the problem is a triple (ϕ, c, D), where ϕ is a feature vector, c is the objective coefficient vector of a MIP, and D represents additional known data that plays a role
in the downstream optimization. In a MIP for a given instance, D will include the
constraint coefficients, right-hand-side constants, and set of integral variables in
each train instance A, b, I , where the constraints encode data about individual
instances and can vary from one problem instance to another. If c were known
a priori, we could simply use branch and bound to solve the corresponding MIP:
minx{c
T x|Ax ≤ b, ∀i ∈ I, xi ∈ Z}; however, we consider the setting where c is
21
unknown and must be estimated from ϕ. We assume that we observe training instances {(ϕ1, c1, D1), . . . ,(ϕm, cm, Dm)} drawn from some distribution , and we
need to make decisions at test time using only features ϕ and known parameters
D. We will use the training data to train a predictive model fθ (where θ denotes
the internal parameters of the model) which outputs an estimate fθ(ϕ) = ˆc on
a test-time instance. Once we have estimates cˆ, an optimal solution to the estimated problem x
∗
(ˆc; A, b, I) can be obtained. The quality of this solution with
respect to the real coefficients c we observe after the fact is then c
T x
∗
(ˆc; A, b, I).
The standard two stage approach in this setting is to train the machine learning model fθ that minimizes a loss that is generally an average distance between
predicted values cˆ and ground truth values c. However, our overall objective is
to find model parameters θ which yield predictions cˆ that directly maximize the
quality of MIP solution for cˆ, evaluated with respect to the (unknown) ground truth
objective c.
2.3 MIPaaL: Encoding MIP in a Neural Network
We formulate the MIP as a differentiable layer in a neural network which takes
objective coefficients cˆ as input and outputs the optimal MIP solution. Formally,
we consider the optimal solution x
∗
(ˆc; A, b, I) of the MIP as a function of the
22
input coefficients cˆ given linear constraints on the feasible region Ax ≤ b and
the set of integral variables I. We write a functional form of the layer as:
x
∗
(ˆc; A, b, I) =
argminx
cˆ
T x
subject to Ax ≤ b
xi ∈ Z ∀i ∈ I
(2.1)
We can perform a forward pass given input objective coefficients, which are
potentially outputs of a neural network, and feasibility parameters of the MIP
using any solver.
Standard practice in this setting is to first train a model fθ to predict the coefficients based on embeddings ϕi of the different predicted components such that
on average the model predictions cˆ = fθ(ϕ) are not far away from the ground
truth objective coefficients c. Then, decisions are made during deployment based
on the predicted values by finding the optimal solution with respect to the predicted values x
∗
(ˆc; A, b, I) based on the formulation in Equation 2.1.
We introduce an alternative approach that incorporates the above layer into
the training pipeline, instead of only calling the forward pass at deployment. To
do so, we need to provide method to compute the forward and backward passes.
While forward propagation in this setting is straightforward using standard MIP
23
solvers, the highly nonconvex and discrete structure of the MIP, which enable its
flexibility, seem to render gradient computation untenable. Indeed, the optimization problem (2.1) as written is nondifferentiable since even infinitesimal changes
to cˆ can drastically change the optimal solution. Previous work has explored differentiation through optimization, but largely for problem classes which are significantly smoother. Amos and Kolter show how to differentiate through convex
quadratic problems [Amos and Kolter, 2017], which have only continuous variables and whose solutions are natively differentiable. Wilder et al. build on this
work to tackle linear programs, which may have discontinuous solutions (but
lack integer variables), via the addition of quadratic smoothing [Wilder et al.,
2019a]. However, both of these problem classes lack the fundamental representational power of MIPs, which are NP-hard to solve and hence often employed
as a general-purpose engine for hard computational problems (while QPs and
LPs are easily solved in polynomial time). This power comes at a price: MIPs
are fundamentally connected to discrete, nonconvex structures via the integer
variables I, which do not easily admit a differentiable surrogate.
Our solution to this dilemma draws on algorithmic techniques for computing a new continuous relaxation for each individual cˆwhich is much closer to the
discrete problem than if we naively relaxed the integer variables. We then use
24
the continuous-domain techniques developed previously to differentiate through
this relaxation during training time. In particular, we use a pure cutting plane
approach which tightens the LP relaxation (at the expense of potentially generating a large number of cutting planes along the way) [Bodur et al., 2017, Wolsey
and Nemhauser, 2014].
A pure cutting plane approach iteratively solves the linear programming relaxation of the current problem. If the found solution is integral then the algorithm terminates since the found solution is both feasible to the original MIP and
optimal for a relaxation of the original problem. Otherwise, a cut is generated
which removes the discovered fractional solution and leaves all feasible integral
solutions. Since the individual cuts do not remove any integral solutions, the final LP retains all integral solutions. Ideally, we would describe the convex hull of
the integral solutions yielding an exponentially large linear program equivalent
to the original MIP, thus ensuring that all potentially optimal integral solutions
lie on extreme points of the feasible region. In practice, finding the convex hull
of MIP solutions is not just NP-Hard but often numerically intractable. Instead,
we can obtain cutting planes that tighten the feasible region via Gomory cuts or
25
other globally valid cuts Gomory [1960], Balas et al. [1996]. We then consider the
problem with generated cuts Sx ≤ t, and write out the following linear program:
minimizex cˆ
T x
subject to Ax ≤ b
Sx ≤ t
(2.2)
Given this continuous optimization problem, we can now find the gradient of the
optimal solution with respect to the input parameters by differentiating through
the KKT conditions, optimality conditions for Equation 2.2. Differentiation through
the continuous LP is done via the quadratic smoothing approach proposed in
Wilder et al. [2019a] for linear programs based on differentiating the KKT conditions of a quadratic program as shown in Amos and Kolter [2017].
In our setting we can compute dx
dθ using the gradient backpropagated to the
MIP layer: d∇xf(x,θ)
dθ , along with the optimal primal solution xˆ, and dual solution λ, µ corresponding to constraints Ax ≤ b and Sx ≤ t respectively. Using
the quadratic smoothing term γ∥x∥
2
2
in the objective proposed by Wilder et al.
26
[2019a], we can find the gradient of the solution x with respect to the parameters
θ by solving the following system of equations for dx
dθ :
−2γ AT S
T
D(λ)A D(Axˆ − b) 0
D(µ)S 0 D(Sxˆ − t)
dx
dθ
dλ
dθ
dµ
dθ
=
d∇xf(x,θ)
dθ
0
0
,
with D(·) being a shorthand for diag(·).
2.4 Decision-Focused Learning with MIPaaL
Having developed a differentiable layer to solve MIPs, we can incorporate this
layer into the broader pipeline of a machine learning problem in order to train
end-to-end for high test-time optimization performance. Specifically, we define
the loss function to be the solution quality of the predicted optimal decision
x
∗
(ˆc, A, b, I) output by the model (which now includes the MIPaaL layer) with
respect to the ground truth objective coefficients c. In other words, we can use
the MIPaaL formulation to train the model to directly minimize the deployment
objective. We instantiate this method using a neural network parameterized by
θ where fθ predicts the objective coefficients cˆ based on embeddings ϕ of the
decision variables. The foward pass, shown in Equation 2.3, calls fθ to predict
27
the objective coefficients, and then uses the MIPaaL cutting plane solver to generate the LP corresponding to the original MIP, along with the resulting decision
x
∗
. We then set the loss to be the solution quality of the returned decision with
respect to the ground truth coefficients c.
cˆ := fθ(ϕ) (2.3a)
xˆ := x
∗
(ˆc; A, b, I) via Equation 2.1 (2.3b)
loss(ˆc, c) := c
T xˆ (2.3c)
Since the dot product in Equation 2.3c and prediction of the neural network
in Equation 2.3a are differentiable functions of their inputs, the parameters of
the neural network predictor θ can be trained via backpropagation using KKT
conditions of the computed surrogate LP found in Equation 2.2, relying on a
small quadratic regularization term for the LP proposed in Wilder et al. [2019a]
to enforce strong convexity and perform backpropagation through Equation 2.3b.
28
2.5 MIPaaL Variants
MIPaaL - k cuts: Given that the cut generation process is time consuming and
must be done for each forward pass, we examine the tradeoff of decision quality
when stopping cut generation after the first k cuts have been generated. We
experiment with two settings (k = 100 and k = 1000) to determine how the
tightness of the cut generation process impacts the decision quality at test time.
Note that in the case where k = 0 cuts are generated, this method is equivalent to
just using the LP relaxation of the original MIP where all integrality constraints
on decision variables are dropped.
MIPaaL-Warm: Warm starting the training process with a pretrained network may enable MIPaaL to take advantage of already learned representations.
The predictive model feeding into MIPaaL can be initialized with any similarly
parametrized model, as a result, we can initialize the predictive input of MIPaaL
with a model trained to predict objective coefficients by minimizing a standard
ML loss. The resulting model may yield better decision quality than the initialization and retain good predictive performance.
MIPaaL-Hybrid: We consider a hybrid loss balancing MIP decision quality
and predictive performance. The hybrid approach is intended to simultaneously
improve decision quality and minimize a standard ML loss, like cross-entropy
29
or mean squared error, lossML(ˆc, c) between neural network predictions c˜ and
ground truth coefficients c. We define the hybrid loss as a weighted sum of MIPbased loss and ML loss, weighting with a scalar α:
losstotal(ˆc; A, b, I) = c
T x
∗
(ˆc, A, B, I) + αlossML(ˆc, c). (2.4)
2.6 Empirical Evaluation
We instantiate MIPaaL for a range of tasks which require prediction followed by
optimization with the overall goal of improving the objective value upon deployment. Specifically, we run experiments on combinatorial portfolio optimization,
diverse bipartite matching and knapsack instances. The portfolio optimization
setting accounts for various combinatorial constraints enforcing small number
of assets in the portfolio and limiting rebalancing transactions to maximize the
overall predicted return of the portfolio. Diverse bipartite matching enforces diversity in the types of recommended matches to maximize an overall predicted
utility of the suggested pairing. Knapsack instances select times to sell energy
given a limit on when energy can be sold. In each setting, the predictive problem
30
is nontrivial since the features do not contain much signal for predicting objective coefficients. However, we demonstrate that in these settings, we can train
a predictive model whose outputted objective coefficients yield high-quality decisions after optimization. Experimental details and extended results are in the
appendix∗
.
2.6.1 Benchmark problems
Portfolio Optimization has been heavily studied in financial settings. Given
a set of assets to choose from and predicted price changes for the next period,
the objective is to invest money among assets to maximize return, while limiting
risk from uncertainty. Practitioners may also want to limit overexposure to any
single financial sector or limit order quantity as in Bertsimas et al. [1999]. In this
setting, the prediction problem of forecasting the next period’s returns is generally a difficult learning task. To tackle this, Bengio [1997] proposes training a
model to maximize the objective value obtained by a convex portfolio optimization problem given by Markowitz [1952]. We approach the problem considering
a MIP criterion based on the formulation given by Bertsimas et al. [1999].
∗https://tinyurl.com/aaai2020-mipaal
31
In the combined prediction and optimization problem, the next period’s returns (i.e., the objective) are unknown and predicted from historical prices and
features from Quandl’s WIKI and FSE datasets Quandl [2019]. We evaluate on
the SP500, a collection of the 505 largest companies representing the American
market, and the DAX, a set of 30 large German companies. We split the data temporally, training, validating, and testing on data from periods Jan ’05-Dec ’10, Jan
’11-Nov ’13, and Dec ’13-Nov ’16 respectively. Details regarding data collection
are in the appendix∗
.
Diverse Bipartite Matching: Bipartite matching is used in many applications, where the success probability of a particular match is often estimated from
data. Without additional constraints, bipartite matching can be formulated as an
LP; a setting previously used for decision-focused learning Wilder et al. [2019a]
since the LP relaxation gives exact integral solutions. In practice though, matching problems are often subject to additional constraints. For instance, finding fair
and high-quality housing allocations or kidney matching with additional contingency plans Benabbou et al. [2018], Dickerson et al. [2016] require additional
modeling which make the integer problem NP-Hard.
We use the problem of bipartite matching with added diversity constraints,
which enforce a maximum and minimum bound on the percent of edges selected
32
with a specified property. We use the experimental setup of Wilder et al. [2019a],
who did not include diversity constraints. Specifically, the matching instances
are constructed from the CORA citation network Sen et al. [2008] by partitioning
the graph into 27 instances on disjoint sets of nodes (split into train, test and
validation). Diversity constraints are added to enforce that at least some percent
of edges selected are between papers in the same field and some percent are
between papers in different fields.
Knapsack concerns choosing from a set of items each with a value and a
weight. The objective is to maximize the total value of items selected, while
not exceeding a budget on total weight. We consider a setting where the item
values are not known at the time of decision-making, and must be predicted
from data. Our knapsack dataset corresponds to unit commitment problems.
In those problems we want to plan when to sell energy at half-hour intervals
over a day, while limiting operating costs. We want to maximize the revenue
we get but we need to predict half-hour prices. We use real-world energy-aware
scheduling data from Demirović et al. [2019] which consist of historical energy
data and prices from the ICON energy-aware scheduling competition. The time
period weights are drawn uniformly between 0 and 1, and the budget is 10% of
the total weight. In Demirović et al. [2019], the authors describe and analyze an
33
algorithm for exhaustively searching linear models which yield good knapsack
performance; however, the results are for linear predictive components feeding
to knapsack instances, and do not directly extend to neural network models.
2.6.2 Methods
MIPaaL: Following our specified methodology, we use CPLEX’s implementation of Gomory cuts Gomory [1960] which we collect through the C API. Since
we are only interested in the global cuts generated, we limit solving to the root
node, disable heuristics, and all cut generation procedures other than Gomory
cuts. Termination occurs when either an integral solution is found, in which
case our resulting LP gives us the exact integral optimal solution, or when CPLEX
stops generating viable cuts (since the cut generation based purely on Gomory
cuts is not complete).
Two-Stage: We compare against the standard predict-then-optimize approach
which treats prediction and optimization components separately. The predictive
component is trained to minimize a standard loss between predicted objective
coefficients and the ground truth (e.g., mean squared error or cross-entropy).
Afterwards, we solve the MIP to optimality using the predicted coefficients.
34
RootLP: Next, we compare against a decision-focused learning approach that
uses only the initial LP relaxation of the MIP (which disregards integrality constraints), and hence can be solved using the method proposed by Wilder et al.
[2019a] for linear programs. While the predictive model is trained using the
LP relaxation, at test time we solve the true MIP using the predicted objective
coefficients to obtain an integral decision. This tests the impact of our cutting
plane method, which allows us to fully account for combinatorial constraints in
MIPaaL’s gradients.
Setup: We average over 5 training and testing iterations per problem setting
with different seeds to evaluate the given approaches. This results in 180 portfolio optimization, 55 diverse bipartite matching, and 95 weighted knapsack instances to compute testing metrics. The neural network architectures are fixed
for a given problem distribution, and are selected from grid search using MIP decision quality on the validation set of the TwoStage model. The models are trained
with Adam Kingma and Ba [2014] with learning rate 0.01 and l2 normalization
coefficient of 0.01. Details on model architectures and evaluation setup can be
found in the appendix∗
.
35
2.6.3 Experimental Results
Table 2.1: Decision quality. Comparison in terms of realized optimization objective: monthly percentage increase for portfolio optimization (SP500 and DAX),
number of pairs successfully matched for Matching, and value of items for Knapsack. MIPaaL gives 2x monthly returns on SP500 and 8x on DAX, and improves
the objective by 40.3% and 1.2% for Matching and Knapsack respectively. MIPaaL
outperforms all other variants considered including naive application of previous work [Wilder et al., 2019a] for differentiating through the root LP.
SP500 DAX Matching Knapsack
MIPaaL 2.79 ± 0.17 5.70 ± 0.68 4.80 ± 0.71 507.70 ± 0.471
MIPaaL-Warm 1.09 ± 0.18 0.68 ± 1.01 2.14 ± 0.51 499.60 ± 0.566
MIPaaL-Hybrid 1.08 ± 0.15 0.74 ± 1.10 3.21 ± 0.73 503.36 ± 0.578
MIPaaL-1000 2.60 ± 0.16 4.39 ± 0.66 3.45 ± 0.71 506.34 ± 0.662
MIPaaL-100 1.25 ± 0.14 0.35 ± 0.63 2.57 ± 0.54 505.99 ± 0.621
RootLP 1.97 ± 0.17 -1.97 ± 0.69 3.17 ± 0.60 501.58 ± 0.662
TwoStage 1.19 ± 0.15 0.70 ± 1.46 3.42 ± 0.78 501.49 ± 0.523
We evaluate the realized decision quality and predictive performance on test
instances. We provide 95% confidence half-widths around the mean evaluation.
Unless otherwise indicated, bolded entries are unbeaten by other methods using
one-sided paired t-tests with significance level of 0.05 to indicate whether we can
reject the hypothesis that observed improvement is due to random chance.
Decision quality: The decision quality of a given model’s outputs is determined by the realized objective value of solutions obtained by using the model’s
36
Table 2.2: ML performance on test set. TwoStage wins on ML metrics used for
training (MSE, CE), whereas MIPaaL has inferior ML metrics while improving
decision quality. In all benchmarks, the predictive problem is hard as evidenced
by the ML metrics of all methods. Bolded entries have 95% confidence intervals
overlapping with the best entry.
SP500 DAX Matching Knapsack MSE Corr MSE Corr CE AUC MSE Corr
MIPaaL 0.22 ± 0.043 0.15 ± 0.015 0.13 ± 0.017 0.25 ± 0.032 0.66 ± 0.009 0.535 ± 0.004 2774 ± 97.664 0.567 ± 0.002 MIPaaL-Warm 0.11 ± 0.010 -0.01 ± 0.010 0.09 ± 0.067 0.07 ± 0.030 0.52 ± 0.003 0.509 ± 0.003 4660 ± 72.008 0.593 ± 0.003 MIPaaL-Hybrid 0.09 ± 0.030 0.13 ± 0.013 0.13 ± 0.099 0.26 ± 0.026 0.55 ± 0.002 0.502 ± 0.004 3824 ± 82.828 0.608 ± 0.006 MIPaaL-1000 0.12 ± 0.020 0.13 ± 0.013 0.35 ± 0.010 0.27 ± 0.035 0.61 ± 0.010 0.506 ± 0.007 5821 ± 154.793 0.590 ± 0.005 MIPaaL-100 0.98 ± 0.089 0.12 ± 0.013 0.99 ± 0.060 0.26 ± 0.037 0.54 ± 0.013 0.503 ± 0.004 5801 ± 145.331 0.553 ± 0.007
RootLP (Wilder et al. 2019) 0.71 ± 0.178 0.15 ± 0.013 1.06 ± 0.137 0.28 ± 0.032 0.49 ± 0.007 0.513 ± 0.001 6267 ± 212.063 0.574 ± 0.002
TwoStage 0.09 ± 0.017 0.06 ± 0.011 0.02 ± 0.066 0.13 ± 0.032 0.39 ± 0.004 0.514 ± 0.005 684 ± 15.568 0.649 ± 0.002
predictions. A trained ML model predicts the objective coefficients of the particular decision problem, and we use an exact Branch-and-Bound solver to compute
an optimal solution xˆ with respect to the predicted objective coefficients cˆ. The
decision quality is the the objective value of that solution under the ground truth
objective coefficients, c
T xˆ. For portfolio optimization, the decision quality is the
monthly rate of return so the 2.79% that MIPaaL achieves in Table 2.1 means that
MIPaaL increases the portfolio value by 2.79% on average. In bipartite matching, our goal is to maximize the number of successful matches. In knapsack, we
maximize the total value of selected items which represents gross income from
an energy schedule.
We compare MIPaaL against TwoStage and RootLP, as well as variants of MIPaaL discussed above. Our results in Table 2.1 show that MIPaaL significantly
outperforms both baselines in decision quality across all four benchmarks. As
37
shown in Table 2.1, the full MIPaaL results in more than 2x the average return of
the TwoStage or RootLP methods for portfolio optimization and 40.3% more successful matches for biparite matching, and 1% more gross revenue for knapsack
instances. We note that a randomly initialized network achieves around 501 objective value for knapsack. These results drive home the importance of integrating the full combinatorial problem into training, as enabled by MIPaaL.
We find that warm starting or using a hybrid loss of decision quality and ML
loss degrade performance below RootLP. Limiting the number of cuts used in
MIPaaL to 100 also results in degraded performance in three of the benchmarks,
but with 1000 cuts, the model outperforms both baselines and is competitive with
the full MIPaaL.
For the full MIPaaL due to the incompleteness of our cutting plane generator, solving the resulting cutting plane LP might not produce an integer optimal
solution to the original MIP (as is also the case with the k cut limited versions).
Hence, the LP solutions and the corresponding gradients we obtain during the
backward pass are approximate some of the time. During the full MIPaaL training, cutting plane LPs of training instances produce integer optimal solutions
in 83% of SP500, 94% of DAX, 65% of Matching, and 97% of Knapsack instances.
On average, the cutting plane LP solutions during MIPaaL training matched the
38
0/1 value in the integer optimal solution for 89.20, 70.40, 88.35 and 97.13% of the
binary variables for SP500, DAX, Matching and Knapsack respectively (See complete set of statistics in appendix∗
). However, the decision quality improvements
of MIPaaL (using only the standard Gomory cut generation procedure) clearly
show that the training is indeed effective and results in a final model that is better than both TwoStage as well as the k-cut limited forms of MIPaaL. The bad
decision quality performance of MIPaaL-100 and RootLP highlights the fact that
MIPaaL needs an effective cut generating procedure to succeed. For any given
MIP domain, using a more comprehensive set of cutting plane generating procedures will improve the tightness of the resulting cutting-plane LP to the original
MIP instance. In our results, we observed that using the full MIPaaL version over
the version constrained to k = 100, 1000 always gave us an improvement in decision quality (Table 2.1), while not significantly increasing training time (See
Appendix∗
) for a full set of statistics on LP integrality and timing results across
different MIPaaL variants).
ML performance: We show the predictive performance in Table 2.2. For portfolio optimization and knapsack (which are regression problems) we report mean
squared error (MSE) and correlation coefficient, while for bipartite matching (a
classification problem) we report cross-entropy and AUC. SP500 Asset returns
39
have mean value 0.055 and stdev 0.199, and in DAX have mean -0.9 and stdev
3.94, so the observed MSE are large compared to the values themselves. In knapsack instances, the values have mean 64 and stdev 35, with the MSE again being
very large compared to these values.
Looking at the ML metrics (Table 2.2), we note the ML performance of the
decision-focused methods varies widely. In particular, the testing MSE for portfolio optimization is quite high compared to the two stage approach. This mismatch between the MSE and decision quality exemplifies the need for training
with the downstream optimization task in mind in that even though the MIPaaL
model has worse MSE than TwoStage , it results in much higher-return decisions.
In terms of the matching problem, we see that even though TwoStage has
better cross-entropy loss at test time, as it was trained with that specific classification loss in mind, it lacks in AUC which both corresponds to the findings in
previous work Wilder et al. [2019a], and indicates that the predictions learned
by MIPaaL may sometimes also be accurate in a traditional sense.
In knapsack instances, MIPaaL gives improvement in MIP solution quality
over other approaches. In addition, the MSE is best among optimization methods whereas the correlation is lower than all approaches other than MIPaaL-100.
40
We note that the performance improvement isn’t as drastic as for the other settings. It is possible that the constraints are not as combinatorial as in the other
domains and thus even straightforward ML losses may still perform well in decision quality. On the other combinatorial problems, our decision quality improves
by anywhere from 40% to 8x compared to the baselines, emphasizing the benefit
of using MIPaaL for problems with combinatorial structure.
Table 2.3: Problem statistics and timing results. Timing results are number of
epochs until validation MIP performance convergence, average time per epoch,
and average % time taken per epoch to compute Forward and Backward passes
through the MIP Layer.
Num Instances Problem Sizes Solve Statistics
Train Val Test Bin Vars Cont Vars Cons #Cuts Epoch (s) Fwd Bwd # Epochs
SP500 72 35 36 1000 3011 5026 2690 486 3.88% 25.46% 28
DAX 72 35 36 60 185 314 1387 47 15.63% 3.34% 20
Matching 16 11 11 2500 0 102 4984 604 2.13% 32.90% 35
Knapsack 56 19 19 48 0 1 1261 208 4.31% 0.36% 31
Table 2.4: Transfer learning metrics for models trained on SP-30a
and tested varying data distribution. Both training and testing problems optimize over 30 stocks.
The model is trained on 30 stock instances and tested on selecting from a different subset of stocks, and 30 stocks from the DAX, a german stock exchange.
Method SP-30b DAX
Decision
Quality
MIPaaL 2.02 ± 0.48 2.77 ± 0.40
RootLP 1.81 ± 0.44 1.74 ± 0.43
TwoStage 0.71 ± 0.04 0.82 ± 0.54
ML Loss
MIPaaL 4.81 ± 8.59 4.59 ± 8.80
RootLP 5.14 ± 1.02 5.39 ± 1.04
TwoStage 0.08 ± 0.05 0.07 ± 0.03
41
Table 2.5: Transfer learning metrics for models trained on SP-30a
and tested varying the problem size.
Method SP-50 SP-100 SP-200 SP500
Decision
Quality
MIPaaL 1.93 ± 0.13 2.27 ± 0.11 2.17 ± 0.48 2.26 ± 0.37
RootLP 1.50 ± 0.09 1.58 ± 0.08 1.82 ± 0.41 1.90 ± 0.29
TwoStage 1.58 ± 0.13 1.22 ± 0.09 1.50 ± 0.58 1.11 ± 0.35
ML Loss
MIPaaL 5.42 ± 3.16 5.42 ± 2.37 5.25 ± 1.83 5.43 ± 1.67
RootLP 4.73 ± 3.17 4.88 ± 2.58 4.81 ± 1.91 4.83 ± 1.56
TwoStage 0.08 ± 0.02 0.07 ± 0.01 0.08 ± 0.01 0.08 ± 0.01
Runtime: Table 2.3 summarizes our benchmarks as well as running time statistics for components of MIPaaL. It shows that the four different benchmarks correspond to MIP instances of various structure, number of variables, and the number of constraints. We report the average number of added cuts per MIP instance
generated during training, the average time per epoch, and the percentage of
that time dedicated to the forward and backward pass through the MIP layer in
particular. The number of added cuts in the forward pass is on the order of a few
thousands for all four problem types. SP500 and Matching take longer per epoch
than DAX and Knapsack. The table shows that for both of these methods a big
percentage of the train time is dedicated to the backward pass through the MIP
layer rather than the forward Gomory-based solver. This is explained by the large
size of the corresponding cutting plane LPs for which the backward pass needs
to solve through the KKT conditions. Furthermore, recent GPU-accelerated QP
42
solvers introduced in Amos and Kolter [2017] would accelerate the backward
pass. On average, the forward pass through MIPaaL takes 0.26, 0.10, 0.80, and
0.16 seconds for SP500, DAX, Matching, and Knapsack instances respectively,
and 1.72, 0.02, 12.42, and 0.14 seconds for the backward passes respectively. Further timing results are in the appendix∗
).
Transfer learning: To test generalization performance, we evaluate MIPaaL,
RootLP, and TwoStage on transfer learning tasks for portfolio optimization. In
this transfer learning setting, models are trained on 30 assets randomly drawn
from SP500 (SP-30a
), with data from Jan 2005 - Dec 2010. These learned models
are then evaluated on data from Dec 2013 - Nov 2016 to test various generalization aspects. To test generalization across the data distribution we evaluate on
1) SP-30b
, a set of 30 randomly drawn assets from the SP500, disjoint from SP30a
, and 2) the DAX, a separate index comprising 30 companies from a different
country. Similarly, we evaluate on instances with a varying number of assets
in SP-50, SP-100, SP-200 and SP-500 which contain 50, 100, and 200 each with
unique assets disjoint from SP-30a
and SP-30b
, as well as on all 505 assets we
have data for in SP-500.
Varying data distribution: The transfer learning results (Table 2.4 and Table 2.5) demonstrate that MIPaaL generalizes to unseen assets and countries. On
43
SP-30b
, MIPaaL doubles the average rate of return over the standard TwoStage
approach, and gives a 59% improvement over RootLP. Furthermore, while the
transferred model applied to DAX doesn’t beat MIPaaL trained on the same task
(Table 2.1), it improves performance over the RootLP and TwoStage trained on
DAX, showing that MIPaaL learns a useful model for portfolio optimization as
a whole. Lastly, MIPaaL’s performance improvement over TwoStage occurs despite a much higher MSE (Table 2.4 and Table 2.5).
Varying problem size: MIPaaL efficiently predicts for the general task of portfolio optimization, regardless of the set or number of assets (Table 2.4 and Table 2.5). Similarly to DAX, for SP500 the transferred MIPaaL model trained on
SP-30a outperforms the TwoStage method trained on SP500 data (Table 2.1).
These results indicate that MIPaaL can train predictive models which generalize across data distribution and MIP problem size. Comparing MIPaaL trained
on SP-30a
, which gets 2.26 returns on the full SP500, to TwoStage trained on
SP500, which gets 1.19 return in Table 2.1, we can see that MIPaaL outperforms
TwoStage even when MIPaaL doesn’t see most assets in the dataset. Additionally, MIPaaL trained on choosing from 30 assets has more than an order of magnitude decrease in training time compared to training on 500 assets. In effect,
when training time is limited, we can still efficiently train a model on small MIP
44
instances which performs well on larger MIP instances compared to TwoStage,
while training faster than MIPaaL on fully-sized MIPs.
2.7 Related Work
The interaction of machine learning and optimization has provided a diverse
set of tools for efficiently solving a wide variety of problems. Recent work has
framed traditionally heuristic components of optimization algorithms as machine learning tasks such as learning in branch and bound for MIPs Khalil et al.
[2016, 2017b], He et al. [2014], Bengio et al. [2021]. These approaches work inside
an exact solver when the MIP is completely specified. Deep reinforcement learning methods have also been used to generate high-quality data-driven solutions
to hard problems such as graph optimization Khalil et al. [2017a], vehicle routing
Kool et al. [2018], Nazari et al. [2018b], and realtime patrol planning in security
games Yu et al. [2019]. These problems require known features, like the true objective coefficients, and are used to quickly generate high-quality solutions with
known objectives, which is computationally intractable for an exact solver.
Several approaches have been proposed which embed optimization components in neural networks. This includes specific problems such as Markowitz
45
portfolio optimization Bengio [1997] or finding physically feasible state transitions de Avila Belbute-Peres et al. [2018], as well as larger classes which exhibit
properties like convexity or submodularity. Problem classes used in end-to-end
training include polynomial-time solvable frameworks like quadratic programs
Amos and Kolter [2017] and linear programs Wilder et al. [2019a], as well as
zero-sum games Ling et al. [2018], Perrault et al. [2019]. In addition, a solution is proposed for problems encoded as submodular optimization problems in
Wilder et al. [2019a]. In Elmachtoub and Grigas [2022], the authors optimize a
decision-focused regret bound. Kao et al. [2009] instantiate end-to-end learning
with linear models predicting components of a quadratic optimization function,
and Demirović et al. [2019] exhaustively search linear models to perform well on
predict + optimize for knapsack. Lastly, Wang et al. [2019] incorporate a differentiable semidefinite program as a relaxation for MAXSAT. To our knowledge,
MIPaaL is the first approach for imbuing neural networks with the highly flexible
Mixed Integer Program, a widely-used class of potentially inapproximable NPHard optimization problems, while providing tight feedback on decision quality.
46
2.8 Conclusion
We propose MIPaaL, a principled method for incorporating mixed integer programs as a differentiable layer in neural networks. We approach the task of differentiating through this flexible, discrete, and potentially inapproximable problem
by algorithmically generating a potentially large but equivalent continuous optimization problem via cutting planes. We instantiate our proposed approach for
decision-focused learning wherein a predictive model is trained with a loss function that directly corresponds to the quality of the decisions made based on the
predictions. MIPaaL is evaluated on two settings of portfolio optimization, one
setting of diverse bipartite matching, and a knapsack setting, all of which contain modeling techniques widely used in combinatorial optimization that make
the problem more complex but more realistic. Furthermore, we evaluate MIPaaL
in several transfer learning settings. We demonstrate empirically that MIPaaL is
able to outperform the standard approach of decoupling the prediction and decision components, as well as an approach of using a continuous relaxation of
the original combinatorial optimization problem. To better understand the impact of the cutting plane technique, we explore hybrid strategies that stop the
cutting plane generation early, simultaneously optimize ML loss and MIP loss,
47
and initialize the network feeding into MIPaaL with a model trained via the decoupled approach. Ultimately, we find empirically that our approach can give
high-quality solutions in the investigated settings.
48
MIPaaL Supplementary Information
2.A Portfolio optimization
2.A.1 Data specification
We simulate our approach on this problem setting we use historical price and volume data downloaded from the Quandl WIKI dataset Quandl [2019]. Our goal is
to generate monthly portfolios of stocks in a given market index which targets
a portfolio of these stocks weighted based on market capitalization, assuming
that we start out with a portfolio that is weighted by market capitalization from
the previous month. We train our model on data collected from January 2005 to
December 2010, validate on data collected from January 2011 to November 2013
and test our model based on data from December 2013 to November 2016 resulting in 72 time periods used for training, 35 for validation and 36 for testing. We
split the data temporally to ensure that data from the future isn’t used to inform
49
predictions in the current time period. The 11 features used are made up of historical price averages and rates of return over different time horizons, which are
commonly used indicators for simple trading strategies. Furthermore, to evaluate the generality of the approach we evaluate on two market indices: the SP500
and DAX. The SP500 is an index of 505 large companies in the united states which
are widely representative of the US trading market. The DAX is composed of 30
large German companies and is much smaller than the SP500, being somewhat
less representative of its respective market as a whole, but still representative of
the central publicly traded companies in terms of trading volume and amount
invested in those companies.
We use the following 11 indicators commonly used in simple momentumbased and mean-reversion trading strategies Conrad and Kaul [1998]:
• percentage increase of price from previous 5 time periods
• percentage increase of price from previous year
• percentage increase of price from previous quarter
• mean of price percentage increase over previous year
• variance of price percentage increase over previous year
• mean of price percentage increase over previous quarter
50
• variance of price percentage increase over previous quarter
We randomly sample 60 indices from the SP500 for SP-30a
and SP-30b
, these
indices are:
SP-30a = [LRCX, PHM, WY, SPG, EMN, CME, AEP, F, CAG, FISV, WBA, XOM
, NVDA, ETN, MDT, FL, HBAN, FFIV, BLK, IPG, EXPD, IRM, PH , DLTR, COST,
NBL, INCY, CSX]
SP-30b = [DVA, DE, BAC, KLAC, ADBE, FIS, IT, KR, FMC, HOG, SHW, RE , ETR,
BK, ACN, NWL, ESS, VMC, C, EW, IR, SWKS, SNPS, ARE , SCHW, WEC, IVZ,
SLB]
Indices for the SP-50, SP-100, and SP-200 are:
SP-50 = [VLO, EIX, RSG, PKG, CMI, AMZN, CHRW, ATVI, ADS, RHI, DVN ,
WMB, ILMN, LOW, SIVB, T, NTRS, HCP, ALXN, CPB, MRO, MU, CB , HES, APA,
VFC, CHD, CMCSA, ALK, PBCT, BBY, MNST, UDR, CTXS , AZO, XRX, SBAC,
M, MTD, COP, UTX, KO, MCO, TGT, CELG, HSIC , WYNN, YUM, ECL, ABT]
SP-100 = [RTN, K, ROST, WAB, MMM, WMT, JPM, MMC, AMD, BA, NEM,
JBHT , STI, ITW, NKTR, ETFC, PNW, HSY, EL, MLM, UPS, FRT, ES, ROL , CMS,
MAC, PVH, TSN, ANSS, SJM, DISH, LUV, MOS, TTWO, SEE , PKI, BAX, AMG,
EOG, AMGN, CCI, EA, AGN, HP, HAL, AXP, SRE , INTU, AOS, APH, ROK,
ABMD, CINF, UNM, ZION, HRL, ADSK, MSFT , JNJ, EBAY, ADI, EXC, CVS, PEG,
51
CVX, PNR, PAYX, CPRT, DHR , JCI, SYK, HST, MO, INTC, GD, MCHP, PFG, PG,
QCOM, STX, HAS , WM, CL, CDNS, BF_B, HIG, REGN, COG, RL, GPS, APD,
APC, GPC , TSCO, JKHY, FAST, NOV, TMK, AVY, PGR]
SP-200 = [SYMC, WDC, CSCO, STZ, NUE, MAA, ARNC, DGX, EQIX, PWR,
ED , LLY, AIV, ISRG, FITB, EQR, BXP, ORLY, MHK, JEC, UNP, TXT , IDXX, WHR,
OMC, KEY, NI, FLIR, SNA, MCK, PXD, MCD, XEC, LEG , KSU, TSS, IP, AVB, RF,
GPN, AFL, UNH, DTE, NEE, ZBH, NDAQ , DRE, MSI, VRSN, A, XEL, MAS, LKQ,
FDX, LMT, WFC, NRG, RHT , COF, RMD, SLG, TROW, ANTM, BRK_B, PSA, RJF,
BSX, AJG, FLR , BLL, AES, ORCL, MAR, HD, GILD, RCL, TIF, PRU, SWK, TMO,
NOC , NKE, IFF, DHI, STT, AMT, TJX, PRGO, GS, NFLX, GLW, PFE, DOV , OKE,
PLD, VZ, LNC, PNC, CTSH, JNPR, LEN, ABC, CI, CCL, LH , FCX, FLS, PCAR,
HRS, EMR, GRMN, AKAM, BBT, PEP, OXY, KSS, MS , URI, AEE, NSC, CAH,
VTR, O, USB, SYY, FE, TXN, BDX, BWA , CMA, COO, NTAP, SBUX, GWW, CNP,
HOLX, CNC, D, TAP, MET, AAP , TFX, IBM, XLNX, LLL, AMAT, HUM, AON,
DIS, KMB, BEN, GIS, ADM , MXIM, PPL, HPQ, ALL, VRTX, L, KIM, DOW, HON,
DRI, AAPL, XRAY , MKC, MRK, HFC, FOX, CTAS, REG, HRB, KMX, SO, MYL,
CLX, CERN , ALB, BIIB, LNT, BMY, MAT, DUK, MTB, AME, LB, GE, VNO, ROP
, EFX, AIG, JWN, CAT, UHS, WAT, ADP, ATO, VAR, FOXA, MDLZ]
52
Extended Transfer Learning: Index Size Generalization We evaluate the
abilit of MIPaaL to perform well when predictions are required in different settings. In particular, we evaluate when a model is trained to perform well on 30
assets, but used in deployment on a larger number of assets. SP-50, SP-100, and
SP-200 are instances drawn on 3 sets of assets which are all disjoint from both
SP-30a
and SP-30b
. Again, we only evaluate on testing time periods to prevent
data leakage and present results on the monthly rates of return in Table 2.6 and
MSE values in Table 2.7.
Table 2.6: Transfer learning monthly rate of return on different problem sizes:
trained on SP-30a
. The decision-focused approaches ensure that the predictions
perform well in multiple deployment settings whereas the TwoStage approach
is unable to extract relevant information.
SP-50 SP-100 SP-200 SP500
MIPaaL 1.93 ± 0.13 2.27 ± 0.11 2.17 ± 0.48 2.26 ± 0.37
RootLP 1.50 ± 0.09 1.58 ± 0.08 1.82 ± 0.41 1.90 ± 0.29
TwoStage 1.58 ± 0.13 1.22 ± 0.09 1.50 ± 0.58 1.11 ± 0.35
As shown in Table 2.6, both decision-focused approaches perform well when
the predictions are used to find the optimal MIP solution. In this case the performance improvement is not as drastic as when the MIP is trained directly for
the task in question. However, these results demonstrate that MIPaaL is able to
extract information relevant to the general task of portfolio optimization, regardless of the set or number of assets in question.
53
Table 2.7: Transfer learning MSE results on different problem sizes: trained on
SP-30a
. The TwoStage approach clearly does better in approximating the distribution of objective coefficients even when the data distribution changes. However, the improvement in MSE does not ensure good deployment quality as show
in Table 2.6
SP-50 SP-100 SP-200 SP500
MIPaaL 5.421 ± 3.163 5.422 ± 2.368 5.254 ± 1.834 5.431 ± 1.670
RootLP 4.725 ± 3.169 4.876 ± 2.579 4.807 ± 1.906 4.834 ± 1.556
TwoStage 0.078 ± 0.019 0.074 ± 0.012 0.078 ± 0.011 0.076 ± 0.011
2.A.2 Optimization model
We specify the optimization model used for the portfolio optimization task.
For our applications, we set the ticket budget Bticket, and Bname to be half of
the number securities considered, 200 for SP500, and 15 for DAX and SP-30. We
set the sector budget Bsector to be 0.1, in that for any given sector we can change
at most 10% of the portfolio weight into or out of that sector. Overall we found
that this gave us a balance of reasonable problems that weren’t trivial to solve
but still had a non-empty feasible region.
Given:
• A set of n assets, denoted as i = 1, . . . , n, and a set of sectors S, which
represents a partitioning of the n assets.
• Ticket (trading) limit: Bticket.
54
• Name (unique item changes) limit: Bname.
• Sector deviation limit: Bsector.
• For each asset i = 1, . . . , n:
– Expected returns: αi
.
– Trading volume: vol(i).
– Indicator whether asset i belongs to sector s: Ms(i) ≡ i ∈ s.
– Initial portfolio asset weight: w0(i).
– Target portfolio asset weight: wt(i).
Goal: find final portfolio weights wf (i) which is close to the target portfolio
wt
(i) in terms of weight, doesn’t incur too much cost from w0(i) to execute, respects budget constraints in terms of sector exposure, cardinality, and trading
limits, and ensuring all weights add up to 1
Decision variables:
• Final weight in asset i: wf (i) ∈ [0, 1] for all i = 1, . . . , n.
• Auxiliary ticket-counting variables: z1(i), z2(i), z3(i).
• Auxiliary indicator of whether asset i is used: ynames(i).
55
• Auxiliary indicator of whether asset i changes weight in the portfolio:
ytickets.
• Auxiliary variable f(i) representing the absolute value of weight change
from the original, keeping track of tickets bought.
• Auxiliary variable y(i) representing the absolute value of deviation from
target.
• Auxiliary variable x(s) representing the absolute value of sector weight
change.
• Auxiliary variables z1(i), z2(i), z3(i) to put weights in three different compartments of a piecewise linear function related to volume.
56
Formulation:
maxx
Pn
i=1 αiw(i)
subject to Pn
i=1 wf (i)= 1
P
s∈S
x(s)≤ Bsector
Pn
i=1 ynames(i)≤ Bname
Pn
i=1 ytickets(i)≤ Bticket
wf (i) − wt(i)≤ y(i) ∀i = 1, . . . , n
−(wf (i) − wt(i))≤ y(i) ∀i = 1, . . . , n
wf (i) − w0(i)≤ f(i) ∀i = 1, . . . , n
−(wf (i) − w0(i))≤ f(i) ∀i = 1, . . . , n
Pn
i=1 Ms(i)(wf (i) − wt(i))≤ x(s) ∀s ∈ S
−
Pn
i=1 Ms(i)(wf (i) − wt(i))≤ x(s) ∀s ∈ S
wf (i)≤ ynames(i) ∀i = 1, . . . , n
f(i)≤ ytickets(i) ∀i = 1, . . . , n
f(i)= z1(i) + z2(i) + z3(i) ∀i = 1, . . . , n
0 ≤ z1(i)≤ 0.1 vol(i)(i) ∀i = 1, . . . , n
0 ≤ z2(i)≤ 0.2 vol(i)(i) ∀i = 1, . . . , n
0 ≤ z3(i)≤ 0.2 vol(i)(i) ∀i = 1, . . . , n
wf (i), f(i), y(i)≥ 0 ∀i = 1, . . . , n
z1(i), z2(i), z3(i)≥ 0 ∀i = 1, . . . , n
ynames(i), ytickets(i)∈ {0, 1} ∀i = 1, . . . , n
x(s)≥ 0 ∀s ∈ S
(2.5)
57
Overall this problem has |S| + 6n continuous decision variables, |2n| binary
decision variables, and 10n + |2S| + 4 constraints.
2.B Bipartite matching
2.B.1 Data specification
We run experiments on a version of the bipartite matching problem used in
Wilder et al. [2019a] which requires additional diversity constraints on the proposed matching. The matching problem is done on graphs sampled from the
CORA citation network dataset Sen et al. [2008] which consist of nodes and edges
representing publications and citations respectively. In addition, we use information present about a given paper’s field of study. The full network we consider
consists of 2708 nodes which are partitioned into 27 matching instances using
metis Karypis and Kumar [1998]. Each instance is a bipartite matching problem on a complete bipartite graph with 50 nodes on each side of the graph. The
100 nodes in each instance are divided to maximize the number of edges between
the two partitions. To ensure the diversity of our recommendations, we require a
minimum p = 25% percentage of the suggested pairings belong to distinct fields
of study and similarly impose a constraint that a minimum percentage q = 25%
58
of suggested citations belong to the same field of study. The predicted values in
this case are the edge weights which correspond to whether one paper actually
cites another. The node representations in this case are binary feature vectors
which correspond to whether a given word from a lexicon of 1433 words appears
in the paper. Edge values are predicted based on the concatenation of the node
features on the edge endpoints.
2.B.2 Optimization model
Given: two sets of nodes representing publications N1, N2, weights on pairs of
nodes ci,j∀i ∈ N1, j ∈ N2 corresponding to how likely it is that one paper will
cite another, same field indicator mi,j = {1 if i and j are in the same field,0 otherwise}
Goal: find matching of nodes such that 1) each node is matched at most once,
2) at least p ∈ [0, 1] proportion of selected edges connect nodes of the same field
(mi,j = 1) 3) at least q ∈ [0, 1] proportion of selected edges connect nodes of
different fields (mi,j = 0)
59
Decision Variables: xi,j ∈ 0, 1∀(i, j) ∈ N1 × N2 corresponding to whether
we use edge (i, j) in the matching or not.
Formulation:
maxx
P
i,j ci,jxi,j
subject to P
j
xi,j≤ 1 ∀i ∈ N1
P
i
xi,j≤ 1 ∀j ∈ N2
P
i,j mi,jxi,j≥ p
P
i,j xi,j
P
i,j (1 − mi,j )xi,j≥ q
P
i,j xi,j
xi,j∈ 0, 1 ∀i ∈ N1 , j ∈ N2
(2.6)
Overall this problem has |N1| × |N2| decision variables, and |N1| + |N2| + 2
constraints. In our setting, we had p = q = 0.25 to validate our experiments
as this setting resulted in the problem not simply being solved too easily while
yielding feasible regions.
60
Chapter 3
Predicting Wildlife Trafficking Routes
with Differentiable Shortest Paths
“Only if we understand, will we care. Only if we care, will we help. Only if we help
shall all be saved.” - Jane Goodall
Wildlife trafficking (WT), the illegal trade of wild fauna, flora, and their parts,
directly threatens biodiversity and conservation of trafficked species, while also
negatively impacting human health, national security, and economic development. Wildlife traffickers obfuscate their activities in plain sight, leveraging
legal, large, and globally linked transportation networks. To complicate matters, defensive interdiction resources are limited, datasets are fragmented and
61
rarely interoperable, and interventions like installing checkpoints or machinery place a burden on legal transportation. As a result, interpretable predictions
of which routes wildlife traffickers are likely to take can help target defensive
efforts and understand what wildlife traffickers may be considering when selecting routes. We propose a data-driven model for predicting trafficking routes
on the global commercial flight network, a transportation network for which
we have some historical seizure data and a specification of the possible routes
that traffickers may take. While seizure data has limitations such as data bias
and dependence on the deployed defensive resources, this is a first step towards
predicting wildlife trafficking routes on real-world data. Our seizure data documents the planned commercial flight itinerary of trafficked and successfully
interdicted wildlife. We aim to provide predictions of highly-trafficked flight
paths for known origin-destination pairs with plausible explanations that illuminate how traffickers make decisions based on the presence of criminal actors,
markets, and resilience systems. We propose a model that first predicts likelihoods of which commercial flights will be taken out of a given airport given input features, and then subsequently finds the highest-likelihood flight path from
origin to destination using a differentiable shortest path solver, allowing us to
automatically align our model’s loss with the overall goal of correctly predicting
62
the full flight itinerary from a given source to a destination. We evaluate the
proposed model’s predictions and interpretations both quantitatively and qualitatively, showing that the predicted paths are aligned with observed held-out
seizures, and can be interpreted by policy-makers.
3.1 Introduction
Wildlife Trafficking (WT) broadly impacts biodiversity, human health, economic
development, and national security UNODC [2017]. It encompasses a wide array of species that originate from, and are transported to, supply and demand
markets around the world. WT spans over 150 countries and includes more
than 37,000 species of fauna and flora UNODC [2017]. Transnational criminal
organizations are known to leverage the increasingly interconnected air transportation network to move illegal wildlife products from source to destination
locations, generating $19 billion annually in black market proceeds IATA [2022],
Utermohlen and Baine [2018], ROUTES [2022]. The massive scope, scale, and diversity of wildlife trafficking networks present a complex and dynamic challenge
for authorities and researchers trying to understand and interrupt the transiting
of illegal wildlife products using detection, interdiction, deterrence, education,
or other activities. Stakeholders working to combat wildlife trafficking also face
63
limited social, physical, and financial capital compared to other illicit activities
such as drug trafficking. Current practice is to rely heavily on trusted and established personal relationships, “tip-offs” about specific flights, use of specially
trained sniffer dogs, and education of airport personnel; these practices can be
successful in one-off contexts but lack a desired deterrent effect. Network interdiction models can assist in determining the optimal allocation of scarce resources along known trafficking networks but have yet to be systematically applied to the transiting stage of wildlife trafficking supply chains Smith and Song
[2020], Haas and Ferreira [2018]. Data-driven methods for understanding underlying wildlife trafficking patterns could help advance on the ground practice
and expand modeling techniques to a novel domain space and are a necessary
first step before targeted interdiction allocation can be applied effectively and
efficiently.
Recognizing the potential for data-driven methods to dramatically enhance
solutions to the problem of wildlife trafficking, multiple sectors have increased
their data collection activities. For example, The Convention on International
Trade in Endangered Species of Wild Fauna and Flora (CITES) is a global agreement among governments to regulate international trade in species under threat
64
that was established in 1976 and is currently signed by 183 countries and the European Union. TRAFFIC is an organization that was established in 1976 by The
World Wide Fund for Nature (WWF) and International Union for Conservation
of Nature (IUCN) as a wildlife trade monitoring network to undertake data collection, analysis, and provision of recommendations to inform decision making
on wildlife trade. In 2015, the U.S. Agency for International Development (IUCN)
established the Reducing Opportunities for Unlawful Transport of Endangered
Species (ROUTES) Partnership to bring together transport and logistics companies, government agencies, law enforcement, and conservation organizations to
eliminate wildlife trafficking from the air transport supply chain. Importantly,
these efforts have contributed to collection and synthesis of a limited but growing
global database of illegal wildlife trade seizure data.
Overall, the flight network’s widespread use for moving illegal goods, as well
as the presence of structured data make it a promising setting for data analysis to
help inform defensive measures. Center for Advanced Defense Studies (C4ADS),
a nonprofit that is a member of ROUTES, produced in-depth summary analysis of
the global wildlife trade flight seizure data from 2009-2017 Utermohlen and Baine
[2018] and 2016-2018 Utermohlen [2020] and derived insights based on observed
concentration of illegal activity and outliers. Some studies and reports describe
65
traffickers’ modus operandi, or factors that may influence their decisions to traffic products through certain ports over others Stringham et al. [2021b], Arroyave
et al. [2020]. Factors, such as larger airports with higher volume, prevalence of
corruption, lower financial costs, and smaller legal penalties, have been shown to
possibly be beneficial for traffickers Gore et al. [2022]. Furthermore, interdisciplinary work is a promising avenue for understanding the drivers of wildlife trafficking Gore et al. [2023a], with operations research presenting several promising directions for allocating defensive resoures Keskin et al. [2022]. Optimization
techniques have also been deployed successfully in other planning domains such
as in patrol planning Chen et al. [2021], wildfire mitigation Chen et al. [2022],
seismic resilience Huang and Dilkina [2020], and game theoretic defenses Shen
et al. [2020], Huang et al. [2020b]. These interdisciplinary efforts have led to
the collation of relevant datasets Gore et al. [2023b]. However, there is limited
quantitative research into the factors that impact traffickers’ transit choices and
their relative importance Magliocca et al. [2021], Stringham et al. [2021a,c]. Additionally, some previous work investigates learning models in a strategy-proof
manner, effectively ensuring that the model avoids manipulation by adversaries
Krishnaswamy et al. [2021], Li et al. [2019a]. In fact, to our knowledge, predictive
66
models have not been applied to the wildlife trafficking domain. Machine learning models can be instrumental in extrapolating the patterns from the limited
seizure data to other airports and routes. They can highlight important factors
and their weights to provide insight into traffickers’ objectives that can be utilized when making interdiction decisions and predicting trafficker responses.
To this end, in this paper, we formulate wildlife trafficking across the global
flight network as a route prediction problem on a graph, synthesize historical
seizure data with data that describes airport nodes and flight edges, and propose a
maximum likelihood machine learning model that exploits recent developments
in differentiable optimization. In particular, we model probabilities of trafficking on each edge in the transportation network as a function of node and edge
features, and train the model by comparing the maximum likelihood path (identified by computing the shortest path in log space) to the ground truth paths. We
demonstrate the predictive power of our model. We analyze our model’s results
to understand the discrepancies between our predictions and the ground truth
seizure data. By utilizing an interpretable linear model with respect to input
features, we are also able to provide feature importance insights.
A key area of concern in combating WT is the convergence of multiple forms
of illicit trade Stringham et al. [2021c], Gore et al. [2019]. Convergence can take
67
a variety of forms. For instance, revenue from WT activities can fund arms trafficking. Additionally, the people, countries, and transit routes used for various
forms of trafficking can substantially overlap due to factors that are mutually
beneficial. Convergence has long been an area of concern but the amount of scientific, quantitative, evidence for convergence is still limited Gore et al. [2021].
Our work makes a step towards quantifying the scale and impact of convergence
by directly incorporating measures of other illicit activities at given locations as
features when predicting wildlife trafficking paths. Understanding the impact of
other illicit activities on the path probabilities of traffickers provides a quantitative measure of geographic convergence.
Figure 3.1: Visualization of itineraries with historical seizures in red as well as a
subset of the global flight network in grey.
68
3.2 Related Work
The overall problem of learning route choices may be considered an inverse optimization problem, where we are given “solutions” to optimization problems
and we want to identify what optimization parameters yields those observed
solutions as optimal Ahuja and Orlin [2001]. Indeed, previous work in trajectory prediction has modeled hidden latencies for travel networks by solving an
inverse shortest path problem Zhang and Paschalidis [2018], or learning transportation preferences for a road network which results in a given traffic flow on
the network Fosgerau et al. [2022]. The area of trajectory prediction Gambs et al.
[2012], Feng et al. [2018], Rudenko et al. [2020], Ziebart et al. [2009] aims to predict paths for individuals and thus tend to assume access to the start location, or
continually updating sequence of locations, and try to predict the rest of the trajectory that the person will take. However, in our case, we have generally-known
source and destination pairs and try to understand what are the most likely paths
that traffickers will take without continuously updating information.
Recent work in the machine learning literature has investigated how to integrate optimization solvers as differentiable components in machine learning
pipelines. This effectively allows the model designer to state that the model predictions will be used downstream by a structured optimization problem which
69
will output an optimal solution to a problem with given predicted inputs. The
seminal OptNet paper Amos and Kolter [2017] introduces the quadratic optimization program as a differentiable layer for use in deep learning pipelines, by
implicitly differentiating through the KKT optimality conditions, with follow-up
work extending the approach to linear programs Wilder et al. [2019a]. In a different vein, researchers investigated differentiating through blackbox optimizers
Pogančić et al. [2020] and differentiating through maximum likelihood estimation which can represent the optimal solution to a mathematical program Niepert
et al. [2021a]. Our approach directly builds off of Pogančić et al. [2020] and leverages empirical insights in order to speed up gradient computation. Lastly, several
approaches for smart predict then optimize have been proposed which compute
subgradients of the optimal solution with respect to the inputs in order to train
the predictive model Elmachtoub and Grigas [2022]. This smart predict then optimize area has work on applicable theoretical guarantees and integration with
decision trees Balghiti et al. [2019], Elmachtoub et al. [2020].
Prior work has successfully used machine learning in the context of wildlife
poaching in conservation areas, but poaching is only the "first" step in the wildlife
trafficking supply chain Xu et al. [2021], Gholami et al. [2018], Nguyen et al.
70
[2016]. Poaching-oriented approaches consider classification models that predict the likelihood of snare detection at a given spatial location to inform ranger
patrolling efforts at the sourcing of wildlife. While these works demonstrated the
ability to predict poacher behavior at each pixel of a given conservation area, here
we address the global wildlife trade problem of learning trafficker route choices
on the broader international air transportation network.
3.3 Flight Itinerary Prediction Formulation
We formulate the problem of predicting trafficker flight paths connecting a given
source airport s and intended destination d airport as a supervised learning problem of predicting a path from s to d on a flight network represented as a directed
graph G. The flight network G represents airports as nodes and the flights between them as directed edges. We augment the flight network with WT-related
features ϕ on both nodes ϕ
v
and edges ϕ
e
. We collect N ground-truthed trafficker
paths Dπ = {πsi,di
}
N
i=1 from centralized databases of seizure reports. These reports contain the traffickers’ intended itineraries between fixed source si and
destination di
. We encode these WT itineraries πi as paths in the flight network,
representing them as either a sequence of airport nodes or flight edges as needed.
71
Our data sources, collection, and synthesis are described in the “Data Sources”
section. To get a sense of the magnitude of the problem at hand, we visualize the
observed trafficker paths as well as 20% of the full flight network in Fig 3.1. We
subsample due to the density of the global flight network consisting of 14,118
flight edges connecting 1,933 airport nodes, rendering the image unreadable otherwise.
Formally, we aim to train a model that correctly predicts the observed structured path πi given the input source si
, destination di
, flight network G, and
features ϕ.
3.3.1 Predictive Model: Edge Transition Estimator
In order to predict full flight paths from features on just edges and nodes, we cannot simply predict how likely any individual path is, as the number of possible
simple paths is exponentially large in the size of the network. Instead, we consider predicting a probability for each edge which then can be used to compute
path likelihood.
72
We propose an approach for modeling the path prediction problem by predicting “transition” probabilities, or probabilities on which flight edges a trafficker may take to exit a given “current” node. Since our setting requires a simple model that can be easily handed off to domain experts and deliver actionable
insights for interdiction, we forego complex architectures in favor of a simplistic predictive model. This models the trafficker as taking a biased random walk
from the source airport to the destination airport on the flight network where
our model learns the biased probabilities given edge and node features. With
this transition probability modeling approach, we can compute the probability
of taking any given source-destination path as being the product of individual
edge probabilities.
Formally, we model the problem as finding the probability P(i, j) of using
a directed edge (i, j) to leave a starting node i. Here, probabilities on all edges
leaving a given node i sum to 1. We use a parametrized model m, with parameters θ, to obtain probability estimates given the relevant features i.e. Pˆ (i, j) =
m
ϕ
e
i,j , ϕv
i
, ϕv
j
; θ
. For notational simplicity, we consider the feature vector for
a given edge to be the concatenation of edge-specific features, origin features,
and destination features ϕi,j =
ϕ
e
i,j , ϕv
i
, ϕv
j
. The edge probability prediction
7
model limits the number of trained parameters to prevent overfitting. This parameter sharing means that the same model is used to predict which flights will
be taken out of an airport whether it is Addis Ababa or Charles de Gaulle. Furthermore, by predicting edge probabilities from edge and node features, we can
understand how these features impact our model’s estimates and thus better understand what factors may be driving wildlife trafficking. Hence, in our experiments we use a linear model relating the features to the predicted probabilities
to ensure that the resulting model is interpretable.
We denote the set of edges leaving i as δ(i), and fully specify our linear model
as making predictions on each edge as computing logits with a linear model, and
using a softmax to normalize the edge logits based on the flight origin node to ensure that the outgoing probabilities sum to one. Mathematically our probability
prediction model is described in Equation 3.1.
Pˆ(i, j) = m
ϕ
e
i,j , ϕv
i
, ϕv
j
; θ
=
exp (θ
T
X
ϕi,j )
j
′∈δ(i)
exp (θ
T ϕi,j′)
(3.1)
Our formulation ensures that the output probability estimates are a differentiable function of the parameters θ to be trained using standard deep learning
libraries like pytorch Paszke et al. [2019]. Additionally, we experimented using a
7
3-layer multi-layer perceptron (MLP) as well as gradient-boosted decision trees
but found poor generalization of the MLP and the gradient-boosted decision trees
performed on par with our linear model so we opted for the linear model as it
was interpretable with no drawback in performance.
With the given formulation, the probability of a path P(π) is the product
of individual edge probabilities Π(i,j)∈πPˆ((i, j)|i). Furthermore, we can identify
the model’s highest-likelihood path by finding a shortest path with edge weights
corresponding to the negative log probability. A path minimizing the sum of
negative log probabilities is a path that maximizes the sum of log probabilities
which, due to the logarithm’s product rule and monotonicity, is a maximum likelihood path. The goal now is to find model parameters θ such that the observed
trafficking paths π have the highest likelihood.
At deployment time, this edge transition model will enable us to identify easily the highest-likelihood path by solving a shortest path problem in log probability space, obtain other highly-likely paths by identifying other near-optimal
solutions, and allows us to easily evaluate the likelihood of any other alternative
path.
75
3.3.2 Model Training: Path-Integrated Learning
Given that we want to predict full paths in the flight network, we propose training the parameters θ to directly minimize differences between the predicted highestprobability path and observed trafficking paths. We consider a differentiable
pipeline and loss function that directly aligns model training with the problem of
recovering the ground truth path, and can be optimized using gradient descent.
Using the above definition of our edge transition probability estimator, we express model training as solving the optimization problem in Equation 3.2 which
minimizes the expected Hamming loss between a given ground-truth path πs,d
with corresponding source s and destination d against the highest-likelihood
path πˆs,d predicted by the model connecting that source to that destination. The
highest-likelihood path is computed by Single Source Single Destination shortest
path solver (SSSDSolver) over the negative log of predicted transition probabilities Pˆ. Transition probabilities Pˆ are computed according to Equation (3.1).
Ultimately, to train the model we compute gradients for the model parameters
via backpropagation of the hamming loss to the predicted highest-probability
path πˆ, back to the predicted transition probabilities Pˆ, and then to the model
parameters θ.
76
min
θ
Eπs,d h
H
πs,d, SSSDSolver
−log
Pˆ
; s, di (3.2)
For completeness, we can define the single source single destination shortest
path solver in Equation (3.3) as finding the path minimizing the sum of weights
on edges used in the path π, which in our case are negative log probabilities.
SSSDSolver(w; s, d) = arg min
πs,d
P
(i,j)∈πs,d
wi,j
(3.3)
Here we can use any off-the-shelf shortest path solver without worrying
about negative edge weights since the probabilities are all between 0 and 1 (exclusive), so the negative log of the probabilities are all positive values. In practice,
we use Dijkstra’s shortest path algorithm. Note that the forward pass to get predicted path πˆ is the same approach we use for determining the highest-likelihood
path, thus aligning our model’s training with the overall deployment pipeline of
correctly identifying the full path.
In order for us to use gradient descent to train our model parameters, we
need to ensure that all steps from the model predictions to the loss evaluation
are differentiable so that gradients may be easily computed using chain rule. All
of the components except for the SSSDSolver are readily differentiable functions
77
available in Pytorch Paszke et al. [2019], as a result we need to define a backward
pass for the shortest path solver to enable model training.
Using the formulation enabling differentiation of blackbox solvers proposed
in previous work Pogančić et al. [2020], we make our forward and gradient update explicit below. In the forward pass, we simply solve the shortest path problem and cache the solution πˆ := SSSDSolver(w; s, d). The backward pass itself
expects incoming gradients from the loss layer, and returns outgoing gradients
with respect to the input edge costs w. Overall, the intention of the gradient
is to give an indication of what changes in the edge costs w will produce the
desired change in the returned path to minimize the loss and better align the
path with the ground truth solution. The method for differentiating blackbox
solvers introduced in Pogančić et al. [2020] essentially perturbs the input objective coefficients w in the direction of the gradient to find a “locally-improved”
solution. It then computes the gradients as the difference between the resulting
“locally-improved” solution and the previously predicted solution. When used in
conjunction with the hamming loss, the “locally-improved” objective coefficients
are simply the input objective coefficients with a given amount increased or decreased depending on whether the decision component, such as the edge usage,
should be used or not. In order to specify the degree that the input costs should
78
be perturbed, the authors use a hyperparameter λ which determines the degree
to which the weights w should be perturbed in the desired direction. Formally,
in the backward pass, we are given input gradients ∇πˆL of the loss with respect
to the shortest path πˆ. We compute improved edge weights w
′ = w + λ∇πˆL.
Then we re-solve the problem with improved edge weights to find a better solution π
′ = SSSDSolver(w
′
; s, d). Finally, we compute gradients of this layer as
−
1
λ
(ˆπ − π
′
).
In our setting, this method corresponds to solving the shortest path problem
with perturbed weights where weight is slightly decreased on edges that should
appear in the ground truth solution and slightly increased on edges that aren’t
in the ground truth solution. The gradient that is passed back to the edge costs is
the difference between the predicted path and the “locally-improved” path. Intuitively, the approach aims to decrease cost on edges that should be in the locallyimproved short path but aren’t in the predicted path, and increase cost on edges
that are in the outputted shortest path but don’t appear in the locally-improved
path. Additionally, in our initial experiments, we found that performant values of λ were large enough so that the weight perturbation eclipsed the initial
weights themselves, meaning that overall the “locally-improved” solution was
simply the ground truth solution. As such, to cut the number of solves down by
79
half, we simply used the ground truth solution path π as the “locally-improved”
solution.
Note that this approach is akin to updating the gradients such that it scores
the ground truth solution π to have better objective value than the predicted
solution πˆ. Additionally, in this scenario we consider that the path π is encoded
as a 0-1 vector with a given entry indicating whether edge (i, j) is used in the
path or not. As such, the weight vector is updated by the difference between
path solutions.
3.3.3 Model Training: Edge-Myopic Learning
We compare our path-integrated learning method with an approach that is trained
to minimize the Kullback–Leibler (KL) divergence Kullback [1997] between the
edge probabilities computed directly from training data P
′
to the edge probabilities predicted as a function of features Pˆ. This approach focuses on correctly
predicting rates at which different edges are used for trafficking in the ground
truth rather than looking at full paths, and is a slight variant of baseline approaches in previous work Choi et al. [2018] that is adapted for our setting where
we have known source and destination locations, as well as network features.
Previous work estimates the transition probabilities between different locations,
80
and here we estimate these transitions with a logistic regression model to obtain
a model of how the features are related to the observed transitions. Using raw
training data, we estimate transition probabilities P
′
(e) as the number of times
that a given flight e is used for trafficking divided by the number of times that
the source airport is used for trafficking. The predictive model’s parameters are
then trained to closely match these transition probabilities based on the given
features. Given probability predictions Pˆ and data-driven estimates P
′
(e), the
KL divergence is KL
Pˆ||P
′
(e)
=
P
e Pˆ
e log Pˆe
P′(e)
. Overall, the Edge-Myopic
learning trains the parameters θ to minimize this edge-level KL divergence.
3.4 Data Sources
Centralized and comprehensive data sources are critical for combating wildlife
trafficking Gore et al. [2022]; however, they are often lacking in practice, complicating the application of models to different domains. In our experiments,
we leverage data regarding wildlife trafficking seizures, flight pricing, available
flights, and indices of general crime prevalence and resilience infrastructure.
The global flight network was collected from OpenFlights.org which hosts opensource information about airports, routes, and flights. The data was last updated
81
on January 2017, and we have manually added several airports and routes to ensure that we can place as many seizure records on the flight network as possible.
Overall, this dataset allows us to construct a flight network of 1,933 airport nodes
connected by 14,118 flight edges.
For each flight edge, we record the distance and collect flight pricing estimates using the Skyscanner API skyscanner [2020]. Since flight pricing depends
on several components such as the amount of time before the flight, we collect
prices for all flight routes one month in advance. For each flight edge, we used
the API on October 14, 2021 to request flight quotes for November 2021. The
API did not return valid responses for several airport pairs due to no valid flight
plans existing in the database accessed by the API which we determined manually from searching google flights. Additionally, we note that data was collected
during the coronavirus pandemic impacting flight availability, as historical data
was not available.
Each airport is associated with its country’s metrics reported in the Global
Organized Crime Index for 2021 Crime [2021], the first year the indices were
published by the Global Initiative Against Transnational Organized Crime (GITOC). These indices represent expert opinion of a country’s relationship with
various forms of organized crime, including the prevalence of different criminal
82
actors, strength of resilience resources, and presence of criminal markets. These
indices score countries from 1 to 10 based on 5 rounds of anonymous and independent expert reviews in 2020. We also add information about whether the
airport’s country is a member of CITES, the city’s population, and the number
of flights that serve the given airport. The node and edge features we collected
are summarized in Table 3.1.
We obtained seizure data from the Wildlife Trade Portal (WTP) TRAFFIC
[2021] through which TRAFFIC provides historical seizure data with detailed
records like intended itinerary (source, destination, transit points), trafficked
wildlife, trafficker details, and legal outcomes. In total, we accessed 1,067 records
between 2017 and 2021 to synthesize a dataset of 454 itineraries of wildlife trafficking. Only 362 of the 1,933 airport nodes in the global flight network are used
by traffickers in the historical seizure data, highlighting the data sparsity. Furthermore, in terms of the paths themselves, the data is biased towards shorter
paths, having 1-hop, 2-hop, 3-hop, and 4-hop paths making up 60.6%, 24.2%, 15%,
and 0.2% of the data respectively.
Seizure data provides a glimpse of how WT networks operate, alert experts
to trends in supply and demand for different species, and point to key locations
for deterring wildlife crime Kurland and Pires [2017]. However, it is important
83
to understand the biases in seizure data due to being collected by different law
enforcement agencies, using several means of detection, against various criminal
agents CITES, Gore et al. [2021]. As a result, seizure data not only reflects the
criminal network, but also the defensive resources. Nevertheless, seizure data is
one of the few tools we have available to peer into WT networks in a scalable
manner.
3.5 Experiments
3.5.1 Feature Selection
Feature selection identifies the highest-impact features, limits overfitting, and
avoids correlated features. We use recursive feature elimination to iteratively
remove the least-useful feature from the current feature set by testing each of
them and evaluating the change in 10-fold path recall. Since node features appear
twice for a given edge, once for the edge’s head and again for the tail, we drop
both as needed. The full set and selected features in bold are in Table 3.1.
84
3.5.2 Metrics
We train the models using the Adam optimizer with amsgrad Reddi et al. [2018],
and evaluate the models using 10-fold cross-validation, splitting the dataset by
source - destination pair. This produces a prediction for every source-destination
path using a model that wasn’t trained on information from the given sourcedestination pair. We evaluate using several metrics at the path and edge level.
Path Recall is the percent of the ground truth paths the model completely
predicted correctly. Given the N ground truth paths in the dataset Dπ
, the path
recall is as follows:
path recall =
1
N
X
πs,d∈Dπ
δ (πs,d = ˆπs,d). (3.4)
Here δ is just a 1 if the paths are completely equal (taking the same sequence of
edges) and 0 otherwise. Higher values here mean that our model is not likely to
miss out on trafficked paths.
85
Edge Recall is the percent of trafficked edges that our model predicts to have
trafficking. Mathematically this is
edge recall =
X
πs,d∈Dπ
X
e∈πs,d
δ (e ∈ πˆs,d)
/
X
πs,d∈Dπ
|πs,d|. (3.5)
High values here mean that a large proportion of observed trafficked edges
are picked up by our model.
Edge Precision measures the percent of edges that our model predicts to have
trafficking which did in fact exhibit trafficking in the seizure data. Mathematically this is
edge precision =
X
πs,d∈Dπ
X
e∈πˆs,d
δ (e ∈ πs,d)
/
X
πs,d∈Dπ
|πˆs,d|. (3.6)
High values here mean that our model’s predictions are trustworthy and that
domain users can expect that the model’s predictions will likely contain trafficking.
Edit Distance or Levenshtein distance Levenshtein et al. [1966], is the smallest
number of “edits” (additions, removals, or substitutions) needed to go from the
predicted to ground truth path, considering the itinerary to be a sequence of
86
airports visited. Low edit distance means that the predicted paths are similar to
the observed paths.
3.5.3 Results Discussion
We present numerical results in Table 3.2, computing the average and standard
deviation in performance with 10-fold cross-validation. Given the data size of
only 454 itineraries, the differences in performance come from only a few predicted paths. Additionally, both models benefit from feature selection, with featureselected and path-integrated learning improving over edge-myopic learning. The
performance of path-integrated learning with feature selection is high in that the
models are able to recall 92.4% of the paths completely, 88.4% of the edges, and
the predicted edges contain trafficking at a rate of 90.5%. Additionally, breaking
the results down by path length, we find that on the 1-hop, 2-hop, 3-hop, and
4-hop paths, path-integrated learning with selected features gets path recalls of
98.9%, 83.1%, 80.7%, and 100%, respectively, and average Levenstein distances of
0.031, 0.169, 0.192, and 0 respectively. On the other hand, across 1-hop, 2-hop, 3-
hop, and 4-hop paths, the edge-myopic learning with selected features has path
recall of 100%, 83.1%, 58.6%, and 0%, respectively, with Levenstein distances of 0.0,
0.261, 0.314, 2.0, respectively. Across all folds, our model differs with 31 ground
87
truth paths between 25 origin-destination pairs. We note that our performance
improvements aren’t statistically significant for most metrics, except for edge
precision, due to our small sample size. However, given that path-integrated
learning gives fewer errors in our low data regime and are promising for future work when more data is available for evaluation. Given the data bias, the
predicted alternate routes may contain wildlife trafficking even though it is not
present in the ground-truth data. We visualize representative discrepancies between predicted and observed paths in Fig. 3.2 with predicted paths in blue and
observed paths in red. We categorize the discrepancies into 10 origin-destination
pairs where our predictions shortcut the observed itinerary by removing stops
(Fig. 2a), and 13 cases where our model predicts different layovers than the observed (Fig. 2b, 2c), identified as highly plausible in informal consultations with
experts. The two cases where our model predicted additional layovers are visualized in Fig. 3.2d and are likely errors.
We visualize the path-integrated learning model’s feature importance in Fig.
3.3. Here, positive values mean that high feature values induce high estimated
probability, whereas negative values mean that high feature values induce low
estimated probability. Overall, the model considers that traffickers are likely to
travel to locations with high arms trafficking as well as resilience against money
88
laundering. The convergence between wildlife trafficking and arms trafficking
has been documented and has broad implications for interdiction Spevack [2021].
Additionally, the model predicts that traffickers are less likely to enter regions
with high flora crime, criminal networks, or human trafficking. The negative
value for the flight destination’s flora crimes is interesting and warrants further
investigation, and may reflect seizure data bias, or traffickers wanting to flee
suspicion. Some features have 0 weight from selecting features based on edgemyopic learning, as well as having correlated features. Ultimately, we propose a
model and a training approach, presenting promising results with the best data
available so far. More in-depth and robust conclusions about wildlife trafficking
route prediction can be made in the future as more complete seizure data and
richer feature sets become available, which can leverage our modeling work.
3.5.4 Conclusion
We approach the problem of predicting wildlife trafficking on the flight transportation network with differentiable optimization. To align our network training with the goal of correctly identifying full paths, we train with a differentiable
highest-probability path solver We show that a path-integrated learning model
learns over the available airport and flight features with limited training data,
89
slightly improving over edge-myopic learning, and can likely further improve
as more seizure data is collected. Lastly, we identify several features that may
contribute to traffickers being more likely to take a given path. We hope that our
method will help inform interdiction efforts and the study of wildlife trafficking
networks, and we intend to use our predictions in conjunction with combinatorial interdiction in future work.
90
NODE FEATURES
METADATA
Population CITES membership
Flight Count
GITOC - CRIMINAL MARKETS
Criminal Markets (Average) Fauna Crimes
Human Trafficking Heroin Trade
Human Smuggling Cocaine Trade
Arms Trafficking Cannabis Trade
Flora Crimes Synthetic Drug Trade
Non-Renewable Resource Crimes
GITOC - RESILIENCE
Anti-Money Laundering Systems Resilience (Average)
Political Leadership And Governance Territorial Integrity
Govt. Transparency & Accountability Law Enforcement
Economic Regulatory Capacity International Cooperation
Victim & Witness Support National Policies & Laws
Judicial System And Detention Prevention
GITOC - CRIMINAL ACTOR
Criminal Actors (Average) State-Embedded Actors
Mafia-Style Groups Foreign Actors
Criminal Networks Non-State Actors
EDGE FEATURES
Price Distance
Table 3.1: Node and edge features of the flight network. Features in Bold were
selected by recursive feature elimination.
91
Method Features Path recall ↑ Edge recall ↑ Edge precision ↑ Edit distance ↓
Edge-Myopic Selected 89.6% ± 1.2 83.1% ± 2.6 86.1% ± 1.0 0.115 ± 0.04
Path-Integrated Selected 92.4% ± 2.7 88.4% ± 4.3 90.5% ± 2.6 0.088 ± 0.03
Edge-Myopic All 89.2% ± 2.5 82.8% ± 3.5 85.5% ± 3.1 0.113 ± 0.03
Path-Integrated All 89.1% ± 3.1 82.6% ± 4.0 85.4% ± 3.4 0.113 ± 0.03
Table 3.2: Summary statistics from 10-fold cross-validation of models using either the full set of features or an algorithmically-selected subset. We evaluate
two training methods, edge-predictive myopic learning which aims to correctly
predict how often individual edges are used, and path-integrated learning which
aims to identify the complete intended source - destination path. We report the
average performance across folds with 95% confidence intervals.
92
(a) (b)
(c) (d)
Figure 3.2: Visualization of the discrepancies between Path - Integrated predicted
itineraries in blue and observed itineraries in red. Additionally, domain experts
identified two likely errors in Fig. 3.2d where our model’s predictions are unrealistic.
93
Figure 3.3: Feature importance boxplot of the model coefficients across 10 training folds. Positive values suggest higher trafficking rates when the indicator is
prevalent and negative values indicate lower rates when the indicator is prevalent.
94
Chapter 4
SurCo: Learning Linear Surrogates For
Combinatorial Nonlinear Optimization Problems
“Our life is frittered away by detail. Simplify, simplify.” - Henry David Thoreau
Optimization problems with nonlinear cost functions and combinatorial constraints appear in many real-world applications but remain challenging to solve
efficiently compared to their linear counterparts. To bridge this gap, we propose SurCo that learns linear Surrogate costs which can be used in existing
Combinatorial solvers to output good solutions to the original nonlinear combinatorial optimization problem. The surrogate costs are learned end-to-end with
95
Figure 4.1: Overview of our proposed framework SurCo.
nonlinear loss by differentiating through the linear surrogate solver, combining the flexibility of gradient-based methods with the structure of linear combinatorial optimization. We propose three SurCo variants: SurCo − zero for
individual nonlinear problems, SurCo − prior for problem distributions, and
SurCo−hybrid to combine both distribution and problem-specific information.
We give theoretical intuition motivating SurCo, and evaluate it empirically. Experiments show that SurCo finds better solutions faster than state-of-the-art and
domain expert approaches in real-world optimization problems such as embedding table sharding, inverse photonic design, and nonlinear route planning.
4.1 Introduction
Combinatorial optimization problems with linear objective functions such as
mixed integer linear programming (MILP) [Wolsey, 2007], and occasionally linear programming (LP) [Chvatal et al., 1983], have been extensively studied in operations research (OR). The resulting high-performance solvers like Gurobi [Gurobi
96
Optimization, LLC, 2022] can solve industrial-scale optimization problems with
tens of thousands of variables in a few minutes.
However, even with perfect solvers, one issue remains: the cost functions
f(x) in many practical problems are nonlinear, and the highly-optimized solvers
mainly handle linear or convex formulations while real-world problems have less
constrained objectives. For example, in embedding table sharding [Zha et al.,
2022a] one needs to distribute embedding tables to multiple GPUs for the deployment of recommendation systems. Due to the batching behaviors within a
single GPU and communication cost among different GPUs, the overall latency
(cost function) in this application depends on interactions of multiple tables and
thus can be highly nonlinear [Zha et al., 2022a].
To obtain useful solutions to real-world problems, one may choose to directly optimize the nonlinear cost, which can be the black-box output of a simulator [Gosavi et al., 2015, Ye et al., 2019], or the output of a cost estimator learned
by machine learning techniques (e.g., deep models) from offline data [Steiner
et al., 2021, Koziel et al., 2021, Wang et al., 2021b, Cozad et al., 2014]. However, many of these direct optimization approaches either rely on human-defined
heuristics (e.g., greedy [Korte and Hausmann, 1978, Reingold and Tarjan, 1981,
Wolsey, 1982], local improvement [Voß et al., 2012, Li et al., 2021]), or resort to
97
general nonlinear optimization techniques like gradient descent [Ruder, 2016],
reinforcement learning [Mazyavkina et al., 2021], or evolutionary algorithms [Simon, 2013]. While these approaches can work in certain settings, they may lead
to a slow optimization process, in particular when the cost function is expensive
to evaluate, and they often ignore the combinatorial nature of most real-world
applications.
In this work, we propose a systematic framework SurCo that leverages existing efficient combinatorial solvers to find solutions to nonlinear combinatorial
optimization problems arising in real-world scenarios. When only one nonlinear
differentiable cost f(x) needs to be minimized, we propose SurCo-zero that optimizes a linear surrogate cost cˆ so that the surrogate optimizer (SO) minx∈Ω cˆ
⊤x
outputs a solution that is expected to be optimal w.r.t. the original nonlinear
cost f(x). Due to its linear nature, SO can be solved efficiently with existing solvers, and the surrogate cost cˆ can be optimized in an end-to-end manner by back-propagating through the solver via methods proposed in previous
work [Pogančić et al., 2020, Niepert et al., 2021b, Berthet et al., 2020].
Thus, SurCo is a general-purpose method for solving combinatorial nonlinear
optimization. Off-the-shelf nonlinear optimizers are often not directly applicable
98
to these problem domains and often require domain-specific solution methodologies to give high-quality solutions in a reasonable amount of time, and solution prediction methods fail to give combinatorially feasible solutions without
problem-specific intervention. Here, learning a linear surrogate problem ensures
that the surrogate solver is practically efficient, yields gradient information for
offline training, and generates solutions that are combinatorially feasible.
When solving a family of nonlinear differentiable functions f(x; y) parameterized by instance description y, the surrogate coefficients cˆ(y; θ) are learned
on a set of optimization instances (called the training set {yi}), by optimizing
the parameters θ. For an unseen held-out instance y
′
, we propose SurCo-prior
that directly optimizes linear SO: xˆ
∗
(y
′
) := arg minx∈Ω(y′) cˆ
⊤(y
′
; θ)x to get the
solution, avoiding optimizing the cost f(x; y
′
) from scratch. Based on the solution predicted by SurCo-prior, we also propose SurCo-hybrid that fine-tunes
the surrogate costs cˆ with SurCo-zero to leverage both domain knowledge synthesized offline and information about the specific instance. We provide a comprehensive description of SurCo in Section 3.
We evaluate SurCo in three settings: embedding table sharding [Zha et al.,
2022a], photonic inverse design [Schubert et al., 2022], and nonlinear route planning Fan et al. [2005]. In the on-the-fly setting, SurCo-zero achieves higher
99
Symbol Description
y Parametric description of a specific instance.
x A solution to an instance.
f(x; y) The nonlinear objective (w.r.t x) for an instance y.
Ω(y) The feasible region of an instance y.
xˆ
∗
(y) The optimal SO solution to an instance y.
c(y) The surrogate coefficients for instance y.
Table 4.1: Notations used in this work.
quality solutions in comparable or less runtime, thanks to the help of an efficient
combinatorial solver. in SurCo-prior, our method obtains better solutions in
held-out problems compared to other methods that require training (e.g., reinforcement learning).
We compare SurCo at a high level with related work integrating learning
and optimization at the end of our paper. We additionally present theoretical
intuition that helps motivate why training a model to predict surrogate linear
coefficients may exhibit better sample complexity than previous approaches that
directly predict the optimal solution [Li et al., 2018, Ban and Rudin, 2019].
4.2 Problem Specification
Our goal is to solve the following nonlinear optimization problem describe by y:
min
x
f(x; y) s.t. x ∈ Ω(y) (4.1)
100
where x ∈ R
n
are the n variables to be optimized, f(x; y) is the nonlinear differentiable cost function to be minimized, Ω(y) is the feasible region, typically
specified by linear (in)equalities and integer constraints, and y ∈ Y are the problem instance parameters drawn from a distribution D over Y . For example, in
the traveling salesman problem, y can be the distance matrix among cities.
Differentiable cost function. The nonlinear cost function f(x; y) can either be given analytically, or the result of a simulator made differentiable via
finite differencing (e.g., JAX [Bradbury et al., 2018]). If the cost function f(x; y)
is not differentiable as in one of our experimental settings, we can use a cost
model that is learned from an offline dataset, often generated via sampling multiple feasible solutions within Ω(y), and recording their costs. In this work, we
assume the following property of f(x; y):
Assumption 4.2.1 (Differentiable cost function). During optimization, the cost
function f(x; y) and its partial derivative ∂f/∂x are accessible.
Learning a good nonlinear cost model f is non-trivial for practical applications (e.g., AlphaFold [Jumper et al., 2021], Density Functional Theory [Nagai
et al., 2020], cost model for embedding tables [Zha et al., 2022a]) and is beyond
the scope of this work.
101
Methods Nonlinear Objective can be Training Set Generalize to Handles
objectives free form unseen instances constraints
Gradient Descent Yes Yes N/A No No
Evolutionary Algorithm Yes Yes N/A No No
Nonlinear combinatorial solvers Yes No N/A No Yes
Learning direct mapping Yes Yes {yi, x
∗
i
} Yes No
Predict-then-optimize Limited No {yi, x
∗
i
} Yes Yes
SurCo (proposed) Yes Yes {yi} Yes Yes
Table 4.2: Conceptual comparison of optimizers (both traditional and MLguided). Our approach (SurCo) can handle nonlinear objective without a predefined analytical form, does not require pre-computed optimal solutions in its
training set, can handle combinatorial constraints (via commercial solvers it incorporates), and can generalize to unseen instances.
Evaluation Metric. We mainly focus on two aspects: the solution quality
evaluated by f(xˆ; y), and the number of queries of f during optimization to
achieve the solution xˆ. For both, smaller measurements are favorable, i.e., fewer
query of f to get solutions closer to global optimum.
When f(x; y) is linear w.r.t x, and the feasible region Ω(y) can be encoded
using mixed integer programs or other mathematical programs, the problem can
be solved efficiently using existing scalable optimization solvers. When f(x; y)
is nonlinear, we propose SurCo that learns a surrogate linear objective function, which allow us to leverage these existing scalable optimization solvers, and
which results in a solution that has high quality with respect to the original hardto-encode objective function f(x; y).
102
4.3 SurCo: Learning Linear Surrogates
4.3.1 SurCo-zero: on-the-fly optimization
We start from the simplest case in which we focus on a single instance with
f(x) = f(x; y) and Ω = Ω(y). SurCo-zero aims to optimize the following
objective:
(SurCo-zero) : min
c
Lzero(c) := f(gΩ(c)) (4.2)
where the surrogate optimizer gΩ : R
n
7→ R
n
is the output of certain combinatorial solvers with linear cost weight c ∈ R
n
and feasible region Ω ⊆ R
n
. For
example, gΩ can be the following:
gΩ(c) := arg min
x
c
⊤x s.t. x ∈ Ω := {Ax ≤ b, x ∈ Z
n
} (4.3)
which is the output of a MILP solver. Thanks to previous works [Ferber et al.,
2020, Pogančić et al., 2020], we can efficiently compute the partial derivative
∂gΩ(c)/∂c. Intuitively, this means that gΩ(c) can be backpropagated through.
Since f is also differentiable with respect to the solution it is evaluating, we
thus can optimize Eqn. 4.2 in an end-to-end manner using any gradient-based
optimizer:
c(t + 1) = c(t) − α
∂gΩ
∂c
∂f
∂x
, (4.4)
103
where α is the learning rate. The procedure starts from a randomly initialized
c(0) and converges at a local optimal solution of c. While Eqn. 4.2 is still nonlinear optimization and there is no guarantee about the quality of the final solution
c, we argue that optimizing Eqn. 4.2 is better than optimizing the original nonlinear cost minx∈Ω f(x). Furthermore, while we cannot guarantee optimality,
we guarantee feasibility by leveraging a linear combinatorial solver.
Intuitively, instead of optimizing directly over the solution space x, we optimize over the space of surrogate costs c, and delegate the combinatorial feasibility requirements of the nonlinear problem to SoTA combinatorial solvers.
Compared to naive approaches that directly optimize f(x) via general optimization techniques, our method readily handles complex constraints of the feasible
regions, and thus makes the optimization procedure easier. Furthermore, it also
helps escape from local minima, thanks to the embedded search component of
existing combinatorial solvers (e.g., branch-and-bound [Land and Doig, 2010] in
MILP solvers). As we see in the experiments, this is particularly important when
the problem becomes large-scale with more local optima. This approach works
well when we are optimizing individual instances and may not have access to
offline training data or the training time is cost-prohibitive.
104
Limitation. Note that due to linear surrogate, our approach will always
return a vertex in the feasible region, while the solution to the original nonlinear
objective may be in the interior. We leave this limitation for future work. In
many real-world settings, such as in the three domains we tested, the solutions
are indeed on the vertices of feasible regions.
4.3.2 SurCo-prior: offline surrogate training
We now consider a more general case where we have N optimization instances,
each parameterized by an instance description yi
, i = 1 . . . N, and we want to
find their solutions to a collection of nonlinear loss functions f(x; yi) simultaneously. Here we write Dtrain := {yi}
N
i=1 as the training set. A naive approach
is just to apply SurCo-zero N times, which leads to N independent surrogate
costs {ci}
N
i=1. However, this approach does not consider two important characteristics. First, it fails to leverage possible relationship between the instance
descriptor yi and its associated surrogate cost ci
, since every surrogate cost is
independently estimated. Second, it fails to learn any useful knowledge from the
N instances after optimization. As a result, for an unseen instance, the entire
105
optimization process needs to be conducted again, which is slow. This motivates
us to add a surrogate cost model cˆ(y; θ) into the optimization as a regularizer:
(SurCo-prior-λ) : min
θ,{ci}
Lprior(θ, {ci}; λ)
:= X
N
i=1
f(gΩ(yi)(ci); yi) + λ∥ci − cˆ(yi
; θ))∥2 (4.5)
The regressor model cˆ(y; θ) directly predicts the surrogate cost from the instance
description. The form of the regressor can be a neural network, in which θ is its
parameters. Note that when λ = 0, it reduces to N independent optimizations,
while when λ > 0, the surrogate costs {ci} interact with each other. With the
regressor, we distill knowledge gained from the optimization procedure into θ,
which can be used for an unseen instance y
′
. Indeed, we use the learned regressor
model to predict the surrogate cost c
′ = cˆ(y
′
; θ), and directly solve the surrogate
optimization (SO):
xˆ
∗
(y
′
) = arg min
x∈Ω(y)
cˆ
⊤(y
′
; θ)x (4.6)
106
A special case is when λ → +∞, we directly learn the network parameters θ
instead of individual surrogate costs:
(SurCo-prior) : min
θ
Lprior(θ)
:= X
N
i=1
f(gΩ(yi)(cˆ(yi
; θ)); yi) (4.7)
This approach is useful when the goal is to find high-quality solutions for unseen instances of a problem distribution when the upfront cost of offline training
is acceptable but the cost of optimizing on-the-fly is prohibitive. Here, we require access to a distribution of training optimization problems, but at test time
only require the feasible region and not the nonlinear objective. Different from
predict-then-optimize Elmachtoub and Grigas [2022], Ferber et al. [2020] or ML
optimizers Ban and Rudin [2019], we do not require the optimal solution {x
∗
i }
N
i=1
as part of the training set.
4.3.3 SurCo-hybrid: fine-tuning a predicted surrogate
Naturally, we consider SurCo-hybrid, a hybrid approach which initializes the
coefficients of SurCo-zero with the coefficients predicted from SurCo-prior
107
which was trained on offline data. This allows SurCo-hybrid to start out optimization from an initial prediction that has good performance for the distribution at large but which is then fine-tuned for the specific instance. Formally, we
initialize c(0) = cˆ(yi
; θ) and then continue optimizing c based on the update
from SurCo-zero. This approach is geared towards optimizing the nonlinear
objective using a high-quality initial prediction that is based on the problem distribution and then fine-tuning the objective coefficients based on the specific
problem instance at test time. Here, high performance comes at the runtime cost
of both having to train offline on a problem distribution as well as performing
fine-tuning steps on-the-fly. However, this additional cost is often worthwhile
when the main goal is to find the best possible solutions by leveraging synthesized domain knowledge in combination with individual problem instances as
arises in chip design [Mirhoseini et al., 2021] and compiler optimization [Zhou
et al., 2020].
108
4.4 Predicting Surrogate Cost vs Predicting Solution,
A Theoretical Analysis
One of the key ingredient of our proposed methods (SurCo-prior and SurCohybrid) is to learn a model to predict surrogate cost c from instance description
y, which is in contrast with previous solution regression approaches that directly
learn a mapping from problem description y to the solution x
∗
(y) [Ban and
Rudin, 2019]. A natural question arise: which one is better?
In this section, we give theoretical intuition to compare the two approaches
using a simple 1-nearest-neighbor (1-NN) solution regressor [Fix, 1985]. We first
relate the number of samples needed to learn any mapping to its Lipschitz constant L, and then show that for the direct mapping y 7→ x
∗
(y), L can be very
large. Therefore, there exist fundamental difficulties to learn such a mapping.
When this happens, we can still find surrogate cost mapping y 7→ c
∗
(y) with
finite L that leads to the optimal solution x
∗
(y) of the original nonlinear problems.
109
4.4.1 Lipschitz constant and sample complexity
Formally, consider fitting any mapping ϕ : R
d ⊇ Y 7→ R
m with a dataset C :=
{yi
, ϕi}. Here Y is a compact region with finite volume vol(Y ). The Lipschitz
constant L is the smallest number so that ∥ϕ(y1)−ϕ(y2)∥2 ≤ L∥y1−y2∥2 holds
for any y1, y2 ∈ Y . The following theorem shows that if the dataset covers the
space Y , we could achieve high accuracy prediction: ∥ϕ(y) − ϕˆ(y)∥2 ≤ ϵ for
any y ∈ Y .
Definition 4.4.1 (δ-cover). A dataset C := {(yi
, ϕi)}
N
i=1 δ-covers the space Y , if
for any y ∈ Y , there exists at least one yi so that ∥y − yi∥2 ≤ δ.
Lemma 4.4.2 (Sufficient condition of prediction with ϵ-accuracy). If the dataset
C can (ϵ/L)-cover Y , then for any y ∈ Y , a 1-nearest-neighbor regressor ϕˆ leads
to ∥ϕˆ(y) − ϕ(y)∥2 ≤ ϵ.
Lemma 4.4.3 (Lower bound of sample complexity for ϵ/L-cover). To achieve
ϵ/L-cover of Y , the size of the dataset set N ≥ N0(ϵ) := vol(Y )
vol0
L
ϵ
d
, where vol0 is
the volume of unit ball in d-dimension.
11
Please find all proofs in the Appendix. While we do not rule out a more advanced regressor than 1-nearest-neighbor that could lead to better sample complexity, the lemmas demonstrate that the Lipschitz constant L plays an important
role in sample complexity.
DLRM-10 DLRM-20 DLRM-30 DLRM-40 DLRM-50 DLRM-60
Setting
0
10
20
30
40
50
Solution Loss (Latency)
Table Sharding Solution Loss (Latency)
Domain Heuristic
Greedy
SurCo-zero
DreamShard
SurCo-prior
SurCo-hybrid
DLRM-10 DLRM-20 DLRM-30 DLRM-40 DLRM-50 DLRM-60
Setting
0.0
0.5
1.0
1.5
2.0
2.5
Deployment Runtime (s)
Table Sharding Deployment Runtime (s)
Figure 4.2: Table placement plan latency (left) and solver runtime (right). We evaluate SurCo against Dreamshard [Zha et al., 2022b], a SoTA offline RL sharding tool, a
domain-heuristic of assigning tables based on dimension, and a greedy heuristic based
on the estimated runtime increase. Striped approaches require pre-training.
4.4.2 Difference between Cost and Solution Regression
In the following we will show that in certain cases, the direct prediction y 7→
x
∗
(y) could have an infinitely large Lipschitz constant L. To show this, let us
consider a general mapping ϕ : R
d ⊇ Y 7→ R
m. Let ϕ(Y ) be the image of Y
under mapping ϕ and κ(Y ) be the number of connected components for region
Y .
111
Theorem 4.4.4 (A case of infinite Lipschitz constant). If the minimal distance
dmin for different connected components of ϕ(Y )is strictly positive, and κ(ϕ(Y )) >
κ(Y ), then the Lipschitz constant of the mapping ϕ is infinite.
Note that this theorem applies to a wide variety of combinatorial optimization problems. For example, when Y is a connected region and the optimization
problem can be formulated as an integer programming, the optimal solution set
x
∗
(Y ) := {x
∗
(y) : y ∈ Y } is a discrete set of integral vertices, so the theorem
applies. Combined with analysis in Sec. 4.4.1, we know the mapping y 7→ x
∗
(y)
is hard to learn even with a lot of samples.
We can see this more clearly with a concrete example in 2D space. Let the
1D instance description y ∈ [0, π/2], and the feasible region is a convex hull
of 3 vertices {(0, 0),(0, 1),(1, 0)}. The nonlinear objective is simply f(x; y) :=
(x1 cos(y) + x2 sin(y))2
, in which x = (x1, x2) is the 2D solution vector. The
direct mapping y → x
∗ maps a continuous region of instance descriptions (i.e.,
y ∈ [0, π/2]) into 2 disjoint regions points (x
∗ = (0, 1) and x
∗ = (1, 0)), and thus
according to Theorem 4.4.4, its Lipschitz constant must be infinite. In contrast,
there exists a surrogate cost mapping c(y) = [cos(y),sin(y)]⊤, and the mapping
y → c has finite Lipschitz constant (actually L ≤ 1) and can be learned easily.
112
4.5 Empirical Evaluation
We evaluate the variants of SurCo on three settings, embedding table sharding,
inverse photonic design, and nonlinear route planning, with the first two being real-world industrial settings. Each setting consists of a family of problem
instances with varying feasible region and nonlinear objective function. Additionally, both table sharding and inverse photonic design lack analytical formulations of the objective function which prevents them from being used by many
off-the-shelf nonlinear solvers like SCIP [Achterberg, 2009].
Figure 4.3: Left The solution loss (% of failed instances when the design loss is
not 0), and right test time solver runtime in log scale. For both, lower is better.
We compare against the Pass-Through gradient approach proposed in Schubert
et al. [2022]. We observe that SurCo-prior achieves similar success rates to
the previous approach Pass-Through with a substantially improved runtime.
Additionally, SurCo-zero runs comparably or faster, while finding more valid
solutions than Pass-Through. SurCo-hybrid obtains valid solutions most often
and is faster than SurCo-zero at the expense of pretraining. Striped approaches
use pretraining.
113
4.5.1 Embedding Table Sharding
The task of sharding embedding tables arises in the deployment of large-scale
neural network models which operate over both sparse and dense inputs (e.g., in
recommendation systems [Zha et al., 2022a,b, 2023, Sethi et al., 2022]). Given T
embedding tables and D homogeneous devices, the goal is to distribute the tables
among the devices such that no device’s memory limit is exceeded, while the
tables are processed efficiently. Formally, let xt,d be the binary variable indicating
whether table t is assigned to device d, and x := {xt,d} ∈ {0, 1}
T D be the
collection of the variables. The optimization problem is minx∈Ω f(x; y) where
Ω(y) := {x : ∀t,P
t
xt,d = 1, ∀d,P
t mtxt,d ≤ M}.
Here the problem description y includes table memory usage {mt}, and capacity M of each device. P
d
xt,d = 1 means each table t should be assigned
to exactly one device, and P
d mtxt,d ≤ M means the memory consumption at
each device d should not exceed its capacity. The nonlinear cost function f(x; y)
is the latency, i.e., the runtime of the longest-running device. Due to shared
computation (e.g., batching) among the group of assigned tables, and communication costs across devices, the objective is highly nonlinear. f(x; y) is wellapproximated by a sharding plan runtime estimator proposed by Dreamshard
114
[Zha et al., 2022b]. Note that here, the runtime is approximated by a differentiable function since the real world deployment runtime isn’t differentiable.
SurCo learns to predict T × D surrogate cost cˆt,d, one for each potential
table-device assignment. During training, the gradients through the combinatorial solver ∂g/∂c are computed via CVXPYLayers [Agrawal et al., 2019a], and
the integrality constraints are relaxed. In practice, we obtained mostly integral
solutions in that only one table on any given device was fractional. At test time,
we solve for the integer solution using SCIP [Achterberg, 2009], a branch and
bound MILP solver.
Settings. We evaluate SurCo on the public Deep Learning Recommendation
Model (DLRM) dataset [Naumov et al., 2019]. We consider 6 settings placing
10, 20, 30, 40, 50, and 60 tables on 4 devices, with a 5GB memory limit on GPU
devices and 100 instances each (50 train, 50 test).
Baselines. For impromptu baselines, Greedy allocates tables to devices based
on predicted latency increase f, and the domain-expert algorithm Domain-Heuristic
balances the aggregate dimension [Zha et al., 2022b]. For SurCo-prior, we evaluate against Dreamshard, the SoTA embedding table sharding algorithm using
offline RL.
115
Results. Fig. 4.2, SurCo-zero finds lower latency sharding plans than the
baselines, while it takes slightly longer than Domain-Heuristic and DreamShard
due to taking optimization steps rather than building a solution from a heuristic feature or reinforcement learned policy. SurCo-prior obtains lower latency
solutions in about the same time as DreamShard with a slight runtime increase
from SCIP. Lastly, SurCo-hybrid obtains the best solutions and has runtime comparable to SurCo-zero. In smaller instances (T ≤ 40), SurCo-prior finds better
solutions than its impromptu counterpart, SurCo-zero, likely by escaping local
optima by training on a variety of examples. For larger instances with more tables available for placement, SurCo-zero performs better by optimizing for the
test instances as opposed to SurCo-prior which only uses training data. Using
SurCo-hybrid, we obtain the best solutions but incur the upfront pretraining
cost and the deployment-time optimization cost.
4.5.2 Inverse Photonic Design
Photonic devices play an essential role in high-speed communication [Marpaung
et al., 2019], quantum computing [Arrazola et al., 2021], and machine learning
hardware acceleration [Wetzstein et al., 2020]. The photonic components can be
116
Wavelength Multiplexer Mode Converter Waveguide Bend Beam Splitter
Wavelength1 1270nm Wavelength2 1290nm
Figure 4.4: Inverse photonic design settings from the ceviche challenges Schubert
et al. [2022] along with SurCo-zero solution designs and wavelength intensities.
Light is fed in on the left and is routed at desired intensities to the output by
designing the intermediate region. In the Wavelength Multiplexer setting, two
wavelengths of interest are visualized as they are routed to different locations.
encoded as a binary 2D grid, with each cell being filled or void. There are constraints on which binary patterns are physically manufacturable: only those that
can be drawn by a physical brush instrument with a specific cross shape can be
manufactured. It remains challenging to find manufacturable designs that satisfy
design specifications like splitting beams of light. An example solution developed
by SurCo is shown in Figure 4.5b: coming from the top, beams are routed to the
left or right, depending on wavelength. The solution is also manufacturable: a
3-by-3 brush cross can fit in all filled and void space. Given the design, existing
work [Hughes et al., 2019] enables differentiation of the design misspecification
cost, evaluated as how far off the transmission intensity of the wavelengths are
from the desired output locations, with zero design loss meaning that the specification is satisfied. Researchers also develop the Ceviche Challenges [Schubert
117
et al., 2022] a standard benchmark of inverse photonic design problems. Formally, a feasible design is a rectangle of pixels which are either filled or void
where both the filled and void pixels can be expressed as a unions of the brush
shape. Please see [Schubert et al., 2022] for an in depth description of the nonlinear objective and feasible region.
0 25 50 75 100 125 150 175 200
Step
0.0
0.2
0.4
0.6
0.8
1.0
Design Misspecification
Inverse Photonics Loss Convergence
Method
Pass-Through
SurCo-zero
SurCo-hybrid
(a) Loss Convergence
Device Design
(b) Device
Ez magnitude
first wavelength
Ez magnitude
second wavelength
(c) Wave Mangitude
Figure 4.5: Inverse photonic design convergence example [Schubert et al., 2022].
In (a), SurCo-zero smoothly lowers the loss while the pass-through baseline converges noisily. Also, SurCo-hybrid quickly fine-tunes an already high-quality
solution. (b) visualizes the SurCo-zero solution and (c) visualizes the two wavelengths of interest which are successfully routed from the top to the bottom.
118
Settings. We compare our approaches against the Pass-Through method
[Schubert et al., 2022] on randomly generated instances of the four types of problems in Schubert et al. [2022]: Waveguide Bend, Mode Converter, Wavelengths
Division Multiplexer, and Beam Splitter. We generate 50 instances in each setting (25 training/25 test), randomly sampling the location of input and output
waveguides, or “pipes” where we are taking in light and desire light to output.
We fix the wavelengths themselves and so the problem description y contains
an image description of the problem instance, where each pixel is either “fixed”
or “designable”. Further generation details are in the appendix. We evaluated
several algorithms described in the appendix, such as genetic algorithms and
derivative-free optimization, which failed to find physically feasible solutions.
We consider two wavelengths (1270nm/1290nm), and optimize at a resolution of
40nm, visualizing the test results in Fig. 4.3.
Results. Fig. 4.3, SurCo-zero consistently finds as many or more valid
devices compared to the Pass-Through baseline [Schubert et al., 2022]. Additionally, since the on-the-fly solvers stop when they either find a valid solution,
or reach a maximum of 200 steps, the runtime of SurCo-zero is slightly lower
than the Pass-Through baseline. SurCo-prior obtains similar success rates as
119
Pass-Through while taking two orders of magnitude less time as it does not require expensive impromptu optimization, making SurCo-prior a promising approach for large-scale settings or when solving many slightly-varied instances.
Lastly, SurCo-hybrid performs best in terms of solution loss, finding valid solutions more often than the other approaches. It also takes less runtime than
the other on-the-fly approaches since it is able to reach valid solutions faster,
although it still requires optimization on-the-fly so it takes longer than SurCoprior. We visualize impromptu solver convergence in Fig. 4.5a where SurCozero has smoother and faster convergence than Pass-Through.
4.5.3 Nonlinear Route Planning
Nonlinear route planning can arise where one wants to maximize the probability
of arrival before a set time in graphs with random edges [Fan et al., 2005, Nikolova
et al., 2006, Lim et al., 2013]. These problems occur in risk-aware settings such as
emergency services operators who need to maximize the probability of arriving
before a critical time, or where driver reward is determined by deadlines.
120
�!, �! �!, �! �!, �!
Figure 4.6: Nonlinear route planning visualization. The goal is to route from the
top left to bottom right corner, with the edge weights being normally distributed,
maximizing the probability of arriving before a set deadline.
Given a graph G with edge lengths coming from a random distribution, a pair
of source and destination nodes s, t, and a time limit T that we would like to arrive before, we select a feasible s−t path Ps,t that maximizes the probability of arriving before the deadline P[length(Ps,t) ≤ T]. If we assume that edge times are
distributed according to a random normal distribution te ∼ N (µe, σ2
e
), then we
could write the objective as maximizing f(x; y) = Φ
(T −
P
e∈Ps,t
µe)/
qP
e∈Ps,t
σ
2
e
,
with Φ being the cumulative distribution function of a standard Gaussian distribution, with the feasible region Ω(y) being the set of s − t paths in the graph
from origin to destination. Explicitly, the problem parameters y are the graph G,
source and destination nodes s, t, time limit T, and the edge weight distributions
121
specified by the edge means and variances µe, σ2
e
. We only consider the zero-shot
setting without training examples since we need to solve the problem on-the-fly.
SurCo trains surrogate edge costs cˆe and finds the shortest path using BellmanFord [Bellman, 1958], and differentiate using blackbox differentiation [Pogančić
et al., 2020].
Settings. We run on a 5x5 grid graph with 25 draws of edge parameters
µe ∼ U(0.1, 1) and σ
2
e ∼ U(0.1, 0.3) ∗ (1 − µe), with U(a, b) being the uniform
random distribution between a and b. We have deadline settings based on the
length of the least expected time path (LET) which is simply the shortest path
using µe as weights. We use loose, normal, and tight deadlines of 1.1 LET, 1 LET,
and 0.9 LET respectively. The source and destination are oppose corners of the
grid graph.
Results. Fig. 4.7, we compare SurCo-zero against a domain-specific approach that minimizes a linear combination of mean and variance [Nikolova
et al., 2006], and SCIP [Achterberg, 2009]. In this setting, we focus on the zeroshot performance of SurCo, comparing it against two other zero-shot approaches.
Furthermore, here we are able to encode the objective analytically into SCIP
122
Loose Deadline Normal Deadline Tight Deadline
Setting
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
On Time Probability
Stochastic Shortest Path Solution Quality
Method
Domain Heuristic
SCIP-1s
SurCo-zero
SCIP-30min
Figure 4.7: Comparison of nonlinear route planning probability of arriving on time. We
compare against a domain heuristic [Nikolova et al., 2006] and SCIP [Achterberg, 2009].
SurCo-zero outperforms the domain heuristic, and is similar to SCIP using less time.
SCIP-1s fails to find feasible solutions.
whereas the objectives of the other settings do not have readily-encodeable formulations, relying on neural networks or physical simulation. Since SurCozero and the domain approach take much less than 1 second, we use SCIP-1s
and find that SCIP cannot find feasible solutions at that time scale. SCIP-30min
demonstrates how well a general-purpose method can do given enough time,
with SCIP timing out on all instances. We also find that SurCo-zero is able
to obtain comparable solutions to SCIP-30min. Furthermore, SurCo-zero consistently outperforms the domain heuristic, finding paths that reach the deadline with 4.5%, 6.5%, 8.5% times higher success rates in loose, normal, and tight
123
deadlines. Finally, we found only 2 instances where the domain heuristic beat
SurCo-zero.
4.6 Related Work
Differentiable Optimization OptNet [Amos and Kolter, 2017] proposed implicitly differentiating through KKT conditions, a set of linear equations that
determine the optimal solution. Followup work differentiated through linear
programs [Wilder et al., 2019a], submodular optimization problems [Djolonga
and Krause, 2017, Wilder et al., 2019a], cone programs [Agrawal et al., 2019a,b],
MaxSAT [Wang et al., 2019], Mixed Integer Linear Programming [Ferber et al.,
2020, Mandi et al., 2020], Integer Linear Programming [Mandi et al., 2020], dynamic programming Demirovic et al. [2020], blackbox discrete linear optimizers
[Pogančić et al., 2020, Rolínek et al., 2020a,b], maximum likelihood estimation
[Niepert et al., 2021b], kmeans clustering [Wilder et al., 2019b], knapsack [Guler
et al., 2022, Demirović et al., 2019], the cross-entropy method [Amos and Yarats,
2020], Nonlinear Least Squares [Pineda et al., 2022], SVM training [Lee et al.,
2019], and combining LP variables Wang et al. [2020a]. SurCo can leverage these
differentiable surrogates for different problem domains.
124
Task Based Learning Task-based learning solves distributions of linear or
quadratic optimization problems with the true objective hidden at test time but
available for training [Elmachtoub and Grigas, 2022, Donti et al., 2017, El Balghiti
et al., 2019, Liu and Grigas, 2021, Hu et al., 2022]. [Donti et al., 2021] predicts and
corrects solutions for continuous nonlinear optimization. Bayesian optimization
(BO) [Shahriari et al., 2016], optimizes blackbox functions by approximating the
objective with a learned model that can be optimized over. Recent work optimizes individual instances over discrete spaces like hypercubes [Baptista and
Poloczek, 2018], graphs [Deshwal et al., 2021], and MILP [Papalexopoulos et al.,
2022]. Data reuse from previous runs is proposed to optimize multiple correlated
instances [Swersky et al., 2013, Feurer et al., 2018]. However, the surrogate Gaussian Process (GP) models are memory and time intensive in high-dimensional
settings. Recent work has addressed GP scalability via gradient updates [Ament
and Gomes, 2022]; however, it is unclear whether GP can scale in conjunction
with combinatorial solvers. Machine learning is also used to guide combinatorial
algorithms. Several approaches produce combinatorial solutions [Zhang and Dietterich, 1995, Khalil et al., 2017a, Kool et al., 2018, Nazari et al., 2018a, Zha et al.,
2022a,b]. Here, approaches are limited to simple feasible regions by iteratively
building solutions for problems like routing, assignment, or covering. However,
125
these approaches fail to handle more complex constraints. Other approaches set
parameters that improve solver runtime [Khalil et al., 2016, Bengio et al., 2021].
Similarly, a neural diving approach has been proposed for finding fast MILP solutions Nair et al. [2020]. This approach requires iteratively solving a subproblem
which in the nonlinear setting may still be hard to solve or even encode.
Learning Latent Space for Optimization We learn latent linear objectives to
optimize nonlinear functions while other approaches learn latent embeddings for
faster solving. FastMap [Faloutsos and Lin, 1995] learns latent object embeddings
for efficient search, and variants of FastMap are used in graph optimization and
shortest path [Cohen et al., 2018, Hu et al., 2022, Li et al., 2019b]. Wang et al.
[2020b, 2021a], Yang et al. [2021], Zhao et al. [2022] use Monte Carlo Tree Search
to perform single and multi-objective optimization by learning to split the search
space.
Mixed Integer Nonlinear Programming (MINLP) SurCo-zero operates as
a Mixed Integer Nonlinear Programming (MINLP) solver, optimizing nonlinear and nonconvex objectives over discrete linear feasible regions. Specialized
solvers handle some MINLP variants [Burer and Letchford, 2012, Belotti et al.,
2013]; however, scalability in nonconvex settings usually requires problem-specific
126
techniques like piecewise linear approximation, objective convexification, or exploiting special structure.
4.7 Conclusion
We introduced SurCo, a method for learning linear surrogates for combinatorial
nonlinear optimization problems. SurCo learns linear objective coefficients for a
surrogate solver which results in solutions that minimize the nonlinear loss via
gradient descent. At its core, SurCo differentiates through the surrogate solver
which maps the predicted coefficients to a combinatorially feasible solution, combining the flexibility of gradient-based optimization with the structure of combinatorial solvers. Our theoretical intuition for SurCo poses promising directions
for future work in proving convergence guarantees or generalization bounds.
Additionally, improvements of SurCo may enable scalable solving for settings
in stochastic optimization, game theory, combinatorial reinforcement learning,
and more. We presented three variants of SurCo, SurCo-zero which optimizes
individual instances, SurCo-prior which trains a coefficient prediction model
offline, and SurCo-hybrid which fine-tunes the coefficients predicted by SurCoprior on individual test instances. While SurCo’s performance is somewhat
limited to binary problems due to the lack of interior integer points, we find
127
that many real-world domains operate on binary decision variables. We evaluated variants of SurCo against the state-of-the-art approaches on three domains,
with two used in industry, obtaining better solution quality for similar or better
runtime in the embedding table sharding domain, quickly identifying viable photonic devices, and finding successful routes in stochastic path planning. Overall,
SurCo trains linear surrogate coefficients to point the solver towards high-quality
solutions, becoming a general-purpose method that aims to tackle a broad class
of combinatorial problems with nonlinear objectives when off-the-shelf solvers
fail.
128
SurCo Supplementary Information
4.A Proofs
Lemma 4.4.2 (Sufficient condition of prediction with ϵ-accuracy). If the dataset
C can (ϵ/L)-cover Y , then for any y ∈ Y , a 1-nearest-neighbor regressor ϕˆ leads
to ∥ϕˆ(y) − ϕ(y)∥2 ≤ ϵ.
Proof. Since the dataset is a ϵ/L-cover, for any y ∈ Y , there exists at least one
yi so that ∥y − yi∥2 ≤ ϵ/L. Let ynn be the nearest neighbor of y, and we have:
∥y − ynn∥2 ≤ ∥y − yi∥2 ≤ ϵ/L (4.8)
From the Lipschitz condition and the definition of 1-nearest-neighbor classifier
(ϕˆ(y) = ϕ(ynn)), we know that
∥ϕ(y) − ϕˆ(y)∥2 = ∥ϕ(y) − ϕ(ynn)∥2 ≤ L∥y − ynn∥2 ≤ ϵ (4.9)
129
Lemma 4.4.3 (Lower bound of sample complexity for ϵ/L-cover). To achieve
ϵ/L-cover of Y , the size of the dataset set N ≥ N0(ϵ) := vol(Y )
vol0
L
ϵ
d
, where vol0 is
the volume of unit ball in d-dimension.
Proof. We prove by contradiction. If N < N0(ϵ), then for each training sample
(yi
, ϕi), we create a ball Bi
:= B (yi
, ϵ/L). Since
vol [
N
i=1
Bi ∩ Y
!
≤ vol [
N
i=1
Bi
!
≤
X
N
i=1
vol(Bi) = Nvol0
ϵ
L
d
< vol(Y )
(4.10)
Therefore, there exists at least one y ∈ Y so that y ∈/ Bi for any 1 ≤ i ≤ N.
This means that y is not ϵ/L-covered.
Theorem 4.4.4 (A case of infinite Lipschitz constant). If the minimal distance
dmin for different connected components of ϕ(Y )is strictly positive, and κ(ϕ(Y )) >
κ(Y ), then the Lipschitz constant of the mapping ϕ is infinite.
Proof. Let R1, R2, . . . , RK be the K = κ(ϕ(Y )) connected components of ϕ(Y ),
and Y1, Y2, . . . , YJ be the J = κ(Y ) connected components of Y . From the condition, we know that mink̸=k
′ dist(Rk, Rk
′) = dmin > 0.
We have Rk∩Rk
′ = ∅ for k ̸= k
′
. Each Rk has a pre-image Sk := ϕ
−1
(Rk) ⊆
Y . These pre-images {Sk}
K
k=1 form a partition of Y since
13
• Sk ∩ Sk
′ = ∅ for k ̸= k
′
since any y ∈ Y cannot be mapped to more than
one connected components;
•
SK
k=1 Sk =
SK
k=1 ϕ
−1
(Rk) = ϕ
−1
SK
k=1 Rk
= ϕ
−1
(ϕ(S)) = S.
Since K = κ(ϕ(Y )) > κ(Y ), by pigeonhole principle, there exists one Yj that
contains at least part of the two pre-images Sk and Sk
′ with k ̸= k
′
. This means
that
Sk ∩ Yj ̸= ∅, Sk
′ ∩ Yj ̸= ∅ (4.11)
Then we pick y ∈ Sk ∩ Yj and y
′ ∈ Sk
′ ∩ Yj
. Since y, y
′ ∈ Yj and Yj
is a
connected component, there exists a continuous path γ : [0, 1] 7→ Yj so that
γ(0) = y and γ(1) = y
′
. Therefore, we have ϕ(γ(0)) ∈ Rk and ϕ(γ(1)) ∈ Rk
′.
Let t0 := sup{t : t ∈ [0, 1], ϕ(γ(t)) ∈ Rk}, then 0 ≤ t0 < 1. For any sufficiently
small ϵ > 0, we have:
• By the definition of sup, we know there exists t0 − ϵ ≤ t
′ ≤ t0 so that
ϕ(γ(t
′
)) ∈ Rk.
• Picking t
′′ = t0 + ϵ < 1, then ϕ(γ(t
′′)) ∈ Rk
′′ with some k
′′ ̸= k.
131
On the other hand, by continuity of the curve γ, there exists a constant C(t0) so
that ∥γ(t
′
) − γ(t
′′)∥2 ≤ C(t0)∥t
′ − t
′′∥2 ≤ 2C(t0)ϵ. Then we have
L = max
y,y′∈Y
∥ϕ(y) − ϕ(y
′
)∥2
∥y − y′∥2
≥
∥ϕ(γ(t
′
)) − ϕ(γ(t
′′))∥2
∥γ(t
′
) − γ(t
′′)∥2
≥
dmin
2C(t0)ϵ
→ +∞
(4.12)
4.B Experiment Details
4.B.1 Setups
Experiments are performed on a cluster of identical machines, each with 4 Nvidia
A100 GPUs and 32 CPU cores, with 1T of RAM and 40GB of GPU memory. Additionally, we perform all operations in Python [Van Rossum and Drake, 2009]
using Pytorch [Paszke et al., 2019]. For embedding table placement, the nonlinear cost estimator is trained for 200 iterations and the offline-trained models
of Dreamshard and SurCo-prior are trained against the pretrained cost estimator for 200 iterations. The DLRM Dataset Naumov et al. [2019] is available at
https://github.com/facebookresearch/dlrm_datasets, and the dreamshard [Zha
132
et al., 2022b] code is available at https://github.com/daochenzha/dreamshard. Additional details on dreamshard’s model architecture and features can be obtained
in the paper and codebase. Training time for the networks used in SurCo-prior
and SurCo-hybrid are on average 8 hours for the inverse photonic design settings and 6, 21, 39, 44, 50, 63 minutes for DLRM 10, 20, 30, 40, 50, 60 settings
respectively.
4.B.2 Network Architectures
4.B.2.1 Embedding Table Sharding
The table features are the same used in Zha et al. [2022b], and sinusoidal positional encoding Vaswani et al. [2017] is used as device features so that the learning model is able to break symmetries between the different tables and effectively
group them onto homogeneous devices. The table and device features are concatenated and then fed into Dreamshard’s initial fully-connected table encoding
module to obtain scalar predictions cˆt,d for each desired objective coefficient. The
architecture is trained with the Adam optimizer with learning rate 0.0005. Here,
we use the dreamshard backbone to predict coefficients for each table-device pair.
We add more output dimensions to the dreamshard backbone, ensuring that we
output the desired number of coefficients.
133
4.B.2.2 Inverse Photonic Design
Network architectures. The input design specification (a 2D image) is passed
through a 3 layer convolutional neural network with ReLU activations and a final
layer composed of filtering with the known brush shape. Then a tanh activation
is used to obtain surrogate coefficients cˆ, one component for each binary input
variable. The architecture is trained with the Adam optimizer with learning rate
0.001.
This is motivated by previous work [Schubert et al., 2022] that also uses the
fixed brush shape filter and tanh operation to transform the latent parameters
into a continuous solution that is projected onto the space of physically feasible
solutions.
In each setting, optimization is done on a binary grid of different sizes to meet
fabrication constraints, namely that a 3 by 3 cross must fit inside each fixed and
void location. In the beam splitter the design is an 80×60 grid, in mode converter
it is a 40×40 grid, in waveguide bend it is a 40×40 grid, in wavelength division
multiplexer it is an 80 × 80 grid.
Previous work formulated the projection as finding a discrete solution that
minimized the dot product of the input continuous solution and proposed discrete solution. The authors then updated the continuous solution by computing
134
Task Randomization
mode converter right and left waveguide width
bend setting waveguide width and length
beam splitter waveguide separation, width and length
wavelength division multiplexer input and output waveguide locations
Table 4.3: Task randomization of 4 different tasks in inverse photonic design.
gradients of the loss with respect to the discrete solution and using pass-through
gradients to update the continuous solution. By comparison, our approach treats
the projection as an optimization problem and updates the objective coefficients
so that the resulting projected solution moves in the direction of the desired gradient.
To compute the gradient of this blackbox projection solver, we leverage the
approach suggested by Pogančić et al. [2020] which calls the solver twice, once
with the original coefficients, and again with coefficients that are perturbed in the
direction of the incoming solution gradient as being an “improved solution”. The
gradient with respect to the input coefficients are then the difference between the
“improved solution” and the solution for the current objective coefficients.
135
4.C Pseudocode
Here is the pseudocode for the different variants of our algorithm. Each of these
leverage a differentiable optimization solver to differentiate through the surrogate optimization problem.
Algorithm 1 SurCo-zero
Input: feasible region Ω, data y, objective f
c ← init_surrogate_coefs(y)
while not converged do
x ← arg minx∈Ω(y) c
⊤x
loss ← f(x; y)
c ←grad_update(c, ∇closs)
end while
Return x
136
Algorithm 2 SurCo-prior Training
Input: feasible region Ω, data Dtrain = {yi}
N
i=1, objective f
θ ← init_surrogate_model()
while not converged do
Sample batch B = {yi}
k
i ∼ Dtrain
for y ∈ B do
cˆ ← cˆ(y; θ)
x ← arg minx∈Ω(y) c
⊤x
loss += f(x; y)
end for
θ ←grad_update(θ, ∇θloss)
end while
Return θ
137
Algorithm 3 SurCo-prior Deployment
1: Input: feasible region Ω, data Dtrain = {yi}
N
i=1, objective f, test instance
ytest
2: θ ← train SurCo-prior(Ω, Dtrain, f)
3: c ← cˆ(y; θ)
4: x ← arg minx∈Ω(y) c
⊤x
5: Return x
Algorithm 4 SurCo-hybrid
1: Input: feasible region Ω, data Dtrain = {yi}
N
i=1, objective f, test instance
ytest
2: θ ← train SurCo-prior(Ω, Dtrain, f)
3: c ← cˆ(y; θ)
4: while not converged do
5: x ← arg minx∈Ω(y) c
⊤x
6: loss ← f(x; y)
7: c ←grad_update(c, ∇closs)
8: end while
9: Return x
138
4.D Additional Failed Baselines
SOGA - Single Objective Genetic Algorithm Using PyGAD [Gad, 2021], we
attempted several approaches for both table sharding and inverse photonics settings. While we were able to obtain feasible table sharding solutions, they underperformed the greedy baseline by 20%. Additionally, they were unable to
find physically feasible inverse photonics solutions. We varied between random,
swap, inversion, and scramble mutations and used all parent selection methods
but were unable to find viable solutions.
DFL - A Derivative-Free Library We could not easily integrate DFLGEN [Liuzzi et al., 2015] into our pipelines since it operates in fortran and we needed
to specify the feasible region with python in the ceviche challenges. DFLINT
works in python but took more than 24 hours to run on individual instances
which reached a timeout limit. We found that the much longer runtime made
this inapplicable for the domains of interest.
Nevergrad We enforced integrality in Nevergrad [Rapin and Teytaud, 2018]
using choice variables which selected between 0 and 1. This approach was unable to find feasible solutions for inverse photonics in less than 10 hours. For
139
table sharding we obtained solutions by using a choice variable for each table,
selecting one of the available devices. This approach was not able to outperform the greedy baseline and took longer time so it was strictly dominated by
the greedy approach.
Solution Prediction We made several attempts at training solution predictors
for each of our domains. We label each problem instance with the best-known
solution obtained (including those obtained via SurCo). Note that predicting feasible solutions to combinatorial optimization problems is nontrivial for general
settings.
We evaluate solution prediction architectures in each setting. The models
here match the architecture of SurCo-prior but the output is fed through a sigmoid transformation to get predictions in [0,1]. In nonlinear shortest path we use
a GCN architecture and predict [0,1] whether edges are in the shortest s-t path.
Not surprisingly, we found that predicting solutions to combinatorial problems
is a nontrivial problem, further motivating the use of SurCo which ensures combinatorial feasibility of the generated solution.
Note that the solutions predicted by the networks may not be binary (and
thus not feasible). We then round the individual decision variables to get binary
predictions. Empirically, we found that our predictions are very close to binary,
140
indicating that rounding is more a numerical exactness operation than an algorithmic decision, with the largest distance from any original to rounded value
being 0.0008 for inverse photonics, 0.0001 for nonlinear shortest path, and 0.0007
for the assignment problem of table sharding.
We evaluate the results on unseen test instances in Table 4.4 and find that
these solution prediction approaches don’t yield combinatorially feasible solutions. We present machin learning performance in the table below to verify that
the predictive models perform “well” in terms of standard machine learning evaluation even though they fail to generate feasible solutions.
Setting Decision Variable Solution Solution
Accuracy Average Accuracy Feasibility Rate
Inverse Photonics - Sigmoid 87% 0% 0%
Nonlinear Path - Sigmoid 95% 0% 0%
Sharding - Sigmoid 92% 0% 0%
Sharding - Softmax 88% 0% 0%
Sharding - Softmax + Iterative 70% 0% 100%
Table 4.4: Solution prediction results, most methods give infeasible solutions.
We also iterate on table sharding to produce two more domain-specific approaches. We evaluate a model variant which assigns each table into one of the
4 devices using softmax, which empirically fails to yield feasible solutions that
meet device memory limits for any of our instances. We further develop a method
called Softmax + Iterative which iteratively assigns the most likely table-device
141
Setting % Latency Increase vs Domain Heuristic (worst baseline)
DLRM-10 6%
DLRM-20 5%
DLRM-30 9%
DLRM-40 7%
DLRM-50 3%
DLRM-60 11%
Table 4.5: Comparison of only feasible solution prediction method against worst
baseline.
assignment as long as the device has enough memory to hold the device. Luckily, this Softmax + Iterative method empirically yields feasible solutions in this
setting but we note that this approach is not guaranteed to terminate in feasible
solutions, unlike SurCo. To see why Softmax + Iterative does not necessarily
guarantee feasible termination, consider assigning 3 tables (2 small and 1 large)
to 2 devices each with memory limit of 2, the small tables have memory 1 and
the large table has memory 2. If the model’s highest assignment probability is on
the small tables being evenly distributed across devices, the algorithm will first
assign the small tables to devices 1 and 2 but stall because it is unable to assign
the large table since neither device has enough remaining capacity. We present
results for this Softmax + Iterative approach compared to our domain heuristic
which is the worst performing baseline in Table 4.5.
For each setting, we evaluate the three metrics:
142
• Decision Variable Accuracy Average, is the average percent of variables
which are correctly predicted.
• The solution accuracy, is the rate of predicting the full solution correctly
(all decision variables predicted correctly).
• The solution feasibility rate, is the percent of instances for which the
predicted solution satisfies the constraints.
143
Chapter 5
GenCO: Generating Diverse Solutions to
Nonlinear Combinatorial Optimization Problems
“The Analytical Engine weaves algebraic patterns, just as the Jacquard loom
weaves flowers and leaves.” - Ada Lovelace
The generation of diverse solutions in non-linear combinatorial optimization
problems presents unique challenges due to combinatorial constraints and nonlinear objectives. Existing generative models and optimization solvers fall short
in ensuring solution diversity and constraint compliance. To this end, we propose
’GenCO’ which trains deep generative models end-to-end with surrogate linear
combinatorial solvers in order to find high-quality solutions with respect to the
nonlinear objective. This framework generates diverse, feasible solutions and
144
optimizes non-linear objectives by generating objective coefficients for a linear
surrogate solver, finding a solution satisfying combinatorial constraints, and penalizing the whole generative pipeline using the nonlinear objective. We demonstrate the effectiveness of our approach on video game level generation, showing
that our approach consistently generates diverse and high-quality solutions that
are guaranteed to satisfy the user-specified combinatorial constraints.
5.1 Introduction
Generating diverse combinatorial solutions to optimization problems is an important task with many applications. For example, in video game level design, we
may want to generate a variety of levels that are both realistic and valid [Zhang
et al., 2020]. In inverse photonic design, we want to generate a variety of devices that meet foundry manufacturing constraints and optimize physics-related
objectives [Schubert et al., 2022]. In the design of new molecules, we want to
generate a variety of molecules that are both chemically valid and have good
targeted properties [Pereira et al., 2021]. In all of these examples, the goal is
to generate a variety of combinatorial solutions that are both feasible and high
quality with respect to a given nonlinear objective.
145
The difficulty in these settings arises from the fact that the generated objects
need to satisfy nontrivial combinatorial constraints, as well as be of high quality
with respect to a nonlinear objective. The combinatorial constraints make it difficult to apply gradient-based methods to train standard generative models, such
as GAN Goodfellow et al. [2014], or VAE Kingma and Welling [2013], which may
generate examples that violate the user-specified constraints. Additionally, the
nonlinear objective makes it difficult to use traditional general-purpose combinatorial optimization tools as mixed integer nonlinear programming (MINLP) is
often slow in practice compared to the more heavily researched mixed integer
linear programming (MILP), often requiring problem-specific solvers or leveraging specialized structure. Furthermore, traditional optimization solvers aren’t
geared towards generating diverse solutions and don’t have the flexibility of deep
learning models to adapt to slightly modified settings quickly. Recent work has
attempted to address these issues by first training a generative model and then
fixing the output after the fact [Zhang et al., 2020]. However, these approaches
are limited in that they only use the combinatorial solver to fix the output of the
generative model and do not use the combinatorial solver to train the generative
model. As a result, the generative model is not trained with the combinatorial
146
solver in mind. Thus, the generator may fail to capture the distribution of highquality feasible solutions as many of the generated examples may be “fixed” to a
single solution. Additionally, some work constrains generative model outputs by
penalizing the generator based on constraint violation [Chao et al., 2021]. While
this approach may work in some settings where the constraints must be satisfied
(and are truly hard, etc.)
As a result, we propose GenCO, an approach for generating diverse combinatorial solutions to nonlinear combinatorial problems that combine the flexibility
of deep generative models with the combinatorial feasibility guarantees of optimization solvers. Our approach ensures that the output of the generative model
satisfies combinatorial constraints by using a comparatively efficient linear combinatorial solver to fix the output of the generative model. Here, the generator
is trained end-to-end with the downstream combinatorial solver to ensure that
the generator produces feasible solutions and that the generator’s loss is evaluated using only feasible solutions. Furthermore, we can ensure that the objective
function is optimized by penalizing the generator based on the generated solutions’ objective values.
147
Our main contributions are proposing a framework for generating diverse solutions to nonlinear problems that are guaranteed to satisfy combinatorial constraints, showing how our framework can be used in both the generative adversarial network and variational autoencoder settings, and demonstrating the
effectiveness of our approach on a variety of generative combinatorial optimization problems.
5.2 Generating Diverse Solutions to Nonlinear
Combinatorial Problems
To generate diverse solutions to nonlinear combinatorial problems, we consider
a problem domain where the task is to generate a set of combinatorial solutions
that satisfy a set of constraints and optimize a given nonlinear objective function.
Effectively, our goal is to train a generator that outputs solutions optimizing the
objective function and which natively handles diversity.
148
5.2.1 Mathematical Formulation
Overall, we are asked to generate a variety of high-quality combinatorial solutions to an optimization problem with a differentiable nonlinear objective function. Formally, we have problems that have parameters y that may include parameters of the feasible region, objective function, and more, a combinatorial
feasible region Ω(y), and a nonlinear objective f(y) : Ω → R. We are asked
to generate a variety of high-quality feasible solutions. Here, the combinatorial feasible region may restrict solutions to those that respect physical or logical constraints. Furthermore, since the number of solutions is undetermined
a-priori, we want to provide the end-user a generator that can quickly sample
novel high-quality solutions as needed. Formally, the generator G parameterized by θ will take random noise ϵ and produce solutions x ∈ Ω. The goal is to
ensure that the solutions produced by the generator are combinatorially feasible,
diverse, and have good objective value.
Formally, the goal is to train the generator G to optimize the following biobjective function balancing the generative loss Lgen with the objective quality
of the solutions f. The generative loss may take many forms depending on the
generative model. For instance, a Wasserstein GAN would use the adversarial difference between outputs, or a VAE would use the evidence lower bound
149
(ELBO). The objective quality of the solutions is the objective value of the solutions produced by the generator.
min
θ
Eϵ
[Lgen(G(ϵ; θ)) + f(G(ϵ; θ))]
subject to G(ϵ; θ) ∈ Ω, ∀ϵ
(5.1)
5.2.2 GenCO: Generating Solutions with
Combinatorial Constraints
To generate solutions that are guaranteed to be feasible, we propose to use a
surrogate linear combinatorial solver to ensure that the solutions satisfy combinatorial constraints and that we have parameters that we can get gradients
through. As recent work has made a variety of combinatorial solvers differentiable [Agrawal et al., 2019a, Pogančić et al., 2020], we can use these methods to
obtain gradients depending on the problem at hand. Note that here, the linear
coefficients are not used to approximate the nonlinear objective f, but rather to
give the generator a continuous and unconstrained object to generate in a tuneable manner such that the resulting combinatorial solution has good objective
value. The key insight here is that in order to ensure that the generator satisfies
150
feasibility constraints, we decompose it into two parts: an unconstrained generator G˜ : ϵ → R
n
and a surrogate combinatorial solver S : R
n → Ω. Below, we
describe our method at a high level in pseudocode and then describe how this
method can be used in the GAN and VAE settings.
Algorithm 5 Generator Training
Output: Trained generator
Initialize generator parameters θ
while not converged do
Sample problem y
Sample linear coefficients c ∼ G(y; θ)
Solve x
∗ = arg maxx∈Ω c
T x
Compute loss = Lgen(G(ϵ; θ)) + f(x
∗
)
Backpropagate ∇θloss to update θ
end while
5.2.3 GenCO - GAN
In the GAN setting, the generative objectives are measured by the quality of a
worst-case adversary, which is trained to distinguish between the generator’s
output and the true data distribution. Here, we use the combinatorial solver to
ensure that the generator’s output is always feasible and that the adversary’s loss
is evaluated using only feasible solutions. This not only ensures that the pipeline
is more aligned with the real-world deployment but also that the discriminator
doesn’t have to dedicate model capacity to detecting infeasibility as indicating a
151
solution is fake and instead dedicate model capacity to distinguishing between
real and fake inputs, assuming they are all valid. Furthermore, we can ensure
that the objective function is optimized by penalizing the generator based on the
generated solutions’ objective values. Below, we write out the objective explicitly
and give the training algorithm in pseudocode.
Lgen(G(ϵ; θ)) = max
θadv
Eϵ
[log(1 − f(G(ϵ; θ)))] (5.2)
152
Algorithm 6 GenCO Generator Training in the GAN setting
Initialize generator parameters θgen
Initialize adversary parameters θadv
for epoch e do
for batch b do
Sample problem y
Sample true examples from dataset xtrue ∼ pdata(x)
Sample linear coefficients c ∼ G(y; θgen)
Solve x
∗ = arg maxx∈Ω c
T x
Compute f(x
∗
; θadv)
Backpropagate ∇θgen f(x
∗
; θadv) to update θgen
Backpropagate ∇θadv f(xtrue; θadv) − f(x
∗
; θadv) to update θadv
end for
end for
5.2.4 GenCO - VQVAE
The code below spells out the VQVAE training procedure and is specialized to
the inverse photonics coefficient generation setting. Here, we simply train VQVAE on a dataset of known objective coefficients, which solves the problem at
153
hand. A variant of this also puts the decision-focused loss on the generated objective coefficients, running optimizer gsolver on the objective coefficients to get a
solution and then computing the objective value of the solution.
(TODO: prove this?) This should be equivalent to optimizing ELBO on the
surrogate objective coefficients as well as the decision-focused loss on the generated objective coefficients.
LELBO(c, θ, E) = Eqθ(z|c)
[log pθ(c|z)]−β ·DKL(qθ(z|c)||p(z))+γ ·∥sg(ek)−ze,θ∥
2
2
(5.3)
Here z is an embedding vector, c is the objective coefficients, log pθ(c|z) is a loss
calculated via the mean squared error between the decoder output and the original input objective coefficients, qθ(z|c) is the encoder, p(z) is the prior, and sg(·)
is the stop gradient operator, E is a discrete codebook that is used to quantize
the embedding.
Loptimization = Ec∼pθ(c|z)
[fobj(gsolver(c; y))] (5.4)
(hopefully) In theory, the below code maximizes a combination of the losses
in Equation 5.3 and Equation 5.4.
154
Algorithm 7 Generator Training for Inverse Photonics
Input: Training data distribution D over problem info y and known highquality objective coefficients c, regularization weight β, linear surrogate solver
gsolver, nonlinear objective fobjective.
Output: Trained encoder fenc, decoder fdec, and codebook E
Initialize the parameters of the encoder fenc, decoder fdec, and the codebook
E = {e1, e2, . . . , eK} with K embedding vectors
for t = 1 to T do
Sample y, c from the distribution D
Compute the encoder output ze = fenc(y, c)
Find the nearest embedding vector zq = arg mine∈E ∥ze − e∥
2
2
Compute the quantization loss Lquant = ∥ze − zq∥
2
2
Compute the reconstruction c˜ = fdec(y, zq)
Compute the reconstruction loss Lrecon = ∥c − c˜∥
2
2
SurCo Variant: Compute the optimization loss Lopt = fobjective(gsolver(˜c, y))
Compute the total loss: Ltotal = Lrecon + β1Lquant + β2Lopt
Update the parameters of the encoder, decoder, and codebook to minimize
Ltotal
end for
155
5.3 Experiments
We evaluate our model on the task of generating diverse Zelda game levels. In
this setting, we use GenCo to train using a GAN-like formulation.
5.3.1 Settings
Explicitly Constrained GAN: Game Level Design We train GenCo on the
task of generating Zelda game levels. Here, we are given examples of humancrafted game levels and are asked to generate fun new levels. The generated
levels must be playable in that the player must be able to solve them by moving
the character through a route that reaches the destination. Additionally, the levels should be realistic in that they should be somewhat similar to the real game
levels as measured by an adversary that is trained to distinguish between real
and fake images. We use the same dataset as [Zhang et al., 2020], consisting of
50 Zelda game levels, as well as the same network architectures, as we don’t tune
the hyperparameters for our model in particular but rather compare the different
approaches all else equal.
Here, we evaluate several approaches, including the previous work [Zhang
et al., 2020], which we call GAN + MILP fix. This approach first trains a standard
Wasserstein GAN architecture to generate the game levels. Specifically, they
156
train the WGAN by alternatively training two components: a generator and a
discriminator. The generator is updated to “fool” the discriminator as much as
possible in that it tries to maximize the loss of the discriminator. The discriminator tries to correctly separate the generated and real game levels into their
respective classes. When the practitioner then wants to generate a valid level,
this approach generates a level with the generator and then fixes it using a MILP
formulation that finds the nearest feasible game level where proximity is determined by cosine distance, which is equivalent to minimizing a dot product
distance.
We evaluate two variants of GenCo, GenCo - Fixed Adversary, which iteratively updates the generative model to fool a fixed pretrained adversary, and
GenCo - Updated Adversary, which updates both the generator and adversary
during training. The fixed adversary approach trains to fool a pretrained adversary that is obtained from the fully trained GAN from previous work [Zhang
et al., 2020]. Both GenCo approaches are initialized with the fully trained GAN
from previous work [Zhang et al., 2020].
157
Approach % unique GAN adversary
(lower better)
GenCo adversary
(lower better)
GenCo - Updated Adversary 0.995 -10.10 -4.49
GenCo - Fixed Adversary 0.22 -1.45 -0.85
GAN + MILP fix (Previous Work) 0.52 0.22 0.24
Table 5.1: Game level design results comparison. Here, we compare our approach
with an updating adversary against ours with a fixed adversary, and lastly, previous work that simply postprocesses solutions to make them generate valid game
levels. Here, all generated levels are valid, but our approach generates more diverse solutions by training the gan to generate solutions with the gan adversary
taking fixed solutions as input.
5.3.2 Results
We present a table of results in Table 5.1, which includes the performance of the
previous approach as well as two variants of our proposed approach. In these
settings, we estimate performance based on sampling 1000 levels. Each level
is made out of a grid, with each grid cell having one of 8 components: wall,
empty, key, exit, 3 enemy types, and the player. A valid level is one that can be
solved by the player in that there is a valid route starting at the player’s location,
collecting the key, and then reaching the exit. We evaluate the performance of
the models using two types of metrics: diversity, as measured by the percentage
of unique levels generated, and fidelity, as measured by the average objective
quality of a fixed GAN adversary. We evaluate using adversaries from both the
previous work and GenCo. As shown in Table 5.1, we find that GenCo with
158
an updating adversary generates unique solutions at a much higher rate than
previous approaches and also generates solutions that are of higher quality as
measured by both the GAN adversary and its own adversary. The adversary
quality demonstrates that the solutions are realistic in that neither the adversary
from the previous work nor from GenCo is able to distinguish the generated
examples from the real examples. Furthermore, given that the levels are trained
on only 50 examples, we can obtain many more game levels. Note here that both
approaches are guaranteed to give playable levels as they are postprocessed to
be valid. However, GenCo is able to generate more diverse solutions that are also
of higher quality.
Uniqueness As shown in Table 5.1, GenCo, with an updated adversary, obtains the highest percentage of unique solutions, generating 995 unique solutions
out of 1000. This is significantly higher than the previous work, which only generated 520 unique solutions. This is likely due to the fact that the generator is
trained with the downstream fixing explicitly in the loop. This means that while
the previous work may have been able to “hide” from the adversary by generating
slightly different continuous solutions, these continuous solutions may project
to the same discrete solution. On the other hand, by integrating the fixing into
the training loop, GenCo’s generator is unable to hide in the continuous solution
159
space and thus is heavily penalized by generating the same solution as the adversary will easily detect those to be originating from the generator. In essence,
this makes the adversary’s task easier as it only needs to consider distinguishing
between valid discrete levels rather than continuous and unconstrained levels.
This is also reflected in the adversary quality, where GenCo’s adversary is able to
distinguish between levels coming from the previous work’s generator and the
real levels with a much better loss.
5.4 Conclusion
In this paper, we introduce GenCO, a framework for integrating combinatorial
constraints in a variety of generative models, and show how it can be used to generate diverse combinatorial solutions to nonlinear optimization problems. Unlike
existing generative models and optimization solvers, GenCO guarantees that the
generated diverse solutions satisfy combinatorial constraints, and we show empirically that it can optimize nonlinear objectives.
The underlying idea of our framework is to combine the flexibility of deep
generative models with the guarantees of optimization solvers. By training the
160
generator end-to-end with a surrogate linear combinatorial solver, GenCO generates diverse and combinatorially feasible solutions, with the generative loss
being computed only on feasible solutions.
We have tested GenCO on various combinatorial optimization problems and
generative settings, including GAN in video game level generation and VQVAE
in inverse photonic design, demonstrating the flexibility of our framework for
integrating into different generative paradigms. Our framework consistently
produced diverse and high-quality solutions that satisfy the combinatorial constraints, which can be flexibly encoded using general purpose combinatorial
solvers.
161
Chapter 6
ML for MIP: Learning Pseudo-Backdoors for
Mixed Integer Programs
“In the middle of difficulty lies opportunity.” - Albert Einstein
We propose a machine learning approach for quickly solving Mixed Integer
Programs (MIPs) by learning to prioritize a subset of decision variables, which
we call pseudo-backdoors, for branching that results in faster solution times.
Approaches based on machine learning have seen success in the area of solving combinatorial optimization problems by being able to flexibly leverage common structures in a given distribution of problems. Our approach takes inspiration from the concept of strong backdoors, which corresponds to a small set
162
of variables such that only branching on these variables yields an optimal integral solution and a proof of optimality. Our notion of pseudo-backdoors corresponds to a small set of variables such that prioritizing branching on them leads
to faster solve time (which can be solver dependent). A key advantage of pseudobackdoors over strong backdoors is that they are more amenable to data-driven
identification or prediction. Our proposed method learns to estimate the solver
performance of a candidate pseudo-backdoor and determine whether or not to
use it. This pipeline can be used to identify high-quality pseudo-backdoors on
unseen MIP instances for a given MIP distribution. We evaluate our method
on five problem distributions and find that our approach can efficiently identify
high-quality pseudo-backdoors. In addition, we compare our learned approach
against Gurobi, a state-of-the-art MIP solver, demonstrating that our method can
be used to improve solver performance.
6.1 Introduction
Mixed integer programs (MIPs) are widely used mathematical models for combinatorial optimization problems Conforti et al. [2014], which consist of finding a setting of variables that optimizes a linear objective function subject to
both linear constraints and some integrality constraints on the decision variables.
163
MIPs can generally be solved via the branch-and-bound algorithm Land and Doig
[2010], which systematically explores the solution space using tree search by
branching on decision variables and pruning provably suboptimal subtrees. MIP
solvers have a wide variety of heuristic components such as how variables are
selected for branching, when heuristics are run, which nodes are selected, and
when cuts are generated. Empirically, branching decisions have a significant impact on the solve time of MIPs Achterberg and Wunderling [2013]. Recently,
a series of papers have explored data-driven approaches to learning branching
heuristics that improve MIP solve time Balcan et al. [2018], Gasse et al. [2019],
Khalil et al. [2016], Nair et al. [2020]. We examine concept of backdoors, which
represent a core set of variables which are key to solving combinatorial problems. Backdoors are first introduced in Williams et al. [2003] in the context of
Constraint Satisfaction Problems (CSPs). “Weak backdoors” are defined to be a
small subset of variables that satisfy the following property: there exists an assignment to this subset of variables such that the remaining unassigned CSP can
be solved in polynomial time. In a “strong backdoor” any setting of the backdoor
variables leads to a fully solvable subproblem. Later, backdoors are generalized to
combinatorial optimization and MIPs Dilkina et al. [2009b,a], where a strong MIP
backdoor is a subset of integer variables such that only branching on them yields
164
an optimal integral solution and a certificate of optimality. While the contribution of the strong backdoor to CSP and SAT has theoretical and practical limitations Järvisalo and Junttila [2007], Järvisalo and Niemelä [2008], Semenov et al.
[2018], solve time speedup in MIPs has been observed by prioritizing “backdoor”
variables in branching Fischetti and Monaci [2011], Khalil et al. [2021]. Here, the
variables are prioritized at the top of the branch and bound tree, so that even
if the MIP is not completely solved by branching on variables in the backdoor,
the solver can then branch on non-backdoor variables when those in the backdoor are exhausted. As a result, a set of prioritized variables which result in
fast runtimes may not necessarily satisfy the strict definition of strong and weak
backdoors given in Dilkina et al. [2009a] and to make this distinction clear, we
will refer to a high-performance subset of integer variables as a pseudo-backdoor.
Our goal in finding pseudo-backdoors is to quickly solve MIPs. In a MIP we
are asked to find real-valued settings for n decision variables x ∈ R
n
, which maximize a linear objective function c
T x, subject to m linear constraints Ax ≤ b, and
with a subset I ⊆ [n] of the decision variables required to be integral xi ∈ Z ∀i ∈
I. Overall the problem can be written as min{c
T x|Ax < b, xj ∈ Z ∀i ∈ I} Given
a MIP, specified by P = (c, A, b, I), our goal is to find a pseudo-backdoor subset B ⊆ I, of the integral decision variables such that prioritizing branching on
165
these decision variables yields fast solve times. We consider a distributional setting of MIP solving where we are given a training distribution of instances and
want to train a pseudo-backdoor selection model that performs well on unseen
instances from the same distribution.
In this work, we introduce a data-driven approach to identifying pseudobackdoors for distributions of MIPs by learning a ranking model to identify which
subset of variables is most likely to be a pseudo-backdoor and a classification
model to decide whether to use a candidate pseudo-backdoor or just use the default solver. If the classifier decides to use the pseudo-backdoor, we prioritize a
MIP solver to branch on variables in the pseudo-backdoor above other integral
decision variables. We represent MIPs as bipartite graphs with different nodefeatures for variables and constraints as in Gasse et al. [2019], and use graph
attention networks Veličković et al. [2018] with global attention pooling to learn
both models. We conduct empirical evaluations on neural network verification
instances Gowal et al. [2018], facility location Cornuéjols et al. [1991], and the
Generalized Independent Set Problem (GISP) Colombi et al. [2017], showing that
our models can firstly identify high-quality backdoors and secondly that these
backdoors can be used to achieve faster solve times than Gurobi.
166
6.2 Related Work
Backdoors in SAT & MIP: Backdoors were initially introduced as a concept
to represent the core hardness of a Constraint Satisfaction Problem (CSP) by
Williams et al. [2003]. The authors analyze time complexity for CSP solving
when assuming various sized backdoors exist. Dilkina et al. [2009a] extend the
definition of backdoors for general combinatorial optimization problems. Both
works also study the existence of backdoors in practice and find that small size
backdoors often exist for typical problem instances. Kilby et al. [2005] propose
algorithms for identifying backdoors for SAT by collecting branching variables
from a SAT solver. In MIP solving, Dvořák et al. [2017] study a variant of backdoors for MIPs called fracture backdoors which are variables whose removal
would result in a natural decomposition of MIPs. Additionally, Fischetti and
Monaci [2011] design a method to identify pseudo-backdoors for a given MIP
by solving a set covering problem, and observe that by prioritizing branching on
identified pseudo-backdoor variables, a MIP solver can quickly solve some MIPLIBKoch et al. [2011] instances. This method is designed to identify a pseudobackdoor for a given single MIP instance rather than provide a methodology
for generalizing to a distribution of instances. Recent work Khalil et al. [2021]
proposes using Monte Carlo Tree Search to sample high-quality backdoors for
167
a given MIP instance. While previous work has focused on identifying pseudobackdoors for a given MIP, they start from scratch for every new instance and
aren’t able to operate on new instances without re-running the entire procedure.
Our method is the first data-driven attempt to predict pseudo-backdoors which
frontloads the computational overhead of finding high-quality backdoors on a
training set to then be quickly deployed on unseen instances without needing to
rerun backdoor search.
Learning in Combinatorial Optimization: Machine learning algorithms
have shown promise in various aspects of combinatorial optimization. In settings concerning exact MIP solving as we consider, machine learning can help
decide various heuristic components of MIP solving such as selecting a branching variable at each node in the branch and bound tree Balcan et al. [2018], Gasse
et al. [2019], Khalil et al. [2016], identifying when to run primal heuristics Khalil
et al. [2017b], neural diving heuristics Nair et al. [2020], learning neighborhoods
for large neighborhood search Huang et al. [2023b] based on approximate supervision Huang et al. [2023a]. These algorithms operate inside the branch and
bound solver to make decisions throughout the branch and bound tree, rather
than once at the top of the tree, and may be used in conjunction with our approach which treats the solver as a black box, whether learning-augmented or
168
not. Several approaches consider learning to configure optimization solver parameters with performance improvements Ansótegui et al. [2015], Hoos [2011],
Hutter et al. [2009]. Outside of MIP, a variety of work has been dedicated to
coordinating multiple agents Huang et al. [2023c, 2019] which is tackled by a
variety of learning tasks Huang et al. [2022], Additionally, previous work investigates using optimization approaches to solve problems when deploying deep
learning models Bartan et al. [2023], Huang et al. [2021, 2020a]. Our approach
leverages the structural connection between the MIP formulation and a set of
hyperparameters, the variable branching priorities, to improve solve time rather
than predicting a fixed set of global hyperparamters as in previous work.
6.3 Learning Pseudo-Backdoors
Our approach utilizes two learned models: a scoring model that scores subsets
of integer variables according to how they rank in runtime, and a classifier that
decides whether to use a predicted subset in the actual MIP solving. The intuition of including the second model is that some MIPs do not easily admit a
small pseudo-backdoor in practice, and it may be difficult to sample high-quality
pseudo-backdoors for a given MIP. In these situations, it is better to run a solver
in its default setting.
169
Figure 6.1 illustrates our method’s deployment. At test time, given a new MIP
instance, we randomly sample fixed-size subsets of integer variables according to
their LP fractionality as in Dilkina et al. [2009a], scoring the sampled subsets, and
taking the subset with the best score as the predicted pseudo-backdoor. Then the
classifier decides whether the identified pseudo-backdoor would be faster than
the default setting. If the answer is positive, we assign higher branching priorities
to variables in the pseudo-backdoor; otherwise, we run the solver with its default
setting.
Concretely, the score model S(P, B; θS) is parametrized by neural network
parameters θS which takes as input the MIP specification P, and a candidate
subset B, then predicts a score that characterizes the quality of B. The classifier
C(P, B; θC) is parameterized by neural network parameters θC which takes as
input the MIP specification P, and a candidate subset B, to make a binary prediction of whether B will outperform the standard solver. Ultimately, we use a
suggested candidate backdoor by setting the branching priority to be higher for
pseudo-backdoor variables than other variables.
170
LP Relaxation
MIP
Pseudo-Backdoor
samples
ℬ!
ℬ"
…
ℬ#
Scoring module
(GAN + Attention Pooling)
Classification module
Solve with ℬ∗ or gurobi?
(GAN + Attention Pooling)
ℬ∗
ℬ∗ or
Gurobi?
Solve with
Gurobi
using ℬ∗
Solve with
Gurobi Score
Figure 6.1: The pseudo-backdoor pipeline visualizes the pipeline for a single MIP
instance consisting of scoring S(P, B; θS) and classification C(P, B; θC). First
k pseudo-backdoor sets of decision variables B1, . . . , Bk are sampled according to the decision variables’ LP fractionality. These candidate pseudo-backdoor
sets are ranked according to the scoring module S(P, B; θS) to predict the best
pseudo-backdoor B
∗
. The classification module then determines whether to
run the solver using B
∗ or not based on the predicted pseudo-backdoor success
C(P, B
∗
; θC).
6.3.1 MIP and Backdoor Data Representation
We want to ensure that the architecture can operate on unseen MIP instances
of various sizes, yields predictions that are invariant to changes in MIP formulations that shouldn’t change the underlying problem such as permuting the variables or constraints, and incorporate information regarding whether a decision
variable is in a predicted candidate backdoor or not. As a result, we consider
a bipartite graph representation of a MIP as in Gasse et al. [2019]. As noted in
Gasse et al. [2019], the bipartite graph representation ensures the MIP encoding
is invariant to permutation in the decision variables or constraints. Additionally, it allows us to tap into the variety of predictive models that are designed to
171
Table 6.1: Variable, constraint, and edge features that encode a MIP instance to
be used as input to machine learning models. The features here capture problem
information and leverage solution information of the root LP.
operate on variable-sized graphs, enabling us to train and deploy our model on
problems with varying numbers of decision variables and constraints. The MIP
here is represented as a featurized bipartite graph P = (G, C, E, V), with graph
G containing a set of nodes representing variables and a set of nodes representing
constraints. There is one variable node for each decision variable and one constraint node for each constraint. Additionally, there is an edge (i, j) ∈ E between
a variable node i and a constraint node j iff the variable i appears in constraint
j with nonzero coefficient i.e. Ai,j ̸= 0. Features are represented as matrices
C ∈ Rm×c
, E ∈ R|E|×e
, V ∈ Rn×v
representing data regarding c constraint
features, e edge features, and v variable features respectively. We enumerate the
features in Table 6.1.
Additionally, to represent a candidate pseudo-backdoor set B and retain the
permutation invariance afforded by the graph representation, we use a binary
encoding of B by including an additional feature for each decision variable node
which takes a value of 1 if the variable is in B and 0 otherwise. Encoding the input
(P, B) as a featurized bipartite graph now allows us to leverage state-of-the-art
172
techniques in making predictions on graphs which are amenable to variable input
graph sizes and exhibit permutation invariance.
6.3.2 Neural Network Architecture
Given the graph encoding of the MIP and candidate backdoor set, we can leverage a wide variety of graph prediction methods and libraries such as those available in the pytorch geometric library Fey and Lenssen [2019]. These approaches
generally rely on generating high-level embeddings of nodes of a graph via message passing, where one message passing iteration embeds a given node as an
aggregation of the embeddings of its neighbors and adjacent edges. Previous
work used the standard Graph Convolutional Network (GCN) Gasse et al. [2019]
to predict branching scores on individual variable nodes at a given branch and
bound node. However, we were unable to train a GCN model to predict scores
and success probabilities. Alternatively, we found that Graph Attention Network
(GAT) worked in our setting Veličković et al. [2018]. Here a node’s high-level embedding is a weighted aggregation of the embeddings of its neighbors and edges,
where the weights are themselves predicted via a neural network as done in
attention-based models Vaswani et al. [2017], thus allowing the model to attend
to which variable or constraint nodes are most useful for generating the next
173
layer’s encoding of the given constraint or variable node. Formally, the GAT
performs the message passing iteration in Equation 6.1 to compute embeddings
x
′
i
for graph node i based on the neighbors of i, N (i) with attention weights αi,j
using neural network weights θ.
x
′
i = αi,iθxi +
X
j∈N (i)
αi,jθxj
(6.1)
The attention weights themselves are given by Equation 6.2 using learnable attention weights a on node embeddings x and edge features E.
αi,j =
exp
LeakyReLU
a
T
[θxi∥θxj∥θ
eEi,j ]
P
k∈N (i)∪{i}
exp (LeakyReLU (a
T [θxi∥θxk∥θ
eEi,j ])) (6.2)
Importantly, the attention weights a sum to 1 for a given source node i and are
non-negative meaning that the embeddings x
′
i
are a weighted average of the
embeddings for nodes adjacent to i and the embedding of i itself. At the first
message passing iteration, the node features V, C are used.
Once we have several iterations of message passing along the edges of the bipartite graph, we aggregate all the node embedding using global attention pooling Li et al. [2016], which similarly computes a weighted average of node embeddings with weights given by a neural network, to yield a single feature vector
1
representing the whole graph. This fixed-length graph representation vector is
then passed to a leaky-relu activated MLP to produce a single scalar output which
will be used as the score for the scoring module or a logit for the classification
module. Performing these steps of obtaining node embeddings followed by aggregation across the entire MIP enables us to use the same network architecture
for MIPs of various sizes. The LeakyReLU transform itself is the identity if the input is positive, and 0.2 times the input if the input is negative. We perform graph
neural network operations using pytorch geometric Fey and Lenssen [2019].
6.3.3 Learning the Scorer Model
We train the scorer model by learning to rank candidate backdoors based on how
quickly the solver finishes when prioritizing branching on variables in the candidate backdoor. For a MIP P and two pseudo-backdoors B1, B2 of P, the model
predicts score estimates s1 = S(P, B1; θS), s2 = S(P, B2; θS). Additionally, we
use a ranking label y which is −1 if solving with B1 is faster than with B2, and 1
otherwise. We train with the margin ranking loss, L(s1, s2, y) = max(0, −y(s1−
s2) + m)Tsochantaridis et al. [2005] for margin value m = 0.1. The ranking loss
175
focuses on distinguishing relative pseudo-backdoor performance rather than accurately predicting performance as would be the case in a regression formulation. Learning to rank has been pioneered in data-driven information retrieval
systems Burges et al. [2005], Liu et al. [2009] and in our use case, it is sufficient
that we r
¯
ank pseudo-backdoors correctly for each MIP rather than model runtime variability across both backdoors and MIPs. At deployment time, we sample
several candidate backdoors for a MIP and score them, taking the best-scoring
pseudo-backdoor for deployment. This best-scoring pseudo-backdoor can then
be deployed by prioritizing the variables in the pseudo-backdoor, and represents
our scorer model. Initially, we experimented with models that directly predict
standardized runtime, as well as predict the probability that the input candidate backdoor will outperform the standard solver, but found that these losses
weren’t tuning the model toward isolating high-quality pseudo-backdoors for
a given MIP, and were suffering on held-out training data used for model validation. Ultimately, the approach of learning to rank closely aligned our model’s
loss with the overall deployment problem of trying to identify the fastest-solving
pseudo-backdoor.
176
6.3.4 Learning the Classifier Model
For a given MIP instance, it may be difficult to sample high-quality pseudobackdoors. As a result, we learn a classifier to determine whether to use the
best-scoring pseudo-backdoor or the default MIP solver. The classifier has the
same architecture as the scoring model, taking in the MIP and candidate subset
encoding (P, B), and outputting a probability representing how likely the predicted pseudo-backdoor is to outperform the standard MIP solver. We train the
model to minimize a binary cross-entropy loss between the model output logits and labels indicating whether the scorer’s best-scoring pseudo-backdoor beat
the default solver. At test time we use a threshold of 0.5 to get a final binary output determining whether to use the best-scoring pseudo-backdoor or the default
solver. The scorer+cls model represents the combination of the scorer and classifier models identifying both the highest-quality pseudo-backdoor and whether
or not it is likely to beat the standard MIP solver.
177
6.4 Experiment Results
6.4.1 Problem domains
Many real-world setting require solving a homogeneous family of problems,
where instances share similar structures, differing slightly in the problem size
or numerical coefficients. We evaluate the two components of our proposed
method on problem instances drawn from three different application domains∗
.
We refrained from evaluating our methodology on heterogeneous MIP datasets
as firstly we found difficulty sampling fast-solving pseudo-backdoors to train
with for a variety of realistic MIP distributions, and secondly we target homogeneous settings where latent problem information that would usually require domain knowledge to extract may help automatically identify high-quality pseudobackdoors.
Neural Network Verification: Neural network verification problems determine the robustness of a fixed network with respect to perturbation of input data.
∗Other domains: We ran initial data collection experiments on two hardness settings each
of set covering Balas and Ho [1980], combinatorial auctions Leyton-Brown et al. [2000], and
maximum independent set Bergman et al. [2016] but found that the best backdoors we sampled
using LP fractionality underperformed Gurobi, meaning that no matter how well our models
performed in selecting among these backdoors, they would be unable to outperform standard
Gurobi.
178
We consider the formulation in Gowal et al. [2018] which formulates the problem as a MIP to determine formal bounds on the amount of perturbation required
to fool a neural network. Our MIP instances are from Nair et al. [2020] which
considers verification of MNIST images against a small convolutional neural network. We ran experiments on a random subset of 100 training, validation, and
test instances.
Facility Location: The Capacitated Facility Location problem is formulated
as selecting which facilities to open and how to route limited facility supply to
satisfy customer demand. The goal is to minimize costs consisting of facility
startup cost and per-unit transportation cost between facilities and customers.
As in Gasse et al. [2019], our problems are generated according to Cornuéjols
et al. [1991] having 100 facilities supplying 200 and 400 customers for the easy
and hard distributions respectively.
Generalized Independent Set Problem: The Generalized Independent Set
Problem (GISP) is a graph optimization problem originally proposed for forestry
management Hochbaum and Pathria [1997]. The input consists of a graph, a
subset of removable edges with deletion cost, and revenues for each vertex upon
selection. The goal is to select a set of vertices and remove removable edges
179
that maximize the net profit of total vertex revenues minus total edge costs such
that the selected vertices are independent in the modified subgraph. We use randomly generated GISP instances to yield different hardness settings. Erdős-Rényi
graphs Erdős and Rényi [1960] are generated with n nodes and edge probability
p. Edges are randomly added into removable edges E
′ with probability α. Edges
have fixed costs of c, and nodes have fixed rewards of r. We set node count to
150 (easy), and 175 (hard), edge probability p = 0.3, removable edge probability
α = 0.25, edge cost c = 1, and node reward r = 100.
6.4.2 Data Generation and Model Evaluation
For each problem setting, we use 300 instances in total: 100 instances Ds for
training and validating the score model, 100 Dc for training and validating the
classifier model, and the final 100 Dt for testing the two models.
To train our score and classifier models, we randomly sample subsets of integer variables for each training instance proportional to their LP-based fractionality as in Dilkina et al. [2009a]. For a given MIP, we solve the LP relaxation
then randomly sample integral decision variables based on their fractionality, or
distance to the nearest integer, with more fractional decision variables having
higher probability. Note that our sampling method is performed only once at the
180
root node instead of selecting decision variables at each branch and bound node
which is shown to perform poorly Achterberg and Wunderling [2013]. On the
contrary, in Dilkina et al. [2009a] the authors found that many instances were
solved with a small backdoor size. We randomly sample subsets of size p · |I|
for p = 0.01, i.e., we sample pseudo-backdoor using 1% of the integer variables.
From initial experiments sampling variables of size 1%, 3%, 5% on NNVerify and
GISP easy, we found that sampling 1% of the decision variables yielded the highquality pseudo-backdoors. As a result, we maintained 1% of the decision variables, presuming that smaller pseudo-backdoors yield faster runtimes due to the
smaller search trees limited to the prioritized decision variables. For each MIP in
our distribution, we collect performance metrics on 50 random subsets sampled
according to LP fractionality Dilkina et al. [2009a]. In total this yields a dataset
of 15,000 pseudo-backdoors. We find that the best sampled backdoors yield average runtimes of 3.6s, 19.1s, 39s, 385s, and 1949s for NNVerify, Facilities easy
and hard, and GISP easy and hard respectively, meaning that correctly identifying pseudo-backdoors has the potential for large runtime improvements of 44%,
30%, 16%, 37%, 23% respectively. This large runtime improvement firstly indicates that high-quality pseudo-backdoors exist for these datasets, and secondly
181
that there is potential benefit for an algorithm which is able to correctly identify
these pseudo-backdoors.
We evaluate the pipeline illustrated in Figure 6.1 on the test set Dt
, by sampling subsets using the LP relaxation, predicting the best pseudo-backdoor with
the scorer, and finally determining whether or not to use the selected pseudobackdoor with the classifier. The backdoor set is deployed by setting Gurobi’s
BranchPriority to be 2 for variables in the pseudo-backdoor and 1 for other variables. As noted in Gurobi documentation (www.gurobi.com/documentation/9.1/
refman/branchpriority.html) the BranchPriority is the primary criterion for selecting a fractional variable for branching, and higher-priority values always take
precedence over lower-priority variables with ties being broken using internal
variable selection criteria . We consider the runtime of the full pipeline starting
with the LP-based sampling, running batched inference on sampled backdoors
to obtain a single backdoor, running inference on a given backdoor for classification, and finally solving the MIP. As Gurobi doesn’t allow access the initial LP
relaxation, we need to re-solve the LP from scratch incurring additional runtime
which would not be present with deeper solver integration. We run MIP solving with single-threaded Gurobi (9.1) Gurobi Optimization, LLC [2022] on five
32-core machines with Intel 2.1 GHz cpus and 264 GB of memory. Deep learning
182
models are trained on a machine with four GTX 1080 Ti GPUs using Pytorch
Paszke et al. [2019].
6.4.3 Main Results
We present test results on the 100 MIPs from Dt
in Table 6.2. We evaluate both
the standalone score model (scorer), which simply takes the candidate backdoor
with highest scorer predictions, and the full pipeline (scorer+cls) with both the
scoring and classification models. The table contains both absolute runtime in
seconds, and win / tie / loss over Gurobi where the runtime considers the time
for the root LP, sampling, and model inferences. We consider the scorer+cls
module “ties” with Gurobi when it predicts to use Gurobi rather than the suggested pseudo-backdoor. While it is in fact a slight loss due to inference overhead, we record the values to give insight into the model predictions. Additionally, the batched model inference comprises an average runtime of 50ms to compute scores for all 50 randomly sampled candidate backdoors and predict success
probability on the best-scoring pseudo-backdoor.
As we can see in Table 6.2, scorer alone performs well on many instances.
On average runtime, scorer outperforms Gurobi on facilities easy (6%) and hard
GISP (9%). Additionally, scorer outperforms Gurobi at different percentiles for
183
NNVerify while losing on average. Importantly, the win tie loss demonstrates
that the scorer can identify performance-improving backdoors. In NNVerify,
facilities hard, and GISP easy, the decreased average performance is partially
explained by the higher standard deviation in runtime. Additionally, in facilities
hard, scorer suffers 72 losses to Gurobi.
scorer+cls outperforms Gurobi on mean solve time across all five MIP distributions and at most solve time quantiles. The scorer+cls pipeline outperforms gurobi on mean solve time by 14%, 17%, 5%, 2% and 8% on NNVerify, facilities easy, facilities hard, GISP easy and GISP hard respectively. In terms of
win/tie/loss, scorer+cls retains most of scorer’s wins and only incorrectly selects using the backdoor in 20 of 100 instances against Gurobi while retaining
many of scorer’s wins. Furthermore, we can see that the variability on the
NNVerify, Facilities hard and GISP easy instances is greatly reduced to be comparatively more on par with Gurobi than scorer. Overall we can see that the
scorer can identify high-quality backdoors for a given MIP, and that the full
scorer+cls pipeline can effectively leverage the proposed backdoors for faster
solving by deciding when to run Gurobi instead.
184
Table 6.2: Runtime (secs) of standard Gurobi (grb), the score model (scorer), and
the joint score and classify model (scorer+cls). We report mean and standard
deviation, 25th, 50th, and 75th percentiles. Finally, we report Win/Tie/Loss vs
Gurobi.
dataset solver mean stdev 25 pct median 75 pct W/T/L vs grb
nnverify grb 6.5 7.9 2.9 4.3 6.1 0 / 100 / 0
nnverify scorer 7.0 12.8 2.5 3.5 5.1 68 / 0 / 31
nnverify scr+cls 5.6 7.4 2.6 3.3 5.0 62 / 18 / 20
facility easy grb 27.4 21.6 14.0 22.1 32.7 0 / 100 / 0
facility easy scorer 24.9 18.2 12.9 19.7 32.1 65 / 0 / 35
facility easy scr+cls 22.8 18.8 11.7 22.7 29.3 55 / 33 / 12
facility hard grb 46.9 31.6 26.2 37.1 55.2 0 / 100 / 0
facility hard scorer 56.0 42.5 29.3 42.5 69.1 28 / 0 / 72
facility hard scr+cls 44.5 30.9 25.2 35.5 52.7 25 / 74 / 1
gisp easy grb 611 182 488 580 681 0 / 100 / 0
gisp easy scorer 960 755 515 649 915 41 / 0 / 59
gisp easy scr+cls 601 247 481 568 663 24 / 70 / 6
gisp hard grb 2533 939 1840 2521 2976 0 / 100 / 0
gisp hard scorer 2373 855 1721 2262 2926 47 / 0 / 53
gisp hard scr+cls 2326 855 1654 2215 2866 41 / 39 / 20
6.5 Conclusion
In this work, we present a flexible method for learning to identify high quality
pseudo-backdoors for faster MIP solving. Inspired by previous work demonstrating that a small number of decision variables can sometimes improve MIP solving performance, our method is comprised of two parts: an initial scoring model
which scores candidate pseudo-backdoors according to how fast they can solve
185
a given MIP, and a subsequent classification module to determine whether the
selected pseudo-backdoor will indeed result in faster runtime compared with a
standard solver. We evaluate the different components of our pipeline on several
realistic MIP distributions and demonstrate that fast-solving pseudo-backdoors
exist for the examined MIP distributions, the scoring module can effectively identify pseudo-backdoors for a given MIP, and that the downstream classifier can
determine when to use these pseudo-backdoors for faster MIP solving. In future
work, we hope to develop techniques to concretely understand how machine
learning is able to improve optimization solvers.
186
Chapter 7
Conclusions and Future Directions
“If you change the way you look at things, the things you look at change.” - George
Bernard Shaw
7.1 Conclusions
Artificial Decision Intelligence (ADI) is a promising new direction that aims to
combine the best of machine learning and combinatorial optimization. We have
shown that this combination can be used to solve a variety of problems, including those that are difficult or impossible to solve with naive applications of either
machine learning or constraint optimization alone, only yielding to methods that
187
tightly integrate the two. Specifically, in MIPaaL, we show that pushing traditional techniques in operations research into the area of deep learning can enable
end-to-end training of prediction and optimization pipelines. The key insight
here is that we can determine how changes in the predicted components of optimization models can impact the downstream decisions to be used in gradientbased optimization. We further develop these types of differentiable optimization
tools to predict wildlife trafficking paths. By treating the constrained optimization module as enforcing constraints on the output of a predictive model, we
ensure that we only predict valid wildlife trafficking paths. Additionally, we can
specialize these differentiable optimization modules to solve nonlinear combinatorial optimization problems by learning linear surrogate problems such that
when they are solved, we obtain high-quality feasible solutions for a downstream
nonlinear loss. These differentiable modules can be further extended to generate
diverse combinatorially constrained solutions to nonlinear optimization problems by learning a distribution over linear coefficients, which induces a distribution over feasible regions. Finally, we show that we can use deep learning to
tune the internals of industrial-scale solvers to improve their performance on a
variety of combinatorial optimization problems.
188
7.2 Future Research Directions
There is considerable direction for future work given the broad potential impact
of artificial decision intelligence. In the area of differentiable optimization that
enables end-to-end training of prediction and optimization pipelines, there are
several open questions regarding how to handle predictions in constraints, how
to handle missing counterfactuals, and how to make these pipelines more accessible to those who lack experience in both deep learning and combinatorial
optimization.
7.2.1 Constraint Prediction
In existing work on differentiating through combinatorial solvers, researchers
have focused on learning objective coefficients. However, in many practical
cases, practitioners may want to learn components that appear in the constraints
of optimization modules, such as budgets, capacity constraints, and possibly logical constraints. Current methods are not geared towards obtaining gradients
with respect to the constraints of a combinatorial optimization module, as slight
modifications of the constraints often do not modify the resulting solution. Secondly, it is unclear how to correctly penalize infeasible solutions in the case that
189
the predicted solution violates some ground truth constraints. Lastly, it is unclear how to handle the case that the predicted problem is itself infeasible in that
there are no valid solutions for the predicted optimization problem.
7.2.2 Handling Missing Counterfactuals
In the prediction and optimization setting it is often assumed that the practitioner can measure the quality of alternative combinatorial solutions. However,
in many cases, this is not possible. For example, in the wildlife trafficking setting, measuring the quality of alternative paths that were not observed in the
data is impossible. In these cases, it is unclear how to train the prediction and
optimization pipeline to avoid biases due to missing counterfactuals in the data.
7.2.3 Accessibility of Artificial Decision Intelligence
Given the recent rise in success from emergent behaviors of large language models (LLM) at the time of writing, it is clear that LLMs have the potential to make
the capabilities of AI more accessible to the general public. As many techniques
in combinatorial optimization, such as routing, logistics, supply chain, and more,
are locked behind the gate of domain expertise, it would be greatly impactful if
these tools could be made more accessible through the use of LLM. Furthermore,
190
artificial decision intelligence has the potential for enabling broader capabilities
for LLMs, including the ability to reason and ensure that the output satisfies certain constraints. In the other direction, many components of LLM tuning are
discrete due to the use of discrete tokens. Thus, there is a potential for artificial decision intelligence to be used to search the space of tokens to optimize the
output of LLMs.
7.2.4 Artificial Decision Intelligence
Artificial Decision Intelligence is a promising new direction with many promising areas for future work. As AI becomes more accessible overall, we hope to
ensure that its capabilities are used to have a positive social impact. This not
only requires the development of new techniques but also fosters collaborations
between researchers and practitioners in the field to ensure that solutions are
deployed for solving real-world problems rather than simplified models of the
world. We hope that this work can serve as a starting point for future research
in AI for social impact, especially in settings requiring resource allocation, scientific discovery, large-scale planning, and more.
191
Bibliography
Tobias Achterberg. Scip: solving constraint integer programs. Mathematical
Programming Computation, 1(1):1–41, 2009.
Tobias Achterberg and Roland Wunderling. Mixed integer programming: Analyzing 12 years of progress. In Facets of combinatorial optimization, pages
449–481. Springer, 2013.
Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond,
and J Zico Kolter. Differentiable convex optimization layers. Advances in neural information processing systems, 32, 2019a.
Akshay Agrawal, Shane Barratt, Stephen Boyd, Enzo Busseti, and Walaa M
Moursi. Differentiating through a cone program. J. Appl. Numer. Optim, 1
(2):107–115, 2019b.
Ravindra K Ahuja and James B Orlin. Inverse optimization. Operations research,
49:771–783, 2001. ISSN 0030-364X. doi: 10.1287/opre.49.5.771.10607.
Sebastian E Ament and Carla P Gomes. Scalable first-order bayesian optimization via structured automatic differentiation. In International Conference on
Machine Learning, pages 500–516. PMLR, 2022.
Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer
in neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages
136–145. PMLR, 2017.
Brandon Amos and Denis Yarats. The differentiable cross-entropy method. In
International Conference on Machine Learning, pages 291–302. PMLR, 2020.
Carlos Ansótegui, Yuri Malitsky, Horst Samulowitz, Meinolf Sellmann, and Kevin
Tierney. Model-based genetic algorithms for algorithm configuration. In International Conference on Artificial Intelligence, pages 733–739, 2015.
192
Juan M Arrazola, Ville Bergholm, Kamil Brádler, Thomas R Bromley, Matt J
Collins, Ish Dhand, Alberto Fumagalli, Thomas Gerrits, Andrey Goussev,
Lukas G Helt, et al. Quantum circuits with many photons on a programmable
nanophotonic chip. Nature, 591(7848):54–60, 2021.
Felber J Arroyave, Alexander M Petersen, Jeffrey Jenkins, and Rafael Hurtado.
Multiplex networks reveal geographic constraints on illicit wildlife trafficking.
Applied Network Science, 5(1):1–20, 2020.
Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. In Combinatorial
Optimization, pages 37–60. Springer, 1980.
Egon Balas, Sebastian Ceria, Gérard Cornuéjols, and N Natraj. Gomory cuts
revisited. Operations Research Letters, 19(1), 1996.
Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In International conference on machine learning (ICML), 2018.
Othman El Balghiti, Adam N Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in the predict-then-optimize framework. Advances in Neural Information Processing Systems, 32, 2019.
Gah-Yi Ban and Cynthia Rudin. The big data newsvendor: Practical insights from
machine learning. Operations Research, 67(1):90–108, 2019.
Ricardo Baptista and Matthias Poloczek. Bayesian optimization of combinatorial
structures. In International Conference on Machine Learning, pages 462–471.
PMLR, 2018.
Burak Bartan, Haoming Li, Harris Teague, Christopher Lott, and Bistra Dilkina.
Moccasin: Efficient tensor rematerialization for neural networks. In International Conference on Machine Learning. PMLR, 2023.
Richard Bellman. On a routing problem. Quarterly of applied mathematics, 16(1):
87–90, 1958.
Pietro Belotti, Christian Kirches, Sven Leyffer, Jeff Linderoth, James Luedtke, and
Ashutosh Mahajan. Mixed-integer nonlinear optimization. Acta Numerica, 22:
1–131, 2013.
193
Nawal Benabbou, Mithun Chakraborty, Xuan-Vinh Ho, Jakub Sliwinski, and Yair
Zick. Diversity constraints in public housing allocation. In AAMAS, pages
973–981, 2018.
Yoshua Bengio. Using a financial training criterion rather than a prediction criterion. Intl Journal of Neural Systems, 8(04):433–443, 1997.
Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of
Operational Research, 290(2):405–421, 2021.
David Bergman, Andre A. Cire, Willem-Jan van Hoeve, and John Hooker. Decision Diagrams for Optimization. Springer Publishing Company, Inc., 1st edition, 2016. ISBN 3319428470.
Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe
Vert, and Francis Bach. Learning with differentiable pertubed optimizers. Advances in neural information processing systems, 33:9508–9519, 2020.
Dimitris Bertsimas, Christopher Darnell, and Robert Soucy. Portfolio construction through mixed-integer programming at grantham, mayo, van otterloo and
company. Interfaces, 29(1):49–66, 1999.
Merve Bodur, Sanjeeb Dash, and Oktay Günlük. Cutting planes from extended
lp formulations. Math. Programming, 161(1-2):159–192, 2017.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris
Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye
Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of
Python+NumPy programs, 2018. URL http://github.com/google/jax.
Samuel Burer and Adam N Letchford. Non-convex mixed-integer nonlinear programming: A survey. Surveys in Operations Research and Management Science,
17(2):97–106, 2012.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96,
2005.
194
Carri W Chan, Vivek F Farias, Nicholas Bambos, and Gabriel J Escobar. Optimizing intensive care unit discharge decisions with patient readmissions. Operations research, 60(6):1323–1341, 2012.
Xiaopeng Chao, Jiangzhong Cao, Yuqin Lu, Qingyun Dai, and Shangsong Liang.
Constrained generative adversarial networks. IEEE Access, 9:19208–19218,
2021. doi: 10.1109/ACCESS.2021.3054822.
Weizhe Chen, Weinan Zhang, Duo Liu, Weiping Li, Xiaojun Shi, and Fei Fang.
Data-driven multimodal patrol planning for anti-poaching. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 35, pages 15270–15277,
2021.
Weizhe Chen, Eshwar Prasad Sivaramakrishnan, and Bistra Dilkina. Landscape
optimization for prescribed burns in wildfire mitigation planning. In ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS),
pages 429–438, 2022.
Seongjin Choi, Hwasoo Yeo, and Jiwon Kim. Network-wide vehicle trajectory
prediction in urban traffic networks using deep learning. Transportation Research Record, 2672(45):173–184, 2018.
Vasek Chvatal, Vaclav Chvatal, et al. Linear programming. Macmillan, 1983.
CITES. https://cites.org/eng, .
Liron Cohen, Tansel Uras, Shiva Jahangiri, Aliyah Arunasalam, Sven Koenig, and
TK Satish Kumar. The fastmap algorithm for shortest path computations. In
IJCAI, 2018.
Marco Colombi, Renata Mansini, and Martin Savelsbergh. The generalized independent set problem: Polyhedral analysis and solution approaches. European
Journal of Operational Research, 260(1):41–55, 2017.
Michele Conforti, Gérard Cornuéjols, Giacomo Zambelli, et al. Integer programming. Springer, Cham, 2014. ISBN 9783319110073.
Jennifer Conrad and Gautam Kaul. An anatomy of trading strategies. The Review
of Financial Studies, 11(3):489–519, 1998.
195
Gérard Cornuéjols, Ranjani Sridharan, and Jean-Michel Thizy. A comparison of
heuristics and relaxations for the capacitated plant location problem. European
Journal of Operational Research, 50(3):280–297, 1991.
Alison Cozad, Nikolaos V Sahinidis, and David C Miller. Learning surrogate
models for simulation-based optimization. AIChE Journal, 60(6):2211–2227,
2014.
Global Initiative Against Transnational Organized Crime. The global organized crime index. Available at https://globalinitiative.net/analysis/
ocindex-2021/, 2021.
Sanjeeb Dash, Neil B. Dobbs, Oktay Günlük, Tomasz J. Nowicki, and Grzegorz M. Świrszcz. Lattice-free sets, multi-branch split disjunctions, and mixedinteger programming. Math. Programming, 145(1):483–508, Jun 2014. ISSN
1436-4646. doi: 10.1007/s10107-013-0654-z. URL https://doi.org/10.1007/
s10107-013-0654-z.
Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and
J Zico Kolter. End-to-end differentiable physics for learning and control. In
NeurIPS, pages 7178–7189, 2018.
Emir Demirović, Peter J Stuckey, James Bailey, Jeffrey Chan, Christopher Leckie,
Kotagiri Ramamohanarao, and Tias Guns. Predict+ optimise with ranking objectives: Exhaustively learning linear functions. In Proceedings of the
Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI
2019, Macao, China, August 10-16, 2019, pages 1078–1085. International Joint
Conferences on Artificial Intelligence, 2019.
Emir Demirovic, Peter J Stuckey, Tias Guns, James Bailey, Christopher Leckie,
Kotagiri Ramamohanarao, and Jeffrey Chan. Dynamic programming for predict+ optimise. In The Thirty-Fourth AAAI Conference on Artificial Intelligence,
AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence
Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in
Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages
1444–1451. AAAI Press, 2020.
Aryan Deshwal, Syrine Belakaria, and Janardhan Rao Doppa. Mercer features for
efficient combinatorial bayesian optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):7210–7218, May 2021. doi: 10.1609/aaai.
v35i8.16886. URL https://ojs.aaai.org/index.php/AAAI/article/view/16886.
196
John P Dickerson, David F Manlove, Benjamin Plaut, Tuomas Sandholm, and
James Trimble. Position-indexed formulations for kidney exchange. In ACM
Conference on Economics and Computation, pages 25–42, 2016.
Bistra Dilkina, Carla P Gomes, Yuri Malitsky, Ashish Sabharwal, and Meinolf
Sellmann. Backdoors to combinatorial optimization: Feasibility and optimality. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR), pages 56–70. Springer,
2009a.
Bistra Dilkina, Carla P Gomes, and Ashish Sabharwal. Backdoors in the context
of learning. In International Conference on Theory and Applications of Satisfiability Testing (SAT), pages 73–79. Springer, 2009b.
Josip Djolonga and Andreas Krause. Differentiable learning of submodular models. Advances in Neural Information Processing Systems, 30, 2017.
Priya Donti, Brandon Amos, and J Zico Kolter. Task-based end-to-end model
learning in stochastic optimization. Advances in neural information processing
systems, 30, 2017.
Priya L. Donti, David Rolnick, and J Zico Kolter. DC3: A learning method for
optimization with hard constraints. In International Conference on Learning
Representations, 2021. URL https://openreview.net/forum?id=V1ZHVxJ6dSS.
Pavel Dvořák, Eduard Eiben, Robert Ganian, Dušan Knop, and Sebastian Ordyniak. Solving integer linear programs with a small number of global variables
and constraints. In International Joint Conference on Artificial Intelligence (IJCAI), pages 607–613, 2017.
Othman El Balghiti, Adam N Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in the predict-then-optimize framework. Advances in neural information processing systems, 32, 2019.
Adam Elmachtoub, Jason Cheuk Nam Liang, and Ryan McNellis. Decision trees
for decision-making under the predict-then-optimize framework. In International Conference on Machine Learning, pages 2858–2867. PMLR, 2020. URL
https://github.com/rtm2130/SPOTree.
Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize”. Management Science, 68(1):9–26, 2022.
197
Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math.
Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
Christos Faloutsos and King-Ip Lin. Fastmap: A fast algorithm for indexing,
data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’95, page 163–174, New York, NY, USA, 1995. Association for
Computing Machinery. ISBN 0897917316. doi: 10.1145/223784.223812. URL
https://doi.org/10.1145/223784.223812.
YY Fan, Robert E Kalaba, and James E Moore. Arriving on time. Journal of
Optimization Theory and Applications, 127:497–513, 2005.
Jie Feng, Yong Li, Chao Zhang, Funing Sun, Fanchao Meng, Ang Guo, and Depeng Jin. Deepmove: Predicting human mobility with attentional recurrent
networks. In Proceedings of the 2018 world wide web conference, pages 1459–
1468, 2018.
Aaron Ferber, Bryan Wilder, Bistra Dilkina, and Milind Tambe. Mipaal: Mixed
integer program as a layer. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 34, pages 1504–1511, 2020.
Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable meta-learning
for bayesian optimization. stat, 1050(6), 2018.
Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and
Manifolds, 2019.
Matteo Fischetti and Michele Monaci. Backdoor branching. In International Conference on Integer Programming and Combinatorial Optimization (IPCO), pages
183–191. Springer, 2011.
Evelyn Fix. Discriminatory analysis: nonparametric discrimination, consistency
properties, volume 1. USAF school of Aviation Medicine, 1985.
Mogens Fosgerau, Mads Paulsen, and Thomas Kjær Rasmussen. A perturbed
utility route choice model. Transportation Research Part C: Emerging Technologies, 136:103514, 2022. ISSN 0968-090X. doi: https://doi.org/10.1016/
j.trc.2021.103514. URL https://www.sciencedirect.com/science/article/pii/
S0968090X21004976.
198
Ahmed Fawzy Gad. Pygad: An intuitive genetic algorithm python library, 2021.
Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez.
Next place prediction using mobility markov chains. In Proceedings of the first
workshop on measurement, privacy, and mobility, pages 1–6, 2012.
Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea
Lodi. Exact combinatorial optimization with graph convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
Shahrzad Gholami, Sara Mc Carthy, Bistra Dilkina, Andrew Plumptre, Milind
Tambe, Margaret Driciru, Fred Wanyama, Aggrey Rwetsiba, Mustapha Nsubaga, Joshua Mabonga, et al. Adversary models account for imperfect crime
data: Forecasting and planning against real-world poachers (corrected version). In 17th International Conference on Autonomous Agents and Multiagent
Systems, 2018.
Ralph Gomory. An algorithm for the mixed integer problem. Technical report,
RAND CORP, 1960.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial
nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27.
Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_
files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
Meredith L Gore, Patrick Braszak, James Brown, Phillip Cassey, Rosaleen Duffy,
Judith Fisher, Jessica Graham, Ronit Justo-Hanani, Andrea E Kirkwood, Elizabeth Lunstrum, et al. Transnational environmental crime threatens sustainable
development. Nature Sustainability, 2(9):784–786, 2019.
Meredith L Gore, Robert Mwinyihali, Luc Mayet, Gavinet Duclair Makaya BakuBumb, Christian Plowman, and Michelle Wieland. Typologies of urban wildlife
traffickers and sellers. Global Ecology and Conservation, page e01557, 2021.
Meredith L Gore, Lee R Schwartz, Kofi Amponsah-Mensah, Emily Barbee, Susan
Canney, Maria Carbo-Penche, Drew Cronin, Rowan Hilend, Melinda Laituri,
David Luna, et al. Voluntary consensus based geospatial data standards for
the global illegal trade in wild fauna and flora. Scientific Data, 9(1):1–8, 2022.
199
Meredith L Gore, Emily Griffin, Bistra Dilkina, Aaron Ferber, Stanley E Griffis,
Burcu B Keskin, and John Macdonald. Advancing interdisciplinary science for
disrupting wildlife trafficking networks. Proceedings of the National Academy
of Sciences, 120(10):e2208268120, 2023a.
Meredith L Gore, Rowan Hilend, Jonathan O Prell, Emily Griffin, John R Macdonald, Burcu B Keskin, Aaron Ferber, and Bistra Dilkina. A data directory
to facilitate investigations on worldwide wildlife trafficking. Big Earth Data, 7
(2):338–348, 2023b.
Abhijit Gosavi et al. Simulation-based optimization. Springer, 2015.
Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli
Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet
Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
Ali Ugur Guler, Emir Demirović, Jeffrey Chan, James Bailey, Christopher Leckie,
and Peter J Stuckey. A divide and conquer algorithm for predict+ optimize
with non-convex problems. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 36, pages 3749–3757, 2022.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL
https://www.gurobi.com.
Timothy C Haas and Sam M Ferreira. Finding politically feasible conservation
policies: the case of wildlife trafficking. Ecological Applications, 28(2):473–494,
2018.
He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and
bound algorithms. In NeurIPS, 2014.
Dorit S Hochbaum and Anu Pathria. Forest harvesting and minimum cuts: a new
approach to handling spatial constraints. Forest Science, 43(4):544–554, 1997.
Holger H Hoos. Automated algorithm configuration and parameter tuning. In
Autonomous search, pages 37–71. Springer, 2011.
Yichun Hu, Nathan Kallus, and Xiaojie Mao. Fast rates for contextual linear
optimization. Management Science, 2022.
200
Taoan Huang and Bistra Dilkina. Enhancing seismic resilience of water pipe
networks. In Proceedings of the 3rd ACM SIGCAS Conference on Computing
and Sustainable Societies, pages 44–52, 2020.
Taoan Huang, Bohui Fang, Hoon Oh, Xiaohui Bei, and Fei Fang. Optimal tripvehicle dispatch with multi-type requests. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 2024–
2026, 2019.
Taoan Huang, Bohui Fang, Xiaohui Bei, and Fei Fang. Dynamic trip-vehicle dispatch with scheduled and on-demand requests. In Uncertainty in Artificial
Intelligence, pages 250–260. PMLR, 2020a.
Taoan Huang, Weiran Shen, David Zeng, Tianyu Gu, Rohit Singh, and Fei Fang.
Green security game with community engagement. In Proceedings of the 19th
International Conference on Autonomous Agents and MultiAgent Systems, pages
529–537, 2020b.
Taoan Huang, Bistra Dilkina, and Sven Koenig. Learning node-selection strategies in bounded suboptimal conflict-based search for multi-agent path finding.
In International Joint Conference on Autonomous Agents and Multiagent Systems
(AAMAS), 2021.
Taoan Huang, Jiaoyang Li, Sven Koenig, and Bistra Dilkina. Anytime multiagent path finding via machine learning-guided large neighborhood search. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages
9368–9376, 2022.
Taoan Huang, Aaron Ferber, Yuandong Tian, Bistra Dilkina, and Benoit Steiner.
Local branching relaxation heuristics for integer linear programs. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 96–113. Springer, 2023a.
Taoan Huang, Aaron M Ferber, Yuandong Tian, Bistra Dilkina, and Benoit
Steiner. Searching large neighborhoods for integer linear programs with
contrastive learning. In International Conference on Machine Learning, pages
13869–13890. PMLR, 2023b.
201
Taoan Huang, Vikas Shivashankar, Michael Caldara, Joseph Durham, Jiaoyang
Li, Bistra Dilkina, and Sven Koenig. Deadline-aware multi-agent tour planning. In Proceedings of the International Conference on Automated Planning
and Scheduling, volume 33, pages 189–197, 2023c.
Tyler W Hughes, Ian AD Williamson, Momchil Minkov, and Shanhui Fan.
Forward-mode differentiation of maxwell’s equations. ACS Photonics, 6(11):
3010–3016, 2019.
Frank Hutter, Holger H Hoos, Kevin Leyton-Brown, and Thomas Stützle.
Paramils: an automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36:267–306, 2009.
IATA. Combating wildlife trafficking. https://www.iata.org/en/programs/
environment/wildlife-trafficking/, 2022. Accessed: 2022-08-15.
Matti Järvisalo and Tommi Junttila. Limitations of restricted branching in clause
learning. In International Conference on Principles and Practice of Constraint
Programming (CP), pages 348–363. Springer, 2007.
Matti Järvisalo and Ilkka Niemelä. The effect of structural branching on the
efficiency of clause learning sat solving: An experimental study. Journal of
Algorithms, 63(1-3):90–113, 2008.
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov,
Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek,
Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
Yihao Kao, Benjamin V. Roy, and Xiang Yan. Directed regression. In NeurIPS,
2009. URL http://papers.nips.cc/paper/3686-directed-regression.pdf.
William Karush. Minima of functions of several variables with inequalities as
side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago,
1939.
George Karypis and Vipin Kumar. A fast and high quality multilevel scheme
for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):
359–392, 1998.
202
Burcu B Keskin, Emily C Griffin, Jonathan O Prell, Bistra Dilkina, Aaron Ferber,
John MacDonald, Rowan Hilend, Stanley Griffis, and Meredith L Gore. Quantitative investigation of wildlife trafficking supply chains: A review. Omega,
page 102780, 2022.
Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina.
Learning to branch in mixed integer programming. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 30, 2016.
Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning
combinatorial optimization algorithms over graphs. Advances in neural information processing systems, 30, 2017a.
Elias B. Khalil, Bistra Dilkina, George Nemhauser, Shabbir Ahmed, and Yufen
Shao. Learning to run heuristics in tree search. In International Joint
Conference on Artificial Intelligence (IJCAI), 2017b. URL files/papers/
KhaDilNemetal17.pdf.
Elias B. Khalil, Pashootan Vaezipoor, and Bistra Dilkina. Finding backdoors to
integer programs: A monte carlo tree search framework, 2021.
Philip Kilby, John Slaney, Sylvie Thiébaux, and Toby Walsh. Backbones and
backdoors in satisfiability. In AAAI, pages 1368–1373, 2005.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
Thorsten Koch, Tobias Achterberg, Erling Andersen, Oliver Bastert, Timo
Berthold, Robert E Bixby, Emilie Danna, Gerald Gamrath, Ambros M Gleixner,
Stefan Heinz, et al. Miplib 2010. Mathematical Programming Computation, 3
(2):103–163, 2011.
Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing
problems! In International Conference on Learning Representations, 2018.
Bernhard Korte and Dirk Hausmann. An analysis of the greedy heuristic for
independence systems. In Annals of Discrete Mathematics, volume 2, pages
65–74. Elsevier, 1978.
203
Slawomir Koziel, Nurullah Çalık, Peyman Mahouti, and Mehmet A Belen. Accurate modeling of antenna structures by means of domain confinement and
pyramidal deep neural networks. IEEE Transactions on Antennas and Propagation, 70(3):2174–2188, 2021.
Anilesh K Krishnaswamy, Haoming Li, David Rein, Hanrui Zhang, and Vincent
Conitzer. Classification with strategically withheld data. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 35, pages 5514–5522, 2021.
Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.
Justin Kurland and Stephen F Pires. Assessing us wildlife trafficking patterns:
How criminology and conservation science can guide strategies to reduce the
illegal wildlife trade. Deviant Behavior, 38(4):375–391, 2017.
Ailsa H Land and Alison G Doig. An automatic method for solving discrete
programming problems. In 50 Years of Integer Programming 1958-2008, pages
105–132. Springer, 2010.
Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto.
Meta-learning with differentiable convex optimization. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 10657–
10665, 2019.
Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.
Soviet Union, 1966.
Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test
suite for combinatorial auction algorithms. In ACM conference on Electronic
commerce, pages 66–76, 2000.
Haoming Li, Sujoy Sikdar, Rohit Vaish, Junming Wang, Lirong Xia, and Chaonan
Ye. Minimizing time-to-rank: a learning and recommendation approach. In
Proceedings of the 28th International Joint Conference on Artificial Intelligence,
pages 1408–1414, 2019a.
Jiaoyang Li, Ariel Felner, Sven Koenig, and TK Satish Kumar. Using fastmap to
solve graph problems in a euclidean space. In Proceedings of the international
conference on automated planning and scheduling, volume 29, pages 273–278,
2019b.
204
Sirui Li, Zhongxia Yan, and Cathy Wu. Learning to delegate for large-scale vehicle routing. Advances in Neural Information Processing Systems, 34:26198–
26211, 2021.
Yujia Li, Richard Zemel, Marc Brockschmidt, and Daniel Tarlow. Gated graph
sequence neural networks. In ICLR, 2016.
Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with
graph convolutional networks and guided tree search. Advances in neural information processing systems, 31, 2018.
Sejoon Lim, Christian Sommer, Evdokia Nikolova, and Daniela Rus. Practical
route planning under delay uncertainty: Stochastic shortest path queries. In
Robotics: Science and Systems, volume 8, pages 249–256. United States, 2013.
Chun Kai Ling, Fei Fang, and J Zico Kolter. What game are we playing? end-toend learning in normal and extensive form games. IJCAI, 2018.
Heyuan Liu and Paul Grigas. Risk bounds and calibration for a smart predictthen-optimize method. Advances in Neural Information Processing Systems, 34:
22083–22094, 2021.
Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and
Trends® in Information Retrieval, 3(3):225–331, 2009.
Giampaolo Liuzzi, Stefano Lucidi, and Francesco Rinaldi. Derivative-free methods for mixed-integer constrained optimization problems. Journal of Optimization Theory and Applications, 164(3):933–965, 2015.
Nicholas Magliocca, Aurora Torres, Jared Margulies, Kendra McSweeney, Inés
Arroyo-Quiroz, Neil Carter, Kevin Curtin, Tara Easter, Meredith Gore, Annette Hübschle, et al. Comparative analysis of illicit supply network structure
and operations: cocaine, wildlife, and sand. Journal of Illicit Economies and
Development, 3(1), 2021.
Jayanta Mandi, Peter J Stuckey, Tias Guns, et al. Smart predict-and-optimize for
hard combinatorial optimization problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1603–1610, 2020.
Harry Markowitz. Portfolio selection. The journal of finance, 1952.
205
David Marpaung, Jianping Yao, and José Capmany. Integrated microwave photonics. Nature photonics, 13(2):80–90, 2019.
Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers &
Operations Research, 134:105400, 2021.
Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim
Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade
Nazi, et al. A graph placement methodology for fast chip design. Nature, 594
(7862):207–212, 2021.
Amir-Hamed Mohsenian-Rad and Alberto Leon-Garcia. Optimal residential load
control with price prediction in real-time electricity pricing environments.
IEEE Trans. Smart Grid, 1(2):120–133, 2010.
Ryo Nagai, Ryosuke Akashi, and Osamu Sugino. Completing density functional
theory by machine learning hidden messages from molecules. npj Computational Materials, 6(1):1–8, 2020.
Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O’Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi,
Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol
Vinyals, and Yori Zwols. Solving mixed integer programs using neural networks, 2020.
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, CaroleJean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich,
Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu,
Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems. CoRR,
abs/1906.00091, 2019. URL https://arxiv.org/abs/1906.00091.
Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác.
Reinforcement learning for solving the vehicle routing problem. Advances in
neural information processing systems, 31, 2018a.
206
Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác.
Reinforcement learning for solving the vehicle routing problem. In NeurIPS,
pages 9839–9849, 2018b.
George L Nemhauser. Integer programming: The global impact.
ISyE DOS Optimization Seminar presentation at Georgia Tech
http://hdl.handle.net/1853/49829, 2013.
Thanh H Nguyen, Arunesh Sinha, Shahrzad Gholami, Andrew Plumptre, Lucas
Joppa, Milind Tambe, Margaret Driciru, Fred Wanyama, Aggrey Rwetsiba, Rob
Critchlow, et al. Capture: A new predictive anti-poaching tool for wildlife
protection. In Proceedings of the 2016 International Conference on Autonomous
Agents & Multiagent Systems, pages 767–775, 2016.
Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit mle: Backpropagating through discrete exponential family distributions. Advances in
Neural Information Processing Systems, 34, 12 2021a. URL https://github.com/
nec-research/tf-imle.
Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit mle: backpropagating through discrete exponential family distributions. Advances in
Neural Information Processing Systems, 34:14567–14579, 2021b.
Evdokia Nikolova, Jonathan A Kelner, Matthew Brand, and Michael Mitzenmacher. Stochastic shortest paths via quasi-convex maximization. In European
Symposium on Algorithms, pages 552–563. Springer, 2006.
Eoin O’Mahony and David B Shmoys. Data analysis and optimization for (citi)
bike sharing. In AAAI, 2015.
Theodore P Papalexopoulos, Christian Tjandraatmadja, Ross Anderson,
Juan Pablo Vielma, and David Belanger. Constrained discrete black-box optimization using mixed-integer programming. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,
International Conference on Machine Learning, volume 162 of Proceedings of
Machine Learning Research, pages 17295–17322. PMLR, 17–23 Jul 2022. URL
https://proceedings.mlr.press/v162/papalexopoulos22a.html.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,
207
Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai,
and Soumith Chintala. Pytorch: An imperative style, high-performance deep
learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Tiago Pereira, Maryam Abbasi, Bernardete Ribeiro, and Joel P Arrais. Diversity
oriented deep reinforcement learning for targeted molecule generation. Journal of cheminformatics, 13(1):21, 2021.
Andrew Perrault, Bryan Wilder, Eric Ewing, Aditya Mate, Bistra Dilkina, and
Milind Tambe. Decision-focused learning of adversary behavior in security
games. arXiv preprint arXiv:1903.00958, 2019.
Luis Pineda, Taosha Fan, Maurizio Monge, Shobha Venkataraman, Paloma Sodhi,
Ricky TQ Chen, Joseph Ortiz, Daniel DeTone, Austin Wang, Stuart Anderson,
et al. Theseus: A library for differentiable nonlinear optimization. Advances
in Neural Information Processing Systems, 35:3801–3818, 2022.
Marin Vlastelica Pogančić, Anselm Paulus, Vit Musil, Georg Martius, and Michal
Rolinek. Differentiation of blackbox combinatorial solvers. In International
Conference on Learning Representations, 2020. URL https://openreview.net/
forum?id=BkevoJSYPB.
Quandl. Various end-of-day data. https://www.quandl.com/, 2019.
J. Rapin and O. Teytaud. Nevergrad - A gradient-free optimization platform.
https://GitHub.com/FacebookResearch/Nevergrad, 2018.
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam
and beyond. In International Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=ryQu7f-RZ.
Edward M Reingold and Robert E Tarjan. On a greedy heuristic for complete
matching. SIAM Journal on Computing, 10(4):676–681, 1981.
Michal Rolínek, Vít Musil, Anselm Paulus, Marin Vlastelica, Claudio Michaelis,
and Georg Martius. Optimizing rank-based metrics with blackbox differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 7620–7630, 2020a.
208
Michal Rolínek, Paul Swoboda, Dominik Zietlow, Anselm Paulus, Vít Musil, and
Georg Martius. Deep graph matching via blackbox differentiation of combinatorial solvers. In European Conference on Computer Vision, pages 407–424.
Springer, 2020b.
ROUTES. How the aviation industry transformed to combat wildlife trafficking. https://www.internationalairportreview.com/article/173456/
how-the-aviation-industry-transformed-to-combat-wildlife-trafficking/,
2022. Accessed: 2022-08-15.
Andrey Rudenko, Luigi Palmieri, Michael Herman, Kris M Kitani, Dariu M
Gavrila, and Kai O Arras. Human motion trajectory prediction: A survey.
The International Journal of Robotics Research, 39(8):895–935, 2020.
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv
preprint arXiv:1609.04747, 2016.
Kengo Sato, Yuki Kato, Michiaki Hamada, Tatsuya Akutsu, and Kiyoshi Asai.
Ipknot: fast and accurate prediction of rna secondary structures with pseudoknots using integer programming. Bioinformatics, 2011.
Martin F. Schubert, Alfred K. C. Cheung, Ian A. D. Williamson, Aleksandra Spyra,
and David H. Alexander. Inverse design of photonic devices with strict foundry
fabrication constraints. ACS Photonics, 9(7):2327–2336, 2022. doi: 10.1021/
acsphotonics.2c00313.
Alexander Semenov, Oleg Zaikin, Ilya Otpuschennikov, Stepan Kochemazov, and
Alexey Ignatiev. On cryptographic attacks using backdoors for sat. In AAAI
Conference on Artificial Intelligence, 2018.
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and
Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):
93–93, 2008.
Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel,
and Carole-Jean Wu. Recshard: statistical feature-based memory optimization
for industry-scale neural recommendation. In Proceedings of the 27th ACM
International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 344–358, 2022.
209
Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016. doi: 10.1109/JPROC.2015.2494218.
Weiran Shen, Weizhe Chen, Taoan Huang, Rohit Singh, and Fei Fang. When to
follow the tip: Security games with strategic informants. In Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020.
Dan Simon. Evolutionary optimization algorithms. John Wiley & Sons, 2013.
skyscanner. https://skyscanner.github.io/slate/, 2020. URL https://
skyscanner.github.io/slate/.
J Cole Smith and Yongjia Song. A survey of network interdiction models and
algorithms. European Journal of Operational Research, 283(3):797–811, 2020.
Ben Spevack. Shared skies convergence of wildlife trafficking with other illicit
activities in the aviation industry. C4ADS, 2021. URL www.routespartnership.
org.
Benoit Steiner, Chris Cummins, Horace He, and Hugh Leather. Value learning for
throughput optimization of deep learning workloads. In A. Smola, A. Dimakis,
and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3,
pages 323–334, 2021. URL https://proceedings.mlsys.org/paper/2021/file/
73278a4a86960eeb576a8fd4c9ec6997-Paper.pdf.
Oliver C Stringham, Stephanie Moncayo, Katherine GW Hill, Adam Toomes,
Lewis Mitchell, Joshua V Ross, and Phillip Cassey. Text classification to
streamline online wildlife trade analyses. PloS one, 16(7):e0254007, 2021a.
Oliver C. Stringham, Stephanie Moncayo, Eilish Thomas, Sarah Heinrich, Adam
Toomes, Jacob Maher, Katherine G.W. Hill, Lewis Mitchell, Joshua V. Ross,
Chris R. Shepherd, and Phillip Cassey. Dataset of seized wildlife and their
intended uses. Data in Brief, 39:107531, 2021b. ISSN 2352-3409. doi: https://doi.
org/10.1016/j.dib.2021.107531. URL https://www.sciencedirect.com/science/
article/pii/S2352340921008076.
Oliver C Stringham, Adam Toomes, Aurelie M Kanishka, Lewis Mitchell, Sarah
Heinrich, Joshua V Ross, and Phillip Cassey. A guide to using the internet to
monitor and quantify the wildlife trade. Conservation Biology, 35(4):1130–1139,
2021c.
210
Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26.
Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/
2013/file/f33ba15effa5c10e873bf3842afb46a6-Paper.pdf.
TRAFFIC. Wildlife trade portal. www.wildlifetradeportal.org, 2021.
Lorraine Trilling, Alain Guinet, and Dominiue Le Magny. Nurse scheduling using
integer linear programming and constraint programming. IFAC Proceedings
Volumes, 39(3):671–676, 2006.
Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun,
and Yoram Singer. Large margin methods for structured and interdependent
output variables. Journal of Machine Learning Research, 6(9), 2005.
UNODC. Enhancing the Detection, Investigation, and Disruption of Illicit Financial Flows from Wildlife Crime, 2017. URL http://www.unodc.org/unodc/
en/data-and-analysis/wildlife.html.
M. Utermohlen and P. Baine. In plane sight: Wildlife trafficking in the air transport sector. https://www.traffic.org/publications/reports/in-plane-sight/,
2018. Accessed: 2022-08-15.
Mary Utermohlen. Runway to extinction: Wildlife trafficking in the air transport
sector. https://routespartnership.org/industry-resources/publications/
runway-to-extinction-report, 2020. Accessed: 2022-08-15.
Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. CreateSpace,
Scotts Valley, CA, 2009. ISBN 1441412697.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Liò, and Yoshua Bengio. Graph Attention Networks. International Conference
on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?
id=rJXMpikCZ. accepted as poster.
211
Stefan Voß, Silvano Martello, Ibrahim H Osman, and Catherine Roucairol. Metaheuristics: Advances and trends in local search paradigms for optimization.
Springer Science & Business Media, 2012.
Kai Wang, Bryan Wilder, Andrew Perrault, and Milind Tambe. Automatically
learning compact quality-aware surrogates for optimization problems. Advances in Neural Information Processing Systems, 33:9586–9596, 2020a.
Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization using monte carlo tree search. Advances in
Neural Information Processing Systems, 33:19511–19522, 2020b.
Linnan Wang, Saining Xie, Teng Li, Rodrigo Fonseca, and Yuandong Tian.
Sample-efficient neural architecture search by learning actions for monte carlo
tree search. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2021a.
Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep
learning and logical reasoning using a differentiable satisfiability solver. In
International Conference on Machine Learning, pages 6545–6554. PMLR, 2019.
Xiaodi Wang, Youbo Liu, Junbo Zhao, Chang Liu, Junyong Liu, and Jinyue Yan.
Surrogate model enabled deep reinforcement learning for hybrid energy community operation. Applied Energy, 289:116722, 2021b.
D Michael Warner. Scheduling nursing personnel according to nursing preference: A mathematical programming approach. Operations Research, 24(5):
842–856, 1976.
Gordon Wetzstein, Aydogan Ozcan, Sylvain Gigan, Shanhui Fan, Dirk Englund,
Marin Soljačić, Cornelia Denz, David AB Miller, and Demetri Psaltis. Inference
in artificial intelligence with deep optics and photonics. Nature, 588(7836):39–
47, 2020.
Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions
pipeline: Decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages
1658–1665, 2019a.
Bryan Wilder, Eric Ewing, Bistra Dilkina, and Milind Tambe. End to end learning and optimization on graphs. Advances in Neural Information Processing
Systems, 32, 2019b.
212
Ryan Williams, Carla P Gomes, and Bart Selman. Backdoors to typical case complexity. In International Joint Conference on Artificial Intelligence (IJCAI), pages
1173–1178, 2003.
Laurence A Wolsey. An analysis of the greedy algorithm for the submodular set
covering problem. Combinatorica, 2(4):385–393, 1982.
Laurence A Wolsey. Mixed integer programming. Wiley Encyclopedia of Computer Science and Engineering, pages 1–10, 2007.
Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization. John Wiley & Sons, 2014.
Lily Xu, Elizabeth Bondi, Fei Fang, Andrew Perrault, Kai Wang, and Milind
Tambe. Dual-mandate patrols: Multi-armed bandits for green security. In
Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021.
Kevin Yang, Tianjun Zhang, Chris Cummins, Brandon Cui, Benoit Steiner, Linnan Wang, Joseph E Gonzalez, Dan Klein, and Yuandong Tian. Learning space
partitions for path planning. Advances in Neural Information Processing Systems, 34:378–391, 2021.
Yingjun Ye, Xiaohui Zhang, and Jian Sun. Automated vehicle’s behavior decision
making using deep reinforcement learning and high-fidelity simulation environment. Transportation Research Part C: Emerging Technologies, 107:155–170,
2019.
Lantao Yu, Yi Wu, Rohit Singh, Lucas Joppa, and Fei Fang. Deep reinforcement
learning for green security game with online information. In AAAI, 2019.
Daochen Zha, Louis Feng, Bhargav Bhushanam, Dhruv Choudhary, Jade Nie,
Yuandong Tian, Jay Chae, Yinbin Ma, Arun Kejariwal, and Xia Hu. Autoshard:
Automated embedding table sharding for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, pages 4461–4471, 2022a.
Daochen Zha, Louis Feng, Qiaoyu Tan, Zirui Liu, Kwei-Herng Lai, Bhushanam
Bhargav, Yuandong Tian, Arun Kejariwal, and Xia Hu. Dreamshard: Generalizable embedding table placement for recommender systems. In Advances in
Neural Information Processing Systems, 2022b.
213
Daochen Zha, Louis Feng, Liang Luo, Bhargav Bhushanam, Zirui Liu, Yusuo Hu,
Jade Nie, Yuzhen Huang, Yuandong Tian, Arun Kejariwal, et al. Pre-train and
search: Efficient embedding table sharding with pre-trained neural cost models. In Sixth Conference on Machine Learning and Systems, 2023.
Hejia Zhang, Matthew Fontaine, Amy Hoover, Julian Togelius, Bistra Dilkina,
and Stefanos Nikolaidis. Video game level repair via mixed integer linear programming. In Proceedings of the AAAI Conference on Artificial Intelligence and
Interactive Digital Entertainment, volume 16, pages 151–158, 2020.
Jing Zhang and Ioannis Ch Paschalidis. Data-driven estimation of travel latency
cost functions via inverse optimization in multi-class transportation networks.
2017 IEEE 56th Annual Conference on Decision and Control, CDC 2017, 2018-
January:6295–6300, 1 2018. doi: 10.1109/CDC.2017.8264608.
Wei Zhang and Thomas G Dietterich. A reinforcement learning approach to
job-shop scheduling. In IJCAI, volume 95, pages 1114–1120. Citeseer, 1995.
Yiyang Zhao, Linnan Wang, Kevin Yang, Tianjun Zhang, Tian Guo, and Yuandong Tian. Multi-objective optimization by learning space partition. In
International Conference on Learning Representations, 2022. URL https://
openreview.net/forum?id=FlwzVjfMryn.
Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin
Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, et al.
Transferable graph optimizers for ml compilers. Advances in Neural Information Processing Systems, 33:13844–13855, 2020.
Brian D Ziebart, Nathan Ratliff, Garratt Gallagher, Christoph Mertz, Kevin Peterson, J Andrew Bagnell, Martial Hebert, Anind K Dey, and Siddhartha Srinivasa.
Planning-based prediction for pedestrians. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3931–3936. IEEE, 2009.
214
Abstract (if available)
Abstract
Artificial Intelligence (AI) has the potential to impact many facets of our society largely due to its ability to quickly make high-quality data-driven decisions at scale. We consider Artificial Decision Intelligence (ADI) to be a paradigm for building artificial intelligence methods geared explicitly toward automatic decision-making. In the rapidly evolving paradigms of machine learning (ML) and combinatorial optimization (CO), remarkable progress has been made in different directions, revolutionizing how we synthesize insights from data as well as how to best act on those insights. Machine learning, specifically deep learning, with its ability to learn intricate patterns from seemingly unstructured data, has seen profound success across diverse applications. Simultaneously, combinatorial optimization has made significant strides, efficiently performing industrial-scale decision-making by searching for optimal solutions from combinatorially large and highly-structured search spaces. This thesis explores different perspectives on the tight integration of these two paradigms: machine learning and combinatorial optimization, developing new tools that demonstrate the strengths of both approaches for solving complex tasks. Taking different perspectives on machine learning, combinatorial optimization, and how they can be combined in a cohesive and complementary manner, we propose new methodologies that enable end-to-end data-driven decision-making, deep predictive models that respect combinatorial constraints, methods that solve complex problems by learning to formulate simpler surrogate optimization problems, and optimization algorithms that learn from historical data to improve solver performance. The proposed methodologies contribute to the advancement of our capability in handling new and complex real-world problems. Specifically, we demonstrate the impact of our methodologies in several domains, such as identifying wildlife trafficking routes, designing photonic devices, large-scale recommendation systems, financial portfolio optimization, generating game levels, and smart energy grid scheduling. Thus, this thesis serves as a step forward in artificial decision intelligence by solving complex tasks and providing decision-support tools that leverage machine learning and combinatorial optimization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Integer optimization for analytics in high stakes domain
PDF
Towards trustworthy and data-driven social interventions
PDF
Learning at the local level
PDF
Decision-aware learning in the small-data, large-scale regime
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Learning and control in decentralized stochastic systems
PDF
Imposing classical symmetries on quantum operators with applications to optimization
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Advances in linguistic data-oriented uncertainty modeling, reasoning, and intelligent decision making
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Controlling information in neural networks for fairness and privacy
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Machine learning in interacting multi-agent systems
Asset Metadata
Creator
Ferber, Aaron Matthew
(author)
Core Title
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
10/09/2023
Defense Date
08/11/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Artificial Decision Intelligence,branch and bound,combinatorial optimization,cutting planes,data-driven decision making,decision support tools,decision-focused learning,deep learning,discrete optimization,game level design,inverse photonic design,learning for combinatorial optimization,machine learning,mixed integer linear programming,OAI-PMH Harvest,portfolio optimization,predictive models,reasoning,recommendation systems,surrogate optimization problems,table sharding,wildlife trafficking
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dilkina, Bistra (
committee chair
), Liu, Yan (
committee member
), Vayanos, Phebe (
committee member
)
Creator Email
aferber@usc.edu,aferber6174@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113758744
Unique identifier
UC113758744
Identifier
etd-FerberAaro-12418.pdf (filename)
Legacy Identifier
etd-FerberAaro-12418
Document Type
Thesis
Format
theses (aat)
Rights
Ferber, Aaron Matthew
Internet Media Type
application/pdf
Type
texts
Source
20231013-usctheses-batch-1101
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Artificial Decision Intelligence
branch and bound
combinatorial optimization
cutting planes
data-driven decision making
decision support tools
decision-focused learning
deep learning
discrete optimization
game level design
inverse photonic design
learning for combinatorial optimization
machine learning
mixed integer linear programming
portfolio optimization
predictive models
reasoning
recommendation systems
surrogate optimization problems
table sharding
wildlife trafficking