Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving decisionmaking in search algorithms for combinatorial optimization with machine learning
(USC Thesis Other)
Improving decisionmaking in search algorithms for combinatorial optimization with machine learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IMPROVING DECISIONMAKING IN SEARCH ALGORITHMS
FOR COMBINATORIAL OPTIMIZATION WITH MACHINE LEARNING
by
Taoan Huang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2024 Taoan Huang
Dedication
To my grandma.
ii
Acknowledgements
First and foremost, I would like to express my deepest gratitude to my advisors, Bistra Dilkina and Sven
Koenig, for their unwavering patience and guidance throughout my research journey, for their constructive
advice whenever I needed it and for the freedom they offered me to explore diverse research problems.
Their enthusiasm, curiosity and dedication to research have profoundly inspired me as a junior researcher.
I am also deeply grateful to the members of my dissertation committee, Lars Lindemann, Meisam
Razaviyayn, and Peter Stuckey, for their generous time, insightful feedback and support. A special thanks
to Lars for accepting the invitation to join my dissertation committee on short notice.
I would like to extend my heartfelt thanks to Fei Fang, my former undergraduate research advisor,
for her pivotal role in shaping my early research career. Fei took me as an intern at Carnegie Mellon
University in 2018 when I had no prior research experience. She guided me through my first couple of
research projects with extraordinary patience.
I have been fortunate to collaborate with many talented researchers. To Yuandong Tian from Meta
AI Research, thank you for many brilliant suggestions for many research problems and our productive
collaboration. To Vikas Shivashankar, Michael Caldara and Joseph Durham from Amazon Robotics, thank
you for your support and for providing me with the opportunity to work on my first realworld warehouse
planning problem. To Jiaoyang Li, thank you for the productive discussions and valuable insights into the
multiagent path finding problem. I would also like to thank my other collaborators, Brandon Amos,
Xiaohui Bei, Vadim Bulitko, Junyang Cai, Weizhe Chen, Bohui Fang, Tianyu Gu, Minbiao Han, Weimin
iii
Huang, Thomy Phan, Sumedh Pendurkar, Martin Schubert, Guni Sharon, Weiran Shen, Rohit Singh, Benoit
Steiner, Roni Stern, Shuwei Wang, Haifeng Xu, David Zeng, Daochen Zha, Shuyang Zhang and Arman
Zharmagambetov, for our fruitful collaborations.
I have thoroughly enjoyed my time with my labmates from the IDM lab and CAIS. To Sina Aghaei,
Junyang Cai, ShaoHung Chan, Weizhe Chen, Aaron Ferber, Amrita Gupta, Weimin Huang, Qing Jin, Caroline Johnston, Nathan Justin, Christopher Leet, Haoming Li, Jiaoyang Li, Laksh Matai, Hannah Murray,
Thomy Phan, Caleb Robinson, Qingshi Sun, Bill Tang, Yimin Tang, Yingxiao Ye, Han Zhang and Yi Zheng,
thank you for the casual but interesting conversations we had and the many fun lab events.
Finally, I would like to express my deepest gratitude to my parents and my sister for their unconditional
love and support. My heartfelt thanks also go to Zixin Huang for her companionship and for being by my
side every step of the way.
This dissertation reports on research supported by the National Science Foundation (NSF) under grant
numbers 1409987, 1724392, 1817189, 1837779, 1935712 and 2112533, the U.S. Department of Homeland Security under grant number 2015ST061CIRC01 as well as a gift from Amazon. The views and conclusions
contained in this dissertation should not be interpreted as necessarily representing the official policies,
either expressed or implied, of the sponsoring organizations, agencies or the U.S. government. I would
also like to thank Shuyang Zhang, an amazing undergraduate student I coadvised with Jiaoyang Li, for
contributing to the empirical evaluation and part of the writing in Section 2.9.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Combinatorial Optimization Problems (COPs) . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 MultiAgent Path Finding (MAPF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Mixed Integer Linear Program (MILP) . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Search Algorithms for COPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Machine Learning (ML) for COPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2: Improving DecisionMaking in MAPF Search Algorithms . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 MultiAgent Path Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 ConflictBased Search (CBS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Enhanced CBS (ECBS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 MAPFLNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.4 Prioritized Planning (PP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.5 MAPF Instances Used in the Empirical Evaluation . . . . . . . . . . . . . . . . . . . 24
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 MAPF Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 ML for MAPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 ML for other COPs that Inspires Our Work . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 An Imitation Learning Framework for Learning DecisionMaking Strategies . . . . . . . . 30
2.6 Learning to Select Conflicts for CBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.1 Machine Learning Methodolody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1.1 Experts for Conflict Selection . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.1.3 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
v
2.6.1.4 MLGuided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 Learning to Select Nodes for ECBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.1 Machine Learning Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.1.1 Expert for Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.1.3 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.1.4 MLGuided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.8 Learning to Select Agent Sets for MAPFLNS . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.8.1 Machine Learning Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8.1.1 Expert for AgentSet Selection . . . . . . . . . . . . . . . . . . . . . . . . 65
2.8.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.8.1.3 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.8.1.4 MLGuided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.8.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.8.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.8.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.9 Learning to Prioritize Agents for PP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.9.1 Machine Learning Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.9.1.1 Expert for Assigning Agents’ Priorities . . . . . . . . . . . . . . . . . . . 80
2.9.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.9.1.3 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.9.1.4 MLGuided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.9.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.9.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.9.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 3: Improving DecisionMaking in MILP Search Algorithms . . . . . . . . . . . . . . . . . . 98
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.2 Mixed Integer Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.1 LNS for MILP solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.1.1 Local Branching Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.1.2 Local Branching Relaxation Heuristic . . . . . . . . . . . . . . . . . . . . 105
3.3.2 Neural Diving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.3 PredictandSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4.1 LNS for MILPs and Other COPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4.2 LNSBased Primal Heuristics in BnB . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4.3 Learning to Solve MILPs with BnB . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.4.4 Solution Predictions for COPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.4.5 Contrastive Learning for COPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
vi
3.5 A Contrastive Learning Framework for Learning DecisionMaking Strategies . . . . . . . . 110
3.6 Contrastive Large Neighborhood Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.6.1 Machine Learning Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.6.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.6.1.2 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 114
3.6.1.3 Model Learning with a Contrastive Loss . . . . . . . . . . . . . . . . . . 115
3.6.1.4 MLGuided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.6.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.6.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.7 Contrastive PredictandSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7.1 Machine Learning Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.7.1.2 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 131
3.7.1.3 Model Learning with a Contrastive Loss . . . . . . . . . . . . . . . . . . 131
3.7.1.4 MLGuided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.7.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.7.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Chapter 4: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A Supplementary Materials to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.1 Additional Details of MILP Instance Generation . . . . . . . . . . . . . . . . . . . . 166
A.2 Supplementary Materials to Section 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.2.1 Neural Network Architecture for CLLNS . . . . . . . . . . . . . . . . . . 168
A.2.2 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.2.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 171
A.3 Supplementary Materials to Section 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.3.1 Neural Network Architecture for ConPaS . . . . . . . . . . . . . . . . . . 172
A.3.2 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.3.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 183
vii
List of Tables
1.1 Acronyms and their meanings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Performance of CBSH2 with different experts and our method CBS+ML. Expert time is
the runtime of the expert. Search time is the runtime minus the expert time. All entries
are averaged over the MAPF instances that are solved by all methods. . . . . . . . . . . . . 35
2.2 Features of a conflict c = ⟨ai
, aj , u, t⟩ (⟨ai
, aj , u, v, t⟩) of a CT node N. Given
the underlying graph G = (V, E), let VT = {(v, t) : v ∈ V, t ∈ Z≥0}, ET =
{((u, t),(v, t + 1)) : t ∈ Z≥0 ∧ (u = v ∨ (u, v) ∈ E)}, and define the timeexpanded
graph as an unweighted graph GT = (VT , ET ). Let du,v be the cost of the costminimal path between vertices u and v in G and d(u′
,t′),(u,t) be the distance from
(u
′
, t′
) to (u, t) in GT if t
′ ≤ t or from (u, t) to (u
′
, t′
), otherwise. For a conflict
c
′ = ⟨a
′
i
, a′
j
, u′
, t′
⟩ (⟨a
′
i
, a′
j
, u′
, v′
, t′
⟩) in NConf, define Vc
′ = {u
′} (Vc
′ = {u
′
, v′})
and V
T
c
′ = {(u
′
, t′
)} (V
T
c
′ = {(u
′
, t′
),(v
′
, t′
)}). For an agent a, define Va = {(u, t) :
agent a is at vertex u at time step t following its path}. The counts are the numbers of
features contributed by the corresponding entries, which add up to p = 67. . . . . . . . . 38
2.3 Numbers of agents in MAPF instances in ITrain and IValid, validation losses and accuracies.
The swapped pairs are the percentages of swapped pairs averaged over all test CT nodes,
and the top pick accuracy is the accuracy of the ranking function selecting one of the
conflicts labeled as 1 in the test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Success rates and the average runtimes and CT sizes of MAPF instances solved by
all methods (MLS and MLO stand for CBS+MLS and CBS+MLO, respectively) for
different numbers of agents k on five maps. For the success rates of MLS and MLO,
the percentages of MAPF instances solved by both our methods and CBSH2 are given in
parentheses (bolded if they solve all MAPF instances that CBSH2 solves). For each grid
map, we report the percentages of our improvement over CBSH2 on the runtime and CT
size on MAPF instances solved by all methods. . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5 Parameters for each grid map. w is the suboptimality factor, m is the number of different
numbers of agents we train and test on, k1 is the smallest number of agents that we
train and test on, km is the largest number of agents that we train and test on, and V  is
the number of unblocked cells on the grid map. k2, · · · , km−1 are evenly distributed on
[k1, km], i.e., ki = (i − 1)(km − k1)/(m − 1) + k1. . . . . . . . . . . . . . . . . . . . . . . 55
viii
2.6 Loss li ∈ [0, 1] of ranking function πi for ki agents evaluated by Equation (2.1) averaged
over all CTs in the training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7 Agent ai
’s features with respect to instance I and incumbent solution PI = {pi
: i ∈ [k]}.
The counts are the numbers of features contributed by the corresponding entries. . . . . . 66
2.8 Validation results for the learned ranking function π. “Training k” is the number of agents
of the training instances. “Average ranking” is the average rank of the first agent set
selected by π among the S = 20 agent sets. “Improving choice” is the fraction of times
π selects an agent set that results in a positive cost improvement. “Regret” is calculated
as the average of 100% minus the cost improvement achieved by π as a percentage of the
cost improvement achieved by the expert. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.9 The average ratios of the AUCs of MAPFLNS and variants of MAPFMLLNS (MLS
and MLO) with their standard deviations, the win/loss counts with respect to the AUCs
and the average sums of delays with the average suboptimalities for a runtime limit
of 60 seconds. All entries take only the solved MAPF instances into account. We bold
the number of agents k on which MLS is trained and the entries where a variant of
MAPFMLLNS outperforms MAPFLNS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.10 The average number of replans of MAPFLNS and MLS for a runtime limit of 60 seconds. 77
2.11 p = 26 features for agent ai
. Column “Count” reports the numbers of features contributed
by the corresponding entries. We consider an MDD MDDi for agent ai that consists
of all individually costminimal paths from si to ti
, i.e., the MDD that would have been
computed at the root CT node in CBS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.12 Success rate and solution rank for deterministic ranking. The best results achieved among
all algorithms are shown in bold. The results are obtained by training and testing on the
same map with the same number of agents k, except for maps lak303d and ost003d with
k > 500, where the results are obtained by training on the same map with k = 500. . . . . 90
2.13 Success rate and runtime to the first solution for stochastic ranking with random restarts.
The best results achieved among all algorithms are shown in bold. The results are obtained
by training and testing on the same grid map with the same number of agents k, except for
grid maps lak303d and ost003d with k > 500, where the results are obtained by training
on the same map with k = 500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.1 Names and the average numbers of variables and constraints of the test instances. . . . . 116
3.2 Primal gap (PG) (in percent), primal integral (PI) at 60minute runtime cutoff, averaged
over 100 test instances and their standard deviations. “↓” means the lower, the better. For
ML methods, the policies are trained only on small training instances but are tested on
both small and large test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3 Ablation study: Primal gap (PG) (in percent) and primal integral (PI) at 60minute runtime
cutoff, averaged over 100 small test instances and their standard deviations. “↓” means the
lower the better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
ix
3.4 The average numbers of variables and constraints in the test instances. . . . . . . . . . . . 132
3.5 Comparison of different loss functions. We report the primal gaps (PG) and the primal
integrals (PI) at the 1,000second runtime cutoff averaged over 100 instances. . . . . . . . . 140
3.6 The primal gap and primal integral at the 1,000second runtime cutoff on the CA instances
with different k0 averaged over 100 instances. . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1 Hyperparameters with their notations and values used. . . . . . . . . . . . . . . . . . . . . 171
A.2 Test results on small instances: Primal bound (PB), primal gap (PG) (in percent), primal
integral (PI) at 30 minutes time cutoff, averaged over 100 instances and their standard
deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.3 Test results on small instances: Primal bound (PB), primal gap (PG) (in percent), primal
integral (PI) at 60 minutes time cutoff, averaged over 100 instances and their standard
deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.4 Generalization results on large instances: Primal bound (PB), primal gap (PG) (in percent),
primal integral (PI) at 30 minutes time cutoff, averaged over 100 instances and their
standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.5 Generalization results on large instances: Primal bound (PB), primal gap (PG) (in percent)
and primal integral (PI) at 60 minutes time cutoff, averaged over 100 instances and their
standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.6 Hyperparameters (k0, k1, ∆) used for PaS and ConPaS. . . . . . . . . . . . . . . . . . . . . 180
A.7 Tabular representation of the primal integral plots in Figures 3.10 and 3.11: The primal
integral and the standard deviation at 1,000 seconds runtime cutoff averaged over 100
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.8 Comparisons with Gurobi: Hyperparameters (k0, k1, ∆) used for PaS and ConPaSLQ. . . 181
A.9 Prediction accuracy and AUROC on 100 validation instances. . . . . . . . . . . . . . . . . . 183
x
List of Figures
1.1 An ML method that applies to a search algorithm for a COP. . . . . . . . . . . . . . . . . . 7
1.2 An ML method that applies to a search algorithm for multiple COPs. . . . . . . . . . . . . 7
1.3 An ML framework: An ML method that applies to multiple search algorithms for a COP. . 9
1.4 Contribution 1: An ML framework that applies imitation learning to multiple MAPF
search algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Contribution 2: An ML framework that applies contrastive learning to multiple MILP
search algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Success rates for a runtime limit of 5 minutes as a function of the number of agents for
each grid map. The values of w and the numbers of agents are listed in Table 2.5. For
ECBS+ML, ECBS+ML(ES) and ECBS+IL, the vertical line of the same color indicates the
number of agents in the last iteration where a ranking function is learned in the training
algorithm. In the figure for the warehouse map, the graph of ECBS+h1 coincides entirely
with the one of ECBS+h2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2 Success rates for a fixed number of agents as a function of the runtime limit for each grid
map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Success rates for a runtime limit of 5 minutes as a function of the suboptimality factor w
on the random map for 95 agents. The vertical brown line indicates the value of w in the
last iteration where a ranking function is learned for ECBS+ML(w). . . . . . . . . . . . . . 60
2.4 Feature importance plots. We restate the definitions of some atomic features here (see
Section 2.7.1.2 for the full list): f1 is the number of conflicts, f2 is the number of pairs of
agents that have at least one conflict with each other, f3 is the number of agents that have
at least one conflict with other agents, and f9 is the depth of the CT node. . . . . . . . . . 62
2.5 Evolution of the solution quality as a function of the number of replans for MAPFLNS,
MAPFMLLNS and MAPFLNS with the expert. . . . . . . . . . . . . . . . . . . . . . . . . 65
xi
2.6 Evolutions of the sum of costs (solid curves with the yaxis on the left side, smaller is
better) from 1 second to 60 seconds for MAPFLNS, MLS and MLO, averaged over all
solved instances, and the average ratio of the AUCs of MAPFLNS and one of MLS and
MLO (dotted curves with the yaxis on the right side, greater than 1 is better), also
averaged over all solved instances, as a function of the runtime. The error bars represent
the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.7 Normalized sum of costs (i.e., we normalize them by taking the ratio of the sum of costs
of the solution over the sum of the lengths of the individually costminimal paths of
all agents) of 100 PP runs with different random priority orderings on MAPF instance
“room32324random1.scen” from [181] with 20 agents, sorted in increasing order of
their normalized sums of costs. PP runs that fail to find a solution are shown on the top of
the plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.8 An example of Op. Assume we have a MAPF instance with k = 4 agents on an empty
4 × 5 grid map. The start and goal vertices of agents a1 and a2 are shown in (a). . . . . . 82
2.9 Normalized sum of costs for deterministic ranking on the random map. Unsolved MAPF
instances are shown on top of the plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.10 Solution rank for stochastic ranking with random restarts. . . . . . . . . . . . . . . . . . . 95
3.1 An overview of training and data collection for CLLNS. For each MILP instance for
training, we run several LNS iterations with LB. In each iteration, we collect both positive
and negative neighborhood samples and add them to the training dataset, which is used
in downstream supervised contrastive learning for neighborhood selections. . . . . . . . . 111
3.2 The primal gap (the lower, the better) as a function of runtime averaged over 100 test
instances. For ML methods, the policies are trained only on small training instances but
are tested on both small and large test instances. . . . . . . . . . . . . . . . . . . . . . . . 120
3.3 The survival rate (the higher, the better) over 100 test instances as a function of runtime
to meet the primal gap threshold 1.00%. For ML methods, the policies are trained only on
small training instances but are tested on both small and large test instances. . . . . . . . 122
3.4 The best performing rate (the higher the better) as a function of runtime on 100 test
instances. The sum of the best performing rates at a given runtime might sum up greater
than 1 since ties are counted multiple times. . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.5 The primal bound (the lower, the better) as a function of the number of iterations averaged
over 100 small test instances. LB and LB (data collection) are LNS with LB using the
neighborhood sizes finetuned for CLLNS and data collection, respectively. The table
shows the neighborhood size (NH size) and the average runtime in seconds (with standard
deviations) per iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.6 Ablation study: The primal gap (the lower, the better) as a function of time averaged over
100 small test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
xii
3.7 Overview of ConPaS. For training, we collect data from a set of MILP instances, including
positive samples that are optimal and nearoptimal solutions and negative samples that
are lowquality or infeasible solutions. We use the data in supervised CL to predict optimal
solutions. During testing, the predictions are used in PredictandSearch [70]. . . . . . . . 128
3.8 The primal gap (the lower the better) as a function of runtime, averaged over 100 test
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.9 The survival rate (the higher, the better) to meet a certain primal gap threshold over 100
test instances as a function of runtime. The primal gap threshold is set to the median of
the average primal gaps at the 1,000second runtime cutoff among all methods rounded to
the nearest 0.50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.10 The primal integral (the lower, the better) at the 1,000second runtime cutoff, averaged
over 100 test instances. The error bars represent the standard deviation. A tabular
representation is provided in the Appendix Table A.7. . . . . . . . . . . . . . . . . . . . . 137
3.11 Generalization to 100 large instances: The primal gap as a function of runtime, the survival
rate as a function of runtime and the primal integral at the 1,000second runtime cutoff.
The primal gap threshold for the survival rate is chosen as the medium of the average
primal gaps at the 1,000second runtime cutoff among all methods rounded to the nearest
0.50%. A tabular representation for the primal integral plots is provided in Appendix. . . . 138
3.12 Training on different fractions of training instances: The primal gap as a function of
runtime and the primal integral at the 1,000second runtime cutoff. ConPaSLQ50% and
ConPaSLQ25% denote the versions of ConPaS trained with only 50% and 25% of the
training instances, respectively (similarly for PaS). . . . . . . . . . . . . . . . . . . . . . . 138
A.1 The primal gap (the lower the better) as a function of time, averaged over 100 instances.
For ML approaches, the policies are trained on only small training instances but tested on
both small and large test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.2 The survival rate (the higher the better) over 100 instances as a function of time to meet
primal gap threshold 1.00%. For ML approaches, the policies are trained on only small
training instances but tested on both small and large test instances. . . . . . . . . . . . . . 174
A.3 The primal bound (the lower the better) as a function of time, averaged over 100 instances.
For ML approaches, the policies are trained on only small training instances but tested on
both small and large test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.4 The best performing rate (the higher the better) as a function of runtime over 100 test
instances. For ML approaches, the policies are trained on only small training instances
but tested on both small and large test instances. . . . . . . . . . . . . . . . . . . . . . . . . 176
A.5 The gap to virtual best (the lower the better) as a function of runtime, averaged over 100
test instances. For ML approaches, the policies are trained on only small training instances
but tested on both small and large test instances. . . . . . . . . . . . . . . . . . . . . . . . . 177
xiii
A.6 The primal gap as a function of runtime and the primal integral at 1,000 seconds runtime
cutoff. Note that the curves of PaS and ConPaS highly overlap with each other. . . . . . . 180
A.7 Comparisons with Gurobi: The primal gap (the lower, the better) as a function of runtime
averaged over 100 test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
A.8 Comparisons with Gurobi: The primal integral (the lower, the better) at 1,000 seconds
runtime cutoff, averaged over 100 test instances. The error bars represent the standard
deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
xiv
Abstract
Combinatorial optimization is fundamental in computer science and operations research, focusing on finding highquality solutions in structured solution spaces. It encompasses a wide range of realworld problems, including those in logistics, manufacturing, transportation and finance. Many search algorithms have
been proposed to solve combinatorial optimization problems (COPs). The high complexity of COPs makes
effective decisionmaking strategies crucial in these algorithms. These strategies guide the search process by navigating the search space. Effective decisionmaking strategies enhance the efficiency of finding
optimal or nearoptimal solutions. Despite the significant advancements in search algorithms, humandesigned strategies have a few limitations. They often rely on domainspecific knowledge that may not
generalize well to different instances and may lead to suboptimal performance. In this dissertation, we
show that one can use machine learning (ML) to improve humandesigned decisionmaking strategies for
different search algorithms for COPs. Specifically, we focus on two important COPs, namely multiagent
path finding and mixed integer linear programs.
Multiagent path finding (MAPF) is the problem of finding conflictfree (i.e., collisionfree) paths for
agents in a shared environment that minimizes their total travel time. It is an NPhard problem that has
important applications for distribution centers, traffic management and computer games. Various search
algorithms have been proposed to solve MAPF. Search algorithms, such as ConflictBased Search (CBS),
are guaranteed to find optimal solutions. To trade off runtime with solution quality, boundedsuboptimal
search algorithms, such as Enhanced CBS (ECBS), have been proposed to find solutions with a guaranteed
xv
approximation ratio and are more scalable than optimal search algorithms. Unboundedsuboptimal search
algorithms, such as Prioritized Planning (PP) and Large Neighborhood Search (LNS), drop optimality guarantees to find solutions even faster. There are a handful of decisions in these search algorithms that concern
partitioning the search space, prioritizing search space exploration and pruning the search space. Thus,
they typically have a big impact on the efficiency and/or effectiveness of the search. In the past decade
of research on MAPF, these decisions have mainly been made manually by humans. In this dissertation,
we show that one can leverage ML to improve decisionmaking in various types of search algorithms for
MAPF and introduce CBS+ML, ECBS+ML, MAPFMLLNS and PP+ML. Specifically, we apply general ML
techniques to learn (1) which conflict to resolve next in CBS, (2) which search tree node to expand next in
ECBS, (3) which part of the solutions to improve next in LNS and (4) which priority to assign to agents in
PP. In these four settings, we deploy imitation learning to imitate slow but effective experts and reduce the
ML task to learningtorank problems where the ML models rank available decision options. Empirically,
we demonstrate that our MLguided search algorithms show substantial improvement in terms of the success rates, runtimes and/or solution qualities over their stateoftheart nonMLguided counterparts on
several different types of grid maps from a popular MAPF benchmark.
Mixed integer linear programs (MILP) are flexible and powerful tools for modeling and solving many
difficult realworld COPs. In the past decades, research efforts have been dedicated to improving BranchandBound (BnB), an optimal MILP search algorithm. It is a tree search algorithm that repeatedly breaks the
MILP down into smaller subproblems and maintains upper and lower bounds to eliminate subproblems that
cannot contain an optimal solution. Unlike BnB, metaheuristic search algorithms, such as Large Neighborhood Search (LNS) and PredictandSearch (PaS), are popular unboundedsuboptimal MILP search algorithms that can find highquality solutions to MILPs much faster without having to prove their optimality.
There are important decisions to make in both LNS and PaS: LNS improves the solution by iteratively
reoptimizing a subset of variables, and deciding which subset of variables to reoptimize is a challenging
xvi
decision to make; PaS predicts which values to assign to a subset of variables based on the input MILP to
get a reducedsize MILP that is much faster to solve, and deciding which variables to fix to which values
is also important. In previous works, those decisions have mainly been made manually by humans and
learned by imitation learning algorithms. In this dissertation, we show that one can leverage contrastive
learning to improve decisionmaking in both LNS and PaS and introduce CLLNS and ConPaS. In these
two settings, we develop novel data collection techniques to collect both positive and negative samples,
which are crucial to the success of contrastive learning, and use supervised contrastive losses to train the
ML models. Empirically, we demonstrate that both CLLNS and ConPaS significantly outperform their
nonMLguided and MLguided counterparts in terms of both runtime and solution quality.
xvii
Chapter 1
Introduction
Combinatorial optimization [153] is a pivotal area in computer science and operations research, focusing
on selecting the best solutions from vast but structured solution spaces. Such problems are characterized
by decisions that are discrete in nature, involving yes/no or selectfrommany options. These challenges
are prevalent across numerous sectors, including logistics, manufacturing, telecommunications, finance
and healthcare, underscoring their significance in both theoretical research and practical applications.
In the past decades, search algorithms for combinatorial optimization problems (COPs) play a crucial
role by helping to navigate the solution spaces efficiently and/or effectively. Here, efficiency refers to small
runtimes and effectiveness refers to highquality solutions. On the other hand, machine learning (ML) has
profoundly advanced various domains within computer science, revolutionizing traditional approaches
and enabling new paradigms of innovation. In this dissertation, we explore leveraging ML to improve
search algorithms for COPs and validate the following hypothesis:
One can leverage general ML frameworks to improve decisionmaking strategies in different types of
search algorithms for combinatorial optimization problems.
To explain the hypothesis, we first provide an introduction to COPs in Section 1.1. We then provide an
overview of search algorithms for COPs in Section 1.2, where we also talk about decisionmaking strategies
1
in search algorithms and what has or has not been done in using ML to improve those strategies. Finally,
we present the contribution of this dissertation in Section 1.3.
1.1 Combinatorial Optimization Problems (COPs)
The significance of combinatorial optimization extends across various practical applications. In logistics
and transportation, it is employed to optimize routing and scheduling, reducing costs and improving efficiency [189]. In manufacturing, it aids in resource allocation and production scheduling [165, 109]. In
network design and management, it is applied to optimizing the operation of transportation and utility
networks to enhance the networks’ throughput or reliability [84, 67].
The primary objective in COPs is to optimize a particular objective function, such as cost, time or resource usage, subject to a set of constraints. This could involve, for example, finding the most costeffective
route in a delivery network that adheres to constraints such as distance and capacity limits. Problems in
this field are inherently complex, often requiring sophisticated mathematical models and algorithms. The
complexity arises from the exponential growth of possible solutions as the instance size increases, making
many of these problems NPhard and thus challenging to solve [153].
In this dissertation, we focus on two specific NPhard COPs, namely multiagent path finding (MAPF)
and mixed integer linear programs (MILPs). Next, we introduce both MAPF and MILP.
1.1.1 MultiAgent Path Finding (MAPF)
MAPF is a COP that involves finding the optimal paths for multiple agents to navigate from their respective
start locations to their target locations without colliding with each other and static obstacles in a potentially
congested shared environment. The objective of MAPF typically involves minimizing the sum of all agents’
travel times.
2
We focus on MAPF in this dissertation since it has significant relevance across multiple realworld
domains: In robotics, it is crucial to coordinate the movement of autonomous robots in warehouses or
factories, where efficient path planning can drastically improve effectiveness and safety [199]. In computer
games, MAPF algorithms enhance the realism by managing the nonplayer characters to move in certain
formations [140, 125] which contributes to more enjoyable gameplay. In transportation management and
logistics, MAPF algorithms benefit applications like drone delivery systems [81] and autonomous vehicle
coordination [45, 123, 33], which also help to reduce traffic congestion and improve safety. In disaster
response scenarios, managing the paths of multiple agents efficiently is also pivotal since search and rescue
operations rely on coordinated movements of responders and evacuees to maximize coverage and minimize
response times [161, 157]. Overall, the significance of MAPF lies in its ability to enhance efficiency and
effectiveness in systems involving multiple agents that are not allowed to collide. As the deployment
of multiagent systems continues to grow, the ability to solve MAPF problems efficiently and effectively
becomes increasingly critical.
However, solving MAPF optimal is an NPhard problem on general graphs [208] and some specific
types of graphs, such as planar graphs [207] and grid graphs [9], meaning that finding the optimal solution can require computational time that grows exponentially with the problem instance size. Even
approximate solutions can be challenging to obtain within a reasonable runtime [139]. The complexity
arises from the need to account for the interactions between agents, which leads to exponential growth
in the search space as the number of agents increases. This difficulty is even exacerbated in dynamic environments where agents must adapt to changing conditions in realtime [126] or when path planning is
coupled with other tasks such as target assignments [138].
3
1.1.2 Mixed Integer Linear Program (MILP)
In addition to MAPF, we also focus on mixed integer linear programs (MILP) since they are powerful
mathematical modeling techniques, and a wide range of COPs can be formulated as MILPs. MILPs extend
linear programs by allowing some or all of the decision variables to be restricted to integer values. The
general form of a MILP involves an objective function, which is to be maximized or minimized, subject
to a set of linear equality and/or inequality constraints. Including integer variables in MILPs makes them
particularly useful for solving problems requiring discrete decisions. MILPs have applications in numerous
domains due to their flexibility in modeling complex decisionmaking: In operations research, MILPs are
used to solve scheduling [109, 146], resource allocation [165] and supply chain management problems [204,
93, 171]. In finance, they are applied to portfolio optimization [13, 145], where investment decisions must
comply with risk control and budget constraints. In the energy sector, MILPs are used to optimize power
generation and distribution [29, 152, 129]. In logistics and transportation, MILPs help in vehicle routing
and network design [134, 65]. The versatility of MILPs makes them an essential tool for industries where
optimal decisionmaking is crucial.
However, MILPs are generally NPhard to solve. This complexity arises from the integer constraints,
which create a combinatorial explosion of possible solutions. Unlike linear programs, where efficient
polynomialtime algorithms exist, largescale MILP instances with numerous variables and constraints
are much harder to solve, requiring significant computational resources and sophisticated algorithms.
1.2 Search Algorithms for COPs
Search algorithms for COPs have been developed and play a crucial role by helping to navigate the solution
spaces efficiently and effectively. To name a few, there are optimal search algorithms, such as A* search
[72] and BranchandBound (BnB) [113], boundedsuboptimal search algorithms, such as focal search [155]
4
and explicit estimation search [186], and unboundedsuboptimal search algorithms, such as local search
[97] and genetic algorithms [79]. Many solvers for COPs based on these search algorithms have evolved
considerably over the past decades and can handle largescale problem instances thanks to algorithmic
advances. There are many decisions to make in search algorithms that are crucial to their success. These
decisions concern the exploration and exploitation of the search space. These decisions include partitioning
the search space into multiple parts, determining the order in which to explore these parts and deciding
which part of the search space should be pruned.
In the context of MAPF, recent research has focused on developing efficient MAPF search algorithms
to tackle its inherent complexity. MAPF search algorithms can be broadly categorized into centralized and
decentralized methods. Centralized methods, such as ConflictBased Search (CBS) [174] and its variants
[10, 120, 124], provide optimal or boundedsuboptimal solutions by considering the joint state space of
all agents that includes information of all agents’ locations at every time step. Decentralized methods
[78, 172], on the other hand, allow agents to plan independently while coordinating with other agents
to avoid conflicts (i.e., collisions), offering scalability benefits. However, decentralized methods typically
find lowerquality solutions than centralized methods. Thus, we focus on centralized methods in this
dissertation.
Research on MILPs is a dynamic and evolving field, with ongoing efforts to develop more efficient
algorithms. In the past decades, innovations in optimal MILP search algorithms, such as BnB [113] and
BranchandCut [64], have significantly improved the efficiency of solving MILPs and become the cores
of many stateoftheart solvers, such as SCIP [21], Gurobi [69] and CPLEX [37]. On the other hand,
popular unboundedsuboptimal search algorithms, such as greedy algorithms [27] and local search [114,
71, 177], provide solutions faster and are often good enough for practical purposes. Other unboundedsuboptimal search algorithms, such as genetic algorithms [136], simulated annealing [185] and particle
swarm optimization [105], have also been explored to solve MILP problems.
5
There are notable similarities between some MAPF and MILP search algorithms. For example, CBS in
MAPF and BnB in MILP are both tree search algorithms that repeatedly break problems down into smaller
subproblems. CBS selects conflicts to resolve, while BnB selects variables to branch on. These decisions
significantly impact the efficiency of the search algorithms [102, 22]. Another important decision shared by
both BnB and the unbounded suboptimal variants of CBS∗
is selecting which search tree nodes to expand.
Large Neighborhood Search (LNS) is another search algorithm applied to both MAPF and MILP. LNS begins
with a feasible solution and iteratively improves it by reoptimizing a part of the solution. For both MAPF
and MILP, selecting which part of the solution to reoptimize is a crucial decision in LNS [117, 177, 198,
179].
The idea of combining research efforts in MAPF and MILP search algorithms has been explored, particularly in formulating MAPF as a MILP. [209] models MAPF as a multicommodity flow problem using
MILPs. This approach involves a timeexpanded graph where vertices are indexed by location and time,
with binary variables for each pair of agents and edges in the graph. For small MAPF instances, these
MILPs can be solved using offtheshelf MILP solvers. However, this formulation does not scale well to
large instances due to its inefficient representation of MAPF. The BranchandCutandPrice algorithm
[112] is a MILPbased MAPF search algorithm that addresses this issue by incrementally constructing the
necessary variables and constraints to resolve conflicts in agents’ paths.
1.2.1 Machine Learning (ML) for COPs
Despite the advances in search algorithms and their ability to tackle largescale COPs, many stateoftheart search algorithms still rely on handcrafted strategies to make decisions that are otherwise too expensive to compute or not welldefined mathematically. These handcrafted strategies inherently face several
limitations since they often rely on domainspecific knowledge and intuition, which, while valuable, can
∗BnB maintains both upper and lower bounds on the solution. Thus, the order in which search tree nodes are expanded does
not affect the proof of optimality. CBS and its variants maintain only the lower bounds. Thus, CBS has little flexibility in selecting
search tree nodes since it always selects the ones with the lowest bounds, but its boundedsuboptimal variants do not.
6
An ML Method A Search Algorithm A COP
Figure 1.1: An ML method that applies to a search algorithm for a COP.
An ML Method A Search Algorithm
COP 1
COP 2
COP 3
Figure 1.2: An ML method that applies to a search algorithm for multiple COPs.
be subjective and may not generalize well across different problem instances or problem sizes. Moreover,
manually tuning parameters and designing effective strategies for balancing exploration and exploitation
in the search space is a complex and timeconsuming task, prone to human errors and bias. This can lead to
suboptimal performance, especially in new environments where generalizable decisionmaking is crucial.
Given these limitations, ML presents a promising avenue for enhancing decisionmaking in search
algorithms for COPs [14]. ML can learn from vast amounts of data, identifying patterns and strategies
that may not be apparent to human designers. By leveraging techniques such as supervised learning and
imitation learning, it is possible to develop decisionmaking models that make decisions more efficiently
and effectively. These models can dynamically adjust parameters, balance exploration and exploitation,
and make decisions based on the state of the search that includes, for example, information on the solutions
found and statistics on the search process. Consequently, integrating ML into combinatorial optimization
has the potential to significantly improve the efficiency and effectiveness of solving complex realworld
problems.
ML has been applied to improve search algorithms for various COPs. Much of the previous research
has focused on developing ML methods tailored to specific problems, including routing problems such as
7
the traveling salesman problem (TSP) [12, 42] and the vehicle routing problem (VRP) [149, 107, 132, 127], as
well as resource allocation problems such as the bin packing problem [83, 47] and the scheduling problem
[108, 215]. These studies specialize in one COP and propose ML methods tailored to one search algorithm,
as illustrated by Figure 1.1. However, this narrow focus presents challenges in generalization: ML methods
developed for a particular search algorithm in one COP cannot easily be adapted to other search algorithms
for the same COP or the same search algorithm for different COPs. For example, applying ML methods
designed for local search in VRP [132, 127] to greedy algorithms for VRP [12, 107] is not straightforward,
as these methods use different features, network architectures and learning algorithms. Similarly, ML
methods for local search in VRP may not be easily transferable to local search for other COPs, as they are
often tailored to the specific features and structures of VRP.
There are studies that partially address this issue by developing an ML method for a specific search
algorithm that can be applied to multiple COPs, as illustrated by Figure 1.2. For example, one of the
earliest works on MLguided COP [99] improves a greedy algorithm to solve multiple graph optimization
problems, including TSP, the maximum cut problem and the maximum independent set problem; [28]
learns to improve a search algorithm based on constraint programming and dynamic programming for
TSP, the portfolio optimization problem and the packing problem; [32] learns to perform reoptimization in
local search for expression simplification, the scheduling problem and VRP. However, these approaches are
limited by their focus on specific search algorithms, restricting their applicability to only those COPs that
the search algorithms can address. For example, the greedy algorithm in [99, 143] constructs solutions by
sequentially adding vertices, which works well for graph optimization problems where feasible solutions
are easy to obtain but may not generalize to other problems with more complex constraints.
MILP itself is a COP and serves as a general technique for solving various COPs, partially addressing
the limitations of earlier studies by modeling and solving a broader range of problems. Recent research
has integrated ML with MILP search algorithms to enhance decisionmaking strategies. These efforts
8
An ML Method Search Algorithm 2
Search Algorithm 1
Search Algorithm 3
A COP
Figure 1.3: An ML framework: An ML method that applies to multiple search algorithms for a COP.
are motivated by the observation that MILP instances in certain applications often share structural similarities [6]. A growing body of literature has emerged to enhance MILP search algorithms, particularly
BranchandBound [102, 59] and Large Neighborhood Search [177, 198, 179], by incorporating adaptable
ML components that leverage data and training.
In the context of MAPF, there has been progress as well in integrating ML techniques to enhance
decentralized MAPF algorithms. For instance, reinforcement learning and imitation learning have been
applied to construct agents’ paths in warehouse environments with limited communication capabilities
[172, 38]. However, to the best of our knowledge, there has been little progress in advancing centralized
MAPF search algorithms with ML.
Several gaps remain in the current literature:
1. Lack of ML methods for Centralized MAPF Search Algorithms Despite the similarities in
decisionmaking between MAPF and MILP search algorithms, there has been little progress on applying ML to improve decisionmaking in centralized MAPF search algorithms. Leveraging insights
from MLguided MILP algorithms is a promising research direction to address this gap.
2. Limited Exploration of ML Techniques Most studies in ML for COPs have focused on imitation learning and reinforcement learning, leaving opportunities for other emerging ML techniques.
For example, contrastive learning, which has seen success in computer vision [77, 74, 30], natural
9
language processing [62, 164] and graph representation learning [205, 188], could be explored to
improve decisionmaking strategies in search algorithms for COPs.
3. Need for General ML Frameworks Most crucially, a significant gap is the absence of general ML
frameworks that can be applied to multiple search algorithms for a specific COP, as illustrated in
Figure 1.3. While previous studies have developed ML methods for a specific search algorithm for
one or multiple COPs, there is no existing work that formulates a general ML framework capable of
improving decisionmaking strategies for multiple search algorithms for a single COP. This is important because (1) different types of search algorithms are needed since the desired tradeoffs between
optimality requirements of the solutions and the computation budgets to solve a COP change when
solving different instances under different circumstances; (2) valuable common insights into a COP
and engineering techniques can be utilized to improve multiple search algorithms; and (3) a general
ML framework could enable the reuse of ML implementations, making tools more accessible to users,
including those without extensive ML expertise. Developing such a framework is challenging, as it
requires not just an ML method but a general ML framework that can be applied to diverse search
algorithms designed to function differently.
1.3 Contributions
In this dissertation, we fill the gaps in the current literature and make two major contributions. The first
major contribution validates the hypothesis for MAPF, addressing the first and third gaps. The second
major contribution validates the hypothesis for MILP, addressing the second and third gaps. Addressing
the second gap for MAPF is left for future work and discussed in Chapter 4.
• Contribution 1 To validate the hypothesis for MAPF, we first formulate a general imitationlearning framework to improve decisionmaking strategies for MAPF search algorithms. Building
10
Imitation
Learning ECBS
CBS
MAPFLNS
PP
MAPF
Figure 1.4: Contribution 1: An ML framework that applies imitation learning to multiple MAPF search
algorithms.
Contrastive
Learning
LNS
PaS
MILP
Figure 1.5: Contribution 2: An ML framework that applies contrastive learning to multiple MILP search
algorithms.
on existing imitation learning methods for MILP solving [102, 73], we tailor our framework specifically for MAPF. We utilize domain knowledge from MAPF to design the data collection processes
and features, which are crucial for imitation learning. We then apply this framework to four different stateoftheart MAPF search algorithms, namely (1) ConflictBased Search (CBS), (2) Enhanced
ConflictBased Search (ECBS), (3) Large Neighborhood Search (MAPFLNS) and (4) Prioritized Planning (PP), as illustrated in Figure 1.4. CBS is an optimal tree search algorithm that repeatedly detects
conflicts between agents and resolves one of them by splitting the current problem into two subproblems. For CBS, we propose CBS+ML, that learns to select which conflict to resolve. ECBS is a
boundedsuboptimal variant of CBS that expands the search tree by repeatedly selecting search tree
nodes from a list of candidate nodes. For ECBS, we propose ECBS+ML, that learns to select which
search tree node to expand next. MAPFLNS is an anytime MAPF search algorithm that iteratively
11
selects a subset of agents’ paths to reoptimize. For MAPFLNS, we propose MAPFMLLNS, that
learns to select promising subsets. PP is a greedy search algorithm that plans agents’ paths sequentially in descending order of their preassigned priorities. For PP, we propose PP+ML, that learns to
assign priorities to agents. Finally, we empirically show that CBS+ML, ECBS+ML, MAPFMLLNS
and PP+ML significantly outperform their respective vanilla versions in terms of runtime and/or
solution quality. Further details are provided in Chapter 2.
• Contribution 2 To validate the hypothesis for MILP, we first formulate a general contrastivelearning framework to improve decisionmaking strategies for MILP search algorithms. We then
apply this framework to two different stateoftheart MILP search algorithms as illustrated in Figure 1.5, namely (1) Large Neighborhood Search (LNS) and (2) PredictandSearch (PaS). LNS is an
anytime MILP search algorithm that iteratively selects a subset of variables to reoptimize. For LNS,
we propose CLLNS, that learns to select promising subsets. PaS is a greedy search algorithm that
first greedily fixes values for a subset of variables and then solves a reducedsize MILP. For PaS, we
propose ConPaS, that learns to predict which values to fix for which subset of variables. Finally, we
empirically show that both CLLNS and ConPaS significantly outperform their respective MLguided
and nonMLguided counterparts in terms of both runtime and solution quality. Futher details are
provided in Chapter 3.
To summarize, we introduce two general ML frameworks aimed at improving humandesigned decisionmaking strategies in different search algorithms for MAPF and MILP, respectively. They are the first ML
frameworks capable of improving multiple search algorithms for a single COP. Although MAPF cannot be
solved efficiently when using its MILP formulation, as discussed in Section 1.2, we leverage insights from
ML methods tailored for MILP search algorithms to formulate an ML framework that improves four different MAPF search algorithms. They are the first works that use ML techniques to enhance MAPF search
12
algorithms and the first ML framework that provides systematic guidance on improving these algorithms
with ML.
For MILP, previous research has applied imitation learning and reinforcement learning to improve
decisionmaking strategies in various MILP search algorithms. However, no prior work has demonstrated
how to systematically apply a general ML framework across multiple search algorithms, nor has any used
contrastive learning, an ML technique proven more effective than imitation learning and reinforcement
learning in other domains. We address these gaps by proposing the first contrastive learningbased ML
framework for MILP and applying it to improve two MILP search algorithms.
Finally, we suggest that the contrastive learning framework developed for MILP can also be generalized
to MAPF search algorithms. Moreover, we propose that both ML frameworks could be generalized to other
COPs within the context of multitask learning. These possibilities, along with the realized impact of our
two major contributions, are discussed in Chapter 4.
13
MAPF MultiAgent Path Finding
MILP Mixed Integer Linear Program
COP Combinatorial Optimization Problem
ML Machine Learning
CBS ConflictBased Search
ECBS Enhanced ConflictBased Search
LNS Large Neighborhood Search
PP Prioritized Planning
BnB BranchandBound
PaS PredictandSearch
TSP Traveling Salesman Problem
VRP Vehicle Routing Problem
CT Constraint Tree
PPS ParallelPushandSwap
WDG Weighted Dependency Graph
ICBS Improved ConflictBased Search
MDD MultiValued Decision Diagram
CG Conflict Graph
EECBS Explicit Estimation ConflictBased Search
SVM Support Vector Machine
LB Local Branching
ND Neural Diving
CL Contrastive Learning
LP Linear Program
GAT Graph Attention Network
PG Primal Gap
PI Primal Integral
DNN Deep Neural Network
AUC Area Under the Curve
GCN Graph Convolutional Network
MLP MultiLayer Perceptron
MVC Minimum Vertex Cover
MIS Maximum Independent Set
CA Combinatorial Auction
SC Set Covering
IP Item Placement
Table 1.1: Acronyms and their meanings.
14
Chapter 2
Improving DecisionMaking in MAPF Search Algorithms
In this chapter, we present the first major contribution of this dissertation. Specifically, we formulate a
general imitation learning framework to improve decisionmaking strategies for MAPF search algorithms.
We identify important decisions that are typically made by humandesigned strategies in four different
stateoftheart MAPF search algorithms, namely ConflictBased Search (CBS), Enhanced ConflictBased
Search (ECBS), Large Neighborhood Search (MAPFLNS) and Prioritized Planning (PP), and then apply
the framework to improve them. Empirically, the machine learningguided versions of the MAPF search
algorithms substantially outperform their nonMLguided counterparts in terms of runtime and/or solution
quality. Therefore, these results validate the hypothesis that one can leverage a general ML framework to
improve humandesigned decisionmaking strategies in different types of MAPF search algorithms.
The remainder of this chapter is structured as follows. In Section 2.1, we state the motivation behind
using machine learning (ML) for MAPF and provide an overview of our contributions. In Section 2.2, we
formally define MAPF. In Section 2.3, we introduce CBS, ECBS, MAPFLNS and PP. In Section 2.4, we
summarize related work. In Section 2.5, we introduce the framework. In Sections 2.62.9, we introduce
CBS+ML, ECBS+ML, MAPFMLLNS and PP+ML, respectively, and evaluate them empirically. Finally, in
Section 2.10, we summarize the contributions of this chapter.
15
2.1 Introduction
MultiAgent Path Finding (MAPF) is the problem of finding a set of conflictfree (that is, collisionfree)
paths for a team of agents that moves on a given underlying graph and minimizes the sum of path costs.
MAPF has practical applications in distribution centers [138, 80], traffic management [45] and video games
[140]. For these applications, a MAPF instance can involve hundreds and sometimes thousands of agents.
MAPF is NPhard to solve optimally [209, 9]. However, given its importance in various applications,
different MAPF search algorithms have been proposed. One of the leading categories of MAPF search algorithms is optimal and boundedsuboptimal algorithms, which guarantee to return a solution that is optimal
and not larger than optimal by more than some userspecified multiplicative factor w ≥ 1, respectively.
The state of the art in this category includes ConflictBased Search (CBS) [174], BranchandCutandPrice
[112], Enhanced CBS (ECBS) [10] and Explicit Estimation CBS (EECBS) [124]. CBS is an optimal MAPF
search algorithm and the backbone of most of these algorithms. It uses a singleagent pathfinding algorithm to plan a path for each agent first and resolves conflicts afterward. The key idea behind CBS is to
use a bilevel search that resolves conflicts by adding constraints at the high level and replans paths for
agents respecting these constraints at the low level. The high level of CBS performs a bestfirst search on
a binary search tree called constraint tree (CT). A CT node consists of a set of paths, one for each agent,
and a set of constraints on these paths. The cost of a CT node is the sum of costs of all agents’ paths. CBS
maintains an open list that sorts all CT nodes that have not been expanded in increasing order of their
costs. CBS always expands the first CT node in the open list. To expand a CT node, CBS chooses a conflict
between two agents’ paths to resolve and adds constraints that prevent this conflict in the child CT nodes.
ECBS is a boundedsuboptimal version of CBS. It uses focal searches [155] instead of bestfirst searches for
both the highlevel and the lowlevel searches to guarantee bounded suboptimality. The highlevel search
of ECBS maintains a focal list that contains the subset of CT nodes in the open list whose costs are at most
16
w times the lowest cost of any CT node in the open list and can select an arbitrary one in the focal list for
expansion.
The other category of MAPF search algorithms is unboundedsuboptimal algorithms that can solve
very large MAPF instances but usually find lowquality solutions. These algorithms include greedy algorithms, such as prioritized planning (PP) [49], rulebased algorithms, such as ParallelPushandSwap (PPS)
[170] and Priority Inheritance with Backtracking [150], and anytime algorithms∗
, such as Large Neighborhood Search for MAPF (MAPFLNS) [117]. MAPFLNS is one of the stateoftheart MAPF search algorithms in this category. MAPFLNS first finds a set of conflictfree paths quickly using an existing MAPF
search algorithm, such as EECBS, PP or PPS. It then improves the sum of costs of the conflictfree paths
to nearoptimal over time by iteratively destroying subsets of the paths generated by agentset selection
strategies and replanning them using a repair operator while leaving the remaining paths unchanged. On
the other hand, PP is one of the fastest algorithms for solving MAPF suboptimally based on a simple planning scheme [49] that assigns each agent a unique priority and computes, in descending priority ordering,
each agent’s individually costminimal path that avoids conflicts with both static obstacles and the paths
of the alreadyplanned agents (which are treated as moving obstacles).
There is a lot of important decisionmaking in MAPF search algorithms that concerns, for example,
how to partition the search space into two or more parts, which part of the search space to explore next
and how to prune the search space. In the past, decisionmaking was often done by handcrafted strategies
which requires domain knowledge and a thorough understanding of the algorithms. In this chapter, we
apply a general ML framework to learn such strategies and demonstrate that the performance of MAPF
search algorithms, i.e., their runtime and/or solution quality, can be improved with MLguided strategies.
In particular, we introduce CBS+ML, ECBS+ML, MAPFMLLNS and PP+ML to show that the framework
is applicable to stateoftheart MAPF search algorithms of different types: the optimal algorithm CBS with
∗An anytime algorithm can be stopped at any point after a feasible solution is found during its execution and still provides
a valid solution to the problem.
17
improved heuristics [120], the boundedsuboptimal algorithm ECBS [10] and the unboundedsuboptimal
algorithms MAPFLNS [117] and PP [49]. To apply this ML framework to improve a MAPF search algorithm, we first identify an important decision to make in the search. For example, there are strategies that
decide which conflict to resolve next in CBS, which node in the focal list to expand next in ECBS, which
subset of agents to replan next in MAPFLNS and which priorities to assign to agents in PP. We then learn
to imitate effective decisions from an expert: for CBS, we propose an expert to select the next conflict to
resolve based on a weighted dependency graph (WDG) heuristic [120] that solves a weighted vertex cover
problem for each conflict on a graph that captures interaction among agents; for ECBS, the expert selects
the next node to expand by retrospectively computing the complete search tree; for MAPFLNS, the expert
samples agent subsets and replans them to select the best one; for PP, the expert samples random priority
orderings and plans agents’ paths with each of them to select the best one. These experts are too slow to
be directly useful in the search but provide effective guidance for the search. By observing and recording
the features and decisions of the expert, we deploy imitation learning to learn strategies to predict decisions that are as similar as possible to the expert without actual exhaustive computation. Empirically, we
show that variants of MAPF search algorithms with the learned strategies substantially outperform their
nonMLguided counterparts in terms of runtime and/or solution quality. The results demonstrate how
a general ML framework can be applied to advance some stateoftheart MAPF search algorithms and
possibly others.
2.2 MultiAgent Path Finding
The MultiAgent PathFinding (MAPF) problem is to find a set of conflictfree paths for a set of agents
{a1, . . . , ak} on a given 2D fourneighbor grid map with blocked cells that is represented as an undirected
unweighted graph G = (V, E). Each agent ai has a start vertex si ∈ V and a goal vertex ti ∈ V . A
path pi = (pi,0, . . . , pi,l(pi)
) for agent ai
is a sequence of vertices, where pi,0 = si
, pi,l(pi) = ti and l(pi)
18
is the length of the path. Time is discretized into time steps, and, at each time step t, every agent takes
an action: It either moves to an adjacent vertex, i.e., (pi,t, pi,t+1) ∈ E, or waits at its current vertex, i.e.,
pi,t = pi,t+1 ∈ V . Two types of conflicts are considered: i) A vertex conflict ⟨ai
, aj , v, t⟩ occurs when
agents ai and aj are at the same vertex v at time step t; and ii) an edge conflict ⟨ai
, aj , u, v, t⟩ occurs when
agents ai and aj traverse the same edge (u, v) in opposite directions from time step t to time step t + 1.
The cost of agent ai
is defined as l(pi), which is the number of time steps until it reaches its goal vertex
ti and remains there. The delay of agent ai
is defined as the difference between l(pi) and the distance
between its start and goal vertices. A solution is a set of conflictfree paths that move all agents from their
start vertices to their goal vertices. The sum of costs (and delays) of a solution is the sum of all agent costs
Pk
i=1 l(pi) (and their delays, respectively). Our goal is to find a solution with the minimum sum of costs.
2.3 Background
In this section, we provide a brief introduction to MAPF search algorithms, CBS, ECBS, MAPFLNS and
PP, that we focus on in this chapter. At the end, we introduce the MAPF instances used in the empirical
evaluation.
2.3.1 ConflictBased Search (CBS)
CBS is a bilevel tree search algorithm. It records the following information for each CT node N:
1. NCon: The set of constraints imposed so far in the search. There are two types of constraints: i) a
vertex constraint ⟨ai
, v, t⟩, corresponding to a vertex conflict, prohibits agent ai from being at vertex
v at time step t; and ii) an edge constraint ⟨ai
, u, v, t⟩, corresponding to an edge conflict, prohibits
agent ai from moving from vertex u to vertex v between time steps t and t + 1.
19
2. NSol: A set of individually costminimal paths for all agents respecting the constraints in NCon that
are potentially not conflictfree. An individually costminimal path for an agent is a costminimal
path between its start and goal vertices under the assumption that it is the only agent in the graph.
3. NCost: The cost of N, calculated as the sum of costs of all agents in NSol.
4. NConf: The set of conflicts between any two paths in NSol.
On the high level, CBS starts with a CT with only one CT node whose set of constraints is empty and then
expands the CT in a bestfirst manner by always expanding a CT node with the lowest NCost. After choosing a CT node to expand, CBS identifies the set of conflicts NConf in NSol. If there are none, CBS terminates
and returns NSol. Otherwise, CBS randomly (by default) selects one of the conflicts to resolve and adds
two child CT nodes to N by imposing, depending on the type of conflict, an edge or vertex constraint on
one of the two conflicting agents and adding the constraint to NCon of one of the child nodes and similarly
for the other conflicting agent and NCon of the other child node. On the low level, it replans the paths in
NSol to accommodate the newlyadded constraints, if necessary. CBS guarantees optimality by performing
bestfirst searches on both of its high and low levels. CBS itself does not guarantee completeness, but [210]
has an algorithm to detect whether a solution exists for a MAPF instance which can be run to guarantee
completeness prior to running CBS. We do not implement such a component in our empirical evaluation.
2.3.2 Enhanced CBS (ECBS)
ECBS is a boundedsuboptimal version of CBS [10] and is summarized in Algorithm 1. Given a suboptimality factor w ≥ 1 (Line 1), ECBS is guaranteed to find a wapproximate solution. Both the highlevel
and lowlevel searches of ECBS use focal searches [155] instead of bestfirst searches. Consider a CT node
N. On the low level, ECBS runs a focal search for each agent ai
(Line 12) such that the cost of the path
found is at most wNLB,i, where NLB,i is the lower bound on the cost of the individually costminimal
20
Algorithm 1 ECBS
1: Input: A MAPF instance and suboptimality factor w
2: Generate the root CT node R and calculate RSol, RCost and RConf
3: Initialize open list N ← {R}
4: LB ← RLB, and initialize focal list F ← {R}
5: while N is not empty do
6: N ← a CT node with the minimum dvalue in F
7: if NConf = ∅ then
8: return NSol
9: Delete N from the open and focal lists
10: Pick a conflict in NConf
11: Generate 2 child CT nodes N1
and N2 of N, and update N1
Con and N2
Con
12: Call lowlevel search for Ni
to calculate Ni
Sol, Ni
Cost and Ni
Conf for i = 1, 2
13: Add Ni
to N if Ni
Sol exists for i = 1, 2
14: Add Ni
to F if Ni
Sol exists and Ni
Cost ≤ wLB for i = 1, 2
15: if minN∈N NLB > LB then
16: LB ← minN∈N NLB
17: F ← {N ∈ N : NLB ≤ wLB}
18: return No solution
path for ai that respects the set of constraints of CT node N and is computed by the focal search. Let
NLB =
Pk
i=1 NLB,i. On the high level, ECBS performs a focal search (Lines 5  17) with a focal list that
contains all CT nodes N in N such that NCost ≤ wLB (Lines 4, 14 and 17), where N is the open list and
LB = minN∈N NLB (Lines 4 and 16). Since LB is a lower bound on the sum of costs of any solution, once
a solution is found by always expanding a CT node in the focal list, it is guaranteed to be a wapproximate
solution. Selecting CT nodes from the focal list for expansions (Line 6) in the high level search of ECBS is
an important decision to make. The common practice is to select a CT node with the minimum dvalue,
where the dvalue is typically a handcrafted value that is computed for each CT node when it is generated.
The dvalue of a CT node is an estimate of the effort required to find a solution in the CT subtree rooted
at that CT node.
21
Algorithm 2 MAPFLNS
1: Input: A MAPF instance I
2: P = {pi
: i ∈ [k]} ← runInitialSolver(I)
3: Initialize the weights ω of the agentset selection strategies
4: while runtime limit not exceeded do
5: H ← selectDestroyHeuristic(w)
6: B ← selectAgentSet(I, H)
7: P
− ← {pi ∈ P : ai ∈ B}
8: P
+ ← runReplanSolver(I, B, P \ P
−)
9: Update the weights ω of the agentset selection strategies
10: if P
p∈P + l(p) <
P
p∈P − l(p) then
11: P ← (P \ P
−) ∪ P
+
12: return P
2.3.3 MAPFLNS
MAPFLNS [117] is the stateoftheart anytime MAPF search algorithm. It is able to solve large MAPF
instances that most existing MAPF search algorithms fail to either solve or provide highquality solutions
for.
MAPFLNS, shown in Algorithm 2, takes a MAPF instance as input and calls an efficient initial search
algorithm to compute a solution P (Line 2). In each iteration, it selects an agent set B using an agentset
selection strategy H (Lines 56), deletes the paths P
− of the agents in B from P (Line 7) and calls a replan
search algorithm to replan new paths P
+ for them that conflict with neither each other nor the paths in
P \ P
− (Line 8). If P
+ decreases the sum of costs (Line 10), then MAPFLNS replaces P
− with P
+ (Line
11). The initial search algorithm could be any offtheshelf MAPF search algorithm, and the replan search
algorithm could be any offtheshelf MAPF search algorithm that can handle moving obstacles.
MAPFLNS uses two randomized agentset selection strategies, namely an agentbased heuristic and a
mapbased heuristic, to generate the agent sets B. The agentbased heuristic generates the agent set B by
including the agent ai with the largest delay and other agents (found via a random walk procedure) whose
paths prevent it from achieving a lower agent cost. The mapbased heuristic randomly chooses a vertex
with a degree greater than 2 in graph G and generates the agent set B by including some of the agents
whose paths visit the chosen vertex. Both the agentbased and the mapbased heuristics impose a limit
22
on the cardinality of the agent set. MAPFLNS uses Adaptive LNS [166], essentially an online learning
algorithm, to select one of the two agentset selection strategies by maintaining a weight for each of them.
2.3.4 Prioritized Planning (PP)
Definition 2.3.1 (Priority ordering). A priority ordering ≺ is a strict partial order on {a1, . . . , ak}: ai ≺ aj
iff agent ai has higher priority than agent aj [137]. ≺ is a total priority ordering iff any two agents in
{a1, . . . , ak} are comparable (i.e., either ai ≺ aj or aj ≺ ai for all ai ̸= aj ) and a partial priority ordering
otherwise.
Prioritized Planning (PP) [49] is an unboundedsuboptimal MAPF search algorithm. In PP, we arrange
all agents into a predefined total priority ordering. Then, we plan paths for all agents one by one in descending order according to the priority ordering. The path of each agent is the individually costminimal
path from its start vertex to its goal vertex that has no conflicts with the paths of all higherpriority agents.
So, instead of planning paths for all agents at once, PP decouples the planning process and plans for the
agents sequentially. PP does not guarantee completeness or optimality, but it is popular because of its
efficiency and simplicity. A key consideration in PP is how to determine the predefined total priority ordering. It is typically determined either randomly or via manually designed heuristics. We introduce a few
of them in the following.
QueryDistance Heuristic [17] proposes the querydistance heuristic, which measures the startgoal
graph distance dist(si
, gi) of each agent ai and assigns higher priority to agents with longer distances.
The motivation behind this heuristic was to prioritize agents that need to travel longer distances and thus
minimize the makespan (i.e., the largest cost of all agents). An opposite version of the querydistance
heuristic, which assigns higher priority to agents with shorter startgoal graph distances, has been used
in [137].
23
LeastOption Heuristic Building on the idea behind the mostconstrainedvariable heuristic for solving
constraint satisfaction problems, [191] and [196] propose the leastoption heuristic, which assigns higher
priority to agents with fewer path options, where the number of path options for an agent is defined as
the number of paths that do not have conflicts with the paths of alreadyplanned agents within a given
number of time steps in [191] or the number of homology classes of paths in [196].
StartandGoalConflict Heuristic [24] propose prioritization rules that consider the potential conflicts at the start and goal vertices of the agents. Intuitively, if the individually costminimal path of agent
ai visits the start vertex of another agent aj , then aj needs to be planned prior to ai
; if the individually
costminimal path of agent ai visits the goal vertex of another agent aj , then ai needs to be planned prior
to aj . This heuristic tends to reduce the runtime of PP [24] and increase its success rate [18].
Random Restarts When the priority ordering is assigned randomly, people often apply random restarts
to improve the performance of PP [16]. When PP with a particular priority ordering fails to find a solution
for a MAPF instance, we can “restart” it with a new randomized priority ordering.
2.3.5 MAPF Instances Used in the Empirical Evaluation
In our empirical evaluation, we use grid maps from the following sources: (1) grid maps from the MAPF
benchmark [181]; (2) grid map “‘lak503d”’ from the 2D pathfinding benchmarks [182]; and (3) random
maps and warehouse maps that we generate following [120], where random maps are grid maps with
randomly blocked cells and warehouse maps are grid maps with rectangular obstacles (clusters of blocked
cells).
We describe the training instances, validation instances and test instances used in our empirical evaluation. For each grid map from the MAPF benchmark [181], we use the 25 random scenarios provided. A
scenario is a list of randomly created pairs of start and goal vertices for a given grid map. Given a grid
24
map M and a number of agents k, we generate 25 test instances I
(M)
Test , one from each scenario, by using
the first k pairs of start and goal vertices. In order to generate training instances and validation instances
that follow a similar distribution as the test instances, given a scenario with a map M ∈ M and a number
of agents k, we generate a training instance I ∈ I(M)
Train by randomly selecting k start vertices from all start
vertices in the scenario, randomly selecting k goal vertices from all goal vertices in the scenario and then
randomly combining them into k pairs of start and goal vertices. For grid maps that are not from the MAPF
benchmark, we generate the MAPF instances in the same way.
The numbers of agents and the types of grid maps used in the MAPF instances and the runtime limits in
our empirical evaluation are mainly based on the scalability of the MAPF search algorithms. For example,
CBS is an optimal MAPF search algorithm, which is slow and does not scale compared to ECBS, MAPFLNS or PP. Thus, we use MAPF instances on smaller grid maps with lower agent and/or obstacle densities
and use longer runtime limits. In contrast, for MAPFLNS and PP, we use MAPF instances on larger maps
with higher agent and/or obstacle densities (that is, easier MAPF instances) and use shorter runtime limits.
These MAPF instance configurations are also impacted by the amount of computation resources we had
access to, which changed from time to time. When we had limited computation resources available, we
tended to use shorter runtime limits and easier MAPF instances.
All empirical evaluations in this chapter follow the setup described above unless stated otherwise.
2.4 Related Work
In this section, we summarize related works on MAPF search algorithms and using ML for MAPF. Finally,
we discuss ML for other combinatorial optimization problems (COP) that inspires our work.
25
2.4.1 MAPF Search Algorithms
For optimal MAPF search algorithms, there has been a huge effort to improve CBS. Selecting which conflict
to resolve next has been explored in Improved CBS (ICBS) [22]. ICBS categorizes conflicts into three types
to prioritize them: (i) A conflict is cardinal iff, when CBS uses the conflict to split CT node N, the costs of
both resulting child nodes are strictly larger than NCost; (ii) a conflict is semicardinal iff the cost of one of
the child nodes is strictly larger than NCost and the cost of the other child node is the same as NCost; and (iii)
a conflict is noncardinal otherwise. By first resolving cardinal conflicts, then semicardinal conflicts and
finally noncardinal conflicts, CBS is able to improve its efficiency since it increases the lower bound on the
optimal cost more quickly by generating child nodes with larger costs. ICBS uses MultiValued Decision
Diagrams (MDDs) to classify conflicts. An MDD for agent ai
is a directed acyclic graph consisting of all
costminimal paths from si to ti that respect the current constraints NCon. The vertices at level t of the
MDD are exactly the vertices that agent ai could be at when following one of its costminimal paths. The
width of an MDD level is the number of vertices at that level and a singleton is an MDD level with width
one. A vertex conflict ⟨ai
, aj , v, t⟩ (edge conflict ⟨ai
, aj , u, v, t⟩) is cardinal iff vertex v (edge (u, v)) is the
only vertex at depth t (the only edge from depth t to depth t + 1) in the MDDs of both agents. [122]
proposes to split the search space into two disjoint ones when expanding a CT node in CBS and prioritizes
conflicts based on the number of singletons in or the widths of level t of the MDDs of both agents. To the
best of our knowledge, other than selecting conflicts using MDDs, conflict selection for CBS has not yet
been explored. In this chapter, we show how one can apply ML to learn an improved conflictselection
strategy in CBSH2 [120] for CBSbased optimal MAPF search algorithms. CBSH2 is the stateoftheart
version of CBS and uses the same conflictselection strategy as ICBS.
Another line of research focuses on speeding up CBS by calculating a tighter lower bound on the optimal cost to guide the highlevel search. When expanding a CT node N, CBSH [54] uses the CG heuristic,
which builds a conflict graph (CG) whose vertices represent agents and whose edges represent cardinal
26
conflicts in NSol. Then, the lower bound on the optimal cost within the subtree rooted at N is guaranteed
to increase at least by the size of the minimum vertex cover of this CG. We refer to this increment as the
hvalue of the CT node. Based on CBSH, CBSH2 [120] uses the DG and WDG heuristics that generalize CG and compute hvalues for CT nodes using (weighted) pairwise dependency graphs that take into
account semicardinal and noncardinal conflicts besides cardinal ones. CBSH2 with the WDG heuristic
is the current stateoftheart CBSbased optimal MAPF search algorithm [120]. CBSH2 uses the same
conflictselection strategy proposed in Improved CBS [22].
There are optimal compilationbased MAPF search algorithms that reduce MAPF to other COPs. Branchandcutandprice [112, 111] is a bilevel MAPF search algorithm based on MILP solvers. MAPF has also
been encoded as satisfiability [183], constraint programming [57] and answer set programming problems
[63].
For boundedsuboptimal MAPF search algorithms, there are A∗
based algorithms, such as Enhanced
PartialExpansion A∗
[53], A∗ with operator decomposition [180] and M* [190], and CBSbased algorithms,
such as ECBS [10] and EECBS [124]. Variants of ECBS, such as ECBS with the highway heuristic [35]
and Improved ECBS [36], have been proposed to speed up ECBS by generating highways (paths for the
agents that include edges from userprovided sets of edges) in environments such as warehouses. However,
these approaches are not generalizable to environments with open areas or environments without straight
corridors. EECBS [124] is the current stateoftheart boundedsuboptimal MAPF search algorithm. It
replaces focal search in ECBS with explicit estimation search [186], that selects the next node to expand
the CT from three different lists based on certain rules. In contrast, ECBS selects nodes only from the
focal list, which is a simpler rule. For demonstration purposes, in this chapter, we use ECBS as an example
to show how one can apply ML to learn an improved nodeselection strategy for CBSbased boundedsuboptimal MAPF search algorithms.
27
For unboundedsuboptimal MAPF search algorithms, there are prioritized planningbased algorithms,
such as prioritized planning, and rulebased algorithms, such as PushandSwap [135], PPS [170] and Priority Inheritance with Backtracking [150]. MAPFLNS [117] and MAPFLNS2 [119] are the stateoftheart
unboundedsuboptimal MAPF search algorithms based on LNS. MAPFLNS starts with finding an initial
solution fast and then iteratively improves it over time. MAPFLNS2 starts with a set of paths with conflicts,
then iteratively reduces the number of conflicts to find a solution and finally uses MAPFLNS to optimize
this solution. For demonstration purposes, in this chapter, we use MAPFLNS to show how one can apply
ML to learn an improved agentset selection strategy for LNSbased unboundedsuboptimal MAPF search
algorithms. Although the solution qualities of rulebased and prioritized planningbased algorithms are
often worse than those of other types of MAPF search algorithms, they run in polynomial time. Therefore,
they can compute a solution quickly and are quite popular. In this chapter, we also show how one can
apply ML to learn a priorityassignment strategy for PP.
2.4.2 ML for MAPF
Our work is one of the first to use ML to improve decisionmaking within MAPF search algorithms. ML has
been applied to decentralized MAPF, where agents coordinate in a decentralized fashion. [172] proposes a
framework that combines reinforcement learning and imitation learning to learn decentralized policies for
agents to avoid expensive centralized planning. An enhanced version [38] is later proposed that resolves
deadlocks in congested environments using symmetrybreaking techniques. [141, 128, 193] show that the
communication capabilities of agents help further resolve deadlocks and congestion. ML has also been
used to select the best MAPF search algorithms for solving MAPF optimally [98, 163] and suboptimally
[31].
28
2.4.3 ML for other COPs that Inspires Our Work
Using ML to improve combinatorial search has been studied extensively for other COPs. Mixed integer
linear programs (MILP) are powerful tools for modeling and solving a wide variety of COPs. There is a
huge body of studies that use ML to improve decisionmaking in BranchandBound (BnB) search [194]
for MILPs, and our works have been inspired by some of them. BnB is a tree search, and, as part of BnB,
nodes in the search tree that contain unassigned variables must be expanded into two child nodes by
selecting one of the unassigned variables and splitting its domain by adding new constraints. There has
been a line of works on learning how to select variables to branch on for BnB [102, 59, 68, 211], where
the main goal is commonly to imitate the effective but expensive Strong Branching heuristic [7]. Conflict
selection in CBS is similar to variable selection in BnB, but previous methods are not directly applicable
since the Strong Branching heuristic does not apply to CBS and features used in those works are based
on variables and constraints that are specific to MILPs. We thus leverage insights from MAPF to craft
our methods. In addition, learning to select nodes to expand [73, 178, 110] in BnB and learning to select
variables to reoptimize in LNS [177] have been explored for solving MILPs. For node selection in BnB,
[73] uses imitation learning to learn nodeselection and nodepruning strategies for solving MILPs. [178]
scale up this approach by progressively increasing the instance sizes in the form of curriculum learning.
For variable selection in LNS, both [177] and [179] use imitation learning where [177] learns to imitate an
expert based on random sampling and [179] learns to imitate the Local Branching heuristic [56], which is
a more effective one. Inspired by these early works, we seek to improve similar decisions for ECBS and
MAPFLNS and develop methods that work for our specific tasks in the context of MAPF. We show how we
achieve those tasks by leveraging domain expertise and designing engineering techniques that suit MAPF
search algorithms.
29
2.5 An Imitation Learning Framework for Learning DecisionMaking
Strategies
In this section, we propose a general ML framework based on imitation learning to learn decisionmaking
strategies for MAPF search algorithms. As will be shown in Sections 2.62.9, based on our framework, we
develop the first ML methods to improve centralized MAPF search algorithms. It is also the first ML framework in the literature that has been successfully applied to improving multiple MAPF search algorithms.
It consists of the following steps:
1. Identify a Decision to Improve Given a MAPF search algorithm, identify a decision that is crucial
to its performance. The goal is to learn a strategy to improve making this decision.
2. Find an Expert Since our framework is based on imitation learning, this step identifies an expert
to provide highquality demonstrations for decisionmaking so that we can imitate it via supervised
learning. Formally, we define a state as a snapshot of the search whenever a decision needs to be
made and the set of actions A(s) at state s as the set of possible decisions at s. An expert evaluates all
actions A(s) and selects the optimal or suboptimal action a
∗ ∈ A(s) as the decision. Compared to a
trivial strategy (for example, a greedy strategy or a random choice strategy), an ideal expert makes
more effective decisions at a slightly higher but reasonable computational cost. The computational
cost should not be too low, otherwise one could potentially simply deploy the expert in the search
to get good solutions with low runtime. It should not be too high either since the expert will be
used to collect demonstrations as labels for supervised learning and we need a sufficient amount of
them to train an ML model. Since the expert evaluates multiple actions, it provides not only the best
decisions but also information on lowquality decisions.
3. Data Collection We obtain a training dataset D, which is a set of states. At each state s of the
search, we can compute features and labels for each available action a ∈ A(s). By observing and
30
recording the features and the decisions given by the expert, we then learn to make predictions as
similar as possible to the expert without actually probing it. Features serve as signals to inform
predictions and should be fast to compute. The features depend on the choice of ML model. For
example, imagelike features are more suitable for convolutional neural networks, and graph representations are needed for graph neural networks. Lightweight models, such as linear regression
models, support vector machines (SVM) and decision trees, are sometimes more favorable for repeated decisionmaking since they have much lower computational overhead. For each decision,
labels serve as the learning target and are derived from the quality of actions determined by the
expert. Given the expert, one could label the actions as good or bad actions based on a performance
metric, such as the solution quality or runtime they lead to, or a proxy for the metric.
4. Model Learning We reduce the ML problem to an imitation learning task, that is, to imitate the
expert. There are a few methods for imitation learning. One could learn to classify actions based on
their labels or predict the labels. In our work, we deploy a learningtorank method, where we learn
to rank the actions based on their rankings derived from the labels. The main benefit of learningtorank is that it learns to predict the actions of the expert from the differences among actions, which
has been shown effective [73, 102].
5. MLGuided Search Once we have a trained ML model, we plug it into the MAPF search algorithm
as a decisionmaking strategy.
In the rest of this section, we formulate the imitationlearning task in Step 4 as a learningtorank task.
For a state s and an action a ∈ A(s), let ϕs(a) ∈ R
p be the feature vector of action a ∈ A(s) and ys(a) be
the ground truth label for a. The goal is to learn a ranking function π : R
p → R that serves as a scoring
function for each action, where better actions receive higher scores. π takes as input the pdimensional
features ϕs(a) of action a at state s and then predicts a
∗ = arg maxa∈A(s) yˆs(a) as the best action, where
31
yˆs(a) = π(ϕs(a)) is the predicted score. π could be learned by regression on the labels or classifying the
labels. However, such methods learn to score each action independently. In contrast to a regression or
classification task, we adopt a formulation of learningtorank that predicts the score based on the relative
differences between pairs of actions instead of based on the action itself. The labels must be numerical
values that indicate the qualities of the actions, such as runtimes or solution qualities. The loss function
for training is based on a strict partial order on A(s) derived from the labels. Formally, given a set of states
as the training data D, we minimize the following loss function
L(w) = X
s∈D
l(ys, yˆs) + C
2
w2
2
,
where w are the learnable weights of π, C is a regularization parameter and l(ys, yˆs) is a loss function
based on a pairwise loss between the ground truth labels ys and the predicted scores yˆs. To compute
l(ys, yˆs), we consider the set of ordered pairs of actions at state s where one of their labels is greater than
the other
Ps = {(a
′
, a′′) : ys(a
′
) > ys(a
′′) ∧ a
′
, a′′ ∈ A(s)}.
l(ys, yˆs) is the weighted fraction of swapped pairs in the predictions defined as
l(ys, yˆs) =
P
(a
′
,a′′)∈Ps:ˆys(a
′)≤yˆs(a
′′) w˜a
′
,a′′
P
(a
′
,a′′)∈Ps
w˜a
′
,a′′
,
where w˜a
′
,a′′ is a weight associated with each pair of actions in Ps. One could simply set w˜a
′
,a′′ to a
constant for an unweighted version of l(ys, yˆs).
To learn π, one could train a deep neural network (DNN), like most of the existing works in the literature, such as RankNet [25] and LambdaRank [162]. However, DNNs introduce an undesirably large
32
computational overhead if used for repeated decisionmaking in the MAPF search algorithms. We thus
learn a linear ranking function
π : R
p → R : π(ϕs(a)) = wT ϕs(a)
using SVMrank [94] instead, which has been shown to be efficient and effective in previous work on learning
to rank [102, 73] in the context of search algorithms for other COPs. In practice, one can train the linear
ranking function with opensource solvers such as those developed for SVMrank [95] and LIBLINEAR [52].
These solvers minimize an upper bound on the loss, since the loss itself is NPhard to minimize. In our
implementation, we use SVMrank [94] for the unweighted version of l(ys, yˆs) and use LIBLINEAR [52]
otherwise, since SVMrank is simpler to use but does not allow customizing the weights.
2.6 Learning to Select Conflicts for CBS
In this section, we introduce CBS+ML to show how the framework can be applied to improve conflict
selection in CBS [174]. CBS is one of the leading algorithms for solving MAPF optimally, and a number
of enhancements to CBS have been developed [22, 120, 54, 10]. Picking good conflicts is important, and a
good strategy for conflict selection could have a big impact on the efficiency of CBS by reducing both the
size of the CT and its runtime. We refer the readers to the example in Figure 1 in [22] that demonstrates
why the size of the CT can be impacted by conflict selections. To the best of our knowledge, other than
prioritizing conflicts using MDDs in ICBS [22], conflict prioritization has not yet been explored much. In
the rest of this section, we apply the framework introduced in Section 2.5 and propose CBS+ML tailored
to conflict selection in CBS. We then empirically demonstrate the effectiveness and efficiency of CBS+ML.
33
2.6.1 Machine Learning Methodolody
Our goal is to learn a conflictselection strategy that reduces the runtime of CBS. The conflictselection
strategy is applied when expanding a CT node N. Thus, we represent the state of the search with the
CT node s = N. The state contains information about not only CT node N but also its ancestors and
siblings that are generated before expanding N. CBS can select any conflict from NConf to resolve. Thus,
the available actions at CT node N are A(N) = NConf. To apply the framework, we first propose an
expert for conflict selection that results in smaller CT sizes than the ones used in previous work. However,
the expert is much more computationally expensive since it has to compute the WDG heuristic for each
conflict that will be explained in Section 2.6.1.1. Next, given the expert, we explain how we use ML to
imitate its decisions.
2.6.1.1 Experts for Conflict Selection
Given a MAPF instance, an expert for conflict selection at a particular CT node N is a ranking function
that takes the set of conflicts NConf as input, calculates a realvalued score for each conflict and outputs
the ranks determined by the scores. We say that CBS follows an expert for conflict selection iff CBS builds
the CT by always resolving the conflict with the highest rank. We define expert O0 as the one proposed
in ICBS [22], which uses MDDs to rank conflicts.
Definition 2.6.1. Given a CT node N, expert O0 ranks the conflicts in NConf in the order of cardinal conflicts,
semicardinal conflicts and noncardinal conflicts, breaking ties in favor of conflicts at the smallest time step
and remaining ties randomly.
Next, we define experts O1 and O2, that both calculate 1step lookahead scores by using, for each
conflict, the two child nodes of N that would result if the conflict were resolved at N.
Definition 2.6.2. Given a CT node N, expert O1 computes the score vc = min{g
l
c + h
l
c
, gr
c + h
r
c} for each
conflict c ∈ NConf, where g
l
c
and g
r
c would be the costs of the two child nodes of N and h
l
c
and h
r
c would be
34
The Random Map
Runtime CT Size Expert Time Search Time
CBSH2+O0 9.95s 2,362 nodes 0.00s 9.95s
CBSH2+O1 24.89s 746 nodes 21.34s 3.55s
CBSH2+O2 12.13s 632 nodes 9.52s 2.61s
CBS+ML 6.19s 998 nodes 0.88s 5.31s
The Game Map
Runtime CT Size Expert Time Search Time
CBSH2+O0 2.3min 952 nodes 0.0min 2.3min
CBSH2+O1 19.8min 565 nodes 19.0min 0.8min
CBSH2+O2 27.4min 2,252 nodes 23.4min 4.0min
CBS+ML 1.6min 754 nodes 0.2min 1.4min
Table 2.1: Performance of CBSH2 with different experts and our method CBS+ML. Expert time is the
runtime of the expert. Search time is the runtime minus the expert time. All entries are averaged over the
MAPF instances that are solved by all methods.
the hvalues given by the WDG heuristic for the two child nodes of N if conflict c were resolved at N. Then, it
outputs the ranks determined by the decreasing order of the scores (i.e., the highest rank for the highest score).
Expert O1 selects the conflict that results in the tightest lower bound on the optimal cost estimated
by the WDG heuristic in the child nodes. Inspired by CBSH2, we use the WDG heuristic in expert O1 to
compute the hvalues since it is the state of the art. The intuition behind using this expert is that the sum of
the cost and the hvalue of a node is a lower bound on the cost of any solution found in the subtree rooted
in the node, and, sometimes, the lower bound maintained by CBS might not increase for most conflicts
you select. Thus, we want CBS to increase the lower bound as much as possible to find a solution quickly
by selecting the right conflict to resolve next.
Definition 2.6.3. Given a CT node N, expert O2 computes the score vc = min{ml
c
, mr
c} for each conflict
c ∈ NConf, where ml
c
and mr
c would be the number of remaining conflicts in the two child nodes of N if
conflict c were resolved at N. Then, it outputs the ranks determined by the increasing order of the scores (i.e.,
the highest rank for the lowest score).
Expert O2 selects the conflict that results in the least number of conflicts in the child nodes.
35
We use CBSH2 with the WDG heuristic as our search algorithm and run it with experts O0, O1 and
O2 on (1) the random map, which is a 20 × 20 fourneighbor grid map with 25% randomly generated
blocked cells [120], and (2) the game map “lak503d” [182], which is a 192 × 192 fourneighbor grid map
with 51% blocked cells from the video game Dragon Age: Origins. The maps are shown in Table 2.4. The
experiments are conducted on 2.4 GHz Intel Core i7 CPUs with 16 GB RAM. We set the runtime limit to
20 minutes for the random map and 1 hour for the game map. We set the number of agents to k = 18 for
the random map and k = 100 for the game map and run each variant of CBSH2 on 50 MAPF instances for
each grid map. The MAPF instances are generated the same way for training instances as in Section 2.3.5.
In Table 2.1, we present the performance of the three experts as well as our method CBS+ML. All entries
are averaged over the MAPF instances that are solved by all methods. We evaluate the experts according
to the resulting CT sizes since they determine the runtime when the calculation of the experts is not taken
into account (and everything else being equal) and first look at the CT sizes of CBSH2 with each of the
three experts. Expert O2 is best for the random map, followed closely by expert O1. Expert O1 is best
for the game map. Overall, expert O1 is best. Therefore, in the rest of the chapter, we mainly focus on
learning a ranking function that imitates expert O1. Table 2.1 shows that, by learning to imitate expert
O1, our method CBS+ML†
achieves the best runtime, even though it induces a larger CT than CBSH2+O1.
Next, we introduce our ML methodology.
2.6.1.2 Data Collection
The next step in our framework is to construct a training dataset from which we can learn a model that
imitates the expert’s output. First, we fix the graph underlying the MAPF instances that we want to solve
and the number of agents. The number of agents is only fixed during the data collection and model learning steps. Later in Section 2.6.2, we show that the models can generalize to MAPF instances with larger
numbers of agents during testing. We obtain a set of MAPF instances ITrain for training. A MAPF instance
†This is the CBS+MLS variant that will be introduced in Section 2.6.2.
36
dataset DI is obtained for each I ∈ ITrain, and the final training dataset is obtained by taking the union
of these datasets D =
S
I∈ITrain
DI . To obtain dataset DI , we run CBSH2 on I and expert O1 is run for
each CT node N to produce the ranking for NConf. DI consists of a set of those CT nodes N which are
the expanded CT nodes during the search. For each node N ∈ DI , the conflicts in NConf are the available
actions A(N). For each c ∈ NConf, we compute a binary label yN (c) ∈ {0, 1} derived from the expert’s
ranking of the conflicts and a pdimensional feature vector ϕN (c) that describes the characteristics of the
conflict c at CT node N. We also collect a validation dataset on another set of MAPF instances IValid for
validation to evaluate the prediction accuracy of the learned model.
Features We collect a pdimensional feature vector ϕN (c) that describes a conflict c ∈ NConf in CT node
N. The p = 67 features of a conflict ⟨ai
, aj , v, t⟩ (⟨ai
, aj , u, v, t⟩) in our implementation are summarized
in Table 2.2. They consist of (1) the properties of the conflict, (2) statistics of CT node N, the conflicting
agents ai and aj and the contested vertex or edge with respect to NSol, (3) the number of conflicts that
have been resolved for a vertex‡ or an agent, and (4) features of the MDD and the WDG. We perform a
linear transformation to normalize the value of each feature to the range of [0, 1] across all conflicts in
NConf, where the minimum value of that feature gets transformed into a 0 and the maximum value gets
transformed into a 1. All features of a given conflict c ∈ NConf can be computed in O(NConf + k) time.
Labels We label each conflict in NConf such that conflicts with higher ranks determined by the expert
have larger labels. Instead of using the full ranking provided by expert O1, we use a binary labeling scheme
similar to the one proposed by [102]. We assign label 1 to a conflict if no more than 20% of the conflicts
in NConf have the same or a higher score; otherwise, we assign label 0 to it. When more than 20% of
the conflicts have the same highest O1 score, we assign label 1 to those conflicts and label 0 to the rest.
By doing so, we ensure that at least one conflict is labeled 1 and conflicts with the same score have the
‡An edge conflict is considered to be resolved for both vertices on the edge.
37
Feature Descriptions Count
Types of the conflict: binary indicators for edge conflicts, vertex conflicts, cardinal conflicts,
semicardinal conflicts and noncardinal conflicts.
5
Number of conflicts involving agent ai
(aj ) that have been selected and resolved so far during
the search: their minimum, maximum and sum.
3
Number of conflicts that have been selected and resolved so far during the search at vertex u
(v): their minimum, maximum and sum.
3
Number of conflicts that agent ai
(aj ) is involved in: their minimum, maximum and sum. 3
Time step t of the conflict. 1
Ratio of t and the makespan of NSol. 1
Cost of the path of agent ai
(aj ) in NSol: their minimum, maximum, sum, absolute difference
and ratio of their maximum and minimum.
5
Delay of agent ai
(aj ): their minimum and maximum. 2
Ratio of the costs of the path of agent ai
(aj ) and its individually costminimal path: their minimum and maximum.
2
Difference of the cost of the path of agent ai
(aj ) and t: their minimum and maximum. 2
Ratio of the cost of the path of agent ai
(aj ) and t: their minimum and maximum. 2
Ratio of the cost of the path of agent ai
(aj ) and NCost: their minimum and maximum. 2
Binary indicator whether none (at least one) of agents ai and aj has reached its goal vertex by
time step t.
2
Number of conflicts c
′ ∈ NConf such that min{dq,q′ : q ∈ V
T
c
, q′ ∈ V
T
c
′ } = w (0 ≤ w ≤ 5). 6
Number of agents a such that there exists q
′ ∈ Va and q ∈ V
T
c
such that dq,q′ = w (0 ≤ w ≤ 5). 6
Number of conflicts c
′ ∈ NConf such that min{dq,q′ : q ∈ Vc, q′ ∈ Vc
′} = w (0 ≤ w ≤ 5). 6
Width of level w (w − t ≤ 2) of the MDD for agent ai(aj ) (we use zero as the width of a level
that does not exist): their minimum and maximum [122].
10
Weight of the edge between agents ai and aj in the weighted dependency graph [120]. 1
Number of vertices q
′
in graph G such that min{dq
′
,q : q ∈ Vc} = w (1 ≤ w ≤ 5). 5
Table 2.2: Features of a conflict c = ⟨ai
, aj , u, t⟩ (⟨ai
, aj , u, v, t⟩) of a CT node N. Given the underlying
graph G = (V, E), let VT = {(v, t) : v ∈ V, t ∈ Z≥0}, ET = {((u, t),(v, t + 1)) : t ∈ Z≥0 ∧ (u = v ∨
(u, v) ∈ E)}, and define the timeexpanded graph as an unweighted graph GT = (VT , ET ). Let du,v be the
cost of the costminimal path between vertices u and v in G and d(u′
,t′),(u,t) be the distance from (u
′
, t′
) to
(u, t) in GT if t
′ ≤ t or from (u, t) to (u
′
, t′
), otherwise. For a conflict c
′ = ⟨a
′
i
, a′
j
, u′
, t′
⟩ (⟨a
′
i
, a′
j
, u′
, v′
, t′
⟩)
in NConf, define Vc
′ = {u
′} (Vc
′ = {u
′
, v′}) and V
T
c
′ = {(u
′
, t′
)} (V
T
c
′ = {(u
′
, t′
),(v
′
, t′
)}). For an agent a,
define Va = {(u, t) : agent a is at vertex u at time step t following its path}. The counts are the numbers
of features contributed by the corresponding entries, which add up to p = 67.
38
same label. This labeling scheme relaxes the definition of “top” conflicts that allows the learning algorithm
to focus on only highranking conflicts and avoids the irrelevant task of learning the correct ranking of
conflicts with low scores. We tried directly using their O1 scores as their labels but did not get as good
performance as using the scheme described.
2.6.1.3 Model Learning
Given the training dataset D =
S
I∈ITrain
DI , we follow the formulation in Section 2.5 to learn a linear
ranking function with parameter w ∈ R
p
π : R
p → R : π(ϕN (c)) = wTϕN (c)
that minimizes the loss function
L(w) = X
N∈D
l(yN , yˆN ) + C
2
w2
2
.
To compute l(yN , yˆN ), we consider the set of pairs PN = {(ci
, cj ) : ci
, cj ∈ NConf ∧ yN (ci) > yN (cj ))},
where yˆN (c) = π(ϕN (c)) is the predicted scores for conflict c. The loss function l(·, ·) is the fraction of
swapped pairs, computed as
l(yN , yˆN ) = 1
PN 
{(ci
, cj ) ∈ PN : ˆyN (ci) ≤ yˆN (cj )}.
2.6.1.4 MLGuided Search
After data collection and model learning, we replace expert O1 for conflict selection in CBS with the
learned ranking function π(·). At each CT node N, we first compute the feature vector ϕN (c) for each
conflict c ∈ NConf and pick the conflict with the maximum score c
∗ = arg maxc∈NConf π(ϕN (c)). The time
39
complexity of conflict selection at node N is O(NConf(NConf + k))§
. Even though the complexity of
conflict selection with expert O0 is only O(NConf), we will show in our experiments that we are able to
outperform CBSH2+O0 in terms of both the CT size and the runtime.
Discussion Improving conflictselection strategies with ML in CBS is inspired by previous works [102,
59] on improving variableselection strategies with ML in BnB for MILPs. We leverage several techniques
and insights from previous works and tailor them for CBS. First, we leverage imitation learning to imitate
decisions made by an expert. For MILP, [102, 59] imitate the Strong Branching [7] heuristic that solves a
linear program relaxation for each candidate variable to estimate the increase on the lower bound in the
resulting child nodes if it is branched on. For MAPF, one of the main contributions is that we design expert
O1 similarly to estimate the increase on the lower bound in the resulting child CT nodes for each conflict
if it is selected. Second, we leverage a linear ranking function as our ML model. For MILP, both linear
ML models [102] and deep neural networks [59] have been used. For MAPF, we choose to use the linear
model since MAPF search algorithms are a lot more sensitive to the inference runtime (i.e., the runtime
for computing the features and predictions) than MILP search algorithms. Therefore, we leverage domain
expertise to craft features for the linear model that suit MAPF search algorithms. Third, we leverage the
labeling scheme that relaxes the definition of “top” conflicts. Such a scheme is more effective than labeling
each conflict with its score given by the expert, which aligns with the observation in MILP solving by
[102].
2.6.2 Empirical Evaluation
In this subsection, we demonstrate the efficiency of CBS+ML through experiments. In the following, we
introduce our evaluation setup and then present the results.
§We exclude the time complexity of building the MDDs for both CBS+ML and CBSH2+O0.
40
2.6.2.1 Setup
We use the C++ code for CBSH2 with the WDG heuristic [120] as our CBS version. We compare against
CBSH2+O0 as a baseline since O0 is the most commonly used conflictselection expert. The reason why
we choose CBSH2 with the WDG heuristic over CBS, ICBS and CBSH2 with the CG or DG heuristics is
that it performs best, as demonstrated in [120]. We use the same compute resources as described in Section
2.6.1.1.
Our experiments provide answers to the following questions:
1. If the graph underlying the MAPF instances is known in advance, can we learn a model that performs
well on unseen MAPF instances on the same graph with different numbers of agents?
2. If the graph underlying the MAPF instances is unknown in advance, can we learn a model from
other graphs that performs well on MAPF instances on that graph?
We use a set of five fourneighbor grid maps M of different sizes and structures as the graphs underlying the MAPF instances and evaluate our algorithms on them. M includes (1) a warehouse map [121],
which is a 79 × 31 grid map with 100 6 × 2 rectangular obstacles; (2) the room map “room32324” [181],
which is a 32 × 32 grid map with 64 3 × 3 rooms connected by singlecell doors; (3) the random map; (4)
the city map “Paris_1_256” [181], which is a 256 × 256 grid map of Paris; (5) the game map. The figures of
the maps are shown in Table 2.4. For each grid map M ∈ M, we collect data for training instances I
(M)
Train
and validation instances I
(M)
Valid on M with a fixed number of agents, where I(M)
Train = 30 and I(M)
Valid = 20.
We learn two ranking functions for grid map M: one ranking function that is trained using 5,000 CT nodes
i.i.d. sampled from the training dataset collected by solving training instances I
(M)
Train on the same grid map
and another one that is trained using 5,000 CT nodes sampled from the training dataset collected by solving training instances ∪M′∈MI
(M′
)
Train \ I(M)
Train on the other maps, namely an equal number of i.i.d. CT nodes
sampled from each of the four other maps. We use only 30 training instances since they are sufficient for
41
Warehouse Room Random City Game
Number of agents in MAPF instances in ITrain and IValid 30 22 18 180 100
Training on
the same map
Swapped pairs (%) 5.78 12.58 10.89 2.89 4.40
Top pick accuracy (%) 84.93 67.56 69.03 83.05 60.16
Training on
the other maps
Swapped pairs (%) 6.08 15.24 19.64 7.66 7.45
Top pick accuracy (%) 86.85 66.80 50.44 78.57 53.13
Table 2.3: Numbers of agents in MAPF instances in ITrain and IValid, validation losses and accuracies.
The swapped pairs are the percentages of swapped pairs averaged over all test CT nodes, and the top
pick accuracy is the accuracy of the ranking function selecting one of the conflicts labeled as 1 in the test
dataset.
collecting 5,000 CT nodes for each grid map. For each grid map M ∈ M, we denote CBS+ML that uses
the ranking function trained on the same grid map by CBS+MLS and CBS+ML that uses the one trained
on the other maps by CBS+MLO. We set the regularization parameter C = 1/100 to train an SVMrank
[94] with a linear kernel to obtain each of the ranking functions. We varied C ∈ {1/10, 1/100, 1/1000}
and achieved similar results. We test the learned ranking functions on the validation dataset collected by
solving I
(M)
Valid. The numbers of agents in the training instances used for data collection, the validation
losses and the accuracies of selecting one of the conflicts labeled as 1 are reported in Table 2.3. We varied
the number of agents for data collection and find that they led to similar performance. In general, the
losses of the ranking functions for CBS+MLO are larger and their accuracies of selecting “good” conflicts
are lower than those for CBS+MLS.
2.6.2.2 Results
Success Rate, Runtime and Tree Size We run CBSH2, CBS+MLS and CBS+MLO on 25 test instances
on each of the five maps and vary the number of agents. The runtime limits are set to 60 minutes for the
two largest maps (the city and game maps) and 10 minutes for the other maps. In Table 2.4, we report
the success rates together with the average runtimes and the average CT sizes of the test instances solved
by all methods for different numbers of agents on each grid map. CBS+MLS and CBS+MLO dominate
CBSH2 in all metrics on all maps for almost all cases. For CBS+MLS, even though we learn the ranking
42
Grid Map k
Success Rate (%) Runtime (min) CT Size (nodes)
CBSH2 MLS MLO CBSH2 MLS MLO CBSH2 MLS MLO
Warehouse
30 100 100 (100) 100 (100) 0.15 0.05 0.05 541 131 152
36 92 100 (92) 100 (92) 0.42 0.06 0.07 1992 268 321
42 52 68 (52) 68 (52) 1.97 0.63 0.32 11243 3533 1503
45 28 44 (28) 48 (28) 3.41 1.33 0.50 17348 5320 1997
48 16 36 (16) 40 (16) 0.23 0.10 0.12 1328 517 676
54 4 20 (4) 20 (4) 0.49 1.11 0.20 2808 5633 1087
Improvement over CBSH2 0 73.5% 76.1% 0 77.0% 81.9%
Room
22 96 96 (96) 96 (96) 0.08 0.07 0.07 313 228 227
26 80 80 (76) 80 (76) 0.94 0.44 0.42 10983 4373 3859
28 60 72 (60) 68 (60) 1.43 1.02 1.01 17551 10505 9968
30 40 44 (40) 44 (40) 1.76 0.98 1.04 23250 11348 11987
32 24 24 (20) 24 (20) 3.16 2.26 2.26 42041 27035 25986
34 4 8 (4) 4 (4) 2.68 0.67 0.80 36137 7925 9090
Improvement over CBSH2 0 34.3% 33.1% 0 47.7% 47.2%
Random
18 92 92 (92) 92 (92) 0.16 0.15 0.15 2609 2366 2302
20 88 88 (88) 88 (88) 0.90 0.88 0.89 8779 7897 7742
23 60 72 (56) 68 (56) 1.60 1.37 1.40 27628 23970 24257
26 40 40 (40) 40 (40) 3.01 2.85 2.51 47297 44770 43094
29 24 36 (24) 28 (24) 4.44 3.38 3.46 64965 52105 52463
Improvement over CBSH2 0 27.4% 25.2% 0 25.3% 22.8%
City
180 76 84 (76) 84 (19) 4.73 3.37 3.45 578 280 285
200 72 76 (72) 76 (72) 7.92 4.62 4.76 878 288 297
220 48 64 (48) 68 (48) 6.04 4.47 3.73 934 445 402
240 36 44 (36) 48 (36) 9.25 5.91 5.71 790 510 507
260 24 24 (24) 32 (24) 10.59 8.80 8.62 1363 1088 1074
280 24 24 (24) 28 (24) 12.72 14.27 10.16 1529 1650 1414
Improvement over CBSH2 0 23.6% 29.8% 0 40.9% 42.4%
Game
100 72 76 (72) 76 (72) 6.47 4.88 3.87 3418 1729 1470
110 44 56 (44) 52 (44) 7.13 4.47 4.44 3157 1312 1366
115 44 48 (44) 48 (44) 8.06 4.96 4.95 3990 1753 1805
120 36 36 (36) 36 (36) 12.48 6.59 6.47 6176 3664 3536
130 24 32 (24) 28 (24) 14.03 16.52 13.89 6649 6700 6621
135 20 28 (20) 24 (20) 8.88 11.68 8.43 2537 2598 2502
Improvement over CBSH2 0 27.2% 37.6% 0 45.7% 48.7%
Table 2.4: Success rates and the average runtimes and CT sizes of MAPF instances solved by all methods
(MLS and MLO stand for CBS+MLS and CBS+MLO, respectively) for different numbers of agents k on
five maps. For the success rates of MLS and MLO, the percentages of MAPF instances solved by both our
methods and CBSH2 are given in parentheses (bolded if they solve all MAPF instances that CBSH2 solves).
For each grid map, we report the percentages of our improvement over CBSH2 on the runtime and CT size
on MAPF instances solved by all methods.
43
function from data collected on instances with a fixed number of agents (listed in Table 2.3), the learned
function generalizes to instances with larger numbers of agents on the same grid map and outperforms
CBSH2. CBS+MLO, without seeing the actual grid map being tested on during training, is competitive
with CBS+MLS. The results suggest that our approach, when focusing on solving instances on a particular
grid map, can outperform CBSH2 substantially and, when faced with new maps, still has an advantage.
CBS+MLO even outperforms CBS+MLS sometimes on the warehouse, city, and game maps (similarly in
Table 2.3), which suggests that learning from the expert’s demonstration on multiple grid maps benefits
CBS+ML since the effectiveness of the expert varies across grid maps.
Feature Importance Next, we look at the feature importance of the learned ranking functions. For
CBS+MLO, the five ranking functions have nine features in common among their eleven features with
the largest absolute weights. Thus, they are similar when looking at the important features. We take the
average of each weight and sort them in decreasing order of their absolute values. The top eight features
are (1) the weight of the edge between agents ai and aj in the WDG; (2) the binary indicator for noncardinal conflicts; (3) the maximum of the differences of the cost of the path of agent ai
(aj ) and t; (4) the
binary indicator for cardinal conflicts; (5) the minimum of the numbers of conflicts that agent ai
(aj ) is
involved in; and (68) the minimum, the maximum and the sum of the numbers of conflicts involving agent
ai
(aj ) that have been selected and resolved. Those features mainly belong to three categories: features
related to the conflict type, the WDG and the number of conflicts having been resolved for agents, where
the first one is commonly used in previous work on CBS and the third one is an analogue of the branching
variable pseudocosts in BranchandBound for MILP solving [2].
44
2.7 Learning to Select Nodes for ECBS
In this section, we introduce ECBS+ML to show how the framework to learn heuristic can be applied
to improve node selection in Enhanced ConflictBased Search (ECBS) [174]. MAPF is NPhard to solve
optimally [209, 9] and, therefore, optimal MAPF search algorithms, such as CBS, do not scale to many
agents. ECBS and its variants [35, 36] are guaranteed to find solutions whose sums of costs of the paths
are at most w ≥ 1 times the minimum ones and run faster than optimal MAPF search algorithms. Which
CT nodes to select from the focal list for expansion is important decisionmaking in ECBS. A generic
nodeselection strategy for ECBS assigns a dvalue to each CT node and always selects a CT node with
the minimum dvalue in the focal list for expansion. The most commonly used nodeselection strategy
in previous work uses the number of conflicts NConf as the dvalue of a CT node N. We refer to this
strategy as h1. [10] propose other strategies that use the number of pairs of agents that have at least one
conflict with each other and the number of agents that have at least one conflict with other agents as the
dvalues. We refer to these two strategies as h2 and h3, respectively. We implement and experiment with
h1, h2 and h3 in Section 2.7.2.
Instead of manually defining the dvalues, we introduce ECBS+ML to show how our framework can be
applied to improving node selection in ECBS. We borrow tools such as imitation learning and curriculum
learning from the machine learning literature [167, 168, 178, 15] and propose a novel method for learning
nodeselection strategies for the highlevel focal search to speed up ECBS. This method could also be
applied to the lowlevel focal search but we focus on the highlevel one since the lowlevel focal search
runs in polynomial time and does not have as big an impact on the efficiency of ECBS as the highlevel
one. We then empirically demonstrate the effectiveness and efficiency of ECBS+ML.
45
2.7.1 Machine Learning Methodology
For training, we do not directly learn the dvalues but rather a ranking function that differentiates CT nodes
that have shorter distances to leaf nodes in the CT with wapproximate solutions from those CT nodes that
have longer distances to one. During the search, the ranking function takes a CT node’s features as input
and calculates a realvalued dvalue. Our goal is to learn a ranking function such that its dvalues allow
ECBS to get closer to a wapproximate solution every time it expands a CT node and, therefore, help it to
find a solution more quickly. The nodeselection strategies are applied when selecting a node to expand
the CT. Thus, we represent the state of the search with the CT s = T
′ ECBS built prior to selecting the
node. ECBS can select any CT node from the focal list F(T
′
) when T
′
is built, thus, the available actions
are all CT nodes in the list, i.e., A(T
′
) = F(T
′
). To apply the framework, we first propose an expert that
retrospectively computes the complete CT to select CT nodes. The complete CT is defined as the CT we
get when ECBS terminates, which is expensive to compute. Next, given the expert, we explain how we use
it to collect data and apply ML to imitate its decisions. During training, we fix the underlying graph and
learn nodeselection strategies from solving the training instances on that graph where the start and goal
vertices of the agents are drawn from a given distribution. We start with a small number of agents and
use imitation learning [40, 167, 168] to learn a nodeselection strategy for that number of agents. We then
continue learning nodeselection strategies for larger and larger numbers of agents. Instead of learning
from scratch every time the number of agents increases, we use curriculum learning [15] to learn more
efficiently by using previously learned nodeselection strategies as starting points.
2.7.1.1 Expert for Node Selection
Given a MAPF instance, an expert for node selection at a given state T
′
is a ranking function π
′
that takes
the focal list F(T
′
) as input, calculates a realvalued score per CT node N ∈ F(T
′
) as its dvalue and
46
outputs the ranks determined by the scores. We say that ECBS follows an expert for node selection iff
ECBS builds the CT by always selecting the CT node with the highest rank to expand.
The expert we propose retrospectively computes the complete CT to find all wapproximate solutions
to the MAPF instance. For each CT node N ∈ F(T
′
), we define Nd as the distance between N and any
wapproximate solution found within the subtree rooted at N. We assign Nd = ∞ if no solution was
found within its subtree. The expert outputs the ranks determined by the increasing order of the dvalues
(i.e., the highest rank for the smallest dvalue). However, retrospectively computing the complete CT and
finding all wapproximate solutions are prohibitively expensive computationally. Therefore, in practice,
we run the expert until either it exceeds a runtime limit, T wapproximate solutions have been found or
the expert terminates.
2.7.1.2 Data Collection
Given an instance I and the expert’s ranking function π
′
, we describe a subroutine CollectData(I, π′
) for
data collection that is used in our learning algorithm. CollectData(I, π′
) runs ECBS following the expert
using the nodeselection strategy π
′
and returns the complete CT T .
Features For each CT node N ∈ F at state T
′
, it computes the following atomic features f1, . . . , f9:
1. features related to the conflicts: the number of conflicts NConf (f1), the number of pairs of agents
that have at least one conflict with each other (f2) and the number of agents that have at least one
conflict with other agents (f3);
2. features related to NCost: f4 := NCost, f5 :=
NCost
LB , f6 := NCost − LB, f7 := NCost − S and f8 :=
NCost/S, where S is the sum of costs of the individually costminimal paths of all agents; and
3. the depth of N in the CT (f9).
47
From these atomic features, we obtain interaction features fifj (for i ≤ j), which are the pairwise products
of the atomic features. The final feature vector ϕT ′(N) ∈ R
p
(p = 54) is obtained by concatenating all
atomic features and interaction features, resulting in the degree2 polynomial kernel in the space of atomic
features. Features f2 and f3 can be computed in time O(NConf), and the other features can be computed
in time O(1). Therefore, the overall time complexity for computing all 54 features is O(NConf).
The interaction features are a richer set of features than the atomic features. This is also known as using
the polynomial kernel in SVM, where we use a polynomial with degree two. But using them increases the
runtime for training an SVMrank. It is not an issue for learning to select nodes in ECBS since we have only
nine atomic features, in contrast to having 67 of them in the previous section where we learn to select
which conflict to resolve next in CBS. We do not use interaction features either in Sections 2.8 and 2.9 for
the same reasons.
Labels During data collection, we run the expert until either T solutions are found, the search exceeds
the runtime limit or the expert terminates. If the expert exceeds the runtime limit without finding any
solution, we return an empty set of training data for instance I. Otherwise, we return the set of states
encountered during the search. We assign a label yT ′(N) to each CT node N ∈ F(T
′
) based on the
minimum distance Nd between N and any wapproximate solution found within the subtree rooted at N.
We assign Nd = ∞ if no solution was found within its subtree. Since we want to assign smaller dvalues
48
to CT nodes that are closer to a solution, we label N in a way such that the closer N is to a solution, the
smaller yN is:
yT ′(N) =
0, if Nd < τ0,
1, if τ0 ≤ Nd < τ1,
2, if τ1 ≤ Nd < τ2,
3, if τ2 ≤ Nd < ∞,
∞, otherwise,
where τ2 > τ1 > τ0 > 0 are three thresholds. Our labeling scheme allows us to focus on pairs of CT
nodes that have large differences in Nd when the labels are used to learn a ranking function. Different
from using yT ′(N) = Nd, it avoids having to rank CT nodes correctly that are almost equally good or bad,
which is irrelevant for making good node selections. We indeed tried using yT ′(N) = Nd but did not get
better performance than ECBS without MLguided node selection.
2.7.1.3 Model Learning
We want to learn nodeselection strategies for instances with different numbers of agents on a fixed underlying graph G = (V, E) with a fixed suboptimality factor w. The idea central to our training algorithm
is that we start learning a nodeselection strategy by solving easy instances with a small number of agents
and iteratively increasing the number of agents to learn another strategy based on the previous one. In
particular, we want to learn to solve instances with increasing difficulty, i.e., with m different numbers
of agents k1, . . . , km where k1 < . . . < km. For each ki
, we learn a nodeselection strategy that assigns
πi(ϕT ′(N)) to CT node N as its dvalue, where πi
is a learned ranking function. Therefore, a desirable
ranking function is one that assigns smaller dvalues to CT nodes that are closer to a wapproximate solution and larger dvalues to those CT nodes that are farther away from one.
49
Algorithm 3 Training Algorithm: Curriculum Learning
1: Input: {k1, . . . , km} and m sets of training instances {I1, . . . , Im}
2: π0 ← π
∗
3: for i = 1 to m do
4: πi ← DAgger(πi−1, Ii) ▷ Call Algorithm 4
5: if πi = πi−1 then ▷ Stopping criterion met
6: ∀i < j ≤ m, πj ← πi
7: break
8: return {π1, . . . , πm}
One of the main challenges is that, as ki
increases, it becomes increasingly hard to collect a sufficient
amount of training data due to the increased difficulties of the MAPF instances and, thus, the increased
runtime to collect data. To overcome this challenge, we propose a training algorithm based on curriculum
learning, as shown in Algorithm 3. Curriculum learning is a machine learning technique that trains an ML
model using examples (in our case, MAPF instances) of increasing difficulty. The ML model is first trained
on simple tasks (in our case, node selection in ECBS on easier MAPF instances) and then knowledge from
those tasks is transferred to the difficult task. Algorithm 3 takes {k1, . . . , km} and m sets of training
instances {I1, . . . , Im} as input and outputs {π1, . . . , πm}. Each instance in Ii
includes ki agents, where
the start and goal vertices of the agents are drawn i.i.d. from a given distribution. π0 is set to a ranking
function π
∗
that corresponds to an initial nodeselection strategy (e.g., one of nodeselection strategies h1,
h2 and h3) (Line 2). To obtain π1 for instances with k1 agents, we use DAgger(π
∗
, I1) [168] (see Algorithm
4) as a training algorithm that learns a ranking function from solving the training instances in I1 using π
∗
as a starting point. To obtain πi
(for i > 1), instead of starting from π
∗
again, we start learning from πi−1.
We obtain πi
(1 < i ≤ m) by calling DAgger(πi−1, Ii), which learns a ranking function starting from πi−1
as the ranking function of the nodeselection strategy (Line 34) until a stopping criterion is met (Lines
57) or i > m. If the stopping criterion is met before i > m, we terminate training (Line 7) and simply set
πj = πi for all i < j ≤ m (Line 6). If DAgger(πi−1, Ii) returns πi−1, then it cannot find a better ranking
function than πi−1. This situation typically occurs at some point in time during training for hard instances
with many agents since only data collected from solved instances during data collection contributes to the
50
training data, and, for hard instances, it is difficult to collect a sufficient amount of training data, which
makes it difficult to improve on πi−1. When the training algorithm observes this situation (Line 5), it stops
training and uses the last obtained ranking function for all instances with larger numbers of agents than
the one for which it could not improve the ranking function.
DAgger(π
(0)
, Ii), shown in Algorithm 4, is an imitation learning algorithm. The inputs π
(0) and Ii are
the ranking function of the initial nodeselection strategy and the set of training instances, respectively.
DAgger repeatedly determines a ranking function that makes better decisions in those situations that
were encountered when running ECBS with the previous version of the ranking function. Initially, the
training data D is set to ∅ (Line 1). Let R be the number of iterations for which the algorithm runs (Line
4). In iteration j, it collects training data by solving the instances in Ii with the ranking function π
(j−1)
obtained in iteration j − 1, aggregates it with D (Line 6) and learns a new ranking function π
(j)
from D
that minimizes a loss function over D (Line 7). When collecting training data using π
(j−1) in ECBS, we set
a runtime limit for each instance. We record the success rate (i.e., the fraction of instances solved within
the given runtime limit) on Ii
(Line 8) and the average runtime for the solved instances in Ii
(Line 9).
Finally, DAgger returns the ranking function that achieves the highest success rate on the instances in Ii
in all R iterations (Line 10), breaking ties in favor of the lowest average runtime for the solved instances
(Line 11).
Learning a Ranking Function One could follow the learningtorank formulation in Section 2.5 to
train a ranking function that minimizes the loss function l(yT ′, yˆT ′) across all states. However, l(yT ′, yˆT ′)
needs to consider PT ′ which consists of all ordered CT node pairs in the focal list F(T
′
) and has a quadratic
number of CT node pairs. Thus, the total number of CT node pairs needed to be considered to compute
the loss function for a single CT T is O(T 3
) for just a single training instance where T  is the number
of CT nodes in T , and it would be prohibitively expensive to compute.
51
Algorithm 4 DAgger(π
(0)
, Ii)
1: D = ∅
2: r0 ← success rate on Ii using π
(0) in ECBS
3: c0 ← average runtime on solved instances in Ii using π
(0) in ECBS
4: for j = 1 to R do
5: for I in training instance set Ii do
6: D ← D ∪ CollectData(I, π(j−1)) ▷ Call ECBS
7: π
(j) ← train a ranking function using D
8: rj ← success rate on Ii using π
(j)
in ECBS
9: cj ← average runtime on solved instances in Ii using π
(j)
in ECBS
10: L ← arg max0≤l
′≤R{rl
′}
11: l ← an element from arg minl
′∈L{cl
′}
12: return π
(l)
Notice that the value of label yT ′(N) depends solely on the complete CT T , i.e., yT ′(N) = yT (N) for
any T
′
. Also, notice that the features ϕT ′(N) depend only on the information of CT node N and, thus,
ϕT ′(N) = ϕT (N) for any T
′
.
Based on the above observation, we propose to consolidate the states T
′ of a MAPF instance to a
single state represented by the complete CT T and call T
′ ⊆ T a substate of T . To compute the loss for
the consolidated state T , we consider all ordered CT node pairs in every substate, i.e., we let
P˜
T =
[
T ′⊆T
PT ′.
P˜
T consists of all ordered pairs of CT nodes that occur in the focal list during the search. After the
consolidation, the training dataset D is a set of complete CTs, one for each training instance. We train a
linear ranking function with parameter w ∈ R
p
π : R
p → R : π(ϕT (N)) = wTϕT (N)
52
and minimize the loss function
L(w) = X
T ∈D
l(yT , yˆT ) + C
2
w2
2
over the training data D, where yT is the groundtruth label vector of all CT nodes that appear in T , yˆT is
the corresponding vector of predicted values resulting from applying π to the feature vector ϕT (N), and
l(yT , yˆT ) is computed as follows:
l(yT , yˆT ) =
P
(Ni,Nj )∈P˜
T :ˆyT (Ni)≤yˆT (Nj ) w˜Ni,Nj
P
(Ni,Nj )∈P˜
T
w˜Ni,Nj
. (2.1)
The weight w˜Ni,Nj
of each pair (Ni
, Nj ) ∈ P˜
T is set to e
−(di+dj )/rdmax , where di and dj are the depths
of Ni and Nj in T respectively, dmax is the depth of T , and r is a damping factor. The weight w˜Ni,Nj
takes into account the fact that the CT grows exponentially and can be understood as the product of the
weights of Ni and Nj , where the weight of Ni
is e
−di/rdmax
. We use the weighted version of the ranking
loss to help focus on making accurate predictions for CT nodes that are close to the root early in the search
since expanding a CT node that contains no w−approximate solution would bring an extra computation
cost that is exponential of the depth of the subCT rooted at that CT node in the worst case. Alternatively,
one could argue setting the weight e
− min(di,dj )/rdmax since the importance of ranking a pair of CT nodes
(Ni
, Nj ) correctly depends on the closest solution to either of Ni or Nj . We did not try this in our empirical
evaluation but do not want to rule out the possibility that this could also be a good choice for setting the
weights.
2.7.1.4 MLGuided Search
After learning the ranking functions {π1, . . . , πm}, we deploy them in ECBS. Given an instance with k
agents and the same underlying graph as used during training, we run ECBS with ranking function πj ,
53
where j ∈ arg mini∈[m]{k − ki
}, i.e. the one trained on the most similar number of agents. When a CT
node N is generated, we compute its feature vector ϕT (N) and set its dvalue to πj (ϕT (N)). The overall
time complexity of computing the dvalue is O(NConf) because of the time complexity of computing
the features. Even though the time complexity of computing the dvalue for nodeselection strategy h1 is
O(1), we will show experimentally that ECBS+ML outperforms ECBS with h1 in terms of both the success
rate and the runtime.
Discussion Our main motivation to train multiple ranking functions for different numbers of agents is
that, as will be shown in Section 5.2, the ranking functions learned with DAgger(·, ·) do not generalize
well to instances with different numbers of agents, especially when those numbers are substantially larger
than the one we train on. There are two reasons for this issue: (1) We are not able to normalize the feature
values based on their minimums and maximums as we did for CBS+ML since we need to compute the
dvalue of a CT node immediately during the search when it is generated, but the minimum and maximum
are not known until ECBS terminates. (2) Different features are important for instances with different
numbers of agents, which will be shown in Section 2.7.2. Therefore, we learn nodeselection strategies
specific to the number of agents, and we use curriculum learning to learn them efficiently. In contrast, we
did not use curriculum learning in Section 2.6 since we observed that the ML models generalized well to
MAPF instances with larger numbers of agents than the training instances. However, it is possible that
curriculum learning can further improve the performance of CBS+ML.
There are heuristic components in our training algorithm. The first component is the design of the
stopping criterion for curriculum learning (Line 5 in Algorithm 3). If we cannot improve on πi−1 in iteration
i of the training algorithm, we are not able to improve on it in subsequent iterations either. The other
component is the criterion for choosing the best ranking function in DAgger. One could argue that the
best ranking function should be chosen based on the performance on a set of validation instances drawn
from the same distribution as for training (Lines 89 in Algorithm 4). However, we do not use validation
54
Grid Map Random Warehouse Maze Game City
w 1.1 1.05 1.01 1.005 1.005
m 10 11 9 16 13
k1 75 140 45 80 160
km 125 240 125 305 400
V  819 5,699 14,818 28,178 47,240
Table 2.5: Parameters for each grid map. w is the suboptimality factor, m is the number of different
numbers of agents we train and test on, k1 is the smallest number of agents that we train and test on, km
is the largest number of agents that we train and test on, and V  is the number of unblocked cells on the
grid map. k2, · · · , km−1 are evenly distributed on [k1, km], i.e., ki = (i − 1)(km − k1)/(m − 1) + k1.
Grid Map Random Warehouse Maze Game City
l1 0.0075 0.0088 0.0092 0.0085 0.0051
l⌊m/2⌋ 0.0330 0.0166 0.0192 0.0131 0.0068
lm 0.0653 0.0283 0.0318 0.0204 0.0107
Table 2.6: Loss li ∈ [0, 1] of ranking function πi for ki agents evaluated by Equation (2.1) averaged over
all CTs in the training data.
instances since this would approximately double the runtime of DAgger and, thus, also of the training
algorithm if the numbers of instances for validation and training were the same. Our criterion allows the
training algorithm to select a good ranking function more efficiently.
We learn the ranking function differently from Section 2.6 in several aspects. First, we use a weighted
version of the loss function since the unweighted version did not perform well on all the grip maps except
the random map that we tested in our empirical evaluation in the next subsection. Secondly, we consolidate
the training dataset by considering the pairs of CT nodes that occur in multiple PT ′ only once in the loss
function to reduce the runtime for learning the ranking function. We did not do the same thing in Section
2.6 since the number of pairs of actions (i.e., conflicts at a CT node) for conflict selection in CBS was much
smaller than the number of pairs of actions (i.e., CT nodes in the focal list) for node selection in ECBS.
2.7.2 Empirical Evaluation
In this subsection, we demonstrate the efficiency of ECBS+ML through experiments. In the following, we
introduce our evaluation setup and then present the results.
55
2.7.2.1 Setup
We implement ECBS+ML in C++ and conduct our experiments on a 2.4 GHz Intel Core i7 CPU with 16 GB
RAM. During testing, we compare against ECBS with the nodeselection strategies h1, h2 and h3, denoted
by ECBS+h1, ECBS+h2 and ECBS+h3, respectively. We also compare against two versions of ECBS+ML,
one that stops early, denoted by ECBS+ML(ES), and one that uses only imitation learning without curriculum learning, denoted by ECBS+IL. In our ablation study, we also compare against ECBS+ML(ES) and
ECBS+IL. ECBS+ML(ES) uses the same training algorithm as ECBS+ML except that it stops training earlier
than ECBS+ML. The number of agents in the last iteration of training in the training algorithm is the one
where the success rate of ECBS+h1 first drops below 60%. ECBS+IL uses the same training algorithm as
ECBS+ML except that, for each number of agents, it learns a ranking function starting from the given initial ranking function without relying on the previously learned one. We replace Line 4 in Algorithm 3 with
“πi ← DAgger(π
∗
, Ii)” and “πi = πi−1” on Line 5 with “πi = π
∗
”. We set the runtime limit to 5 minutes
per instance for running ECBS for both data collection and testing. The number of solutions T collected
during data collection is set to 10. The thresholds that determine the labels τ0, τ1 and τ2 are set to 10, 30
and 60, respectively. The number of iterations R for DAgger is set to 10. The damping factor r for weight
w˜Ni,Nj
is set to 0.3727. r is chosen so that a CT node at depth 0.6dmax has weight e
−0.6/r = 0.2. Since we
are using a pairwise loss, we suffer from a quadratic time complexity (O(T 2
)) for the loss computation.
Therefore, we record only the first 10,000 CT nodes generated for each instance during data collection.
We use the default values for all parameters in LIBLINEAR, including the regularization parameter C,
which is set to 1. We did not try out many other values for the hyperparameters since the improvement
of ECBS+ML over ECBS is already substantial with these values.
We evaluate ECBS+ML on five grid maps of different sizes and structures from the MAPF benchmark
[181], including (1) a random map “random323220”, which is a 32 × 32 grid map with 20% randomly
blocked cells; (2) a warehouse map “warehouse10201021”, which is a 163 × 63 grid map with 200
56
10 × 2 rectangular obstacles; (3) a maze map “maze12812810”, which is a 128 × 128 grid map with
tencellwide corridors; (4) a game map “den520d”, which is a 257 × 256 grid map from the video game
Dragon Age: Origins; and (5) a city map “Paris_1_256”, which is a 256 × 256 grid map of Paris. Since
ECBS has better scalability than CBS, compared to the MAPF instances used in Section 2.6.2, we use MAPF
instances with larger sizes and higher obstacle and agent densities. For example, we increase the sizes of
the random map and the warehouse map. We also increase the number of agents in MAPF instances on
the same city map. We use 25 MAPF instances for both training and testing. The parameters related to
each grid map are listed in Table 2.5. k1 is chosen such that at least one of ECBS+hi
(i=1,2,3) has a success
rate of 88% or higher. We fix the increment between ki and ki+1 for each grid map, and km is chosen such
that either the success rate of ECBS+ML falls below 20% or all ECBS+hi
(i=1,2,3) have 0% success rates. We
fix the suboptimality factor w, following the reasoning in previous work [10], where small w values are
chosen for large grid maps and large w values are chosen for small grid maps. Our objective is to obtain a
ranking function πi for each number of agents ki
in {k1, . . . , km}. The training loss of the learned ranking
functions is shown in Table 2.6. It is small. We now test the nodeselection strategies that correspond to
those ranking functions on unseen instances with k1, . . . , km agents.
2.7.2.2 Results
Success Rate and Runtime Figure 2.1 plots the success rates on all grid maps. Overall, ECBS+ML
substantially outperforms the three baselines, ECBS+h1, ECBS+h2 and ECBS+h3, on all grid maps. On the
game map, in particular, the success rates of the baselines drop below 20% when the number of agents
increases to 170, and the baselines can hardly solve instances with more than 245 agents, while the success
rates of ECBS+ML stay above 76% for up to 245 agents and ECBS+ML can still solve 16% of the instances
with 305 agents. Overall, the success rates of ECBS+ML are 52% to 80% when those of the baselines begin
to drop below 20%. When the success rates of the baselines are all below 8%, ECBS+ML can still solve
57
Figure 2.1: Success rates for a runtime limit of 5 minutes as a function of the number of agents for each
grid map. The values of w and the numbers of agents are listed in Table 2.5. For ECBS+ML, ECBS+ML(ES)
and ECBS+IL, the vertical line of the same color indicates the number of agents in the last iteration where
a ranking function is learned in the training algorithm. In the figure for the warehouse map, the graph of
ECBS+h1 coincides entirely with the one of ECBS+h2.
58
Figure 2.2: Success rates for a fixed number of agents as a function of the runtime limit for each grid map.
59
Figure 2.3: Success rates for a runtime limit of 5 minutes as a function of the suboptimality factor w on
the random map for 95 agents. The vertical brown line indicates the value of w in the last iteration where
a ranking function is learned for ECBS+ML(w).
instances with 9% to 17% more agents with success rates around 12% to 20%. To demonstrate the efficiency
of ECBS+ML further, we show the success rates for different runtime limits in Figure 2.2. We show one
figure for each grid map with a fixed number of agents, namely the smallest number of agents ki where
the baseline with the weakest heuristic has a success rate below 50% for a runtime limit of 5 minutes. In
these cases, ECBS+ML has a success rate above 80% and still substantially outperforms the baselines for
shorter runtime limits, e.g., of 1 or 2 minutes.
We have applied curriculum learning (Algorithm 3) to learn nodeselection strategies for MAPF instances with increasing difficulties in terms of the number of agents. Next, we demonstrate that we can
do the same for other measurements of difficulties, such as the suboptimality factor w. When w decreases,
the difficulty increases. Figure 2.3 shows the success rates of ECBS+ML and the baselines for a runtime
limit of 5 minutes as a function of the suboptimality factor w on the random map for 95 agents. We use
ECBS+ML with the ranking function obtained in the experiment described in the previous paragraph that
is trained on 95 agents and w = 1.1. To show that the success rates of ECBS+ML can be improved with
curriculum learning, we use the same training algorithm (Algorithm 3) but, instead of using a fixed value
for w and different numbers of agents, we use a fixed number of agents and different values of w, namely,
60
wi ∈ {1.09, 1.08, . . . , 1.05}. We then obtain a ranking function for each wi
. Figure 2.3 shows the success
rates of the resulting variant ECBS+ML(w). ECBS+ML(w) achieves higher success rates by applying curriculum learning on different values of w than ECBS+ML, which just generalizes the ranking function for
w = 1.1 to other values of w.
Ablation Analysis To assess the effect of curriculum learning, we perform two ablation analyses. First,
we experiment with ECBS+ML(ES). The success rates of ECBS+ML(ES) are shown in Figure 2.1. ECBS+ML(ES)
is competitive with ECBS+ML on the random, warehouse and maze maps and outperforms all baselines
on the random and warehouse maps, but its success rates on the city and game map drop dramatically
beyond the number of agents that ECBS+ML(ES) stopped training at. The results imply that the learned
nodeselection strategy does not generalize well to larger numbers of agents on some grid maps and curriculum learning helps to learn better strategies in those cases.
Second, we experiment with ECBS+IL. The success rates of ECBS+IL are shown in Figure 2.1. ECBS+IL
outperforms the baselines but not as substantially as ECBS+ML. The results show another two advantages
of curriculum learning: (1) It enables learning for one to three more iterations (see the gaps between
the vertical lines for ECBS+ML and ECBS+IL in Figure 2.1) than ECBS+IL by enabling DAgger to collect
more data for training due to being provided with better nodeselection strategies for this purpose; and (2)
it obtains better nodeselection strategies based on the previouslylearned strategies than ECBS+IL that
learns the nodeselection strategy from the given initial ranking function in every iteration.
Feature Importance Next, we study the feature importance of the learned ranking functions of ECBS+ML,
measured by the permutation feature importance [4] of each feature, which is the increase in the loss on
the training data after randomly permuting the values of that feature across all CT nodes for each CT in the
training data. In Figure 2.4, we plot the normalized permutation feature importance of the top 12 features
of the ranking functions for some numbers of agents and some grid maps. We first study the important
61
(a) Permutation feature importance of the learned ranking functions for different numbers of agents on the maze
map.
(b) Permutation feature importance of the learned ranking functions for different grid maps.
Figure 2.4: Feature importance plots. We restate the definitions of some atomic features here (see Section
2.7.1.2 for the full list): f1 is the number of conflicts, f2 is the number of pairs of agents that have at least
one conflict with each other, f3 is the number of agents that have at least one conflict with other agents,
and f9 is the depth of the CT node.
62
features of the ranking functions when varying the numbers of agents for a single grid map, as shown in
Figure 2.4a. We choose the maze map as a representative example to show that the learned nodeselection
strategies change as the number of agents increases. For 45 agents, the most important features are related
to f1 (the number of conflicts), followed by some features related to f2 (the number of pairs of agents that
have at least one conflict with each other). For both 65 and 85 agents, the top 6 features are related to f2,
followed by some features related to f3 (the number of agents that have at least one conflict with other
agents) for 65 agents and f1 for 85 agents. For 105 agents, the most important features are related to f3,
followed by some features related to f1. To show that the set of important features varies across grid maps,
we study the feature importance of the ranking functions for the random, warehouse, game and city maps,
as shown in Figure 2.4b. The ranking functions are for the numbers of agents used in Figure 2.2. For the
random and warehouse maps, the most important features are related to f3, and the feature importance
drops after the 4th feature. For the city map, the most important features include five features related to
f9 (the depth of the CT node). For the game map, the two most important features are also related to f9,
followed by some features related to f3.
2.8 Learning to Select Agent Sets for MAPFLNS
In this section, we introduce MAPFMLLNS to show how our framework can be applied to improving
selecting agent sets in MAPFLNS. We have introduced how our framework can be applied to improve
optimal and boundedsuboptimal MAPF search algorithms. However, both CBS and ECBS often run too
slowly due to proving (sub)optimality during the search, especially when solving large MAPF instances
with high agent or obstacle densities. To tackle these issues, researchers have studied anytime unboundedsuboptimal MAPF search algorithms. The appeal of an anytime MAPF search algorithm is that it first finds
an initial solution quickly using any existing MAPF search algorithm and, if more runtime is available,
63
then improves the solution quality over time. MAPFLNS [117] is a stateoftheart anytime MAPF search
algorithm that uses Large Neighborhood Search (LNS).
MAPFLNS uses an agentbased and a mapbased heuristic to select agent sets to destroy. The number
of agent sets that could be generated by these (randomized) agentset selection strategies can be exponential in the cardinality of the agent sets, and MAPFLNS randomly selects one of them (namely the one that
is first randomly generated). However, some agent sets might not improve the solution as much as other
agent sets and even result in no improvement at all, even if they are all generated by the same agentset
selection strategy. We apply the framework introduced in Section 2.5 and tailor it for agent set selection
in MAPFLNS. We then empirically demonstrate the effectiveness and efficiency of MAPFMLLNS.
2.8.1 Machine Learning Methodology
Our goal is to learn an agentset selection strategy to improve the solution faster than the existing ones.
The agentset selection strategy is applied in every iteration of MAPFLNS. Thus, we represent the state
of the search with the incumbent solution s = P (the solution with the lowest sum of costs found so far
in the search), and we let the set of actions A(P) = B(P) be the sets of agent sets that can be selected
by the strategy. The size of B(P) is exponential in the cardinality of the agent sets. We first propose a
samplingbased expert for agentset selection. The expert reduces the size of A(P) by downsampling a
collection of agent sets using one of the two agentset selection strategies in MAPFLNS. It then replans
the paths of all agents in the sampled agent sets and selects the agent set that reduces the sum of costs the
most. However, the expert is timeconsuming to compute. We therefore learn to imitate the expert with a
linear ranking function. Finally, we use the learned ranking function to guide agentset selection during
the search.
64
Figure 2.5: Evolution of the solution quality as a function of the number of replans for MAPFLNS, MAPFMLLNS and MAPFLNS with the expert.
2.8.1.1 Expert for AgentSet Selection
Given a MAPF instance and its incumbent solution P, the expert for agentset selection first calls the
agentset selection strategies to sample a collection of S agent sets B(P), where S is a constant that is set
to 20 throughout the experiments. Each agentset sample is generated by a randomized agentset selection
strategy chosen from the agentbased and mapbased heuristics¶ with uniform probability, and its size is
chosen uniformly at random from 5 to 16. For each of the S agent sets, the expert replans the paths of the
agents in it and records the cost improvement, i.e., the resulting decrease in the sum of costs. Finally, the
expert outputs the agent set with the highest rank, i.e., the one with the largest cost improvement.
We replace the agentset selection in MAPFLNS (Lines 56 in Algorithm 2) with the expert and compare the resulting version of MAPFLNS with the expert against MAPFLNS for 100 agents on the random
map “random323210”, which is a 32 × 32 grid map from the MAPF benchmark set [181] with 10% randomly blocked cells. The grid map is shown in Table 2.9. We follow the experimental setup introduced in
Section 2.8.2. We allocate a budget of 100 replans to each algorithm (instead of a runtime limit). Figure
2.5 shows how the average sum of costs changes after each replan. The average runtime of MAPFLNS is
¶We started this work when an earlier version [118] of MAPFLNS came out that uses only the two heuristics. MAPFLNS
[117] actually uses a third heuristic that randomly generates agent sets. We tried adding this heuristic to the expert but saw little
improvement in the results.
65
Feature Descriptions Count
Static Features 6
Distance between ai
’s start and goal vertices. 1
Row and column numbers of ai
’s start and goal vertices. 4
Degree of ai
’s goal vertex. 1
Dynamic Features 10
Delay of ai
. 1
Ratio between the delay of ai and the distance between ai
’s start and goal vertices. 1
Minimum, maximum, sum and average of the heat values of the vertices on ai
’s path pi
:
The heat value of vertex v ∈ V is the number of time steps that v is occupied by an agent.
The heat value of a vertex counts multiple times in the sum and average if the vertex is
visited by the agent multiple times until it no longer leaves the goal vertex.
4
Number of time steps that ai
is on a vertex with degree j (1 ≤ j ≤ 4) until it no longer
leaves the goal vertex.
4
Table 2.7: Agent ai
’s features with respect to instance I and incumbent solution PI = {pi
: i ∈ [k]}. The
counts are the numbers of features contributed by the corresponding entries.
0.8 seconds, while the one of MAPFLNS with the expert is more than 16 seconds, which is too slow to be
useful for MAPF solving. However, the huge difference between the curves of MAPFLNS (red) and MAPFLNS with the expert (blue) in Figure 2.5 suggests that, if we could learn an ML model that approximates
the expert accurately with a small computational overhead during MAPF solving, then a version of MAPFLNS with MLguided LNS might be able to improve the solution quality faster early in the search than
MAPFLNS. The curves of MAPFMLLNS (green) in Figures 2.5 and 2.6 show that this is indeed possible.
2.8.1.2 Data Collection
Given an instance I, the incumbent solution P and the number S of agent sets to sample, we describe
the subroutine collectData(I, P, S) that will be used to collect features and labels for P in our learning
algorithm.
For incumbent solution P, we sample a collection B(P) of S agent sets using the expert. For each
B ∈ B(P), we compute a feature vector ϕP (B) and a groundtruth label yP (B) transformed from the
expert’s ranking.
66
Features To compute the feature vector ϕP (B) of a given agent set B ∈ B(P), we first compute a set
of 16 agent features for each agent ai ∈ {a1, . . . , ak}, which are summarized in Table 2.7. We then divide
the set of agents into two subsets, B and {a1, . . . , ak} \ B. For each subset, we compute the minimum,
maximum, sum and average of the value of each of the 16 agent features over all agents in the subset,
resulting in 4 × 16 = 64 features for the subset and p = 2 × 64 = 128 features for both subsets. We
perform a linear transformation to normalize the value of each feature to the range of [0, 1] across all
agent sets in B(P), where the minimum value of that feature gets transformed into a 0 and the maximum
value gets transformed into a 1. We then concatenate them to obtain the feature vector ϕP (B).
Labels A groundtruth label yP (B) is a value assigned to each agent set B ∈ B(P), such that agent sets
that result in higher cost improvements have smaller values. We use a simple and intuitive soft labeling
scheme following previous work [102]: Let α and β (α ≥ β) be the cost improvements of the agent sets
ranked at the 75 and 50 percentiles by the expert, respectively, and set yP (B) = 1[∆B≥α] + 1[∆B≥β]
,
where ∆B is the cost improvement of B (in our study, we achieved similar results when labeling with 75,
50 and 25 percentiles as well as 80 and 50 percentiles). This labeling scheme assigns label 2 to the agent sets
ranked in the top 25%, label 1 to the ones ranked in the top 50% but not the top 25% (i.e., the ones better
than a choice at random) and label 0 to the rest. Our labeling scheme relaxes the definition of the best
agent set and allows us to learn a ranking function that focuses on selecting only highranking agent sets
with respect to their cost improvements and avoids having to correctly rank agent sets with small or no
cost improvements. We tried using binary labels, e.g., yP (B) = 1[∆B≥α]
, and using the cost improvements
∆B as the labels but did not get as good performance as the one we proposed.
2.8.1.3 Model Learning
We use imitation learning to learn a strategy for agentset selection. We adapt the dataaggregation algorithm [168] combined with the forward training algorithm [167] to our use case.
67
Algorithm 5 Training Algorithm
1: Input: Training instance set ITrain, number R of iterations and number S of agent set samples
2: for I ∈ ITrain do
3: P ← runInitialSolver(I)
4: Record P as the incumbent solution of I
5: D = ∅
6: for r = 1 to R do
7: for I ∈ ITrain do
8: P ← incumbent solution of I
9: collectData(I, P, S) ▷ Sample S agent sets for instance I and collect their features and labels using the expert.
10: D ← D ∪ {P} ▷ Then, add the state P to the training dataset.
11: Train π
(r) with D
12: for I ∈ ITrain do
13: P ← incumbent solution of I
14: B(P) ← ∅
15: for i = 1 to S do
16: H ← uniformly select one of the two heuristics
17: B(P) ← B(P) ∪ selectAgentSet(I, H)
18: B ← arg maxB′∈B(P ) π
(r)
(ϕP (B′
))
19: P
− ← {pi ∈ P : ai ∈ B}
20: P
+ ← runReplanSolver(I, B, P \ P
−)
21: if P
p∈P + l(p) <
P
p∈P − l(p) then
22: P ← (P \ P
−) ∪ P
+
23: Update P as the incumbent solution of I
24: π ← validate({π
(1), . . . , π(R)})
25: return π
The training algorithm, shown in Algorithm 5, takes as input a set ITrain of training instances and
runs for R iterations. We fix the grid map and the number of agents for the training instances, where
the start and goal vertices of the agents are drawn i.i.d. from a given distribution. The training algorithm
first computes an initial solution P for each I ∈ ITrain (Lines 24). In each iteration r (1 ≤ r ≤ R), it
collects training data for each I ∈ ITrain by probing the expert and recording its decision with respect to
the incumbent solution P of I as well as the features of the agent sets sampled by the expert (Lines 710).
Then, it trains a ranking function π
(r)
that minimizes a loss function over the aggregated training data
D (Line 11). To improve P, it evaluates all agent sets B(P) to select an agent set B using π
(r)
(Line 18),
replans the paths of all agents in B (Line 16) and updates P if the solution improves (Lines 2123). After
R iterations, it returns the ranking function that performs best during validation (Lines 2425). Algorithm
68
Algorithm 6 MAPFMLLNS
1: Input: MAPF instance I, ranking function π and number S of agent set samples
2: P = {pi
: i ∈ [k]} ← runInitialSolver(I)
3: Initialize the weights ω of the agentset selection strategies
4: while runtime limit not exceeded do
5: B(P) ← ∅
6: for i = 1 to S do
7: H ← selectDestroyHeuristic(w)
8: B(P) ← B(P) ∪ selectAgentSet(I, H)
9: Compute π(ϕP (B)) for all B ∈ B(P)
10: for B ∈ B(P) in descending order of π(ϕP (B)) do
11: P
− ← {pi
: ai ∈ B}
12: P
+ ← runReplanSolver(I, B, P \ P
−)
13: Update the weights ω of the agentset selection strategies
14: if P
p∈P + l(p) <
P
p∈P − l(p) then
15: P ← (P \ P
−) ∪ P
+
16: break
17: return P
5 repeatedly determines a ranking function that makes good decisions in those situations encountered in
previous iterations when using the previously learned ranking functions to guide agentset selection.
Given the dataset D collected during training, we follow the formulation in Section 2.5 to learn a linear
ranking function
π : R
p → R : π(ϕP (B)) = wTϕP (B)
with parameter w ∈ R
p
, that minimizes the loss function
L(w) = X
P ∈D
l(yP , yˆP ) + C
2
w2
2
.
To compute l(yP , yˆP ), we consider the set of pairs PP = {(B′
, B′′) : B′
, B′′ ∈ B(P)∧yP (B′
) > yP (B′′)}
and calculates it as the fraction of swapped pairs
l(yP , yˆP ) = {(B′
, B′′) ∈ PP : ˆyB′ ≤ yˆB′′}
PP 
.
69
2.8.1.4 MLGuided Search
After learning the ranking function π, we deploy it in MAPFMLLNS. MAPFMLLNS is summarized
in Algorithm 6. In each iteration, given the incumbent solution P, MAPFMLLNS samples a collection
B(P) of S agent sets using the two agentset selection strategies (Lines 68) and computes the predicted
score π(ϕP (B)) for each agent set B ∈ B(P) (Line 9). The agentset selection strategies are chosen from
the agentbased and mapbased heuristics with probabilities according to the weights ω maintained by
adaptive LNS. MAPFMLLNS replans the paths for the agents in agent sets (Line 12) in descending order
of the predicted scores of the agent sets. If a new incumbent solution is found, it discards the remaining
agent sets, recomputes the agent features and continues to the next iteration (Lines 1416).
Given an instance I and its incumbent solution P = {pi
: i ∈ [k]}, the time complexity of computing
the 16 agent features is bounded by O(k +
P
i∈[k]
l(pi)), which is linear in the number of agents and the
sum of costs. The agent features need to be recomputed only if a new incumbent solution is found. For each
B ∈ B(P), ϕP (B) can be computed in time O(Bp).This could be done by preprocessing the largest 16
values, the smallest 16 values and the sum of feature values for each agent feature when computing them.
Discussion One of the main contributions in this section is the expert for agentset selection. Previous
works have applied imitation learning to improve LNS for MILP solving [179, 177]. For MILP, we will
see in Chapter 3 that there is an existing expert called Local Branching [56] to guide selecting which
subset of variables to reoptimize next and [179] learns to predict the subset given by Local Branching.
However, for MAPF, there is no existing expert. Therefore, we design one that leverages spatiotemporal
information by using two heuristics and frame our ML problem as learning an agentset selection strategy
to guide destroying a part of the solution in LNS. Subsequently, this allows us to use a lightweight linear
ML model, such as SVMrank, that is easy to train and fast to evaluate during MAPF solving. We do not
learn how to construct agent sets or predict the cost improvement of given agent sets since these are much
70
more complicated ML problems that require using larger ML models, such as deep neural networks. We
experimented with graph convolutional networks for these tasks on an agent dependency graph [120] and
ended up with good ML performance but an undesirably large computational overhead due to their high
model complexity, rendering them useless without further indepth engineering.
2.8.2 Empirical Evaluation
In this subsection, we demonstrate the efficiency and effectiveness of MAPFMLLNS through experiments.
In the following, we introduce our evaluation setup and then present the results.
2.8.2.1 Setup
We implement MAPFMLLNS in C++ and conduct our experiments on a 2.4 GHz Intel Core i7 CPU with
16 GB RAM. We compare against MAPFLNS on five grid maps of different sizes and structures from the
MAPF benchmark set [181]: (1) the random map “random323210”; (2) the game map “den520d”, which
is a 257×256 grid map from the video game Dragon Age: Origins; (3) the city map “Paris_1_256”, which is
a 256 × 256 grid map of Paris; (4) the game map “ost003d”, which is a 194 × 194 grid map from the video
game Dragon Age: Origins; and (5) the warehouse map “warehouse10201021”, which is a 163×63 grid
map with 200 10 × 2 rectangular obstacles. The five grid maps are shown in Table 2.9. Compared to the
MAPF instances used in Section 2.7.2, we use grid maps of similar sizes and obstacle densities but increase
the agent densities since MAPFLNS scales better than ECBS.
MAPFLNS and MAPFMLLNS use the same setup. For the initial search algorithms and each grid
map, we follow [117] and select the MAPF search algorithm from PP, PPS and EECBS that has the highest
success rate on the instances with the largest number of agents within a runtime limit of 10 seconds as
reported by them. We use that MAPF search algorithm consistently for training and MAPF solving. That
is, we use PP as the initial search algorithm for the city and both game maps (den520d and ost003d), and
71
Grid Map Training k
Average
Ranking
Improving
Choice Regret
random 100 6.5/20 90% 25%
den520d 200 7.0/20 96% 33%
city 250 6.7/20 99% 19%
ost003d 100 5.4/20 91% 26%
warehouse 100 6.0/20 90% 28%
Table 2.8: Validation results for the learned ranking function π. “Training k” is the number of agents of
the training instances. “Average ranking” is the average rank of the first agent set selected by π among
the S = 20 agent sets. “Improving choice” is the fraction of times π selects an agent set that results in
a positive cost improvement. “Regret” is calculated as the average of 100% minus the cost improvement
achieved by π as a percentage of the cost improvement achieved by the expert.
PPS for the random and warehouse maps. We use PP as the replan search algorithm for all grid maps, since
PP dominates the other MAPF search algorithms, namely CBS and EECBS [117].
During training, we run Algorithm 5 for R = 100 iterations. For each grid map, we use ITrain = 16
instances with a fixed number of agents. The number of agents k of the training instances is reported in
Table 2.8. Since we use a randomized version of PP that uses random agent priorities, the cost improvement of each agent set used for creating its label is the average taken over 6 runs. We use regularization
parameter C = 0.1 and the default values for the other parameters in the SVMrank solver. We also tried
C ∈ {0.01, 0.001} and achieved similar results. It takes 2 to 8 hours, depending on the grid map, to run
Algorithm 5 on a single CPU. If collecting training data for the 16 instances were done in parallel on 16
CPUs in each iteration (Line 7 in Algorithm 5), the training time could be reduced to less than 1 hour.
During validation, we evaluate π1, . . . , πT on the validation data and return the ranking function π that
selects agent sets with the highest average ranking. We run MAPFLNS with the expert for 100 iterations
on 4 MAPF instances from the same distribution as the training instances. The validation results for π are
summarized in Table 2.8. During testing, we use 25 MAPF instances and set a runtime limit of 60 seconds
per instance. For both MAPFLNS and MAPFMLLNS, the runtime limit for finding the initial solution
is set to 10 seconds. Those instances for which they fail to find an initial solution within 10 seconds are
considered unsolvable and not included in our results. We use the same random seed to ensure that both
72
methods compute the same instances and initial solutions. The runtime limit of PP per replan is set to 2
seconds for the warehouse map and 0.6 seconds for the other grid maps initially and then adaptively set
to twice the average runtime of all successful replans so far after the first 30 successful replans. We use
the adaptive runtime limits for replanning since we observe that the runtime for unsuccessful replans is
longer than that for successful replans, and the runtime for replans is different on different maps. During
both training and MAPF solving, when generating an agent set using the agentset selection strategies in
LNS, we draw its cardinality uniformly from 5 to 16. We sample S = 20 agent sets in each iteration of
MAPFMLLNS.
2.8.2.2 Results
Our results provide answers to the following questions:
1. If the grid map is known in advance, can we learn a ranking function that performs well on the same
grid map with the same and different numbers of agents?
2. If the grid map is unknown in advance, can we learn a ranking function from other grid maps that
performs well on the unknown one?
We therefore learn two ranking functions with SVMrank for each grid map, namely a ranking function
trained on MAPF instances on that grid map (resulting in MAPF search algorithm MLS) and a ranking
function trained on MAPF instances from the other four grid maps (resulting in MAPF search algorithm
MLO).
Solution Quality and the Speed of Improving the Solution An important metric for evaluating the
performance of an anytime MAPF search algorithm is its speed of improving the solution. Let ITest be the
set of test instances and, for each I ∈ ITest, let t
S
I,init, SOCS
I
(t) and SODS
I
(t) be the runtime needed to
find the initial solution, the sum of costs and the sum of delays of the solution at runtime t, respectively,
73
Grid Map k
AUC Ratio Win/Loss Sum of Agents’ Delay (Suboptimality)
MLS MLO MLS MLO MAPFLNS MLS MLO
random
100 1.15±0.23 1.12±0.20 20/5 20/5 30 (1.01) 28 (1.01) 28 (1.01)
150 1.14±0.12 1.07±0.12 22/3 21/4 105 (1.03) 96 (1.03) 96 (1.03)
200 1.03±0.10 1.07±0.19 15/9 15/9 309 (1.07) 275 (1.06) 270 (1.06)
250 0.98±0.17 0.95±0.12 10/15 8/17 806 (1.15) 843 (1.15) 845 (1.15)
300 1.13±0.14 1.06±0.15 18/6 13/11 4,460 (1.67) 3,754 (1.56) 4,301 (1.61)
350 0.99±0.08 0.94±0.08 11/12 6/17 21,310 (3.78) 22,234 (3.90) 23,674 (4.08)
den520d
200 1.97±0.56 1.75±0.53 23/2 24/1 64 (1.00) 65 (1.00) 66 (1.00)
300 1.62±0.55 1.45±0.43 21/4 20/5 400 (1.01) 298 (1.00) 328 (1.01)
400 1.65±0.54 1.31±0.30 25/0 22/3 1,327 (1.02) 778 (1.01) 1,121 (1.01)
500 1.25±0.35 1.13±0.22 19/6 18/7 3,616 (1.04) 2,676 (1.03) 3,281 (1.03)
600 1.10±0.15 1.10±0.08 18/7 24/1 8,134 (1.08) 6,654 (1.06) 6,967 (1.07)
700 1.07±0.06 1.05±0.06 22/3 20/5 12,558 (1.10) 11,785 (1.10) 11,535 (1.09)
city
250 1.75±0.41 1.14±0.32 22/3 14/11 229 (1.00) 110 (1.00) 128 (1.00)
350 1.12±0.34 1.02±0.24 19/6 14/11 469 (1.01) 372 (1.01) 368 (1.01)
450 1.30±0.35 1.01±0.22 19/6 13/12 763 (1.01) 629 (1.01) 753 (1.01)
550 1.05±0.18 1.06±0.24 16/9 14/11 1,932 (1.02) 2,056 (1.02) 1,536 (1.01)
650 1.08±0.13 1.10±0.25 17/8 17/8 3,274 (1.03) 3,041 (1.02) 3,033 (1.02)
750 1.07±0.14 1.09±0.08 17/6 19/4 8,371 (1.06) 8,363 (1.06) 7,413 (1.05)
ost003d
100 1.28±0.33 1.17±0.28 21/4 15/10 42 (1.00) 42 (1.00) 42 (1.00)
200 1.43±0.36 1.20±0.27 19/4 17/6 458 (1.01) 332 (1.01) 372 (1.01)
300 1.14±0.19 1.16±0.16 16/8 20/4 2,509 (1.05) 2,379 (1.05) 2,152 (1.04)
400 1.05±0.08 1.06±0.08 17/6 17/6 6,907 (1.11) 6,584 (1.10) 6,417 (1.10)
500 1.02±0.03 1.04±0.05 15/7 16/6 14,750 (1.19) 14,431 (1.19) 14,251 (1.18)
600 1.02±0.03 1.03±0.04 14/6 16/4 24,684 (1.27) 24,468 (1.27) 24,401 (1.27)
warehouse
100 1.35±0.33 1.25±0.30 20/5 20/5 57 (1.01) 37 (1.00) 37 (1.00)
150 1.21±0.24 1.14±0.22 18/7 16/9 295 (1.02) 195 (1.01) 217 (1.02)
200 1.19±0.22 1.05±0.13 21/4 15/10 925 (1.06) 736 (1.05) 842 (1.05)
250 1.17±0.20 1.11±0.18 17/8 16/9 1,817 (1.09) 1,595 (1.08) 1,805 (1.09)
300 1.18±0.21 1.13±0.19 17/8 18/7 4,719 (1.20) 3,852 (1.16) 3,547 (1.15)
350 1.07±0.10 1.02±0.07 15/9 13/11 12,004 (1.43) 10,191 (1.36) 12,143 (1.43)
Table 2.9: The average ratios of the AUCs of MAPFLNS and variants of MAPFMLLNS (MLS and MLO)
with their standard deviations, the win/loss counts with respect to the AUCs and the average sums of
delays with the average suboptimalities for a runtime limit of 60 seconds. All entries take only the solved
MAPF instances into account. We bold the number of agents k on which MLS is trained and the entries
where a variant of MAPFMLLNS outperforms MAPFLNS.
74
Figure 2.6: Evolutions of the sum of costs (solid curves with the yaxis on the left side, smaller is better)
from 1 second to 60 seconds for MAPFLNS, MLS and MLO, averaged over all solved instances, and the
average ratio of the AUCs of MAPFLNS and one of MLS and MLO (dotted curves with the yaxis on the
right side, greater than 1 is better), also averaged over all solved instances, as a function of the runtime.
The error bars represent the standard deviation.
75
when solving instance I using MAPF search algorithm S. Following [117], we compute the Area Under
the Curve (AUC) of the sum of delays as a function of the runtime of MAPF search algorithm S on each
instance I, which is formally defined as AUCS
I
(tlimit) = R tlimit
t
S
I,init
SODS
I
(t)dt, where tlimit is the runtime
limit (60 seconds). The smaller the AUC, the higher the speed of improving the solution is. In Table 2.9, we
report the average ratios of the AUCs of MAPFLNS and our MAPF search algorithms, the win/loss counts
with respect to the AUC and the average sums of delays with the average suboptimalities over all solved
test instances∥
. The win/loss counts are the numbers of instances where the AUCs of MLS or MLO are
smaller/larger than those of MAPFLNS. The suboptimalities are overestimated values calculated as the
ratio between the final sum of costs and the sum of distances between the agents’ start and goal vertices.
On the city and both game maps (den520d and ost003d), the AUCs of MAPFLNS are 43% to 97% worse than
the ones of MLS. On these three maps, MLS substantially outperforms MAPFLNS also with respect to
the win/loss counts and, for almost all tested numbers of agents, with respect to the final solution qualities.
On the random and warehouse maps, MLS outperforms MAPFLNS with respect to all metrics, except for
a few cases with large numbers of agents (250 and 350 agents on the random map). Even though MLS
learns the ranking functions on MAPF instances with a fixed number of agents, they generalize well to
MAPF instances with larger numbers of agents on the same grid map and outperform MAPFLNS in almost
all cases. MLO also substantially outperforms MAPFLNS. MLO, without seeing the test grid map during
training, is competitive with MLS and even outperforms it sometimes on both game maps and the city
map. For the random map, the improvement of MAPFMLLNS over MAPFLNS is not as substantial as
for the other grid maps, especially on MAPF instances with large numbers of agents. We tried retraining
the ranking functions on MAPF instances with larger numbers of agents (e.g., 250 agents for the random
map) but achieved similar results. It is future work to improve the effectiveness of MAPFMLLNS on this
grid map.
∥All search algorithms have the same set of solved instances since they use the same initial search algorithm with the same
random seeds.
76
To demonstrate the effectiveness of our MAPF search algorithms further, we show the average sum
of costs for MAPFLNS, MLS and MLO in Figure 2.6 together with the average ratios between the AUCs of
MAPFLNS and one of MLS and MLO as functions of the runtime limittlimit, i.e., 1
ITest
P
I∈ITest
SOCS
I
(tlimit)
and 1
ITest
P
I∈ITest
AUCMAPFLNS
I
(tlimit)
AUCS
I
(tlimit)
for each S ∈ {MLS, MLO}. In these cases, MLS and MLO establish
advantages early in the search and substantially outperform MAPFLNS for several shorter runtime limits,
e.g., 20 or 30 seconds.
Grid Map k
Number of Replans
MAPFLNS MLS
random
100 19,075 15,892
200 6,398 5,673
300 1,002 711
den520d
200 1,138 932
400 633 620
600 401 374
city
250 1,978 1,452
450 1,314 1,101
650 794 783
ost003d
100 1,398 1,044
300 419 317
500 168 138
warehouse
100 3,152 2,706
200 874 695
300 241 188
Table 2.10: The average number of replans of MAPFLNS and MLS for a runtime limit of 60 seconds.
The runtime overhead of MAPFMLLNS induced by computing the features and evaluating the ranking function is small. Table 2.10 shows the average number of replans of MAPFLNS and MLS for a
runtime limit of 60 seconds. MAPFMLLNS performs fewer replans than MAPFLNS on average. These
results suggest that our learned ranking functions select agent sets more effectively since they improve
the solutions faster and achieve better solution qualities than MAPFLNS with fewer replans.
Feature Importance Finally, we study the feature importance of the learned ranking function for MLS
for each grid map, measured by the absolute values of the learned feature weights. It makes sense to do
77
so since the features are normalized. Features related to the delays are the most important ones for all five
grid maps. The other important features are related to the costs of the paths, the ratios between the delays
and costs, the sums of the heat values on the paths and the numbers of time steps that the agents are on a
vertex with degree 2 or 3 (see Table 2.7 for definitions).
2.9 Learning to Prioritize Agents for PP
In this section, we introduce PP+ML to show how our framework can be applied to improve the assignment
of priorities to agents in prioritized planning (PP) [175]. PP is one of the fastest algorithms for solving
MAPF suboptimally. It is based on a simple planning scheme [49]: It assigns each agent a unique priority
and computes, in descending priority ordering, each agent’s costminimal path that avoids conflicts with
both static obstacles and the alreadyplanned agents (moving obstacles). Because of its computational
efficiency and simplicity, PP remains the most commonlyadopted MAPF algorithm in practice [197]. For
example, PP is commonly used to find the initial solution in LNSbased MAPF search algorithms [117, 91].
However, its solution quality is sensitive to the predetermined priority ordering. Good priority orderings
can yield (near)optimal solutions, whereas bad priority orderings can lead to solutions with large sums of
costs or even failures to find any solution for solvable MAPF instances, as shown in Figure 2.7.
Existing PP algorithms use either randomized assignments or greedy heuristics to determine the priority ordering, such as the querydistance heuristic [17], leastoption heuristic [191, 196] and startandgoalconflict heuristic [24, 116]. However, these handcrafted heuristics have been developed in the context of
specific usage scenarios, and none of them dominates the others in all cases in terms of the success rate
and solution quality (measured by the sum of costs). We apply our framework introduced in Section 2.5
to this task and tailor it for priority assignments for agents in PP. We then empirically demonstrate the
effectiveness and efficiency of PP+ML.
78
Figure 2.7: Normalized sum of costs (i.e., we normalize them by taking the ratio of the sum of costs of the
solution over the sum of the lengths of the individually costminimal paths of all agents) of 100 PP runs with
different random priority orderings on MAPF instance “room32324random1.scen” from [181] with 20
agents, sorted in increasing order of their normalized sums of costs. PP runs that fail to find a solution are
shown on the top of the plot.
2.9.1 Machine Learning Methodology
Our goal is to learn a priorityassignment strategy that increases the success rate of PP and its solution
quality compared to a humandesigned strategy. In general, a priorityassignment strategy takes the MAPF
instance and the alreadyplanned paths as input and outputs the agent that will be planned next (i.e., it
assigns the highest priority to this agent among the remaining ones). The state of the search is represented
by the MAPF instance and the alreadyplanned paths. The actions are the agents that have not been
planned yet. For example, the leastoption heuristic recalculates the number of path options for each agent
every time a new path is planned [191]. On the other hand, the total priority ordering does not necessarily
have to be either determined online during the planning process or based on the agents that have already
been assigned priorities. Many previous works simplify the priorityassignment strategy to consider only
the MAPF instance and, therefore, determine a total priority ordering, i.e., select all actions, before starting
to plan the first costminimal path. Such a strategy is simple and easy to implement, and does not require
extra overhead for computing the next agent to plan a path for during the planning process, in contrast to
79
an online strategy. In the following, we learn such a strategy. Thus, we represent the state of the search
with the MAPF instance s = I and let the set of actions A(I) = {a1, . . . , ak} be the set of agents. The
learned priorityassignment strategy, represented by a ranking function, sequentially selects an agent with
the highest rank without replacement to produce a total priority ordering in descending order. We first
propose a samplingbased expert for assigning agents’ priorities. The expert randomly samples a set of
sequences of k agents without replacement to form a set of total priority orderings. It then uses PP to
plan the paths for all agents for each total priority ordering and outputs a total or partial priority ordering
based on the resulting sums of costs. However, the expert is timeconsuming to compute. We, therefore,
learn to imitate the expert with a linear ranking function. Finally, we use the learned ranking function to
determine assigning agents’ priorities in PP.
2.9.1.1 Expert for Assigning Agents’ Priorities
Given a MAPF instance I, the expert for assigning agents’ priorities outputs a priority ordering ≺I instead
of a single decision, from which a sequence of decisions to be made at state I can be derived. To obtain
the priority ordering ≺I , the expert first runs PP repeatedly for x times with randomly generated total
priority orderings on the agents in instance I. There are two variants of the expert to construct ≺I . One
outputs a total priority ordering and the other one outputs a partial priority ordering. We denote the two
variants by OT and OP .
OT sets ≺I to the total priority ordering that generates the solution with the smallest sum of costs
among the x runs. It is simple and straightforward but has two drawbacks. First, the total priority ordering
may be arbitrary in places. For example, if agents ai and aj are located far away from each other and do
not collide with each other, then it does not matter which agent has the higher priority. Second, the total
priority ordering is based on a single example, which may not be sufficiently robust.
80
Motivated by these drawbacks, we propose Op to collect the x
′ ≥ 1 samples that result in the smallest
sums of costs from the x runs and generate a partial priority ordering by imposing an ordering on two
agents only if swapping their priorities can decrease the sum of costs substantially. It works as follows:
For each PP run p = 1, ..., x, it starts with an empty partial priority ordering ≺
p
I
. Each iteration of
PP calls spacetime A∗
[175] (i.e., A∗
that searches the spacetime space, whose states are vertextimestep
pairs) to plan an individually costminimal path for a single agent ai that avoids conflicts with the alreadyplanned paths. When this A∗
search generates an A∗ node n with an fvalue of fn that moves ai from
one vertex to another, it checks if this move action leads to a conflict with an alreadyplanned path, say,
that of agent aj , and, if so, prunes node n. The A∗
search records such pruned nodes, i.e., the pair (aj , fn).
When the A∗
search terminates and returns a path pi of length l(pi) for ai
, we collect the set of agents BH
in the recorded pairs whose fn values are smaller than l(pi) and add ai ≺ aj for all aj ∈ BH to ≺
p
I
, for
the following reason: If any agent in BH had lower priority than ai
, then A∗ might find a path of length
within [fn, l(pi)) for ai
, i.e., it might find a shorter path than the current one. In contrast, even if all agents
not in BH had a lower priority than ai
, then A∗
still cannot find a shorter path.
When we select the top x
′
samples, we collect the associated partial priority orderings ≺
p
I
for p =
1, . . . , x′
and combine them into a joint partial priority ordering ≺I . To do so, we first find all pairs
of agents in each ≺
p
I
. Specifically, we convert ≺
p
I
to a directed acyclic graph (DAG) H
p
I
, where node
i represents agent ai and each directed edge i → j represents ai ≺
p
I
aj . We run the FloydWarshall
algorithm on H
p
I
to find all connected agent pairs. We then sort the agent pairs in descending order of
their occurrences in the top x
′
samples (ai ≺ aj and aj ≺ ai are treated as two different agent pairs)
and add them one by one to ≺I whenever possible. That is, if the agent pair is ai ≺ aj , we add it to ≺I
iff agents ai and aj are not comparable in ≺I . We also record the occurrences and use #(ai ≺ aj ) to
represent how often ai ≺ aj occurs in the x
′ priority orderings.
81
(a) Assume in the top 1 sample of the PP runs, agent a2
is planned first and agent a1 is planned second. The solid
arrows represent the paths planned by spacetime A∗
. The
dashed arrow shows a conflict between a1 and a2 if a1 also
takes its individually costminimal paths. n represents the
node spacetime A∗ pruned away to avoid such a conflict.
PP Run Sample Agent Pairs Added
Top 1 sample a1 ≺ a2
Top 2 sample a1 ≺ a2, a3 ≺ a4
Top 3 sample a2 ≺ a1
Top 4 sample a1 ≺ a2, a1 ≺ a4
Top 5 sample a3 ≺ a4
(b) An example of agent pairs added from the
top 5 PP samples assuming x
′ = 5.
Occurrence Agent Pairs
3 a1 ≺ a2
2 a3 ≺ a4
1 a1 ≺ a4, a2 ≺ a1
(c) Occurrences of agent pairs added from the top 5 PP
samples. a2 ≺ a1 is not added to≺I because it contradicts
with a1 ≺ a2 which has a higher occurrence. (d) The DAG representation of ≺I .
Figure 2.8: An example of Op. Assume we have a MAPF instance with k = 4 agents on an empty 4 × 5
grid map. The start and goal vertices of agents a1 and a2 are shown in (a).
Figure 2.8 shows an example of how Op works. Assume we have a MAPF instance I on an empty
4 × 5 grid map with four agents, and we select the top x
′ = 5 samples of PP runs. The start and goal
vertices of agents a1 and a2 are shown in Figure 2.8a. Assume that in the top 1 sample, a2 and a1 have the
highest and second highest priorities, respectively. Therefore, a2 takes one of its individually costminimal
paths, as shown in Figure 2.8a (the solid blue arrow). When planning the path for a1, the spacetime A∗
finds that a1 would have a conflict with a2 if a1 also takes its individually costminimal path (the dashed
red arrow). Therefore, spacetime A∗ prunes away the corresponding node n where fn = 4. At the end,
spacetime A∗ finds a path p1 where l(p1) = 6 (the solid red arrows). The expert then compares l(p1) with
fn. Since l(p1) > fn, it means a1 would have found a path with a lower cost if it had not had to avoid the
conflict with a2. Therefore, the expert adds a1 ≺ a2 to ≺1
I
. Figure 2.8b shows an example of the agent
pairs added to ≺
p
I
from the top p ≤ 5 samples. Figure 2.8c counts the occurrences of each agent pair in
82
≺
p
I
(p ≤ 5). Finally, the expert constructs the partial ordering ≺I greedily by adding agent pairs with the
highest occurrences. It does not add a2 ≺ a1 to ≺
p
I
since a1 ≺ a2 is added before a2 ≺ a1. Figure 2.8d
shows the DAG representation of ≺I .
2.9.1.2 Data Collection
The next step in our framework is to construct a training dataset from which we can learn a model that
imitates the expert’s output. First, we fix the graph underlying the MAPF instances that we want to solve
and the number of agents. Then, we obtain a set of MAPF instances ITrain for training. The training dataset
D = ITrain since the states of the search are represented by the instances themselves. For each I ∈ D,
we run one of the experts OT and OP on I and derive a label yI (ai) for each available action ai ∈ A(I)
from the expert’s priority ordering. We also compute a pdimensional feature vector ϕI (ai) that describes
agent ai
.
Features We collect a pdimensional feature vector ϕI (ai) for each agent ai and each MAPF instance
I ∈ ITrain. The p = 26 features in our implementation are summarized in Table 2.11 and can be classified
into four categories:
1. StartGoal Distances Motivated by the querydistance heuristic [17] (see Section 2.3.4), we design 4
features about the graph and Manhattan distances between the start and goal vertices of ai
(Feature
Group 1). We also generalize this idea to looking at the graph distances between the start/goal
vertices of ai and those of the other agents (Feature Groups 2 and 3).
2. MDD We consider an MDD MDDi for agent ai that consists of all individually costminimal paths
from si to ti
, i.e., the MDD computed at the root CT node in CBS. Motivated by the leastoption
heuristic (see Section 2.3.4), we design 5 features about MDDi
(Feature Groups 46) because MDDi
captures information about the path options of ai
.
83
Feature Description Count
Graph and Manhattan distances between si and gi
: their respective values, absolute
difference and the ratio of the graph distance over the Manhattan distance.
4
Graph distance between si and the start vertices of the other agents: their maximum,
minimum, and mean.
3
Graph distance between gi and the goal vertices of the other agents: their maximum,
minimum, and mean.
3
Sum of the widths of all levels of MDDi
. 1
Width of each level (excluding the first and the last levels) of MDDi
: their maximum,
minimum, and mean.
3
Number of levels of MDDi with width one. 1
Number of the other agents whose MDDs contain si
. 1
Number of the other agents whose MDDs contain gi
. 1
Number of the other agents whose start vertices are in MDDi
. 1
Number of the other agents whose goal vertices are in MDDi
. 1
Number of vertex, edge and cardinal conflicts between any costminimal path of ai
and any costminimal path of one of the other agents: counted once for each agent
pair and counted once for each conflict.
6
Number of vertices in MDDi that are also in the MDD of at least one other agent. 1
Table 2.11: p = 26 features for agent ai
. Column “Count” reports the numbers of features contributed
by the corresponding entries. We consider an MDD MDDi for agent ai that consists of all individually
costminimal paths from si to ti
, i.e., the MDD that would have been computed at the root CT node in
CBS.
3. Start and Goal Vertices Motivated by the startandgoalconflict heuristic (see Section 2.3.4), we
design 4 features about the potential conflicts at the start or goal vertices of the other agents that
ai might be involved in and vice versa, namely potential conflicts between ai and another agent
aj when ai
is at its start or goal vertex and aj follows (one of) its individually costminimal paths
(Feature Groups 7 and 8) or when aj is at its start or goal vertex and ai follows (one of) its individually
costminimal paths (Feature Groups 9 and 10).
4. Conflicts We finally design 7 features about conflicts (Feature Group 11) and potential conflicts
(Feature Group 12) of different types that ai might be involved in. In particular, Feature 11 counts
the number of each type of conflicts that ai can be involved in if all agents follow their individually
costminimal paths. Feature 12 counts the number of potential vertex conflicts that ai can be involved
in if all agents follow their individually costminimal paths but can wait for some time steps along
84
their paths. We use the numbers of these conflicts as features because they can be easily computed
by reasoning about the MDDs of the agents.
We perform a linear transformation to normalize the value of each feature to the range of [0, 1], where
the minimum value of that feature gets transformed into a 0 and the maximum value gets transformed
into a 1.
Labels Depending on the expert used in data collection, we use different labels derived from ≺I .
For expert OT , we group the agents into ⌊k/m⌋+1 priority groups (where m ∈ N is a hyperparameter)
by setting yI (ai) = ⌊ri/m⌋, where ri represents that agent ai has the rith lowest priority among all
agents. That is, agents with larger labels are in higher priority groups and agents with the same label are
in the same priority group.
For expert OP , we assign labels such that yI (ai) > yI (aj ) if ai ≺I aj . Such assignments always exist
according to Definition 2.3.1 (see Section 2.3.4) and can be found by performing a topological sort on the
DAG that captures ≺I .
2.9.1.3 Model Learning
Given the training dataset D = ITrain, we follow the formulation in Section 2.5 to learn a linear ranking
function with parameters w ∈ R
p
π : R
p → R : π(ϕI (ai)) = wTϕI (ai)
that minimizes the loss function
L(w) = X
I∈ITrain
l(yI , yˆI ) + C
2
w2
2
.
85
where yˆI (ai) = π(ϕI (ai)) is the predicted scores for agent ai
. To compute l(yI , yˆI ), we consider the set
of pairs PI = {(ai
, aj ) : yI (ai) > yI (aj )) ∧ ai
, aj ∈ A(I)}. l(yT , yˆT ) is computed as follows:
l(yI , yˆI ) =
P
(ai,aj )∈PI :ˆyI (ai)≤yˆI (aj ) w˜ai,aj
P
(ai,aj )∈PI
w˜ai,aj
. (2.2)
We train two variants of PP+ML, namely MLT and MLP, that are trained with data collected based on
OT and OP , respectively. For MLT, we set w˜ai,aj = 1 to assign uniform weights to discordant pairs in
the loss function. For MLP, we set w˜ai,aj = #(ai ≺ aj ) to assign weights to discordant pairs in the loss
function based on the number of occurrences of the agent pairs in ≺I .
2.9.1.4 MLGuided Search
After data collection and model learning, we apply the learned ranking function π to the feature vectors
for each test instance I ∈ ITest. Based on the predicted scores yˆI : {a1, . . . , ak} → R
k
returned by π, we
propose two different methods to produce a total priority ordering.
Deterministic ranking. We rank the agents by their predicted scores, namely ai ≺ aj iff yˆI (ai) >
yˆI (aj ).
Stochastic ranking. We use the predicted scores to produce a probability distribution and generate a
total priority ordering sequentially from agents with high priority to agents with low priority. Specifically,
we normalize the predicted scores yˆI using the softmax function
σ : R
k → [0, 1]k
: σ(ˆyI ) = (e
γyˆI (a1)
, . . . , eγyˆI (ak)
)
Pk
j=1 e
γyˆI (aj )
, (2.3)
where γ ∈ R
+ is a hyperparameter. We then repeatedly assign the next highest priority to an agent
that is selected with a probability proportional to its normalized predicted score (where, of course, every
86
agent can be selected only once). Agents with higher normalized scores have higher probabilities of being
selected earlier and thus assigned higher priority. This adds randomness to the total priority orderings and
allows us to leverage the random restart scheme when experimenting with PP+ML.
Discussion We address the limitations of PP+ML here. We will show in Section 2.9.2 that PP+ML does
not outperform the baselines in some cases. First, our expert for assigning agents’ priorities is based on the
best of 100 random samples. Thus, there might be not enough effective samples. Especially for large grid
maps, it is hard to get enough samples since evaluating each sample with the expert is slow. We thought
about using the leastoption heuristic as the expert since it is slow but effective. However, computing the
number of path options requires highprecision computing for large grid maps which causes significant
runtime overhead in the expert and makes it difficult to implement. One way to reduce such runtime
overhead is to compute the number of path options modulo some large prime number (that is less than
32 bits) and repeat it with different prime numbers. We need to ensure that the number of path options
is bounded by the multiplication of all prime numbers we use. We then use Chinese remainder theorem
to compute the actual number of path options using highprecision computing. This method allows most
arithmetic operations to be done with 64bit integers. However, this is even more complicated to implement
and we did not invest time on it. Second, the ML model does not generalize well across grid maps or
different numbers of agents on the same grid map. We did not get good results when we tried curriculum
learning to address this limitation similar to what we did in Section 2.7. It is also slow to collect training
data on large grid maps. Therefore, for those maps, we train only one model for a fixed number of agents.
Putting more effort into feature engineering and/or using a more expressive ML model than SVMrank might
mitigate some of these limitations.
87
2.9.2 Empirical Evaluation
In this subsection, we demonstrate the efficiency and effectiveness of PP+ML through experiments. In the
following, we introduce our evaluation setup and then present the results.
2.9.2.1 Setup
We compare the two variants of PP+ML, namely MLT and MLP, against three nonMLguided variants
of PP: (1) LH, a querydistance heuristic where agents with longer individually costminimal paths have
higher priority [17], breaking ties uniformly at random; (2) SH, a querydistance heuristic where agents
with shorter individually costminimal paths have higher priority [137], breaking ties uniformly at random;
and (3) RND, a heuristic that generates a random total priority ordering [16]. We implement all algorithms
in C++ with the same PP code base and run experiments on Ubuntu 20.04 LTS on an Intel Xeon 8175M
processor with 8 GB of memory.
We evaluate PP+ML on a set of six grid maps M, illustrated in Table 2.13, with different sizes and
structures from the MAPF benchmark [181]: (1) the random map “random323220”, which is a 32 × 32
grid map with 20% randomly blocked cells; (2) the room map “room32324”, which is a 32 × 32 grid
map with 64 square rooms connected by singlecell doors; (3) the maze map “maze32322”, which is a
32 × 32 grid map with twocellwide corridors; (4) the warehouse map “warehouse10201021”, which
is a 161 × 63 grid map with 200 10 × 2 rectangular obstacles; (5) the first game map “lak303d” and (6)
the second game map “ost003d”, which are both 194 × 194 grid maps from the video game Dragon Age:
Origins. We refer to the first four grid maps as small maps and to the last two as large maps.
We generate a set I
(M)
Train of training instances for each grid map M. For MLT, we generate 99 training
instances from each of the 25 scenarios, so I(M)
Train = 2,475. For MLP, we generate one training instance
from each scenario since the training loss converges already for a small training dataset, so I(M)
Train = 25.
88
To collect training data, we run PP x = 100 times, once with LH, once with SH and 98 times with RND,
to solve each MAPF instance I ∈ I(M)
Train. We pick the PP run with the least sum of costs for MLT and the
top k = 5 PP runs with the least sums of costs for MLP to generate the outputs of the experts. However,
when we use small maps with large numbers of agents k, most of the 100 PP runs fail to find any solutions.
We show in Section 2.9.2.2 that our ML models often have higher success rates than LH, SH and RND on
small maps with small numbers of agents k. Therefore, when the success rate of the 100 PP runs is less
than 5% for the given MAPF instances (i.e., on the random, room and maze maps with k ≥ 200, k ≥ 125
and k ≥ 90, respectively), we replace 10 of the 98 RND runs with PP+ML trained on a smaller number of
agents on the same map (i.e., the number of agents shown on the previous row of the row in Tables 2.12
and 2.13 that corresponds to the map and the number of agents of the given MAPF instance). Specifically,
we run PP+ML with random restarts with a runtime limit of 3 seconds (i.e., repeatedly run PP using the
stochastic ranking method until a solution is found or the runtime limit is reached) in each run and always
use the same ML model for training and testing (i.e., train MLT with datasets partially generated by MLT
and train MLP with datasets partially generated by MLP). This is effective in gathering training data for
large numbers of agents on all small maps except for the warehouse map, for which we did not use PP+ML
to generate training datasets (because it did not result in higher success rates). We varied the group size
m ∈ {1, 5, 10} for MLT and picked m = 5 as it leads to the best results. We varied the regularization
parameter C ∈ {0.1, 1, 10, 20, 100} and picked C = 20 to train MLT since there was no significant
difference in the test results. We used the builtin crossvalidation function in LIBLINEAR [52] to obtain
the value of C = 128 to train MLP.
We always train and test on the same grid map. For small grid maps, we train and test with the same
number of agents. For large grid maps, we are only able to gather training datasets for MAPF instances
with k ≤ 500 because the runtime for a single PP run with a larger k is too high. We, therefore, train and
test with the same number of agents when k ≤ 500 and use the ML models trained on MAPF instances
89
Grid Map k
Success rate (%) Solution rank
LH SH RND MLT MLP LH SH RND MLT MLP
random 50 96 16 76 84 56 2.00 2.72 1.24 1.52 1.24
100 100 20 60 32 24 1.20 1.68 1.16 1.32 1.68
150 68 4 20 64 8 0.72 1.44 1.28 0.68 1.40
200 24 0 0 32 24 0.44 0.80 0.80 0.28 0.44
250 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00
room 50 88 12 52 76 16 1.48 2.00 1.28 0.72 1.72
75 92 0 8 0 44 0.28 1.44 1.36 1.44 0.72
100 60 0 0 56 60 0.92 1.76 1.76 0.64 0.56
125 24 0 0 16 20 0.28 0.60 0.60 0.44 0.32
150 4 0 0 0 0 0.00 0.04 0.04 0.04 0.04
maze 50 84 0 12 76 68 1.32 2.40 2.12 0.68 0.96
70 76 0 0 84 88 0.84 2.48 2.48 0.88 0.88
90 64 0 0 52 80 0.96 1.96 1.96 0.84 0.68
110 44 0 0 32 44 0.56 1.20 1.20 0.52 0.56
130 16 0 0 8 16 0.20 0.40 0.40 0.32 0.20
warehouse 100 92 32 80 92 80 2.72 2.64 1.40 1.96 0.64
200 80 28 56 36 60 1.80 1.36 1.56 1.44 0.88
300 52 16 12 24 20 0.72 0.68 1.04 0.68 0.84
400 32 4 8 12 24 0.44 0.64 0.64 0.60 0.44
500 12 0 4 0 0 0.04 0.16 0.08 0.16 0.16
lak303d 300 100 28 96 96 76 2.64 2.64 1.96 1.16 1.24
400 100 36 88 88 72 2.84 2.24 1.64 1.52 1.08
500 100 44 88 84 80 2.76 1.76 1.48 2.24 0.96
600 88 8 36 80 12 1.24 1.84 1.36 0.72 1.80
700 16 0 0 68 0 0.64 0.84 0.84 0.16 0.84
ost003d 300 100 28 96 92 80 2.72 2.68 1.20 2.04 1.04
400 88 32 92 84 92 2.84 2.44 2.04 1.04 0.96
500 96 28 84 96 64 2.32 2.40 1.52 1.76 1.20
600 92 16 60 84 40 1.76 2.16 1.28 1.32 1.44
700 72 8 40 68 12 1.08 1.60 0.88 0.88 1.64
Table 2.12: Success rate and solution rank for deterministic ranking. The best results achieved among all
algorithms are shown in bold. The results are obtained by training and testing on the same map with
the same number of agents k, except for maps lak303d and ost003d with k > 500, where the results are
obtained by training on the same map with k = 500.
90
with k = 500 to test on MAPF instances with k > 500. The runtime limit for testing is set to 1 minute for
small grid maps and 10 minutes for large grid maps. We precompute the graph distances from each goal
vertex to all vertices on the grid map and use them as the admissible heuristics for the spacetime A∗
search
for all variants of PP. Unlike Sections 2.6 and 2.8, we do not test on the grid map that is different from the
one trained on since the ML models did not generalize well to unseen grid maps. For the same reason, we
do not test on MAPF instances with numbers of agents that are different from the training instances.
We evaluate all variants of PP with four metrics on 25 test instances for each pair of number of agents
and grid map. Success rate is the percentage of solved test instances within the runtime limit. Runtime to
first solution is the runtime needed to find the first solution, averaged over all test instances, in which the
runtime limit is used for unsolved instances. Here, we consider only the PP runtime and ignore the runtime
overhead of generating the total priority orderings for PP because such runtime overhead for SH and LH
is negligible as the startgoal graph distances are precomputed and that for MLT and MLP are also small
due to the small ML runtime overhead.∗∗ Normalized sum of costs is the ratio of the sum of costs and the
sum of the startgoal graph distances of all agents. Solution rank evaluates the relative solution quality as
follows: For each test instance, we rank the variants of PP in ascending order of the sums of costs of their
solutions. The lower the sum of costs, the lower the numerical value of the rank. The lowest numerical
value of the rank is 0. Algorithms that lead to the same sum of costs have the same rank, which is set to
the numerical value of the lowest rank in the tie. For example, if the sums of costs of the 5 algorithms are
101, 101, 102, 103 and 103, then their ranks are 0, 0, 2, 3 and 3, respectively. Variants of PP that fail to solve
the MAPF instances have the largest numerical value of the rank. Solution rank is the average numerical
value of the rank over the test instances.
∗∗The ML runtime overhead mainly comes from the runtime for collecting features, which, for example, is 0.03 seconds, 0.01
seconds, 0.02 seconds, 0.7 seconds, 4.37 seconds and 3.34 seconds per MAPF instance for the six grid maps with their respective
largest numbers of agents tested in Table 2.13. Moreover, the features need to be collected only once for each MAPF instance
even if we run PP multiple times via random restarts.
91
Figure 2.9: Normalized sum of costs for deterministic ranking on the random map. Unsolved MAPF instances are shown on top of the plot.
2.9.2.2 Results
Deterministic Ranking We first experiment with deterministic ranking for the baseline PP algorithms,
LH, SH and RND and two variants of PP+ML, MLT and MLP, on test instances on each of the six grid
maps and vary the number of agents. Here, each variant of PP generates one total priority ordering and
runs exactly once for each MAPF instance. (RND uses the first total priority ordering that the randomized
algorithm generates.) We report, in Table 2.12, the success rate and the solution rank for all grid maps and,
in Figure 2.9, the normalized sum of costs for the random map.
In terms of success rate, both variants of PP+ML, MLT and MLP, achieve comparable results but do
not completely dominate the nonMLguided variants of PP. MLT generally has a higher success rate than
MLP, but both are often more prone to failure than LH. In terms of solution quality, MLT achieves results
comparable to RND. Although MLP often fails to find a solution, when it does find one, it often finds a
solution with lower sum of costs than the other algorithms, as shown in Figure 2.9. In other words, MLP
suffers from low success rates but yields good solution qualities.
92
Grid Map k
Success rate (%) Runtime to the first solution (seconds)
LH SH RND MLT MLP LH SH RND MLT MLP
random 150 100 100 100 100 100 0.08 0.65 0.42 0.24 1.37
175 100 100 100 100 100 3.24 2.54 6.22 1.06 1.49
200 88 80 88 100 100 15.60 20.56 18.25 2.13 2.86
225 16 20 28 88 92 51.09 49.70 46.10 8.82 13.17
250 0 0 0 44 52 60.00 60.00 60.00 40.88 33.79
room 50 100 100 100 100 100 0.24 0.17 0.15 0.11 0.65
75 100 100 100 100 100 0.66 1.23 1.13 1.47 0.52
100 84 80 76 100 100 12.56 22.70 23.07 3.35 0.70
125 20 8 4 80 88 49.50 59.60 58.21 16.21 10.18
150 0 0 0 24 32 60.00 60.00 60.00 51.53 44.57
maze 50 100 100 100 100 100 0.61 3.86 2.19 1.30 1.28
70 100 68 68 96 100 3.17 25.48 28.58 3.24 0.49
90 68 16 16 100 100 22.81 55.27 54.18 2.84 0.72
110 44 0 0 92 96 33.67 60.02 60.01 15.22 9.90
130 12 0 0 36 52 52.85 60.01 60.02 42.78 33.90
warehouse 350 96 96 96 100 92 9.67 13.62 13.18 13.45 13.43
400 80 84 72 84 76 26.00 25.24 26.80 24.39 29.06
450 68 48 52 60 48 35.80 39.67 39.45 37.10 40.91
500 24 28 20 20 32 52.86 49.96 52.50 53.06 49.11
550 12 8 8 24 12 58.29 56.73 59.54 56.11 56.63
lak303d 500 100 96 100 100 100 26.74 65.87 62.98 43.78 98.97
600 100 96 92 96 88 58.18 117.80 135.97 100.26 174.64
700 100 96 88 88 84 99.51 140.22 230.52 257.10 270.98
800 100 68 76 68 56 192.71 397.20 395.42 381.47 451.95
900 68 32 32 36 16 423.82 565.70 536.69 541.53 584.13
ost003d 500 100 100 96 96 92 15.68 49.15 60.60 53.10 87.81
600 100 96 96 96 92 43.07 85.94 93.92 94.57 136.83
700 96 92 92 92 88 77.55 163.39 205.10 468.27 232.78
800 100 84 84 92 72 126.35 287.04 299.36 263.53 369.30
900 88 64 64 72 40 273.00 462.16 456.27 420.71 525.51
Table 2.13: Success rate and runtime to the first solution for stochastic ranking with random restarts. The
best results achieved among all algorithms are shown in bold. The results are obtained by training and
testing on the same grid map with the same number of agents k, except for grid maps lak303d and ost003d
with k > 500, where the results are obtained by training on the same map with k = 500.
93
Stochastic Ranking with Random Restarts The random restart technique has been shown to improve
the success rate of PP by trying multiple priority assignments [16]. Therefore, we use it to boost the success
rate of both variants of PP+ML. We now illustrate stochastic ranking in conjunction with random restarts,
which we apply to all five algorithms to ensure a fair comparison. To make random restarts possible, we add
randomness to the deterministic algorithms LH and SH. LH relies on the startgoal graph distances of all
agents to determine the total priority ordering. Therefore, for LH, we use the stochastic ranking method in
Section 2.9.1.4 with dist(si
, gi) replacing yˆI (ai) as agent ai
’s score. For SH, since it is the reversed version
of LH, we use the stochastic ranking method as for LH but generate a total priority ordering from low to
high (instead of high to low). We varied parameter γ ∈ {0.1, 0.5, 1.0, 1.5} in the softmax function and
picked γ = 0.5 for SH, LH, MLT and MLP as it leads to the best results. RND is directly used with random
restarts. We keep restarting each algorithm with a new random seed until the runtime limit is reached. We
report, in Table 2.13, the success rate and the runtime to the first solution and, in Table 2.10, the solution
rank, where the solution, which we refer to as the final solution, is the one with the least sum of costs
found within the runtime limit.
MLT and MLP outperform all nonMLguided variants of PP in terms of the success rate, runtime
to the first solution and sum of costs of the final solution on the random, room and maze maps. MLP
has a slightly higher success rate and a better solution rank than MLT. The advantage of PP+ML is most
apparent on these grid maps when the number of agents is large and a solution is hard to find with the
nonMLguided variants of PP. On the warehouse map, MLT and MLP achieve results comparable to the
baseline algorithms. On the large grid maps, MLT achieves success rates and solution ranks comparable to
the baseline algorithms, while MLP has a marginally lower success rate but a better solution rank. These
results, to some degree, are consistent with the difficulty of obtaining highquality training datasets: As
we described in the experimental setup at the beginning of this section, it is difficult to get good training
datasets on the warehouse map and the large grid maps due to both the low success rates of existing PP
94
Figure 2.10: Solution rank for stochastic ranking with random restarts.
95
variants and their long runtimes. This is the reason why we train and test using different numbers of
agents on the lak303d and ost003d maps. Our results on these grid maps demonstrate the limited ability
of MLP to generalize to a higher number of agents.
Feature Importance We now analyze the feature importance of the learned ranking functions with
good success rates and solution ranks, i.e., on the random, room and maze maps, each with the largest
number of agents, because these ranking functions substantially outperform the baseline algorithms. We
sort the feature weights w in decreasing order of their absolute values. Since the features are normalized,
we use the absolute values of the feature weights to represent their importance.
The three ranking functions for MLT, one for each grid map, have nine features in common among
their top ten features with the largest absolute values: the graph and Manhattan distances between si and
gi and their absolute difference in Feature Group 1 (three features, definition in Section 2.9.1.2), the number
of vertex conflicts counted by agent pair in Feature Group 11 (one feature) and Feature Groups 4, 6, 9, 10
and 12 (five features).
The three ranking functions for MLP, one for each grid map, have five features in common among their
top ten features with the largest absolute values: the graph distance between si and gi and the absolute
difference between the graph and Manhattan distances between si and gi
in Feature 1 (two features) and
Feature Groups 4, 6 and 10 (three features).
Taking the intersection between the most important features for MLT and MLP, we determine the
most important features to be Feature Groups 1, 4, 6 and 10, which correspond to the querydistance
heuristic (Feature Group 1), the leastoption heuristic (Feature Groups 4 and 6) and the startandgoalconflict heuristic (Feature Group 10). This indicates that our learned ranking functions cleverly combine
the strengths of the existing heuristic methods.
96
2.10 Summary
In this chapter, we validated the hypothesis that one can leverage a general ML framework to improve
humandesigned decisionmaking strategies in different types of MAPF search algorithms. We first proposed a general ML framework based on imitation learning. To apply the framework, we find an expert to
provide highquality demonstrations of decisions that we are interested in improving and use the expert
to collect data. Then, we learn an ML model to imitate the expert’s decisions using imitation learning to
speed up decisionmaking since the expert is slow. Finally, the learned ML model replaces the expert’s
decisions during the search.
We identified important decisions in CBS, ECBS, MAPFLNS and PP, which are optimal, boundedsuboptimal and unboundedsuboptimal MAPF search algorithms, and then demonstrated the applicability
of the framework to these algorithms. We introduced CBS+ML, ECBS+ML, MAPFMLLNS and PP+ML,
where we learned an improved conflictselection strategy for CBS, a nodeselection strategy for ECBS, an
agentset selection strategy for MAPFLNS and a priorityassignment strategy for PP that showed substantial improvement in empirical performance in terms of efficiency and/or effectiveness over their nonML
counterparts. Specifically, for CBS and ECBS, we improved their efficiency, and for MAPFLNS and PP, we
improved both their efficiency and effectiveness. With the imitation learning framework, we also showed
how imitation learning with the same loss function, the same ML model and similar features can be reused
in improving different MAPF search algorithms.
97
Chapter 3
Improving DecisionMaking in MILP Search Algorithms
In this chapter, we present the second major contribution of this dissertation. In contrast to MAPF, imitation learning and reinforcement learning have been applied to improving MILP search algorithms. However, there are machine learning (ML) techniques, such as contrastive learning, that have shown success in
various domains within computer science, such as computer vision [74] and natural language processing
[66], but have not been applied to solving combinatorial optimization problems (COPs). To fill this gap,
we formulate a general contrastive learning framework to improve decisionmaking strategies for MILP
search algorithms. We identify important decisions to make in two different stateoftheart MILP search
algorithms, namely Large Neighborhood Search (LNS) and PredictandSearch (PaS), and then apply the
framework to improve them. Different from the imitation learning framework for MAPF introduced in
Chapter 2, contrastive learning learns to make discriminative predictions based on the expert’s demonstrations (that is, positive samples) and bad examples of demonstrations (that is, negative samples). One
of the main challenges is to design algorithms to calculate both positive and negative samples, which is
similar to finding an expert in the imitation learning framework. In this chapter, we will again see how
the same ML algorithm, the same ML model, the same loss function and similar features can be reused
in improving different MILP search algorithms. Empirically, the MLguided versions of the MILP search
algorithms substantially outperform their nonMLguided and other MLguided counterparts in terms of
98
both runtime and solution quality. Therefore, these results validate the hypothesis that one can leverage a
general ML framework to improve humandesigned decisionmaking strategies in different types of MILP
search algorithms.
The remainder of this chapter is structured as follows. In Section 3.1, we state the motivation behind
using ML for MILP solving and provide an overview of our contributions. In Section 3.2, we formally define
MILPs. In Section 3.3, we introduce MILP search algorithms, including LNS and PaS. In Section 3.4, we
summarize related work. In Section 3.5, we introduce the framework. In Sections 3.6 and 3.7, we introduce
CLLNS and ConPaS, respectively, and evaluate them empirically. Finally, in Section 3.8, we summarize
the contributions of this chapter.
3.1 Introduction
Algorithm designs for COPs are important and challenging tasks. A wide variety of realworld problems
are COPs, such as vehicle routing [189], path planning [158] and resource allocation [144] problems, and a
majority of them are NPhard to solve. In the past few decades, algorithms, including optimal algorithms,
approximation algorithms and heuristic algorithms, have been studied extensively due to the importance
of COPs. Those algorithms are mostly designed by humans through costly processes that often require a
deep understanding of the problem domains and their underlying structures as well as considerable time
and effort.
Recently, there has been an increased interest in automating algorithm designs for COPs with ML.
Many ML methods learn to either construct or improve solutions or improve decisionmaking within an
algorithmic framework, such as greedy search, local search or tree search, for a specific COP, such as
MAPF, for which we demonstrate concrete examples in Chapter 2. Other examples include the traveling
salesman problem (TSP) [200, 214], vehicle routing problem (VRP) [107] or independent set problem [130].
The ML methods for those COPs are often not easily applicable to the others.
99
In contrast, Mixed Integer Linear Programs (MILPs) can flexibly encode and solve a broad family of
COPs, such as network design problems [96, 43, 84], mechanism design problems [41], facility location
problems [76, 5]. MILPs can be solved by Branch and Bound (BnB) [113], an optimal tree search algorithm
that can achieve stateoftheart for MILPs. Over the past decades, BnB has been improved tremendously
to become the core of many popular MILP solvers such as SCIP [21], CPLEX [37] and Gurobi [69]. However,
due to its exhaustive search nature, it is hard for BnB to scale to large instances [102, 59].
On the other hand, metaheuristic algorithms are MILP search algorithms that can find highquality
solutions much faster than BnB for large MILP instances. One of them is Large Neighborhood Search (LNS)
[177, 198, 179, 86]. LNS starts from an initial solution (i.e., a feasible assignment of values to variables)
and then improves the current best solution by iteratively selecting a subset of variables to reoptimize
while leaving others fixed. Selecting which subset to reoptimize, i.e., the destroy heuristic, is a critical
component in LNS. Handcrafted destroy heuristics, such as the randomized heuristic [177, 179] and the
Local Branching (LB) heuristic [56], are often either inefficient (slow to find good subsets) or ineffective
(find subsets of bad quality). MLbased destroy heuristics have also been proposed and outperformed
handcrafted ones. Stateoftheart methods include ILLNS [179] that uses imitation learning to imitate
the LB heuristic and RLLNS [198] that uses a similar framework to ILLNS but trained with reinforcement
learning.
Another line of research on metaheuristic algorithms focuses on primal heuristics that generate highquality solutions to MILPs. In particular, they focus on generating full or partial highquality feasible
assignments of values to variables. Diving is one of the most popular primal heuristics. In BnB, diving
typically explores the BnB search tree to sequentially fix the values of the variables via depthfirst search.
Recently, there has been an increased interest in datadriven primal heuristic designs for MILPs since
MILPs from the same application domain often share similar structures and characteristics. Among them,
variants of diving [148, 70] have been proposed with a few main differences from diving: First, diving
100
can be performed at any search tree node in BnB and descend into the search tree to make assignments
for variables sequentially, but the variants we discuss here make assignments for multiple variables all
at once at the root node. Second, diving typically makes assignments for all variables, but these variants
make assignments for only a subset of variables and then solve for the remaining variables with a MILP
solver. One of these variants is called Neural Diving (ND) [148], where it learns to partially assign values
to integer variables via imitation learning and delegate the reduced subMILP to a MILP solver, e.g., SCIP.
The fraction of variables to assign values to is controlled by a hyperparameter called the coverage rate. A
SelectiveNet [60] is trained for each coverage rate that jointly decides which variables to fix and the values
to fix to during testing. The main two disadvantages of ND are that (1) enforcing variables to fixed values
leads to lowquality or infeasible solutions if the predictions are not accurate enough and (2) it requires
training multiple SelectiveNet to obtain the appropriate coverage rate, which is computationally expensive.
To mitigate these issues, [70] propose another variant called PredictandSearch (PaS) that deploys a search
inspired by the trust region method. Instead of fixing variables, PaS searches for highquality solutions
within a predefined proximity of the predicted partial assignment, which allows better feasibility and
finding higherquality solutions than ND. For both ND and PaS, the crucial decisions to make are which
variables to make assignment to and what values they should be assigned to. Their effectiveness (i.e., the
quality of the solution found) and efficiency (i.e., the speed at which highquality solutions are found)
depend on the accuracy of the machine learning prediction and the number of variables (controlled by
hyperparameters) whose values to fix.
We have mentioned important decisionmaking in two MILP search algorithms, namely, which subset
of variables to select to reoptimize in LNS and deciding what values to assign to which subset of variables
in PaS. In the past, ML methods that have been applied to improve them are mostly based on imitation
learning [179, 177, 148, 70] or reinforcement learning [198, 177]. In this chapter, we propose a general
101
contrastive learning (CL) [30, 103] framework to learn such strategies and demonstrate that the performance of MILP search algorithms, i.e., the runtime and solution quality, can be improved with MLguided
strategies. CL is an ML method that enhances the performance of MLguided strategies by contrasting
good and bad decision samples to learn attributes that are common among good decisions and attributes
that set apart good decisions from bad ones. In particular, we introduce CLLNS and ConPaS to show that
it is applicable to both LNS and PaS. To apply this CL framework to algorithms for MILP solving, we first
identify an important decision to make in the search. Then, we collect training data. The crucial step in
data collection for CL is to design both positive and negative samples representing good and bad decisions,
respectively. By contrasting positive and negative samples, CL learns to make discriminative predictions
of the decisions. Empirically, we show that variants of MILP search algorithms with contrastivelearned
strategies substantially outperform their imitationlearned and/or reinforcementlearned counterparts in
terms of both runtime and solution quality. The results also demonstrate how a general CL framework can
be applied to advance stateoftheart MILP search algorithms, which provide useful guidance to improve
MLguided MILP solving.
3.2 Mixed Integer Linear Programs
A mixed integer linear program (MILP) M = (A, b, c, q) is defined as
min c
Tx
s.t. Ax ≤ b
x ∈ {0, 1}
q × R
n−q
,
(3.1)
where x = (x1, . . . , xn)
T denotes the q binary variables and n − q continuous variables to be optimized,
c ∈ R
n
is the vector of objective coefficients, A ∈ R
m×n
and b ∈ R
m specify m linear constraints. A
102
Algorithm 7 LNS for MILPs
1: Input: A MILP M.
2: x
0 ← Find an initial solution to M
3: t ← 0
4: while time limit not exceeded do
5: X
t ← Select a subset of binary variables to destroy
6: x
t+1 ← Solve the MILP M with additional constraints {xi = x
t
i
: i ≤ q ∧ xi ∈ X/
t}
7: t ← t + 1
8: return x
t
solution x is feasible if its satisfies all the constraints. Finding an optimal solution to the MILP is NPhard. In this chapter, for the purpose of demonstrating our methodologies, we focus on the mixedbinary
formulation above. However, both our methods CLLNS and ConPaS can also handle general integers
using the same engineering techniques introduced in [148] and [179].
Linear Program (LP) Relaxation of a MILP If we replace the integer constraints in Equations 3.1 with
x ∈ [0, 1]q × R
n−q
, we obtain the linear program (LP) relaxation of the MILP. Finding an optimal solution
to the LP relaxation takes polynomial time. The optimal solution to the LP relaxation is a lower bound of
the MILP. If the optimal solution satisfies the integer constraints, it is also an optimal solution to the MILP.
3.3 Background
In this section, we provide detailed introductions to LNS for MILP solving, Neural Diving [148] and PredictandSearch [70].
3.3.1 LNS for MILP solving
LNS is a heuristic algorithm that starts with an initial solution and then iteratively destroys and reoptimizes
a part of the solution until a runtime limit is exceeded or some stopping condition is met. Let M =
(A, b, c, q) be the input MILP, where A, b and c are the coefficients and q is the number of binary variables
defined in Equation (3.1), and x
0 be the initial solution (typically found by running BnB for a short runtime).
103
In iteration t ≥ 0 of LNS, given the incumbent solution x
t
, defined as the best solution found so far, a
destroy heuristic selects a subset of k
t binary variables X
t = {xi1
, . . . , xikt }. The reoptimization is done
by solving a subMILP with X
t being the variables while fixing the values of xj ∈ X /
t
to the same values
as in x
t
. The solution to the subMILP is the new incumbent solution x
t+1 and then LNS proceeds to
iteration t + 1. Compared to BnB, LNS is more effective in improving the objective value c
Tx, especially
on difficult and largescale instances [177, 179, 198]. Compared to other local search methods, LNS explores
a large neighborhood in each step and thus, is more effective in avoiding local minima. LNS for MILPs is
summarized in Algorithm 7.
Adaptive Neighborhood Size Adaptive methods are commonly used to set the neighborhood size k
t
in previous work [179, 86]. The initial neighborhood size k
0
is set to a constant or a fraction of the number
of binary variables. In this chapter, we consider the following adaptive method [86]: in iteration t, if LNS
finds an improved solution, we let k
t+1 = k
t
, otherwise k
t+1 = min{γ · k
t
, β · n} where γ > 1 is a
constant and we upper bound k
t
to a constant fraction β < 1 of the number of binary variables to make
sure the subMILP is not too large (thus, too difficult) to solve. Adaptively setting k
t helps LNS escape
local minima by expanding the search neighborhood when it fails to improve the solution.
3.3.1.1 Local Branching Heuristic
The LB Heuristic [56] is originally proposed as a primal heuristic in BnB but also applicable in LNS for
MILP solving [179, 131]. Given the incumbent solution x
t
in iteration t of LNS, LB aims to find the subset
of binary variables to destroy X
t
such that it leads to the optimal x
t+1 that differs from x
t on at most k
t
variables, i.e., it computes the optimal solution x
t+1 that sits within a given Hamming ball of radius k
t
104
centered around x
t
. To find x
t+1, the LB heuristic solves the LB MILP that is exactly the same MILP from
input but with one additional constraint that limits the distance between x
t
and x
t+1:
X
i≤q:x
t
i=0
x
t+1
i +
X
i≤q:x
t
i=1
(1 − x
t+1
i
) ≤ k
t
.
The LB MILP is of the same size of the input MILP (i.e., it has the same number of variables and one more
constraint), therefore, it is often too slow to be useful in practice.
3.3.1.2 Local Branching Relaxation Heuristic
We propose the Local Branching Relaxation (LBRELAX) heuristic in [86] that first solves the LP relaxation
of the LB MILP and then selects variables X
t
to destroy based on the LP relaxation solution. Specifically,
given an MILP and the incumbent solution x
t
in iteration t, we construct the LB MILP with neighborhood
size k
t
and solve its LP relaxation. Let x¯
t+1 be the LP relaxation solution to the LB MILP. Also, let ∆t
i =
x¯i
t+1 − x
t
i
 and X¯t = {xi
: ∆t
i > 0, i ≤ q}. To construct X
t
(the set of variables to destroy), LBRELAX
greedily selects k
t variables with the largest ∆t
i
from X¯t
and breaks ties uniformly at random. If X¯t has
less than k
t variables, we select all of them in X¯t
and k
t − X¯t
 from the rest of the binary variables
uniformly at random. Intuitively, LBRELAX greedily selects the variables whose values are more likely
to change in the incumbent solution x
t
after solving the LB MILP. In [86], we propose two other variants
of LBRELAX with randomization. Empirically, LBRELAX runs much faster than the LB heuristic. It also
improves solutions faster than several stateoftheart methods on a few problems but not for some others.
3.3.2 Neural Diving
ND [148] learns to generate a Bernoulli distribution for the solution values of binary variables. It learns
the conditional distribution of the solution x given a MILP M = (A, b, c, q) defined as p(xM) =
P
exp(−E(xM))
x′∈SMp
exp(−E(x′
M)) , where SM
p
is a set of optimal or nearoptimal solutions to M and E(xM) is an
105
energy function of a solution x defined as c
Tx if x is feasible or ∞ otherwise. ND learns pθ
(xM) parameterized by a graph convolutional network to approximate p(xM) assuming conditional independence
between variables p(xM) ≈
Q
i≤q
pθ(xi
M). Since the full prediction pθ
(xM) might not give a feasible
solution, ND predicts only a partial solution controlled by the coverage rates and employs SelectiveNet
[60] to learn which variables’ values to predict for each coverage rates. ND uses binary crossentropy loss
combined with the loss function for SelectiveNet to train the neural network. During testing, the input
MILP M is then reduced to solving a smaller MILP after fixing the selected variables.
3.3.3 PredictandSearch
PredictandSearch (PaS) [70] uses the same framework as ND to learn to predict p(xM). Instead of
using SelectiveNet to learn to fix variables, PaS searches for nearoptimal solutions within a neighborhood
based on the prediction. Specifically, given the prediction pθ(xi
M) for each binary variable, PaS greedily
selects k0 binary variables X0 with the smallest pθ(xi
M) and k1 binary variables X1 with the largest
pθ(xi
M), such that X0 and X1 are disjoint (k0 + k1 ≤ q). PaS fixes all variables in X0 to 0 and X1 to 1
in the subMILP, but also allows ∆ ≥ 0 of the fixed variables to be flipped when solving it. Formally, let
B(X0, X1, ∆) = {x :
P
xi∈X0
xi +
P
xi∈X1
1−xi ≤ ∆} and D be the feasible region of the original MILP,
PaS solves the following optimization problem:
min c
Tx s.t. x ∈ D ∩ B(X0, X1, ∆). (3.2)
Restricting the solution space to B(X0, X1, ∆) can be seen as a generalization of the fixing strategy employed in ND where ∆ = 0. Though in ND, X0 and X1 are constructed using sampling methods based on
the neural network output.
106
3.4 Related Work
In this section, we summarize related work on LNS for MILPs and other COPs, LNSbased primal heuristics
in BnB, learning to solve MILPs with BnB, solution predictions for COPs and contrastive learning for COPs.
3.4.1 LNS for MILPs and Other COPs
A huge effort has been made to improve BnB for MILPs in the past decades, but LNS for MILPs has not been
studied extensively. Recently, [177] show that even a randomized destroy heuristic in LNS can outperform
stateoftheart BnB. They also show that an MLguided decompositionbased LNS can achieve even better
performance, where they apply reinforcement learning and imitation learning to learn destroy heuristics
that decompose the set of variables into equallysized subsets using a classification loss. [179] learn to
select variables by imitating LB. RLLNS [198] uses a similar framework but trained with reinforcement
learning and outperforms [177]. Both [198] and [179] use the bipartite graph representations of MILPs
to learn the destroy heuristics represented by GCNs. Another line of related work focuses on improving
the LB heuristic. [131] use ML to tune the runtime limit and neighborhood sizes for LB. [86] propose
LBRELAX to select variables by solving the LP relaxation of LB.
Besides MILPs, LNS has been applied to solve many COPs, such as VRP [166, 8], TSP [176], scheduling [108, 215] and MAPF [119, 117, 91]. ML methods have also been applied to improve LNS for those
applications [32, 133, 82, 127, 90].
3.4.2 LNSBased Primal Heuristics in BnB
LNSbased primal heuristics are a family of primal heuristics in BnB and have been studied extensively.
With the same purpose of improving primal bounds, the main differences between the LNSbased primal
heuristics in BnB and LNS for MILPs are: (1) LNSbased primal heuristics are executed periodically at
different search tree nodes during the search and the execution schedule is itself dynamic because they
107
are often more expensive to run than the other primal heuristics in BnB; (2) the destroy heuristics in LNSbased primal heuristics are often designed to use information specific to BnB, such as the dual bound and
the LP relaxation at a search tree node, and they are not directly applicable in LNS for MILPs in our setting.
Next, we briefly summarize the destroy heuristics in LNSbased primal heuristics:
• Crossover Heuristics [169] It destroys variables that have different values in a set of selected
known solutions (typically two).
• Mutation heuristics [169] It destroys a random subset of variables.
• Relaxation Induced Neighborhood Search [39] It destroys variables whose values disagree in
the solution of the LP relaxation at the search tree node and the incumbent solution.
• Relaxation Enforced Neighborhood Search [20] It restricts the neighborhood to be the feasible
roundings of the LP relaxation at the current search tree node.
• Local Branching [56] It restricts the neighborhood to a ball around the current incumbent solution.
• Distance Induced Neighborhood Search [61] It takes the intersection of the neighborhoods of
the Crossover, Local Branching and Relaxation Induced Neighborhood Search heuristics.
• GraphInduced Neighborhood Search [142] It destroys the breadthfirstsearch neighborhood
of a variable in the bipartite graph representation of the MILP.
Recently, an adaptive LNS primal heuristic [75] has been proposed to combine the power of these heuristics,
where it essentially solves a multiarmed bandit problem to choose which heuristic to apply.
108
3.4.3 Learning to Solve MILPs with BnB
Several studies have applied ML to improve BnB. The majority of works focus on learning to either select
variables to branch on [102, 59, 68, 211] or select nodes to expand [73, 110]. There are also works on
learning to schedule and run primal heuristics [100, 34] and to select cutting planes [184, 154, 92].
3.4.4 Solution Predictions for COPs
There are other works on learning to predict solutions to MILPs in addition to ND and PaS. [44] learn to
predict backbone variables [48] whose values stay unchanged across different optimal and nearoptimal
solutions and then search for optimal solutions based on the predicted backbone variables. However,
this method is not applicable to many COPs since backbone variables do not necessarily exist for them.
Recently, [203] propose thresholdaware learning to optimize the coverage rate in ND and is one of the
stateoftheart methods. However, this method also fixes variables when solving the subMILP. [101] and
[130] learn to guide decisionmaking, such as warmstarting and node selection, in COP solvers, such as
MIP solvers and local search, via solution predictions.
3.4.5 Contrastive Learning for COPs
While contrastive learning of visual representations [77, 74, 30] and graph representations [205, 188] have
been studied extensively, it has not been explored much for COPs. [147] derive a contrastive loss for
decisionfocused learning to solve COPs with uncertain inputs that can be learned from historical data,
where they view nonoptimal solutions as negative samples. [46] use contrastive pretraining to learn
good representations for the boolean satisfiability problem.
109
3.5 A Contrastive Learning Framework for Learning DecisionMaking
Strategies
In this section, we introduce a general ML framework based on contrastive learning to learn decisionmaking strategies for MILP solving. As will be shown in Sections 3.6 and 3.7, this is a framework that
has been successfully applied to different tasks for MILP solving. We employ CL rather than other learning methods, such as imitation learning and reinforcement learning, because it has been theoretically
demonstrated to be effective [187]. CL has empirically outperformed them in combinatorial optimization
problems [46, 147] and other problem domains [51]. The framework consists of the following steps:
1. Identify a Decision to Improve Given a MILP search algorithm, identify a decision that is crucial
to its performance. The goal is to learn a strategy to improve making this decision.
2. Data Collection One of the crucial steps in CL is to design both positive and negative samples.
Similar to imitation learning, CL learns from positive samples, which are highquality demonstrations of the decisions and can be acquired from an expert. Unlike imitation learning, CL requires
negative samples, which are lowquality or infeasible demonstrations of the decisions. It is encouraged to find negative samples that are deceptively similar to positive ones since it has been analyzed
to be beneficial for CL [187]. For features, one of the popular techniques to featurize MILPs is using
its bipartite graph representation [59], which is often used with a graph neural network. In this
chapter, we use such an engineering technique, but this framework is compatible with the others,
such as those introduced in [102] and [177].
3. Model Learning with a Contrastive Loss The goal is to learn a model to predict decisions that
are as similar to the positive samples as possible and, at the same time, dissimilar to the negative
samples. A contrastive loss is a function whose value is low when this holds true. In this chapter,
110
Figure 3.1: An overview of training and data collection for CLLNS. For each MILP instance for training, we
run several LNS iterations with LB. In each iteration, we collect both positive and negative neighborhood
samples and add them to the training dataset, which is used in downstream supervised contrastive learning
for neighborhood selections.
we utilize a form of supervised contrastive loss, called InfoNCE [151, 74], but this framework is
compatible with other contrastive losses, such as the margin loss [195] and the triplet loss [160].
4. MLGuided Search Once we have a trained ML model, we plug it into the MILP search algorithm
as a decisionmaking strategy.
3.6 Contrastive Large Neighborhood Search
In this section, we introduce CLLNS to show how the framework can be applied to learn efficient and
effective destroy heuristics. Similar to ILLNS [179], we learn to imitate the LB heuristic, a destroy heuristic that selects the optimal subset of variables within the Hamming ball of the incumbent solutions. LB
requires solving another MILP with the same size as the original problem and thus is computationally
expensive. We not only use the optimal subsets provided by LB as the expert demonstration (as in ILLNS)
but also leverage intermediate solutions and perturbations. When solving the MILP for LB, intermediate
solutions are found and those that are close to optimal in terms of effectiveness become positive samples.
We also collect negative samples by randomly perturbing the optimal subsets. With both positive and
111
negative samples, instead of a classification loss as in ILLNS, we use a contrastive loss that encourages
the model to predict the subset similar to the positive samples but dissimilar to the negative ones with
similarity measured by dot products [151, 74]. Finally, we also use a richer set of features and graph attention networks (GAT) instead of GCN to further boost performance. Empirically, we show that CLLNS
outperforms stateoftheart MLguided and nonMLguided versions of LNS at different runtime cutoffs
ranging from a few minutes to an hour in terms of multiple metrics, including the primal gap, the primal
integral, the best performing rate and the survival rate, demonstrating the effectiveness and efficiency of
CLLNS. In addition, CLLNS shows great generalization performance on test instances 100% larger than
training instances.
3.6.1 Machine Learning Methodology
Our goal is to learn a policy, a destroy heuristic represented by an ML model, that selects a subset of
variables to destroy and reoptimize in each LNS iteration. Specifically, let s
t = (M, x
t
) be the current
state in iteration t of LNS where M = (A, b, c, q) is the MILP and x
t
is the incumbent solution, the policy
predicts an action a
t = (a
t
1
, . . . , at
q
) ∈ {0, 1}
q
, a binary representation of the selected binary variables X
t
indicating whether xi
is selected (a
t
i = 1) or not (a
t
i = 0). We use contrastive learning to learn to predict
high quality a
t
such that, after solving the subMILP derived from a
t
(or X
t
), the resulting incumbent
solution x
t+1 is improved as much as possible. Next, we describe our novel data collection process, the
policy network and the contrastive loss used in training. An overview of our training and data collection
pipeline is shown in Figure 3.1. Finally, we introduce how the learned policy is used in CLLNS.
3.6.1.1 Data Collection
Following previous work [179], we use LB as the expert policy to collect good demonstrations to learn to
imitate. Formally, for a given state s
t = (M, x
t
), we use LB to find the optimal action a
t
that leads to the
112
minimum c
Tx
t+1 after solving the subMILP. Different from previous work [179, 177], we use contrastive
learning to learn to make discriminative predictions of a
t by contrasting positive and negative samples
(i.e., good and bad examples of actions a
t
). In the following, we describe how we collect the positive sample
set S
t
p
and the negative sample set S
t
n
.
Collecting Positive Samples S
t
p During data collection, given s
t = (M, x
t
), we solve the LB MILP with
the incumbent solution x
t
and neighborhood size k
t
to find the optimal x
t+1. LNS proceeds to iteration
t + 1 with x
t+1 until no improving solution x
t+1 could be found by the LB MILP within a runtime limit.
In experiments, the LB MILP is solved with SCIP 8.0.1 [21] with an hour runtime limit and k
t
is finetuned
for each type of instances. After each solve of the LB MILP, in addition to the best solution found, SCIP
records all intermediate solutions found during the solve. We look for intermediate solutions x
′ whose
resulting improvements on the objective value is at least 0 < αp ≤ 1 times the best improvement (i.e.,
c
T(x
t − x
′
) ≥ αp · c
T(x
t − x
t+1)) and consider their corresponding actions as positive samples. We
limit the number of the positive samples St
p
 to up. If more than up positive samples are available, we
record the top up ones to avoid large computational overhead with too many samples when computing
the contrastive loss (see subsection 3.6.1.3). αp and up are set to 0.5 and 10, respectively, in experiments.
Collecting Negative Samples S
t
n Negative samples are critical parts of contrastive learning to help
distinguish between good and bad demonstrations. We collect a set of c
t
n negative samples S
t
n
, where
c
t
n = κSt
p
 and κ is a hyperparameter to control the ratio between the numbers of positive and negative
samples. Suppose X
t
is the optimal set of variables selected by LB. We then perturb X
t
to get Xˆt by
replacing 5% of the variables in X
t with the same number of those binary variables not in X
t uniformly
at random. We then solve the corresponding subMILP derived from Xˆt
to get a new incumbent solution
xˆ
t+1. If the resulting improvement of xˆ
t+1 is less than 0 ≤ αn < 1 times the best improvement (i.e.,
c
T(x
t − xˆ
t+1) ≤ αn · c
T(x
t − x
t+1)), we consider its corresponding action as a negative sample. We
113
repeat this c
t
n
times to collect negative samples. If less than c
t
n negative samples is collected, we increase the
perturbation rate from 5% to 10% and generate another c
t
n
samples. We keep increasing the perturbation
rate at an increment of 5% until c
t
n negative samples are found or it reaches 100%. In experiments, we set
κ = 9 and αn = 0.05 and it takes less than 5 minutes to collect negative samples for each state.
3.6.1.2 Neural Network Architecture
Following previous work on learning for MILPs [59, 179, 198], we use a bipartite graph representation
of MILP to encode a state s
t
. The bipartite graph consists of n + m nodes representing the n variables
and m constraints on two sides, respectively, with an edge connecting a variable and a constraint if the
variable has a nonzero coefficient in the constraint. Following [179], we use features proposed in [59] for
node features and edge features in the bipartite graph and also include a fixedsize window of most recent
incumbent values as variable node features with the window size set to 3 in experiments. In addition to
features used in [179], we include features proposed in [102] computed at the root node of BnB to make it
a richer set of variable node features.
We learn a policy πθ(·) represented by a GAT [23] parameterized by learnable weights θ. The policy
takes as input the state s
t
and outputs a score vector πθ(s
t
) ∈ [0, 1]q
, one score per variable. To increase the
modeling capacity and to manipulate node interactions proposed by our architecture, we use embedding
layers to map each node feature and edge feature to space R
d
. Let vj , ci
, ei,j ∈ R
d be the embeddings of
the jth variable, ith constraint and the edge connecting them output by the embedding layers. Since our
graph is bipartite, following previous work [59], we perform two rounds of message passing through the
GAT. In the first round, each constraint node ci attends to its neighbors Ni using an attention structure
with H attention heads to get updated constraint embeddings c
′
i
(computed as a function of vj , ci
, ei,j ). In
the second round, similarly, each variable node attends to its neighbors to get updated variable embeddings
v
′
(computed as a function of vj , c
′
i
, ei,j ) with another set of attention weights. After the two rounds of
114
message passing, the final representations of variables v
′
are passed through a multilayer perceptron
(MLP) to obtain a scalar value for each variable and, finally, we apply the sigmoid function to get a score
between 0 and 1. Full details of the network architecture are provided in Appendix. In experiments, d and
H are set to 64 and 8, respectively.
3.6.1.3 Model Learning with a Contrastive Loss
Given a set of MILP instances for training, we follow the expert’s trajectory to collect training data. Let
DCLLNS = {(s, Sp, Sn)} be the set of states with their corresponding sets of positive and negative samples
in the training data. A contrastive loss is a function whose value is low when the predicted action πθ(s)
is similar to the positive samples Sp and dissimilar to the negative samples Sn. With similarity measured
by dot products, a form of supervised contrastive loss, called InfoNCE [151, 74], is used for CLLNS:
L
CLLNS(θ) = X
(s,Sp,Sn)∈DCLLNS
−1
Sp
X
a∈Sp
log exp(a
Tπθ(s)/τ )
P
a′∈Sn∪{a}
exp(a′Tπθ(s)/τ )
where τ is a temperature hyperparameter set to 0.07 [74] in experiments.
3.6.1.4 MLGuided Search
During testing, we apply the learned policy πθ in LNS. In iteration t, let (v1, · · · , vq) := πθ(s
t
) be the
variable scores output by the policy. To select k
t variables, CLLNS greedily selects those with the highest scores. Previous works [179, 198] commonly use sampling methods to select the variables, but those
sampling methods are empirically worse than our greedy method in CLLNS. However, when the adaptive neighborhood size k
t
reaches its upper bound β · q, CLLNS may repeat the same prediction due to
the deterministic selection process. When this happens, we switch to the sampling method introduced in
[179]. The sampling method selects variables sequentially: at each step, a variable xi that has not been
115
Small Instances Large Instances
Name MVCS MISS CAS SCS MVCL MISL CAL SCL
#Variables 1,000 6,000 4,000 4,000 2,000 12,000 8,000 8,000
#Constraints 65,100 23,977 2,675 5,000 135,100 48,027 5,353 5,000
Table 3.1: Names and the average numbers of variables and constraints of the test instances.
selected yet is selected with probability proportional to v
η
i
, where η is a temperature parameter set to 0.5
in experiments.
3.6.2 Empirical Evaluation
In this subsection, we demonstrate the efficiency and effectiveness of CLLNS through experiments. In the
following, we introduce our evaluation setup and then present the results.
3.6.2.1 Setup
Instance Generation We evaluate on four NPhard MILP problems that are widely used in existing
studies [198, 177, 173], which consist of two graph optimization problems, namely the minimum vertex
cover (MVC) and maximum independent set (MIS) problems, and two nongraph optimization problems,
namely the combinatorial auction (CA) and set covering (SC) problems. We first generate 100 small test
instances for each MILP problem, namely MVCS, MISS, CAS and SCS. MVCS instances are generated
according to the BarabasiAlbert random graph model [3], with 1,000 nodes and an average degree of 70
following [177]. MISS instances are generated according to the ErdosRenyi random graph model [50],
with 6,000 nodes and an average degree of 5 following [177]. CAS instances are generated with 2,000
items and 4,000 bids according to the arbitrary relations in [115]. SCS instances are generated with 4,000
variables and 5,000 constraints following [198]. We then generate another 100 large test instances for each
MILP problem by doubling the number of variables, namely MVCL, MISL, CAL and SCL. For each set of
test instances, Table 3.4 shows its average numbers of variables and constraints. More details of instance
generation are included in Appendix.
116
For data collection and training, we generate another set of 1,024 small instances for each MILP problem. We split these instances into 892 training instances and 128 validation instances.
Baselines We compare CLLNS with five baselines: (1) BnB: using SCIP (v8.0.1), the stateoftheart
opensource MILP solver, with the aggressive mode finetuned to focus on improving the objective value;
(2) RANDOM: LNS which selects the neighborhood by uniformly sampling k
t variables without replacement; (3) LBRELAX [86]: LNS which selects the neighborhood with the LBRELAX heuristics; (4) ILLNS
[179]; (5) RLLNS [198]. We compare with two more baselines in Appendix. For each ML method, a separate model is trained for each MILP problem on the small training instances and tested on both small and
large test instances. We implement ILLNS and finetune its hyperparameters for each MILP problem since
the authors do not fully opensource the code. For RLLNS, we use the code and hyperparameters provided
by the authors and train the models with five random seeds to select one with the best performance on the
validation instances. We do not compare to the method by [177] since it performs worse than RLLNS on
multiple MILP problems [198]. For both ILLNS and RLLNS, we also test their generalization performance
on the large instances.
Metrics We use the following metrics to evaluate all methods:
1. The primal bound is the objective value of the MILP;
2. The primal gap [19] is the normalized difference between the primal bound v and a precomputed
best known objective value v
∗
, defined as v−v
∗
max(v,v∗,ε)
if v exists and v · v
∗ ≥ 0, or 1 otherwise. We
use ε = 10−8
to avoid division by zero; v
∗
is the best primal bound found within 60 minutes by any
method in the portfolio for comparison.
117
3. The primal integral [1] at time z is the integral on [0, z] of the primal gap as a function of runtime. It
captures the quality of and the speed at which solutions are found. This is similar to the area under
the curve that we use for MAPFLNS in Section 2.8.2;
4. The survival rate to meet a certain primal gap threshold is the fraction of instances with primal gaps
below the threshold [179];
5. The best performing rate of a method is the fraction of instances on which it achieves the best primal
gap (including ties) compared to all methods at a given runtime cutoff.
Hyperparameters We conduct experiments on 2.5GHz Intel Xeon Platinum 8259CL CPUs with 32 GB
memory. Training is done on a NVIDIA A100 GPU with 40 GB memory. All experiments use the hyperparameters described below unless stated otherwise. We use SCIP (v8.0.1) [21] to solve the subMILP in
every iteration of LNS. To run LNS, we find an initial solution by running SCIP for 10 seconds. We set the
time limit to 60 minutes to solve each instance and 2 minutes to solve the subMILP in every LNS iteration.
All methods require a neighborhood size k
t
in LNS, except for BnB and RLLNS (k
t
in RLLNS is defined
implicitly by how the policy is used). For LBRELAX, ILLNS and CLLNS, the initial neighborhood size
k
0
is set to 100, 3000, 1000 and 150 for MVC, MIS, CA and SC, respectively, except k
0
is set to 150 for
SC for ILLNS; for RANDOM, it is set to 200, 3000, 1500 and 200 for MVC, MIS, CA and SC, respectively.
All methods use adaptive neighborhood sizes with γ = 1.02 and β = 0.5, except for BnB and RLLNS.
For ILLNS, when applying its learned policies, we use the sampling methods on MVC and CA instances
and the greedy method on SC and MIS instances. For CLLNS, the greedy method is used on all instances.
Additional details on hyperparameter tunings are provided in Appendix.
For data collection, we use different neighborhood sizes k
0 = 50, 500, 200 and 50 for MVC, MIS, CA
and SC, respectively, which we justify in subsection 3.6.2.2. We set γ = 1 and run LNS with LB until
no new incumbent solution is found (i.e., we do not adaptively update neighborhood sizes during data
118
collection). The runtime limit for solving LB in every iteration is set to 1 hour. For training, we use the
Adam optimizer [104] with learning rate 10−3
. We use a batch size of 32 and train for 30 epochs (the
training typically converges in less than 20 epochs and 24 hours).
Since BnB and LNS are both anytime algorithms, we show these metrics as a function of runtime or
the number of iterations in LNS (when applicable) to demonstrate their anytime performance.
3.6.2.2 Results
Figure 3.8 shows the primal gap as a function of runtime. Table 3.2 presents the average primal gap and
primal integral at 60minute runtime cutoff on small and large instances, respectively (see results at 30
minute runtime cutoff in Appendix). Note that we were not able to reproduce the results on CAS and CAL
reported in [198] for RLLNS despite using their code and repeating training with five random seeds. CLLNS shows better anytime performance than all baselines on all MILP problems. On the small instances, it
achieves 32%42% lower average primal gaps and 26%59% lower average primal integrals than the secondbest method at the 60minute runtime cutoff. It also demonstrates strong generalization performance on
large instances unseen during training, reducing the secondbest average primal gap and average primal
integral by up to 94.4% and 57.1%, respectively. Figure 3.9 shows the survival rate to meet the 1.00% primal
gap threshold. CLLNS achieves the best survival rate at the 60minute runtime cutoff on all instances,
except that, on SCL, its final survival rate is slightly worse than RLLNS, but it achieves the rate with a
much shorter runtime. On MVCL, MISS and MISL instances, several baselines achieve the same survival
rate as CLLNS, but it always achieves the rates with the shortest runtime. Figure 3.4 shows the best
performing rate. CLLNS consistently performs best on 50% to 100% of the small instances and has the
highest best performing rate in most cases on the large instances. In Appendix, we present strong results
in comparison with two more baselines and on one more performance metric.
119
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure 3.2: The primal gap (the lower, the better) as a function of runtime averaged over 100 test instances.
For ML methods, the policies are trained only on small training instances but are tested on both small and
large test instances.
120
PG (%) ↓ PI ↓ PG (%) ↓ PI ↓
MVCS MISS
BnB 1.32±0.43 66.1±13.1 5.10±0.69 222.8±25.9
RANDOM 0.96±1.26 38.0±44.8 0.24±0.14 22.1±5.0
LBRELAX 1.38±1.51 57.0±51.2 0.65±0.20 46.9±6.5
ILLNS 0.29±0.23 19.2±10.2 0.22±0.17 19.4±5.8
RLLNS 0.61±0.34 29.6±11.5 0.22±0.14 17.2±5.2
CLLNS 0.17±0.09 8.7±6.7 0.15±0.15 12.8±5.4
CAS SCS
BnB 2.28±0.59 137.4±25.9 1.13±0.95 86.7±37.9
RANDOM 5.90±1.02 235.6±34.9 2.67±1.29 124.3±45.4
LBRELAX 1.65±0.57 140.5±18.3 0.86±0.83 63.2±31.6
ILLNS 1.09±0.51 90.0±20.8 1.33±0.97 63.2±34.3
RLLNS 6.32±1.03 249.2±35.9 1.10±0.77 77.8±28.9
CLLNS 0.65±0.32 50.7±22.7 0.50±0.58 26.2±12.8
MVCL MISL
BnB 2.41±0.40 130.2±11.1 6.29±1.62 285.1±18.2
RANDOM 0.38±0.24 22.7±8.0 0.11±0.08 19.0±3.1
LBRELAX 0.46±0.23 48.4±7.5 0.91±0.16 68.6±5.5
ILLNS 0.27±0.23 21.2±8.1 0.29±0.15 27.1±5.5
RLLNS 0.59±0.30 37.3±9.6 0.14±0.12 18.9±4.1
CLLNS 0.05±0.04 9.1±3.4 0.12±0.11 12.9±4.4
CAL SCL
BnB 2.74±1.87 320.9±83.1 1.54±1.33 115.0±42.5
RANDOM 5.37±0.75 229.2±24.4 3.31±1.79 166.4±61.3
LBRELAX 1.61±1.50 153.0±50.3 1.91±1.42 88.3±48.9
ILLNS 4.56±0.98 254.2±33.4 1.72±1.19 79.1±42.4
RLLNS 4.91±0.81 197.0±28.5 0.66±0.72 116.2±27.1
CLLNS 0.09±0.10 116.1±18.0 0.58±0.45 39.2±23.2
Table 3.2: Primal gap (PG) (in percent), primal integral (PI) at 60minute runtime cutoff, averaged over 100
test instances and their standard deviations. “↓” means the lower, the better. For ML methods, the policies
are trained only on small training instances but are tested on both small and large test instances.
121
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure 3.3: The survival rate (the higher, the better) over 100 test instances as a function of runtime to meet
the primal gap threshold 1.00%. For ML methods, the policies are trained only on small training instances
but are tested on both small and large test instances.
122
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure 3.4: The best performing rate (the higher the better) as a function of runtime on 100 test instances.
The sum of the best performing rates at a given runtime might sum up greater than 1 since ties are counted
multiple times.
123
Comparison with LB (the Expert) Both ILLNS and CLLNS learn to imitate LB. On the small test
instances, we run LB with two different neighborhood sizes, one that is finetuned in data collection and
the other the same as CLLNS, for 10 iterations and compare its per iteration performance with ILLNS
and CLLNS. This allows us to compare the quality of the learned policies to the expert independently of
their speed. The runtime limit per iteration for LB is set to 1 hour. Figure 3.5 shows the primal bound
as a function of the number of iterations. The table in the figure summarizes the neighborhood sizes and
the average runtime per iteration. For LB, the result shows that the neighborhood size affects the overall
performance. Intuitively, using a larger neighborhood size in LB allows LNS to find better incumbent
solutions by exploring larger neighborhoods. However, in practice, LB becomes less efficient in finding
good incumbent solutions as the neighborhood size increases and sometimes even performs worse than
using a smaller neighborhood size (the one for data collection). The neighborhood size for data collection
is finetuned on validation instances to achieve the best primal bound upon convergences, allowing the
ML models to observe demonstrations that lead to as good primal bounds as possible in training. However,
when using the ML models in testing, we have the incentive to use a larger neighborhood size and finetune it since we no longer suffer from the bottleneck of LB. Therefore, we finetune the neighborhood sizes
for ILLNS and CLLNS separately on validation instances. CLLNS has a strong periteration performance
that is consistently better than ILLNS. With the finetuned neighborhood size, CLLNS even outperforms
the expert that it learns from (LB for data collection) on MISS and CAS.
Ablation Study We evaluate how contrastive learning and two enhancements contribute to CLLNS’s
performance. Compared to ILLNS, CLLNS uses (1) addition features from [102] and (2) GAT instead of
GCN. We denote by “FF” the full feature set used in CLLNS and “PF” the partial feature set in ILLNS.
In addition to ILLNS and CLLNS, we evaluate the performance of ILLNS with FF and GAT (denoted by
ILLNSGATFF), CLLNS with GCN and PF (denoted by CLLNSGCNPF) as well as CLLNS with GAT
and PF (denoted by CLLNSGATPF) on MVCS and CAS. Figure 3.6 shows the primal gap as a function
124
MVCS MISS CAS SCS
NH size Runtime NH size Runtime NH size Runtime NH size Runtime
LB 100 3600±0 3,000 3600±0 1,000 3600±0 100 3600±0
LB (data collection) 50 3600±0 500 3600±0 200 3600±0 50 3600±0
ILLNS 100 2.1±0.1 3,000 1.3±0.2 1,000 20.8±13.1 150 120.9±1.3
CLLNS 100 2.2±0.1 3,000 1.3±0.1 1,000 25.1±15.3 100 50.1±10.4
(a) MVCS (b) MISS
(c) CAS (d) SCS
Figure 3.5: The primal bound (the lower, the better) as a function of the number of iterations averaged
over 100 small test instances. LB and LB (data collection) are LNS with LB using the neighborhood sizes
finetuned for CLLNS and data collection, respectively. The table shows the neighborhood size (NH size)
and the average runtime in seconds (with standard deviations) per iteration.
125
(a) MVCS (b) CAS
Figure 3.6: Ablation study: The primal gap (the lower, the better) as a function of time averaged over 100
small test instances.
PG (%) ↓ PI ↓ PG (%) ↓ PI ↓
MVCS CAS
ILLNS(GCNPF) 0.29±0.23 19.2±10.2 1.09±0.51 90.0±20.8
ILLNSGATFF 0.24±0.17 15.3±7.3 1.13±0.63 78.9±22.7
CLLNSGCNPF 0.17±0.10 11.4±8,8 0.75±0.40 57.9±21.2
CLLNSGATPF 0.16±0.09 10.1±0.6 0.76±0.39 53.8±22.1
CLLNS(GATFF) 0.17±0.09 8.7±6.7 0.65±0.32 50.7±22.7
Table 3.3: Ablation study: Primal gap (PG) (in percent) and primal integral (PI) at 60minute runtime cutoff,
averaged over 100 small test instances and their standard deviations. “↓” means the lower the better.
of runtime. Table 3.3 presents the primal gap and primal integral at a 60minute runtime cutoff. The result
shows that ILLNSGATFF, imitation learning with the two enhancements, still performs worse than CLLNSGCNPF without any enhancements. CLLNSGCNPF and CLLNSGATPF perform similarly in
terms of the primal gaps, but CLLNSGATPF has better primal integrals, showing the benefit of replacing
GCN with GAT. On MVCS, CLLNS and its other two variants have similar average primal gaps. On
CAS, CLLNS has a better average primal gap than the other two variants. However, adding the two
enhancements helps improve the primal integral, leading to the overall best performance of CLLNS on
both MVCS and CAS.
126
3.7 Contrastive PredictandSearch
In this section, we introduce ConPaS, Contrastive PredictandSearch for MILPs, to show how the framework can be applied to improve the predictions of partial assignments of values to variables in PaS. ConPaS
leverages CL in the important task of learning to construct highquality (partial) solutions to MILPs. A
key to adapting the framework to this task is devising an appropriate and effective way of collecting positive and negative samples in this new context. Similar to both ND [148] and PaS [70], we collect a set
of optimal and nearoptimal solutions as positive samples; but different from ND and PaS, we additionally
collect negative samples for CL. We propose to collect two types of negative samples  infeasible solutions
and lowquality solutions that are similar to the positive samples  with novel approaches tailored to our
task. For infeasible solutions, we use a sampling approach that randomly perturbs a small fraction of the
positive samples. For lowquality solutions, we formulate the task as a maximin optimization. During
training, instead of using a binary cross entropy loss to penalize the inaccurate predictions for each variable separately, we use a contrastive loss that encourages the model to predict solutions that are similar
to the positive samples but dissimilar to the negative ones. Empirically, we test ConPaS on a variety of
MILP problems, including problems from the NeurIPS Machine Learning for Combinatorial Optimization
competition [58]. We show that ConPaS achieves stateoftheart anytime performance on finding highquality solutions to MILPs, substantially outperforming other learningbased methods such as ND and PaS
in terms of solution quality and speed. In addition, ConPaS shows great generalization performance on
test instances that are 50% larger than the training instances.
3.7.1 Machine Learning Methodology
For a given MILP M, our goal is to use CL to predict the conditional distribution of the solution p(xM),
such that it leads to highquality solutions fast when it is used to guide downstream MILP solving. In
this chapter, we mainly focus on using the prediction in PredictandSearch (optimization problems (3.2))
127
MILP instances for training
For each
instance
Negative samples:
Obtain infeasible or lowquality solutions that are
similar to each positive
sample.
Positive samples:
Solve the instance to
obtain optimal and nearoptimal solutions.
Training data collection
Supervised
contrastive learning
to predict optimal
solution
Dataset PredictandSearch
(Han et al., 2022):
1. Predict scores for variables
2. Fix some variables greedily
based on scores
3. Search for the unfixed
variables while allowing to
change a few fixed ones
Testing
Figure 3.7: Overview of ConPaS. For training, we collect data from a set of MILP instances, including
positive samples that are optimal and nearoptimal solutions and negative samples that are lowquality or
infeasible solutions. We use the data in supervised CL to predict optimal solutions. During testing, the
predictions are used in PredictandSearch [70].
following [70]. However, such prediction can be used to decompose the feasible regions of the input MILP
for exact solving [44] or seed LNS with a better primal solution for heuristic solving [179]. Figure 3.7 gives
an overview of ConPaS. Next, we describe our novel data collection, our supervised CL and how we apply
solution predictions in the search.
3.7.1.1 Data Collection
In ConPaS, we use CL to learn to make discriminative predictions of optimal solutions by contrasting
positive and negative samples. Since finding good assignments for integer variables is essentially the most
challenging part of solving a MILP, we follow previous work [148] to learn p(xM) approximately as
Q
i≤q
pθ(xi
M) where we mainly focus on predicting pθ(xi
M) for binary variables (i ≤ q). Therefore,
our definition of positive and negative samples of solutions mainly concerns the partial solutions on binary
variables (since the optimal solutions for continuous variables can be computed in polynomial time once
the binary ones are fixed). Now, we describe how we collect positive and negative samples.
Positive Samples Collection For a given MILP M, we collect a set of optimal or nearoptimal solutions
SM
p
as our positive samples following previous works [148, 70]. This is done by solving M exhaustively
128
with a MILP solver and collecting up to up solutions with the minimum objective values. In experiments,
up is set to 50.
Negative Samples Collection Negative samples are critical parts of CL to help distinguish between
highquality and lowquality (or even infeasible) solutions. We propose to collect negative samples that
are similar to the positive ones. From a theoretical point of view, the InfoNCE loss [151, 74] we use for
training later can automatically focus on hard negative pairs (i.e., samples with similar representation but
of very different qualities) and learn representations to separate them apart [187].
Given a MILP M, we collect a set of un negative samples SM
n where un = βnSM
p
 and βn is a hyperparameter to control the ratio between the number of positive and negative samples. In experiments, βn is
set to 10. We propose two novel approaches to collect them: (1) a sampling approach to collect infeasible
solutions and (2) an optimizationbased approach to collect lowquality solutions.
• Infeasible Solutions as Negative Samples We introduce a sampling approach. For each positive
sample x ∈ SM
p
, we collect βn infeasible solutions as negative samples. We randomly perturb 10%
of the binary variable values in x (i.e., flipping from 0 to 1 or 1 to 0). If the MILP M contains only
binary variables, we validate that the perturbed solutions are indeed infeasible if they violate at least
one constraint in M. If M contains both binary and continuous variables, we fix the binary variables
to the values in the perturbed solutions and ensure that no feasible assignment of the continuous
variables exists using a MILP solver. If less than βn negative samples are found after validating 2βn
perturbed samples, we increase the perturbation rate by 5% and repeat the same process until we
have βn samples.
• LowQuality Solutions as Negative Samples We introduce an optimizationbased approach. For
each positive sample x = (x1, . . . , xn) ∈ SM
p
, we find the worst βn feasible solutions that differ from
129
x in at most 10% of the binary variables. If the MILP M = (A, b, c, q) contains only binary variables,
we find negative samples x
′ by solving the following Local Branching [56] MILP:
max c
Tx
′
s.t. Ax′ ≤ b, x
′ ∈ {0, 1}
q × R
n−q
, (3.3)
P
i≤q:xi=0 x
′
i +
P
i≤q:xi=1(1 − x
′
i
) ≤ k.
The above MILP is essentially solving the same problem as M, but with a negated objective function
that tries to find solution x
′
as lowquality as possible and a constraint that allows changing at most
k of the binary variables. After solving it, we consider only solutions as negative samples if they are
worse than a given threshold. k is initially set to 10% × q, but if less than βn negative samples are
found with the current k, we increase it by 5% and resolve optimization problem (3.3). We repeat
the same process until we have βn negative samples.
If M contains continuous variables, the goal is to find partial solutions on binary variables, such
that we get as lowquality solutions x
′
as possible when we fix the binary values and optimize for
the rest of the continuous variables. Formally, solving for the partial solutions on binary variables
x
′
1
, . . . , x′
q
can be written as a maximin optimization:
maxx
′
1
,...,x′
q minx
′
q+1,...,x′
n
c
Tx
′
s.t. Ax′ ≤ b, x
′ ∈ {0, 1}
q × R
n−q
, (3.4)
P
i≤q:xi=0 x
′
i +
P
i≤q:xi=1(1 − x
′
i
) ≤ k.
130
Solving the above maximin optimization exactly is prohibitively hard and, to the best of our knowledge, there are no generalpurpose solvers for it [11, Chapter 7]. Therefore, we use a heuristic approach where we iteratively solve the inner minimization problem and add a constraint c
Tx
′ > c
Tx
∗
to enforce the next solution found is strictly better than the current bestfound solution x
∗
to the
maximin problem. It terminates until no better solution can be found. For faster convergence,
we sometimes enforce the next solution found to be at least ϵ > 0 better than x
∗
, i.e., we add
c
Tx
′ ≥ c
Tx
∗ + ϵ, where ϵ is a hyperparameter tuned adaptively in a binary search manner. If we
find less than βn samples, we adjust k the same way as in the previous case.
3.7.1.2 Neural Network Architecture
Following previous work [70], we use a bipartite graph to represent the input MILP M. The bipartite
graph has n variables and m constraints on two sides, respectively, with an edge connecting a variable
and a constraint if the variable has a nonzero coefficient in the constraint. Following [148] and [70], we
use node and edge features in the bipartite graph proposed by [59] . We learn pθ
(xM) represented by a
graph convolutional network (GCN) parameterized by learnable weights θ. The GCN takes the bipartite
graph representation of M and the features as input. We perform two rounds of message passing through
the GCN to obtain an embedding of the variables, which is then passed through a multilayer perceptron
(MLP) followed by a sigmoid activation layer to obtain the final output pθ(xi
M). Details of the GCN
architecture are included in Appendix.
3.7.1.3 Model Learning with a Contrastive Loss
Given a set of MILP instances M for training, let DConPaS = {(SM
p
, SM
n
) : M ∈ M} be the set of positive
and negative samples for all training instances. A contrastive loss is a function whose value is low when
the predicted pθ
(xM) is similar to the positive samples SM
p
and dissimilar to the negative samples SM
n
.
131
MILP Problem MVC MIS CA IP
#Binary Variables 6,000 6,000 4,000 1,050
#Continuous Variables 0 0 0 33
#Constraints 29,975 29,975 2,675 195
Table 3.4: The average numbers of variables and constraints in the test instances.
With similarity measured by dot products, we use an alternative form of InfoNCE, a supervised contrastive
loss, that takes into account the solution qualities of both positive and negative samples:
L
ConPaS(θ) = X
(SMp
,SMn
)∈DConPaS
−1
SM
p

X
xp∈SMp
log
exp(x
T
p pθ
(xM)/τ (xpM))
P
x′∈SMn ∪{xp}
exp(x′Tpθ
(xM)/τ (x′
M))
where we let 1
τ(xM) ∝ −E(xM) if x is feasible to M where E(xM) is the same energy function used
in previous works [70, 148]; otherwise τ (xM) is set to a constant τ
′
(τ
′ = 1 in experiments). Intuitively,
setting τ (xM) in this manner encourages the predictions pθ
(xM) to be more similar to positive samples
xp with better objectives.
3.7.1.4 MLGuided Search
We apply the predicted solution to reduce the search space of the input MILP the same way as PredictandSearch [70]. We greedily select X0 and X1 based on the prediction and solve the optimization problem
defined by Equation (3.2) given hyperparameters k0, k1 and ∆.
3.7.2 Empirical Evaluation
In this subsection, we demonstrate the efficiency of ConPaS through experiments. In the following, we
introduce our evaluation setup and then present the results.
132
3.7.2.1 Setup
MILP Problems We evaluate on four NPhard MILP problems that are widely used in existing studies
[59, 70], which consist of two graph optimization problems, namely the minimum vertex cover (MVC)
and maximum independent set (MIS) problems, and two nongraph optimization problems, namely the
combinatorial auction (CA) and item placement (IP) problems. Both MVC and MIS instances are generated
according to the BarabasiAlbert random graph model [3], with 6,000 nodes and an average degree of 5.
CA instances are generated with 2,000 items and 4,000 bids according to the arbitrary relations in [115].
IP instances are provided by the NeurIPS Machine Learning for Combinatorial Optimization competition
[58]. The workload appointment problem is another MILP problem from the competition. However, they
are not challenging enough for the baselines and our method. Therefore, we exclude the results on the
workload appointment problem from the main content and report them in Appendix. For each problem, we
have 400 training instances, 100 validation instances and 100 test instances. For each set of test instances,
Table 3.4 shows its average numbers of variables and constraints. More details of instance generation are
included in Appendix.
Baselines We compare ConPaS with three baselines: (1) SCIP (v8.0.1) [21], the stateoftheart opensource ILP solver. We allow restart and presolving with the aggressive mode turned on for primal heuristics to focus on improving the objective value; (2) ND [148]; and (3) PredictandSearch (PaS) [70]. We
have considered another version of PaS where we replace the neural network output with the LP relaxation solutions of the MILP. However, this method causes very high infeasibility rates when solving the
optimization problem defined by Equation (3.2). We also compare ConPaS with Gurobi (v10.0.0) [69] and
present the results in Appendix.
133
(a) MVC (b) MIS
(c) CA (d) IP
Figure 3.8: The primal gap (the lower the better) as a function of runtime, averaged over 100 test instances.
134
For MLbased methods, a separate model is trained for each MILP problem. For PaS, we train the
models with the code provide by [70]. For ND, we implement it and finetune its hyperparameters for each
MILP problem since their code is not available.
Metrics We use the following metrics to evaluate all methods: (1) The primal gap [19] is the normalized
difference between the primal bound v and a precomputed best known objective value v
∗
, defined as
v−v
∗
max(v,v∗,ε)
if v exists and v · v
∗ ≥ 0, or 1 otherwise. We use ε = 10−8
to avoid division by zero; v
∗
is the
best primal bound found within 60 minutes by any method in the portfolio for comparison; (2) The primal
integral [1] at runtime cutoff t is the integral on [0, t] of the primal gap as a function of runtime. It captures
the quality of the solutions found and the speed at which they are found; and (3) The survival rate [179]
to meet a certain primal gap threshold is the fraction of instances with primal gaps below the threshold.
Hyperparameters We conduct experiments on 2.4 GHz Intel Core i7 CPUs with 16 GB memory. Training is done on a NVIDIA P100 GPU with 32 GB memory. For data collection, we collect 50 best found
solutions for each training instance with an hour runtime using Gurobi (v10.0.0). For training, we use
the Adam optimizer [104] with learning rate 10−3
. We use a batch size of 8 and train for 100 epochs (the
training typically converges in less than 50 epochs and 5 hours). For testing, we set the runtime cutoff to
1,000 seconds to solve the reduced MILP of each test instance with SCIP (v8.0.1).∗ To tune (k0, k1, ∆) (see
definition in subsection 3.3.3) for both PaS and ConPaS, we first fix ∆ = 5 or 10 and vary k0, k1 to be
0%, 10%, . . . , 50% of the number of binary variables to test their performance on the validation instances
to get their initial values. We then adjust ∆, k0, k1 around their initial values to find the best ones. The
finetuned values are reported in Appendix.
∗Note that our method is agnostic to the solver for the reduced MILP. The test results with Gurobi are reported in Appendix.
135
(a) MVC (b) MIS
(c) CA (d) IP
Figure 3.9: The survival rate (the higher, the better) to meet a certain primal gap threshold over 100 test
instances as a function of runtime. The primal gap threshold is set to the median of the average primal
gaps at the 1,000second runtime cutoff among all methods rounded to the nearest 0.50%.
136
(a) MVC (b) MIS
(c) CA (d) IP
Figure 3.10: The primal integral (the lower, the better) at the 1,000second runtime cutoff, averaged over
100 test instances. The error bars represent the standard deviation. A tabular representation is provided
in the Appendix Table A.7.
137
(a) MVC (large instances).
(b) CA (large instances).
Figure 3.11: Generalization to 100 large instances: The primal gap as a function of runtime, the survival
rate as a function of runtime and the primal integral at the 1,000second runtime cutoff. The primal gap
threshold for the survival rate is chosen as the medium of the average primal gaps at the 1,000second
runtime cutoff among all methods rounded to the nearest 0.50%. A tabular representation for the primal
integral plots is provided in Appendix.
Figure 3.12: Training on different fractions of training instances: The primal gap as a function of runtime
and the primal integral at the 1,000second runtime cutoff. ConPaSLQ50% and ConPaSLQ25% denote
the versions of ConPaS trained with only 50% and 25% of the training instances, respectively (similarly for
PaS).
138
3.7.2.2 Results
We test two variants of ConPaS, denoted by ConPaSInf and ConPaSLQ, that use infeasible solutions and
lowquality solutions as negative samples, respectively. Figure 3.8 shows the primal gap as a function of
runtime. Overall, SCIP performs the worst. PaS achieves lower average primal gaps than ND on three
of the MILP problems at the 1,000second runtime cutoff. Both ConPaSInf and ConPaSLQ show better
anytime performance than all baselines on all MILP problems. ConPaSLQ performances slightly better
than ConPaSInf. At the 1,000second runtime cutoff, ConPaSInf achieves 3.54%52.83% lower average
primal gaps and ConPaSLQ achieves 9.82%86.02% lower average primal gaps than the best baseline.
Figure 3.9 shows the survival rate to meet a certain primal gap threshold. The primal gap threshold
is chosen as the medium of the average primal gap at the 1,000second runtime cutoff among all methods
rounded to the nearest 0.50%. ND surprisingly has the lowest survival rate (even lower than SCIP) on the
CA instances, indicating high variance in performance of both SCIP and ND†
, but ND is better than both
SCIP and PaS on both the two graph optimization problems. PaS has higher survival rates on the CA and
IP instances. ConPaSInf and ConPaSLQ have the best survival rate at the 1,000second runtime cutoff on
all instances. Specifically, on the MVC and MIS instances, at the runtime cutoffs when they both first reach
100% survival rates, the best baseline only achieves about 10%80% survival rates. These results indicate
that ConPaS not only finds better solutions on average but also finds them on more instances. Figure 3.10
shows the average primal integral at the 1,000second runtime cutoff. The result demonstrates that both
ConPaSInf and ConPaSLQ not only find better solutions than the other methods but also find them at a
faster speed.
Next, we test the generalization performance and conduct an ablation study on the loss functions.
Given the large computation overhead, we focus on two representative MILP problems, a graph optimization problem MVC and a nongraph optimization problem CA.
†When the primal gap threshold is set to 5.00%, ND has a 98% survival rate whereas SCIP has only 56%.
139
MVC CA
PG PI PG PI
PaS 0.17% 13.9 1.16% 28.9
ConPaSLQunweighted 0.12% 3.3 0.57% 24.3
ConPaSLQ 0.10% 2.9 0.16% 19.7
Table 3.5: Comparison of different loss functions. We report the primal gaps (PG) and the primal integrals
(PI) at the 1,000second runtime cutoff averaged over 100 instances.
Generalization to Larger Instances We test the generalization performance of the trained models on
larger instances. We generate 100 large MVC instances according to the BarabasiAlbert random graph
model [3], with 9,000 nodes and an average degree of 5. We also generate 100 large CA instances with
3,000 items and 6,000 bids according to the arbitrary relations in [115]. These larger instances have 50%
more variables and constraints than the previous test instances. In Figure 3.11, we show the results of the
average primal gaps, survival rates and the average primal integral over 100 test instances. All MLbased
methods demonstrate good generalizability. On large MVC instances, ND, PaS and ConPaSInf perform
similarly in terms of the primal gap, while ConPaSInf improves the primal gap faster than the other
methods. On large CA instances, both ConPaSInf and ConPaSLQ are substantially better than the other
baselines in terms of all performance metrics. Overall, on both large MVC and CA instances, ConPaSLQ
is the best and its primal integral at the 1,000second runtime cutoff is 57.9%70.3% lower than the best
baseline PaS. It also reaches 100% survival rates fastest for the given thresholds.
Ablation Study We conduct an ablation study on ConPaSLQ to assess the effectiveness of the alternate
form of InfoNCE loss. The results are shown in Table 3.5, where ConPaSLQunweighted refers to training
using the original InfoNCE loss without considering different qualities of the samples where we finetune and set τ (xM) to constant 1. ConPaSLQ refers to the one that takes into account the solution
qualities. ConPaSLQ is still able to outperform PaS. Its performance further improves when the modified
loss function is used.
140
Primal Gap (%) Primal Integral
k0 PaS ConPaSLQ PaS ConPaSLQ
800 6.28 6.59 114.4 117.5
1200 5.45 5.05 104.3 97.3
1600 2.91 2.06 75.6 70.4
2000 1.17 0.55 28.9 19.7
2400 2.19 1.40 27.5 22.9
2700 5.63 4.58 58.0 47.4
3000 12.74 11.56 127.8 115.8
Table 3.6: The primal gap and primal integral at the 1,000second runtime cutoff on the CA instances with
different k0 averaged over 100 instances.
The Effect of Hyperparameters We study the effect of hyperparameters. Specifically, we focus our
study on PaS and ConPaSLQ on the CA instances. We first empirically study how many training instances
are needed for each method. We train separate models with 50% and 25% of the training instances and
test their performance on the test instances. Figure 3.12 shows the results on the primal gap and primal
integral. The two models for ConPaSLQ trained with 50% and 100% of the instances perform similarly
to each other. This is also true for PaS, but its two models are both worse than ConPaSLQ. When we
use 25% of the training instances, we observe a drop in performance for both methods. However, in this
case, ConPaSLQ performs much better than PaS and only slightly worse than PaS trained on 100% or 50%
instances. These empirical results indicate that CL can achieve better performance using fewer training
instances than other learning methods.
We also study the effect of different (k0, k1, ∆) for PaS and ConPaSLQ on the CA instances. For
CA instances, fixing both k1 and ∆ to 0 always gives better primal gaps and primal integrals than other
values. Therefore, we vary only k0. We present the results on primal gaps and primal integrals in Table 3.6.
Overall, setting k0 = 2, 000 gives the best performance for both PaS and ConPaSLQ. Either increasing
or decreasing k0 from 2,000 hurts their performance. However, if we increase k0 from 2,000, both of them
converge to the eventual solutions fast and therefore have comparable primal integrals with small k0, even
though sometimes their primal gaps are worse. In general, having a smaller k requires the search to search
141
for the values on more variables; therefore, it converges slower and has a larger primal integral. On the
other hand, having a larger k reduces the search space more, therefore, it converges faster but to a worse
solution.
3.8 Summary
In this chapter, we validated the hypothesis that one can leverage a general ML framework to improve
humandesigned decisionmaking strategies in different types of MILP search algorithms. We proposed a
general ML framework based on contrastive learning. To apply this framework, we identify decisions in
MILP search algorithms that we want to improve. Then, we collect training data for supervised CL. The
training data includes positive and negative samples that are demonstrations of the decisions. Then, we
train an ML model with a contrastive loss to predict decisions that are similar to the positive samples and
dissimilar to negative ones. Finally, we use the learned ML model to make decisions during the search.
We first applied the framework to LNS and proposed CLLNS that learned efficient and effective destroy
heuristics in LNS. We presented a novel data collection process tailored for CLLNS and used GAT with a
richer set of features to further improve its performance. Empirically, CLLNS substantially outperformed
stateoftheart methods on four MILP problems with respect to the primal gap, the primal integral, the
best performing rate and the survival rate. CLLNS achieved good generalization performance on outofdistribution instances that are 100% larger than those used in training.
We then applied the framework to PaS and proposed ConPaS that learned to predict highquality solutions by contrasting optimal and nearoptimal solutions with infeasible or lowquality solutions. We
presented a novel data collection process tailored for ConPaS, proposing a novel samplingbased approach
and a novel optimizationbased approach to collect negative samples. In testing, we solved a reducedsize
MILP by restricting the search space to the proximity of the predicted solutions. Empirically, we showed
142
that ConPaS found solutions better and faster than the baselines, which include two stateoftheart MLguided MILP search algorithms. ConPaS achieved good generalization performance on outofdistribution
instances that are 50% larger than those used in training.
143
Chapter 4
Conclusions
In today’s rapidly evolving society and economy, the scale, pace and variety of tasks related to resource
allocation, design, planning and operations are expanding. These tasks are often subject to stringent resource constraints, high quality expectations and increasingly complex environments. Central to addressing the challenges of these tasks is addressing complex combinatorial optimization problems (COPs). In
the past decades, search algorithms have been proposed to solve COPs. There are many decisions made by
humandesigned strategies in search algorithms that are crucial to their successful algorithmic advances.
However, handcrafting those strategies is a complicated task that is prone to human errors and bias. On the
other hand, machine learning (ML) has been the major force behind the successful advancements of many
realworld applications nowadays. In this dissertation, we show that one can leverage general machine
learning frameworks to improve humandesigned decisionmaking strategies in different types of search
algorithms for COPs. Specifically, we focus on two important COPs, namely multiagent path finding
(MAPF) and mixed integer linear programs (MILPs).
In Chapter 2, we presented our first major contributions to using ML to improving decisionmaking
strategies in MAPF search algorithms. We contributed a general ML framework based on imitation learning
and implemented the framework on four different MAPF search algorithms, namely CBS, ECBS, MAPFLNS and PP, which substantially improved their performance in terms of runtime and/or solution quality.
144
The main contributions of this chapter were published in major artificial intelligence conferences individually in 2021 and 2022 [85, 89, 90, 213]. They are the first works that use ML techniques to enhance MAPF
search algorithms by improving the quality of decisionmaking within the search process. Our works have
inspired other works in the community since they were published [206, 192, 201, 156], where [206] uses
a graph transformer architecture to improve nodeselection strategies for ECBS, [192] uses genetic algorithms to learn priorityassignment strategies that can be expressed as arithmetic formulae for PP, [201]
proposes a new deep neural network architecture to improve agent setselection strategies for MAPFLNS,
and [156] uses online learning to learn to configure agent setselection strategies. Both [192] and [201]
are built upon our proposed ML framework. We believe that, by formulating a general ML framework and
providing the four examples of implementations for different use cases, the contribution in Chapter 2 will
serve as important guidance on how to improve MAPF search algorithms systematically.
Next, we discuss the limitations of the contribution in Chapter 2 and future work for improving
decisionmaking strategies in MAPF search algorithms. The ML framework proposed in Chapter 2 uses
imitation learning. Thus, one of its limitations is the need for computationally expensive data collection and an effective expert. Finding such an expert might require a good understanding of MAPF itself.
For future work, it would be interesting to design unsupervised learning methods, such as reinforcement
learning (RL) methods, to learn decisionmaking strategies in MAPF search algorithms without the need
for data collection or an expert. RL has been applied to MAPF before but is mostly used to learn policies
to construct conflictfree solutions for the agents [172, 38]. Applying RL to improving search algorithms
poses a unique challenge since it is not straightforward to model some search algorithms, such as tree
search, as a Markov decision process and the rewards are typically sparse, especially for difficult MAPF
instances [173]. It would also be interesting future work to integrate deep learning techniques into the
proposed ML framework. Though deep learning has been successfully used to improve solving MILPs and
145
other COPs [14, 212], it is challenging for MAPF due to the nonnegligible computational overhead introduced by deep neural networks (DNNs) and the highlyoptimized nature of stateoftheart MAPF search
algorithms. Recent work [201] has applied knowledgedistillation techniques to reduce the complexity
of a DNN for MAPFLNS. We believe designing engineering techniques to overcome such challenges is
important and promising.
In Chapter 3, we presented our second major contribution to using ML to improve decisionmaking
strategies in MILP search algorithms. We contributed a general ML framework based on contrastive learning (CL) and implemented the framework on two different MILP search algorithms, namely LNS and PaS,
which substantially improved their performance in terms of both runtime and solution quality when compared to imitation learning and/or RL methods. The main contributions of this chapter were published in
the International Conference on Machine Learning in 2023 and 2024 [87, 88]. Despite a lot of success in
MLguided MILP solving, they are the first works that apply CL techniques to improve MILP search algorithms. Our works have inspired other works in the MLguided MILPsolving community since they were
published [26, 55], where [26] apply the same CL framework to predict a subset of variables to prioritize
branching on and [55, 106] use generative models to learn destroy heuristics for LNS. We believe that the
contribution in Chapter 3 is valuable and will further facilitate the use of CL for MLguided MILP solving.
Next, we discuss the limitations of the contribution in Chapter 3 and future work for improving
decisionmaking strategies in MILP search algorithms. Solving MILPs based on solution predictions, such
as ConPaS, does not guarantee completeness or optimality. CLLNS does not either since it is based on
LNS. Therefore, it would be interesting and important future work to integrate each of them into optimal
MILP search algorithms such as BranchandBound (BnB). For example, CLLNS can be implemented as
a primal heuristic, a class of heuristics that are capable of finding highquality feasible solutions to the
MILP fast in BnB. For this direction, it will be important to craft and utilize features related to the search
tree of BnB, since BnB is a tree search that provides dynamic information, such as the dual bound of the
146
solution, cutting planes generated to prune the search space and branching decisions for partitioning the
search space. On the other hand, ConPaS can also be incorporated in BnB, where the predicted solutions
can be used to assign branching priorities to the variables or to select a linear combination of variables to
branch on for generalized strong branching [202]. Furthermore, it is also promising future work to apply
the CL framework to improve the performance of imitation learning or RL methods for decisionmaking
in BnB, such as selecting variables to branch on and selecting nodes to expand.
To summarize, we presented two general ML frameworks to improve humandesigned decisionmaking
strategies in different search algorithms for MAPF and MILP, respectively. They are the first ML frameworks in the literature that provide guidance on how one could improve different search algorithms for a
COP systematically. These frameworks are useful because one needs different search algorithms when the
optimality requirements for the solutions and/or the computation budget to solve the COP change from
time to time. Finally, we discuss how our contributions can be generalized to other COPs. First, it is a natural idea to apply the CL framework for MILP to MAPF. We discuss how this could be done for MAPFLNS
and PP and identify potential challenges. In Chapter 3, we showed that CL learned joint representations
of variables given MILP instances that were useful for guiding decisionmaking for both LNS and PaS. For
MAPF, we could leverage it to learn useful representations of agents given MAPF instances. For MAPFLNS, instead of learning to select agent sets proposed by an expert in MAPFMLLNS, CL can be applied
to learn to directly construct the agent sets, similarly to CLLNS for MILPs. For PP, instead of predicting
priorities for agents individually (even though we use features that capture information of other agents),
we could leverage CL to learn to jointly predict a score for each agent to derive their priorities. To train
the model, one could use scores or ranks for agents derived from good and bad total priority orderings as
positive and negative samples, respectively. This approach would eliminate the need to assign labels using
a heuristic method to account for multiple good total priority orderings, as we did for PP+ML. One important open question is to design features and ML model architectures for MAPF that capture dependencies
147
among agents, similar to the bipartite graph representations and graph neural networks for MILPs that
capture dependencies between variables and constraints. Second, it is future work to apply either of the
two ML frameworks to other COPs. Both frameworks used the same ML models, ML methods and loss
functions as well as similar features for a COP, demonstrating good generalizability. We focused on training an ML model to make a single decision and evaluate that trained model only for making that decision.
A promising way to apply the frameworks to COPs, in general, is to learn an ML model that generalizes
across different COPs and/or decisionmaking tasks. In this direction, our ML frameworks can serve as
the foundation for multitask learning, where we can put together the data collected for different tasks
with the frameworks to form a larger training dataset. We then use this larger training dataset to train a
foundational ML model capable of performing various decisionmaking tasks in search algorithms for one
or multiple COPs, with minimal or no finetuning required for each task.
148
Bibliography
[1] Tobias Achterberg, Timo Berthold, and Gregor Hendel. “Rounding and propagation heuristics for
mixed integer programming”. In: Operations Research Proceedings. Springer, 2012, pp. 71–76.
[2] Tobias Achterberg, Thorsten Koch, and Alexander Martin. “Branching rules revisited”. In:
Operations Research Letters 33.1 (2005), pp. 42–54.
[3] Réka Albert and AlbertLászló Barabási. “Statistical mechanics of complex networks”. In: Reviews
of modern physics 74.1 (2002), p. 47.
[4] André Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. “Permutation importance: a
corrected feature importance measure”. In: Bioinformatics 26.10 (2010), pp. 1340–1347.
[5] Andre RS Amaral. “An exact approach to the onedimensional facility layout problem”. In:
Operations Research 56.4 (2008), pp. 1026–1033.
[6] Brandon Amos. “Tutorial on amortized optimization”. In: Foundations and Trends in Machine
Learning 16.5 (2023), pp. 592–732.
[7] David Applegate, Robert Bixby, Vašek Chvátal, and William Cook. Finding cuts in the TSP (A
preliminary report). Vol. 95. Citeseer, 1995.
[8] Nabila Azi, Michel Gendreau, and JeanYves Potvin. “An adaptive large neighborhood search for a
vehicle routing problem with multiple routes”. In: Computers & Operations Research 41 (2014),
pp. 167–173.
[9] Jacopo Banfi, Nicola Basilico, and Francesco Amigoni. “Intractability of timeoptimal multirobot
path planning on 2d grid graphs with holes”. In: IEEE Robotics and Automation Letters 2.4 (2017),
pp. 1941–1947.
[10] Max Barer, Guni Sharon, Roni Stern, and Ariel Felner. “Suboptimal variants of the conflictbased
search algorithm for the multiagent pathfinding problem”. In: Symposium on Combinatorial
Search. 2014, pp. 19–27.
[11] Yasmine Beck and Martin Schmidt. “A gentle and incomplete introduction to bilevel
optimization”. In: (2021). url: https://optimizationonline.org/?p=17182.
149
[12] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. “Neural
combinatorial optimization with reinforcement learning”. In: arXiv preprint arXiv:1611.09940
(2016).
[13] Stefano Benati and Romeo Rizzi. “A mixed integer linear programming formulation of the
optimal mean/valueatrisk portfolio problem”. In: European Journal of Operational Research 176.1
(2007), pp. 423–434.
[14] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. “Machine learning for combinatorial
optimization: a methodological tour d’horizon”. In: European Journal of Operational Research
290.2 (2021), pp. 405–421.
[15] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. “Curriculum learning”.
In: International Conference on Machine Learning. 2009, pp. 41–48.
[16] Maren Bennewitz, Wolfram Burgard, and Sebastian Thrun. “Finding and optimizing solvable
priority schemes for decoupled path planning techniques for teams of mobile robots”. In: Robotics
and Autonomous Systems 41.23 (2002), pp. 89–99.
[17] Jur P. van den Berg and Mark H. Overmars. “Prioritized motion planning for multiple robots”. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems. 2005, pp. 430–435.
[18] Jur P. van den Berg, Jack Snoeyink, Ming C. Lin, and Dinesh Manocha. “Centralized path
planning for multiple robots: Optimal decoupling into sequential plans”. In: Robotics: Science and
Systems V. 2009, pp. 2–3.
[19] Timo Berthold. “Primal heuristics for mixed integer programs”. PhD thesis. Zuse Institute Berlin
(ZIB), 2006.
[20] Timo Berthold. “RENS”. In: Mathematical Programming Computation 6.1 (2014), pp. 33–54.
[21] Ksenia Bestuzheva, Mathieu Besançon, WeiKun Chen, Antonia Chmiela, Tim Donkiewicz,
Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner,
Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny,
Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter,
Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Daniel Rehfeldt, Steffan Schlein,
Franziska Schlösser, Felipe Serrano, Yuji Shinano, Boro Sofranac, Mark Turner, Stefan Vigerske,
Fabian Wegscheider, Philipp Wellner, Dieter Weninger, and Jakob Witzig. The SCIP optimization
suite 8.0. Technical Report. Optimization Online, Dec. 2021. url:
http://www.optimizationonline.org/DB_HTML/2021/12/8728.html.
[22] Eli Boyarski, Ariel Felner, Roni Stern, Guni Sharon, David Tolpin, Oded Betzalel, and
Eyal Shimony. “ICBS: Improved conflictbased search algorithm for multiagent pathfinding”. In:
International Joint Conference on Artificial Intelligence. 2015, pp. 442–449.
[23] Shaked Brody, Uri Alon, and Eran Yahav. “How attentive are graph attention networks?” In:
International Conference on Learning Representations (2022).
150
[24] Stephen J Buckley. “Fast motion planning for multiple moving robots”. In: IEEE International
Conference on Robotics and Automation. 1989, pp. 322–326.
[25] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and
Greg Hullender. “Learning to rank using gradient descent”. In: International Conference on
Machine Learning. 2005, pp. 89–96.
[26] Junyang Cai, Taoan Huang, and Bistra Dilkina. “Learning backdoors for mixed integer programs
with contrastive learning”. In: arXiv preprint arXiv:2401.10467 (2024).
[27] Yi Cao, Sivakumar Rathinam, and Dengfeng Sun. “Greedyheuristicaided mixedinteger linear
programming approach for arrival scheduling”. In: Journal of Aerospace Information Systems 10.7
(2013), pp. 323–336.
[28] Quentin Cappart, Thierry Moisan, LouisMartin Rousseau, Isabeau PrémontSchwarz, and
Andre A Cire. “Combining reinforcement learning and constraint programming for combinatorial
optimization”. In: AAAI Conference on Artificial Intelligence. Vol. 35. 5. 2021, pp. 3677–3687.
[29] Gary W Chang, YD Tsai, CY Lai, and JS Chung. “A practical mixed integer linear programming
based approach for unit commitment”. In: IEEE Power Engineering Society General Meeting, 2004.
IEEE. 2004, pp. 221–225.
[30] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework
for contrastive learning of visual representations”. In: International Conference on Machine
Learning. PMLR. 2020, pp. 1597–1607.
[31] Weizhe Chen, Zhihan Wang, Jiaoyang Li, Sven Koenig, and Bistra Dilkina. “No panacea in
planning: Algorithm selection for suboptimal multiagent path finding”. In: arXiv preprint
arXiv:2404.03554 (2024).
[32] Xinyun Chen and Yuandong Tian. “Learning to perform local rewriting for combinatorial
optimization”. In: Advances in Neural Information Processing Systems 32 (2019).
[33] Zhe Chen, Daniel Harabor, Jiaoyang Li, and Peter J Stuckey. “Traffic flow optimisation for
lifelong multiagent path finding”. In: AAAI Conference on Artificial Intelligence. Vol. 38. 18. 2024,
pp. 20674–20682.
[34] Antonia Chmiela, Elias Khalil, Ambros Gleixner, Andrea Lodi, and Sebastian Pokutta. “Learning
to schedule heuristics in branch and bound”. In: Advances in Neural Information Processing
Systems 34 (2021), pp. 24235–24246.
[35] Liron Cohen, Tansel Uras, and Sven Koenig. “Feasibility study: Using highways for
boundedsuboptimal multiagent path finding”. In: Symposium on Combinatorial Search. 2015.
[36] Liron Cohen, Tansel Uras, TK Satish Kumar, Hong Xu, Nora Ayanian, and Sven Koenig.
“Improved solvers for boundedsuboptimal multiagent path finding.” In: International Joint
Conference on Artificial Intelligence. 2016, pp. 3067–3074.
151
[37] IBM ILOG Cplex. “V12. 1: User’s manual for CPLEX”. In: International Business Machines
Corporation 46.53 (2009), p. 157.
[38] Mehul Damani, Zhiyao Luo, Emerson Wenzel, and Guillaume Sartoretti. “PRIMAL_2: Pathfinding
via reinforcement and imitation multiagent learninglifelong”. In: IEEE Robotics and Automation
Letters 6.2 (2021), pp. 2666–2673.
[39] Emilie Danna, Edward Rothberg, and Claude Le Pape. “Exploring relaxation induced
neighborhoods to improve MIP solutions”. In: Mathematical Programming 102.1 (2005), pp. 71–90.
[40] Hal Daumé, John Langford, and Daniel Marcu. “Searchbased structured prediction”. In: Machine
Learning 75.3 (2009), pp. 297–325.
[41] Sven De Vries and Rakesh V Vohra. “Combinatorial auctions: A survey”. In: INFORMS Journal on
computing 15.3 (2003), pp. 284–309.
[42] Michel Deudon, Pierre Cournut, Alexandre Lacoste, Yossiri Adulyasak, and
LouisMartin Rousseau. “Learning heuristics for the TSP by policy gradient”. In: Integration of
Constraint Programming, Artificial Intelligence, and Operations Research. Springer. 2018,
pp. 170–181.
[43] Bistra Dilkina and Carla P Gomes. “Solving connected subgraph problems in wildlife
conservation.” In: International Conference on Integration of Constraint Programming, Artificial
Intelligence, and Operations Research. Vol. 6140. Springer. 2010, pp. 102–116.
[44] JianYa Ding, Chao Zhang, Lei Shen, Shengyin Li, Bing Wang, Yinghui Xu, and Le Song.
“Accelerating primal solution findings for mixed integer programs based on solution prediction”.
In: AAAI Conference on Artificial Intelligence. Vol. 34. 02. 2020, pp. 1452–1459.
[45] Kurt Dresner and Peter Stone. “A multiagent approach to autonomous intersection management”.
In: Journal of Artificial Intelligence Research 31 (2008), pp. 591–656.
[46] Haonan Duan, Pashootan Vaezipoor, Max B Paulus, Yangjun Ruan, and Chris Maddison.
“Augment with care: Contrastive learning for combinatorial problems”. In: International
Conference on Machine Learning. PMLR. 2022, pp. 5627–5642.
[47] Lu Duan, Haoyuan Hu, Yu Qian, Yu Gong, Xiaodong Zhang, Jiangwen Wei, and Yinghui Xu. “A
multitask selected learning approach for solving 3D flexible bin packing problem”. In:
International Conference on Autonomous Agents and MultiAgent Systems. 2019, pp. 1386–1394.
[48] Olivier Dubois and Gilles Dequen. “A backbonesearch heuristic for efficient solving of hard
3SAT formulae”. In: International Joint Conference on Artificial Intelligence. Vol. 1. 2001,
pp. 248–253.
[49] Michael Erdmann and Tomas LozanoPerez. “On multiple moving objects”. In: Algorithmica 2
(1987), pp. 477–521.
[50] Paul Erdos, Alfréd Rényi, et al. “On the evolution of random graphs”. In: Publ. Math. Inst. Hung.
Acad. Sci 5.1 (1960), pp. 17–60.
152
[51] Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Russ R Salakhutdinov. “Contrastive
learning as goalconditioned reinforcement learning”. In: Advances in Neural Information
Processing Systems 35 (2022), pp. 35603–35620.
[52] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. “LIBLINEAR: A
library for large linear classification”. In: Journal of Machine Learning Research 9.Aug (2008),
pp. 1871–1874.
[53] Ariel Felner, Meir Goldenberg, Guni Sharon, Roni Stern, Tal Beja, Nathan R Sturtevant,
Jonathan Schaeffer, and Robert Holte. “PartialExpansion A* with Selective Node Generation.” In:
AAAI Conference on Artificial Intelligence. 2012, pp. 180–181.
[54] Ariel Felner, Jiaoyang Li, Eli Boyarski, Hang Ma, Liron Cohen, TK Satish Kumar, and
Sven Koenig. “Adding heuristics to conflictbased search for multiagent path finding”. In:
International Conference on Automated Planning and Scheduling. 2018, pp. 83–87.
[55] Shengyu Feng, Zhiqing Sun, and Yiming Yang. “DIFUSCOLNS: Diffusionguided large
neighborhood search for integer linear programming”. In: (2023).
[56] Matteo Fischetti and Andrea Lodi. “Local branching”. In: Mathematical programming 98.1 (2003),
pp. 23–47.
[57] Graeme Gange, Daniel Harabor, and Peter J Stuckey. “Lazy CBS: implicit conflictbased search
using lazy clause generation”. In: International Conference on Automated Planning and Scheduling.
Vol. 29. 2019, pp. 155–162.
[58] Maxime Gasse, Simon Bowly, Quentin Cappart, Charfreitag, et al. “The machine learning for
combinatorial optimization competition (ml4co): Results and insights”. In: NeurIPS 2021
Competitions and Demonstrations Track. PMLR. 2022, pp. 220–231.
[59] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. “Exact
combinatorial optimization with graph convolutional neural networks”. In: Advances in Neural
Information Processing Systems 32 (2019).
[60] Yonatan Geifman and Ran ElYaniv. “Selectivenet: A deep neural network with an integrated
reject option”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2151–2159.
[61] Shubhashis Ghosh. “DINS, a MIP improvement heuristic”. In: International Conference on Integer
Programming and Combinatorial Optimization. Springer. 2007, pp. 310–323.
[62] John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. “DeCLUTR: Deep contrastive learning for
unsupervised textual representations”. In: Annual Meeting of the Association for Computational
Linguistics and International Joint Conference on Natural Language Processing (Volume 1: Long
Papers). 2021, pp. 879–895.
[63] Rodrigo N Gómez, Carlos Hernández, and Jorge A Baier. “A compact answer set programming
encoding of multiagent pathfinding”. In: IEEE Access 9 (2021), pp. 26886–26901.
153
[64] Ralph E Gomory. Outline of an algorithm for integer solutions to linear programs and an algorithm
for the mixed integer problem. Springer, 2010.
[65] Lacy M Greening, Mathieu Dahan, and Alan L Erera. “Leadtimeconstrained middlemile
consolidation network design with fixed origins and destinations”. In: Transportation Research
Part B: Methodological 174 (2023), p. 102782.
[66] Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Stoyanov. “Supervised contrastive learning
for pretrained language model finetuning”. In: International Conference on Learning
Representations. 2021.
[67] Amrita Gupta and Bistra Dilkina. “Budgetconstrained demandweighted network design for
resilient infrastructure”. In: 2019 IEEE 31st International Conference on Tools with Artificial
Intelligence. IEEE. 2019, pp. 456–463.
[68] Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio.
“Hybrid models for learning to branch”. In: Advances in Neural Information Processing Systems 33
(2020), pp. 18087–18097.
[69] Gurobi Optimization, LLC. Gurobi optimizer reference manual. 2022. url: https://www.gurobi.com.
[70] Qingyu Han, Linxin Yang, Qian Chen, et al. “A GNNguided predictandsearch framework for
mixedinteger linear programming”. In: International Conference on Learning Representations.
2022.
[71] Pierre Hansen, Nenad Mladenović, Jack Brimberg, and José A Moreno Pérez. Variable
neighborhood search. Springer, 2019.
[72] Peter E Hart, Nils J Nilsson, and Bertram Raphael. “A formal basis for the heuristic determination
of minimum cost paths”. In: IEEE transactions on Systems Science and Cybernetics 4.2 (1968),
pp. 100–107.
[73] He He, Hal Daume III, and Jason M. Eisner. “Learning to search in branch and bound algorithms”.
In: Advances in Neural Information Processing Systems. 2014, pp. 3293–3301.
[74] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. “Momentum contrast for
unsupervised visual representation learning”. In: IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2020, pp. 9729–9738.
[75] Gregor Hendel. “Adaptive large neighborhood search for mixed integer programming”. In:
Mathematical Programming Computation 14.2 (2022), pp. 185–221.
[76] Sunderesh S Heragu and Andrew Kusiak. “Efficient models for the facility layout problem”. In:
European Journal of Operational Research 53.1 (1991), pp. 1–13.
[77] R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Phil Bachman,
Adam Trischler, and Yoshua Bengio. “Learning deep representations by mutual information
estimation and maximization”. In: International Conference on Learning Representations (2019).
154
[78] Florence Ho, Rúben Geraldes, Artur Gonçalves, Bastien Rigault, Benjamin Sportich,
Daisuke Kubo, Marc Cavazza, and Helmut Prendinger. “Decentralized multiagent path finding
for UAV traffic management”. In: IEEE Transactions on Intelligent Transportation Systems 23.2
(2020), pp. 997–1008.
[79] John H Holland. “Genetic algorithms”. In: Scientific American 267.1 (1992), pp. 66–73.
[80] Wolfgang Hönig, Scott Kiesel, Andrew Tinka, Joseph W. Durham, and Nora Ayanian. “Persistent
and robust execution of MAPF schedules in warehouses”. In: IEEE Robotics and Automation Letters
4.2 (2019), pp. 1125–1131.
[81] Wolfgang Hönig, James A Preiss, TK Satish Kumar, Gaurav S Sukhatme, and Nora Ayanian.
“Trajectory planning for quadrotor swarms”. In: IEEE Transactions on Robotics 34.4 (2018),
pp. 856–869.
[82] André Hottung and Kevin Tierney. “Neural large neighborhood search for the capacitated vehicle
routing problem”. In: European Conference on Artificial Intelligence. IOS Press, 2020, pp. 443–450.
[83] Haoyuan Hu, Xiaodong Zhang, Xiaowei Yan, Longfei Wang, and Yinghui Xu. “Solving a new 3d
bin packing problem with deep reinforcement learning method”. In: arXiv preprint
arXiv:1708.05930 (2017).
[84] Taoan Huang and Bistra Dilkina. “Enhancing seismic resilience of water pipe networks”. In: ACM
SIGCAS Conference on Computing and Sustainable Societies. 2020, pp. 44–52.
[85] Taoan Huang, Bistra Dilkina, and Sven Koenig. “Learning nodeselection strategies in bounded
suboptimal conflictbased search for multiagent path finding”. In: International Joint Conference
on Autonomous Agents and Multiagent Systems (AAMAS). 2021.
[86] Taoan Huang, Aaron Ferber, Yuandong Tian, Bistra Dilkina, and Benoit Steiner. “Local branching
relaxation heuristics for integer linear programs”. In: International Conference on Integration of
Constraint Programming, Artificial Intelligence, and Operations Research. Springer. 2023,
pp. 96–113.
[87] Taoan Huang, Aaron M Ferber, Yuandong Tian, Bistra Dilkina, and Benoit Steiner. “Searching
large neighborhoods for integer linear programs with contrastive learning”. In: International
Conference on Machine Learning. PMLR. 2023, pp. 13869–13890.
[88] Taoan Huang, Aaron M Ferber, Arman Zharmagambetov, Yuandong Tian, and Bistra Dilkina.
“Contrastive predictandsearch for mixed integer linear programs”. In: International Conference
on Machine Learning. PMLR. 2024.
[89] Taoan Huang, Sven Koenig, and Bistra Dilkina. “Learning to resolve conflicts for multiagent path
finding with conflictbased search”. In: AAAI Conference on Artificial Intelligence. Vol. 35. 13. 2021,
pp. 11246–11253.
[90] Taoan Huang, Jiaoyang Li, Sven Koenig, and Bistra Dilkina. “Anytime multiagent path finding
via machine learningguided large neighborhood search”. In: AAAI Conference on Artificial
Intelligence. Vol. 36. 9. 2022, pp. 9368–9376.
155
[91] Taoan Huang, Vikas Shivashankar, Michael Caldara, Joseph Durham, Jiaoyang Li, Bistra Dilkina,
and Sven Koenig. “Deadlineaware multiagent tour planning”. In: International Conference on
Automated Planning and Scheduling. 2023.
[92] Zeren Huang, Kerong Wang, Furui Liu, HuiLing Zhen, Weinan Zhang, Mingxuan Yuan,
Jianye Hao, Yong Yu, and Jun Wang. “Learning to select cuts for efficient mixedinteger
programming”. In: Pattern Recognition 123 (2022), p. 108353.
[93] Anil Jindal and Kuldip Singh Sangwan. “Closed loop supply chain network design and
optimisation using fuzzy mixed integer linear programming model”. In: International Journal of
Production Research 52.14 (2014), pp. 4156–4173.
[94] Thorsten Joachims. “Optimizing search engines using clickthrough data”. In: ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. 2002, pp. 133–142.
[95] Thorsten Joachims. “Training linear SVMs in linear time”. In: ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. 2006, pp. 217–226.
[96] David S Johnson, Jan Karel Lenstra, and AHG Rinnooy Kan. “The complexity of the network
design problem”. In: Networks 8.4 (1978), pp. 279–285.
[97] David S Johnson, Christos H Papadimitriou, and Mihalis Yannakakis. “How easy is local search?”
In: Journal of computer and system sciences 37.1 (1988), pp. 79–100.
[98] Omri Kaduri, Eli Boyarski, and Roni Stern. “Algorithm selection for optimal multiagent
pathfinding”. In: International Conference on Automated Planning and Scheduling. 2020,
pp. 161–165.
[99] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. “Learning combinatorial
optimization algorithms over graphs”. In: Advances in Neural Information Processing Systems 30
(2017).
[100] Elias B Khalil, Bistra Dilkina, George L Nemhauser, Shabbir Ahmed, and Yufen Shao. “Learning to
Run Heuristics in Tree Search.” In: International Joint Conference on Artificial Intelligence. 2017,
pp. 659–666.
[101] Elias B Khalil, Christopher Morris, and Andrea Lodi. “Mipgnn: A datadriven framework for
guiding combinatorial solvers”. In: AAAI Conference on Artificial Intelligence. Vol. 36. 9. 2022,
pp. 10219–10227.
[102] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L. Nemhauser, and Bistra Dilkina.
“Learning to branch in mixed integer programming”. In: AAAI Conference on Artificial
Intelligence. 2016, pp. 724–731.
[103] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola,
Aaron Maschinot, Ce Liu, and Dilip Krishnan. “Supervised contrastive learning”. In: Advances in
Neural Information Processing Systems 33 (2020), pp. 18661–18673.
156
[104] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:
International Conference on Learning Representations. 2015.
[105] Satoshi Kitayama and Keiichiro Yasuda. “A method for mixed integer programming problems by
particle swarm optimization”. In: Electrical Engineering in Japan 157.2 (2006), pp. 40–49.
[106] Shufeng Kong, Caihua Liu, and Carla P Gomes. “ILPFORMER: Solving Integer Linear
Programming with Sequence to MultiLabel Learning”. In: Uncertainty in Artificial Intelligence.
2024.
[107] Wouter Kool, Herke Van Hoof, and Max Welling. “Attention, learn to solve routing problems!” In:
International Conference on Learning Representations. 2018.
[108] Attila A Kovacs, Sophie N Parragh, Karl F Doerner, and Richard F Hartl. “Adaptive large
neighborhood search for service technician routing and scheduling problems”. In: Journal of
Scheduling 15.5 (2012), pp. 579–600.
[109] WenYang Ku and J Christopher Beck. “Mixed integer programming models for job shop
scheduling: A computational analysis”. In: Computers & Operations Research 73 (2016),
pp. 165–173.
[110] Abdel Ghani Labassi, Didier Chételat, and Andrea Lodi. “Learning to compare nodes in branch
and bound with graph neural networks”. In: Advances in Neural Information Processing Systems
(2022).
[111] Edward Lam and Pierre Le Bodic. “New valid inequalities in branchandcutandprice for
multiagent path finding”. In: International Conference on Automated Planning and Scheduling.
2020, pp. 184–192.
[112] Edward Lam, Pierre Le Bodic, Daniel Damir Harabor, and Peter J Stuckey.
“Branchandcutandprice for multiagent pathfinding.” In: International Joint Conference on
Artificial Intelligence. 2019, pp. 1289–1296.
[113] Ailsa H Land and Alison G Doig. “An automatic method for solving discrete programming
problems”. In: 50 Years of Integer Programming 19582008. Springer, 2010, pp. 105–132.
[114] Jasmina Lazić, Saıd Hanafi, Nenad Mladenović, and Dragan Urošević. “Variable neighbourhood
decomposition search for 0–1 mixed integer programs”. In: Computers & Operations Research 37.6
(2010), pp. 1055–1067.
[115] Kevin LeytonBrown, Mark Pearson, and Yoav Shoham. “Towards a universal test suite for
combinatorial auction algorithms”. In: ACM conference on Electronic Commerce. 2000, pp. 66–76.
[116] Hui Li, Teng Long, Guangtong Xu, and Yangjie Wang. “Couplingdegreebased heuristic
prioritized planning method for UAV swarm path generation”. In: Chinese Automation Congress.
2019, pp. 3636–3641.
157
[117] Jiaoyang Li, Zhe Chen, Daniel Harabor, Peter J Stuckey, and Sven Koenig. “Anytime multiagent
path finding via large neighborhood search”. In: International Joint Conference on Artificial
Intelligence. 2021, pp. 4127–4135.
[118] Jiaoyang Li, Zhe Chen, Daniel Harabor, Peter J. Stuckey, and Sven Koenig. “Anytime multiagent
path finding via large neighborhood search: Extended abstract”. In: International Joint Conference
on Autonomous Agents and Multiagent Systems. 2021.
[119] Jiaoyang Li, Zhe Chen, Daniel Harabor, Peter J. Stuckey, and Sven Koenig. “MAPFLNS2: Fast
Repairing for MultiAgent Path Finding via Large Neighborhood Search”. In: AAAI Conference on
Artificial Intelligence. 2022, pp. 10256–10265.
[120] Jiaoyang Li, Ariel Felner, Eli Boyarski, Hang Ma, and Sven Koenig. “Improved heuristics for
multiagent path finding with conflictbased search.” In: International Joint Conference on
Artificial Intelligence. 2019, pp. 442–449.
[121] Jiaoyang Li, Graeme Gange, Daniel Harabor, Peter J Stuckey, Hang Ma, and Sven Koenig. “New
techniques for pairwise symmetry breaking in multiagent path finding”. In: Proceedings of the
International Conference on Automated Planning and Scheduling. 2020, pp. 193–201.
[122] Jiaoyang Li, Daniel Harabor, Peter J. Stuckey, Hang Ma, and Sven Koenig. “Disjoint splitting for
multiagent path finding with conflictbased search”. In: International Conference on Automated
Planning and Scheduling. 2019, pp. 279–283.
[123] Jiaoyang Li, Eugene Lin, Hai L Vu, Sven Koenig, et al. “Intersection coordination with
prioritybased search for autonomous vehicles”. In: AAAI Conference on Artificial Intelligence.
Vol. 37. 10. 2023, pp. 11578–11585.
[124] Jiaoyang Li, Wheeler Ruml, and Sven Koenig. “Eecbs: A boundedsuboptimal search for
multiagent path finding”. In: AAAI Conference on Artificial Intelligence. Vol. 35. 14. 2021,
pp. 12353–12362.
[125] Jiaoyang Li, Kexuan Sun, Hang Ma, Ariel Felner, TK Kumar, and Sven Koenig. “Moving agents in
formation in congested environments”. In: Symposium on Combinatorial Search. Vol. 11. 1. 2020,
pp. 131–132.
[126] Jiaoyang Li, Andrew Tinka, Scott Kiesel, Joseph W Durham, TK Satish Kumar, and Sven Koenig.
“Lifelong multiagent path finding in largescale warehouses”. In: AAAI Conference on Artificial
Intelligence. Vol. 35. 13. 2021, pp. 11272–11281.
[127] Sirui Li, Zhongxia Yan, and Cathy Wu. “Learning to delegate for largescale vehicle routing”. In:
Advances in Neural Information Processing Systems 34 (2021), pp. 26198–26211.
[128] Wenhao Li, Hongjun Chen, Bo Jin, Wenzhe Tan, Hongyuan Zha, and Xiangfeng Wang.
“Multiagent path finding with prioritized communication learning”. In: International Conference
on Robotics and Automation. IEEE. 2022, pp. 10695–10701.
158
[129] Xiang Li, Tiejian Li, Jiahua Wei, Guangqian Wang, and William WG Yeh. “Hydro unit
commitment via mixed integer linear programming: A case study of the three gorges project,
China”. In: IEEE Transactions on Power Systems 29.3 (2013), pp. 1232–1241.
[130] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. “Combinatorial optimization with graph
convolutional networks and guided tree search”. In: Advances in Neural Information Processing
Systems 31 (2018).
[131] Defeng Liu, Matteo Fischetti, and Andrea Lodi. “Learning to search in local branching”. In: AAAI
Conference on Artificial Intelligence. 2022.
[132] Hao Lu, Xingwen Zhang, and Shuang Yang. “A learningbased iterative method for solving
vehicle routing problems”. In: International Conference on Learning Representations. 2019.
[133] Hao Lu, Xingwen Zhang, and Shuang Yang. “A learningbased iterative method for solving
vehicle routing problems”. In: International Conference on Learning Representations. 2020.
[134] Paramet Luathep, Agachai Sumalee, William HK Lam, ZhiChun Li, and Hong K Lo. “Global
optimization method for mixed transportation network design problem: a mixedinteger linear
programming approach”. In: Transportation Research Part B: Methodological 45.5 (2011),
pp. 808–827.
[135] Ryan Luna and Kostas E Bekris. “Push and swap: Fast cooperative pathfinding with completeness
guarantees”. In: International Joint Conference on Artificial Intelligence. 2011, pp. 294–300.
[136] YuhChyun Luo, Monique Guignard, and ChunHung Chen. “A hybrid approach for integer
programming combining genetic algorithms, linear programming and ordinal optimization”. In:
Journal of Intelligent Manufacturing 12 (2001), pp. 509–519.
[137] Hang Ma, Daniel Harabor, Peter J. Stuckey, Jiaoyang Li, and Sven Koenig. “Searching with
consistent prioritization for multiagent path finding”. In: AAAI Conference on Artificial
Intelligence. 2019, pp. 7643–7650.
[138] Hang Ma, Jiaoyang Li, TK Satish Kumar, and Sven Koenig. “Lifelong multiagent path finding for
online pickup and delivery tasks”. In: International Conference on Autonomous Agents and
MultiAgent Systems. 2017, pp. 837–845.
[139] Hang Ma, Craig Tovey, Guni Sharon, TK Satish Kumar, and Sven Koenig. “Multiagent path
finding with payload transfers and the packageexchange robotrouting problem”. In: AAAI
Conference on Artificial Intelligence. 2016.
[140] Hang Ma, Jingxing Yang, Liron Cohen, TK Satish Kumar, and Sven Koenig. “Feasibility study:
Moving nonhomogeneous teams in congested video game environments”. In: Artificial
Intelligence and Interactive Digital Entertainment Conference. 2017, pp. 270–272.
[141] Ziyuan Ma, Yudong Luo, and Hang Ma. “Distributed heuristic multiagent path finding with
communication”. In: 2021 IEEE International Conference on Robotics and Automation. IEEE. 2021,
pp. 8699–8705.
159
[142] Stephen J Maher, Tobias Fischer, Tristan Gally, Gerald Gamrath, Ambros Gleixner,
Robert Lion Gottwald, Gregor Hendel, Thorsten Koch, Marco Lübbecke, Matthias Miltenberger,
et al. “The SCIP optimization suite 4.0”. In: (2017).
[143] Sahil Manchanda, Akash Mittal, Anuj Dhawan, Sourav Medya, Sayan Ranu, and Ambuj Singh.
“Learning heuristics over large graphs via deep reinforcement learning”. In: arXiv preprint
arXiv:1903.03332 (2019).
[144] Alan S Manne. “On the jobshop scheduling problem”. In: Operations Research 8.2 (1960),
pp. 219–223.
[145] Renata Mansini, Wlodzimierz Ogryczak, and M Grazia Speranza. “Twenty years of linear
programming based portfolio optimization”. In: European Journal of Operational Research 234.2
(2014), pp. 518–535.
[146] Leilei Meng, Chaoyong Zhang, Yaping Ren, Biao Zhang, and Chang Lv. “Mixedinteger linear
programming and constraint programming formulations for solving distributed flexible job shop
scheduling problem”. In: Computers & Industrial Engineering 142 (2020), p. 106347.
[147] Maxime Mulamba, Jayanta Mandi, Michelangelo Diligenti, Michele Lombardi,
Victor Bucarey Lopez, and Tias Guns. “Contrastive losses and solution caching for
predictandoptimize”. In: International Joint Conference on Artificial Intelligence. 2021, p. 2833.
[148] Vinod Nair, Sergey Bartunov, Felix Gimeno, et al. “Solving mixed integer programs using neural
networks”. In: arXiv preprint arXiv:2012.13349 (2020).
[149] Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. “Reinforcement
learning for solving the vehicle routing problem”. In: Advances in Neural Information Processing
Systems 31 (2018).
[150] Keisuke Okumura, Manao Machida, Xavier Défago, and Yasumasa Tamura. “Priority inheritance
with backtracking for iterative multiagent path finding”. In: Artificial Intelligence 310 (2022),
p. 103752.
[151] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representation learning with contrastive
predictive coding”. In: arXiv preprint arXiv:1807.03748 (2018).
[152] James Ostrowski, Miguel F Anjos, and Anthony Vannelli. “Tight mixed integer linear
programming formulations for the unit commitment problem”. In: IEEE Transactions on Power
Systems 27.1 (2011), pp. 39–46.
[153] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and
complexity. Courier Corporation, 1998.
[154] Max B Paulus, Giulia Zarpellon, Andreas Krause, Laurent Charlin, and Chris Maddison. “Learning
to cut by looking ahead: Cutting plane selection via imitation learning”. In: International
Conference on Machine Learning. PMLR. 2022, pp. 17584–17600.
160
[155] Judea Pearl and Jin H. Kim. “Studies in semiadmissible heuristics”. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence 4 (1982), pp. 392–399.
[156] Thomy Phan, Taoan Huang, Bistra Dilkina, and Sven Koenig. “Adaptive anytime multiagent path
finding using banditbased large neighborhood search”. In: AAAI Conference on Artificial
Intelligence. Vol. 38. 16. 2024, pp. 17514–17522.
[157] Victor Pillac, Pascal Van Hentenryck, and Caroline Even. “A conflictbased pathgeneration
heuristic for evacuation planning”. In: Transportation Research Part B: Methodological 83 (2016),
pp. 136–150.
[158] Ira Pohl. “Heuristic search viewed as path finding in a graph”. In: Artificial intelligence 1.34
(1970), pp. 193–204.
[159] Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and
Andrea Lodi. “Ecole: A gymlike library for machine learning in combinatorial optimization
solvers”. In: Learning Meets Combinatorial Algorithms at NeurIPS2020. 2020. url:
https://openreview.net/forum?id=IVc9hqgibyB.
[160] Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. “Softtriple loss: Deep metric
learning without triplet sampling”. In: IEEE/CVF International Conference on Computer Vision.
2019, pp. 6450–6458.
[161] Arthur Queffelec, Ocan Sankur, and François Schwarzentruber. “Conflictbased search for
connected multiagent path finding”. In: arXiv preprint arXiv:2006.03280 (2020).
[162] C Quoc and Viet Le. “Learning to rank with nonsmooth cost functions”. In: Advances in Neural
Information Processing Systems. 2007, pp. 193–200.
[163] Jingyao Ren, Vikraman Sathiyanarayanan, Eric Ewing, Baskin Senbaslar, and Nora Ayanian.
“MAPFAST: A deep algorithm selector for multi agent path finding using shortest path
embeddings”. In: International Conference on Autonomous Agents and MultiAgent Systems. 2021.
[164] Nils Rethmeier and Isabelle Augenstein. “A primer on contrastive pretraining in language
processing: Methods, lessons learned, and perspectives”. In: ACM Computing Surveys 55.10 (2023),
pp. 1–17.
[165] Julia Rieck, Juergen Zimmermann, and Thorsten Gather. “Mixedinteger linear programming for
resource leveling problems”. In: European Journal of Operational Research 221.1 (2012), pp. 27–37.
[166] Stefan Ropke and David Pisinger. “An adaptive large neighborhood search heuristic for the pickup
and delivery problem with time windows”. In: Transportation science 40.4 (2006), pp. 455–472.
[167] Stéphane Ross and Drew Bagnell. “Efficient reductions for imitation learning”. In: International
Conference on Artificial Intelligence and Statistics. 2010, pp. 661–668.
[168] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. “A reduction of imitation learning and
structured prediction to noregret online learning”. In: International Conference on Artificial
Intelligence and Statistics. 2011, pp. 627–635.
161
[169] Edward Rothberg. “An evolutionary algorithm for polishing mixed integer programming
solutions”. In: INFORMS Journal on Computing 19.4 (2007), pp. 534–541.
[170] Qandeel Sajid, Ryan Luna, and Kostas Bekris. “Multiagent pathfinding with simultaneous
execution of singleagent primitives”. In: Symposium on Combinatorial Search. Vol. 3. 1. 2012,
pp. 88–96.
[171] Arun Kumar Sangaiah, Erfan Babaee Tirkolaee, Alireza Goli, and Saeed DehnaviArani. “Robust
optimization and mixedinteger linear programming model for LNG supply chain planning
problem”. In: Soft Computing 24 (2020), pp. 7885–7905.
[172] Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish Kumar, Sven Koenig, and
Howie Choset. “PRIMAL: Pathfinding via reinforcement and imitation multiagent learning”. In:
IEEE Robotics and Automation Letters 4.3 (2019), pp. 2378–2385.
[173] Lara Scavuzzo, Feng Chen, Didier Chételat, Maxime Gasse, Andrea Lodi, Neil YorkeSmith, and
Karen Aardal. “Learning to branch with tree MDPs”. In: Advances in Neural Information Processing
Systems 35 (2022), pp. 18514–18526.
[174] Guni Sharon, Roni Stern, Ariel Felner, and Nathan R. Sturtevant. “Conflictbased search for
optimal multiagent pathfinding”. In: Artificial Intelligence 219 (2015), pp. 40–66.
[175] David Silver. “Cooperative pathfinding”. In: Artificial Intelligence and Interactive Digital
Entertainment Conference. 2005, pp. 117–122.
[176] Stephen L Smith and Frank Imeson. “GLNS: An effective large neighborhood search heuristic for
the generalized traveling salesman problem”. In: Computers & Operations Research 87 (2017),
pp. 1–19.
[177] Jialin Song, Ravi Lanka, Yisong Yue, and Bistra Dilkina. “A general large neighborhood search
framework for solving integer linear programs”. In: Advances in Neural Information Processing
Systems. Vol. 33. 2020.
[178] Jialin Song, Ravi Lanka, Albert Zhao, Aadyot Bhatnagar, Yisong Yue, and Masahiro Ono.
“Learning to search via retrospective imitation”. In: arXiv preprint arXiv:1804.00846 (2018).
[179] Nicolas Sonnerat, Pengming Wang, Ira Ktena, Sergey Bartunov, and Vinod Nair. “Learning a large
neighborhood search algorithm for mixed integer programs”. In: arXiv preprint arXiv:2107.10201
(2021).
[180] Trevor Scott Standley. “Finding optimal solutions to cooperative pathfinding problems.” In: AAAI
Conference on Artificial Intelligence. 2010, pp. 28–29.
[181] Roni Stern, Nathan R. Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker,
Jiaoyang Li, Dor Atzmon, Liron Cohen, TK Satish Kumar, Eli Boyarski, and Roman Bartak.
“Multiagent pathfinding: Definitions, variants, and benchmarks”. In: Symposium on
Combinatorial Search. 2019, pp. 151–158.
162
[182] Nathan R. Sturtevant. “Benchmarks for gridbased pathfinding”. In: IEEE Transactions on
Computational Intelligence and AI in Games 4.2 (2012), pp. 144–148.
[183] Pavel Surynek. “Unifying searchbased and compilationbased approaches to multiagent path
finding through satisfiability modulo theories”. In: Symposium on Combinatorial Search. Vol. 10. 1.
2019, pp. 202–203.
[184] Yunhao Tang, Shipra Agrawal, and Yuri Faenza. “Reinforcement learning for integer
programming: Learning to cut”. In: International Conference on Machine Learning. PMLR. 2020,
pp. 9367–9376.
[185] J Teghem, M Pirlot, and C Antoniadis. “Embedding of linear programming in a simulated
annealing algorithm for solving a mixed integer production planning problem”. In: Journal of
Computational and Applied Mathematics 64.12 (1995), pp. 91–102.
[186] Jordan Tyler Thayer and Wheeler Ruml. “Bounded suboptimal search: A direct approach using
inadmissible estimates”. In: International Joint Conference on Artificial Intelligence. Vol. 2011. 2011,
pp. 674–679.
[187] Yuandong Tian. “Understanding Deep Contrastive Learning via Coordinatewise Optimization”.
In: Advances in Neural Information Processing Systems. 2022.
[188] Zekun Tong, Yuxuan Liang, Henghui Ding, Yongxing Dai, Xinke Li, and Changhu Wang.
“Directed graph contrastive learning”. In: Advances in Neural Information Processing Systems 34
(2021), pp. 19580–19593.
[189] Paolo Toth and Daniele Vigo. The vehicle routing problem. SIAM, 2002.
[190] Glenn Wagner and Howie Choset. “M*: A complete multirobot path planning algorithm with
performance bounds”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems.
2011, pp. 3260–3267.
[191] Jiangxing Wang, Jiaoyang Li, Hang Ma, Sven Koenig, and S Kumar. “A new constraint
satisfaction perspective on multiagent path finding: Preliminary results”. In: International
Conference on Autonomous Agents and Multiagent Systems. 2019, pp. 2253–2255.
[192] Shuwei Wang, Vadim Bulitko, Taoan Huang, Sven Koenig, and Roni Stern. “Synthesizing priority
planning formulae for multiagent pathfinding”. In: AAAI Conference on Artificial Intelligence and
Interactive Digital Entertainment. Vol. 19. 1. 2023, pp. 360–369.
[193] Yutong Wang, Bairan Xiang, Shinan Huang, and Guillaume Sartoretti. “SCRIMP: Scalable
communication for reinforcementand imitationlearningbased multiagent pathfinding”. In:
International Conference on Autonomous Agents and Multiagent Systems. 2023, pp. 2598–2600.
[194] Laurence A. Wolsey and George L. Nemhauser. Integer and combinatorial optimization. Vol. 55.
John Wiley & Sons, 1999.
163
[195] ChaoYuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. “Sampling matters in
deep embedding learning”. In: IEEE International Conference on Computer Vision. 2017,
pp. 2840–2848.
[196] Wenying Wu, Subhrajit Bhattacharya, and Amanda Prorok. “Multirobot path deconfliction
through prioritization by path prospects”. In: IEEE International Conference on Robotics and
Automation. 2020, pp. 9809–9815.
[197] Wenying Wu, Subhrajit Bhattacharya, and Amanda Prorok. “Multirobot path deconfliction
through prioritization by path prospects”. In: IEEE international conference on robotics and
automation. IEEE. 2020, pp. 9809–9815.
[198] Yaoxin Wu, Wen Song, Zhiguang Cao, and Jie Zhang. “Learning large neighborhood search policy
for integer programming”. In: Advances in Neural Information Processing Systems 34 (2021),
pp. 30075–30087.
[199] Peter R Wurman, Raffaello D’Andrea, and Mick Mountz. “Coordinating hundreds of cooperative,
autonomous vehicles in warehouses”. In: AI Magazine 29.1 (2008), pp. 9–9.
[200] Liang Xin, Wen Song, Zhiguang Cao, and Jie Zhang. “NeuroLKH: Combining deep learning model
with linkernighanhelsgaun heuristic for solving the traveling salesman problem”. In: Advances
in Neural Information Processing Systems 34 (2021), pp. 7472–7483.
[201] Zhongxia Yan and Cathy Wu. “Neural neighborhood search for multiagent path finding”. In:
International Conference on Learning Representations. 2024.
[202] Yu Yang, Natashia Boland, Bistra Dilkina, and Martin Savelsbergh. “Learning generalized strong
branching for set covering, set packing, and 0–1 knapsack problems”. In: European Journal of
Operational Research 301.3 (2022), pp. 828–840.
[203] Taehyun Yoon, Jinwon Choi, Hyokun Yun, and Sungbin Lim. “Thresholdaware Learning to
Generate Feasible Solutions for Mixed Integer Programs”. In: arXiv preprint arXiv:2308.00327
(2023).
[204] Fengqi You and Ignacio E Grossmann. “Mixedinteger nonlinear programming models and
algorithms for largescale supply chain design with stochastic inventory management”. In:
Industrial & Engineering Chemistry Research 47.20 (2008), pp. 7802–7817.
[205] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. “Graph
contrastive learning with augmentations”. In: Advances in Neural Information Processing Systems
33 (2020), pp. 5812–5823.
[206] Chenning Yu, Qingbiao Li, Sicun Gao, and Amanda Prorok. “Accelerating multiagent planning
using graph transformers with bounded suboptimality”. In: IEEE International Conference on
Robotics and Automation. 2023, pp. 3432–3439.
[207] Jingjin Yu. “Intractability of optimal multirobot path planning on planar graphs”. In: IEEE Robotics
and Automation Letters 1.1 (2015), pp. 33–40.
164
[208] Jingjin Yu and Steven M LaValle. “Structure and intractability of optimal multirobot path
planning on graphs”. In: AAAI Conference on Artificial Intelligence. 2013, pp. 1443–1449.
[209] Jingjin Yu and Steven M. LaValle. “Planning optimal paths for multiple robots on graphs”. In: IEEE
International Conference on Robotics and Automation. 2013, pp. 3612–3617.
[210] Jingjin Yu and Daniela Rus. “Pebble motion on graphs with rotations: Efficient feasibility tests
and planning algorithms”. In: Algorithmic Foundations of Robotics XI: Selected Contributions of the
Eleventh International Workshop on the Algorithmic Foundations of Robotics. Springer. 2015,
pp. 729–746.
[211] Giulia Zarpellon, Jason Jo, Andrea Lodi, and Yoshua Bengio. “Parameterizing branchandbound
search trees to learn branching policies”. In: AAAI Conference on Artificial Intelligence. 2021.
[212] Jiayi Zhang, Chang Liu, Xijun Li, HuiLing Zhen, Mingxuan Yuan, Yawen Li, and Junchi Yan. “A
survey for solving mixed integer programming via machine learning”. In: Neurocomputing 519
(2023), pp. 205–217.
[213] Shuyang Zhang, Jiaoyang Li, Taoan Huang, Sven Koenig, and Bistra Dilkina. “Learning a priority
ordering for prioritized planning in multiagent path finding”. In: Symposium on Combinatorial
Search. 2022.
[214] Jiongzhi Zheng, Kun He, Jianrong Zhou, Yan Jin, and ChuMin Li. “Combining reinforcement
learning with LinKernighanHelsgaun algorithm for the traveling salesman problem”. In: AAAI
Conference on Artificial Intelligence. 2021.
[215] Ivan Žulj, Sergej Kramer, and Michael Schneider. “A hybrid of adaptive large neighborhood
search and tabu search for the orderbatching problem”. In: European Journal of Operational
Research 264.2 (2018), pp. 653–664.
165
Appendix
A Supplementary Materials to Chapter 3
A.1 Additional Details of MILP Instance Generation
We present the MILP formulations for the minimum vertex cover (MVC), maximum independent set (MIS),
set covering (SC) and combinatorial auction (CA) problems. The descriptions and formulations for the item
placement and workload appointment problems can be found at the ML4CO competition [58] website∗
.
In an MVC instance, we are given an undirected graph G = (V, E). The goal is to select the smallest
subset of nodes such that at least one end point of every edge in the graph is selected:
minP
v∈V
xv
s.t. xu + xv ≥ 1, ∀(u, v) ∈ E,
xv ∈ {0, 1}, ∀v ∈ V.
∗ML4CO Competition Website: https://github.com/ds4dm/ml4cocompetition/blob/main/DATA.md
166
In an MIS instance, we are given an undirected graph G = (V, E). The goal is to select the largest
subset of nodes such that no two nodes in the subsets are connected by an edge in G:
min −
P
v∈V
xv
s.t. xu + xv ≤ 1, ∀(u, v) ∈ E,
xv ∈ {0, 1}, ∀v ∈ V.
In an SC instance, we are given m elements and a collection S of sets whose union is the set of all elements.
The goal is to select a minimum number of sets from S such that the union of the selected set is the set of
all elements:
minP
s∈S
xs
s.t. P
s∈S:i∈s
xs ≥ 1, ∀i ∈ [m],
xs ∈ {0, 1}, ∀s ∈ S.
In a CA instance, we are given n˜ bids {(Bi
, pi) : i ∈ [˜n]} for m˜ items, where Bi
is a subset of items
and pi
is its associated bidding price. The objective is to allocate items to bids such that the total revenue
is maximized:
min −
P
i∈[˜n]
pixi
s.t. P
i:j∈Bi
xi ≤ 1, ∀j ∈ [ ˜m],
xi ∈ {0, 1}, ∀i ∈ [˜n].
167
A.2 Supplementary Materials to Section 3.6
A.2.1 Neural Network Architecture for CLLNS
We give full details of the GAT architecture described in subsection 3.6.1.2. The policy takes as input the
state s
t
and outputs a score vector πθ(s
t
) ∈ [0, 1]n
, one score per variable. We use 2layer MLPs with 64
hidden units per layer and ReLU as the activation function to map each node feature and edge feature to
R
d where d = 64.
Let vj , ci
, ei,j ∈ R
d be the embeddings of the jth variable, ith constraint and the edge connecting
them output by the embedding layers. We perform two rounds of message passing through the GAT. In the
first round, each constraint node ci attends to its neighbors Ni using an attention structure with H = 8
attention heads:
c
′
i =
1
H
X
H
h=1
α
(h)
ii,1
θ
(h)
c,1
ci +
X
j∈Ni
α
(h)
ij,1
θ
(h)
v,1vj
where θ
(h)
c,1 ∈ R
d×d
and θ
(h)
v,1 ∈ R
d×d
are learnable weights. The updated constraint embeddings c
′
i
are
averaged across H attention heads using attention weights [23]
α
(h)
ij,1 =
exp(wT
1
ρ([θ
(h)
c,1
ci
, θ
(h)
v,1vj , θ
(h)
e,1
ei,j ]))
P
k∈Ni
exp(wT
1
ρ([θ
(h)
c,1
ci
, θ
(h)
v,1vk, θ
(h)
e,1
ei,k]))
where the attention coefficients w1 ∈ R
3d
and θ
(h)
e,1 ∈ R
d×d
are both learnable weights and ρ(·) refers to
the LeakyReLU activation function with negative slope 0.2. In the second round, similarly, each variable
node attends to its neighbors to get updated variable node embeddings
v
′
j =
1
H
X
H
h=1
α
(h)
jj,2
θ
(h)
v,2vj +
X
i∈Nj
α
(h)
ji,2
θ
(h)
c,2
c
′
i
168
with attention weights
α
(h)
ji,2 =
exp(wT
2
ρ([θ
(h)
c,2
c
′
i
, θ
(h)
v,2vj , θ
(h)
e,2
ei,j ]))
P
k∈Nj
exp(wT
2
ρ([θ
(h)
c,2
c
′
i
, θ
(h)
v,2vj , θ
(h)
e,2
ei,k]))
where w2 ∈ R
3d
and θ
(h)
c,2
, θ
(h)
v,2
, θ
(h)
e,2 ∈ R
d×d
are learnable weights. After the two rounds of message
passing, the final representations of variables v
′
are passed through a 2layer MLP with 64 hidden units
per layer to obtain a scalar value for each variable. Finally, we apply the sigmoid function to get a score
between 0 and 1.
Features We use features proposed in [59] for node features and edge features in the bipartite graph
and also include a fixedsize window of most recent incumbent values as variable node features with the
window size set to 3 in experiments. In addition, we include features proposed in [102] computed at the
root node of BnB to make it a richer set of variable node features. The full list of features can be found in
Table 2 in Appendix of [59] and Table 1 in [102]. In our implementation, we compute them using the APIs
provided by the Ecole library [159]
†
.
A.2.2 Hyperparameter Tuning
For RLLNS, we use all the hyperparameters provided in their code [198] in our experiments. For the
other LNS methods, all hyperparameters used in experiments are finetuned on the validation set, and the
hyperparameter tunings are described below.
For β, which upper bounds the neighborhood size, we tried values from {0.25, 0.5, 0.6, 0.7}. β = 0.25
is the worst for all approaches, resulting in the highest gap. For LBRELAX, ILLNS and CLLNS, all values
perform similarly (because they select effective neighborhoods early in the search and their neighborhood
†More details and the source code can be found at https://doc.ecole.ai/py/en/stable/reference/observations.
html.
169
sizes either do not reach the upper bound or they already converge to good solutions before reaching it).
For RANDOM and GRAPH, β = 0.5 is the best for them. So, we set β = 0.5 consistently for all approaches.
For initial neighborhood sizes k
0
, we observe that the best values are sensitive for approaches that
need longer runtime to select variables, such as LBRELAX, ILLNS and CLLNS, thus they need the right
k
0
from the beginning and we finetune it for them. For RANDOM and GRAPH, their runtime for selecting
variables is short, and with the adaptive neighborhood size mechanism, they could very quickly find the
right neighborhood size and are insensitive to k
0
. They converge to the same primal gaps (< 1% relative differences) with similar primal integrals (< 2% relative differences) using different k
0
. Despite the
differences being small, we still use the best k
0
for them.
For γ that controls the rate at which k
t
increases, we tried values from {1, 1.01, 1.02, 1.05}. Overall,
γ does not greatly impact the performance if γ > 1; however, γ = 1 is far worse than the others.
For the runtime limit for each repair operation, we tried different limits of 0.5, 1, 2 and 5 minutes. None
of the approaches are sensitive to it since most repairs are finished within 20 seconds. Except for ILLNS
on the SC instances, it selects neighborhoods that require a longer time to repair and a 2minute runtime
limit is necessary. Therefore, we use 2 minutes consistently.
For BnB, the aggressive mode is finetuned for each problem on the validation set. With the aggressive
mode turned on, BnB (SCIP) does not always deliver better anytime performance compared to when it is
turned off. Based on the validation results, the aggressive mode is turned on for MVC and SC instances
and turned off for CAT and MIS instances.
For ILLNS, it uses the same training dataset as CLLNS but uses only the positive samples. We finetune
its hyperparameters for each problem on the validation set, resulting in a different k
0 on the SC instance
from CLLNS. In [179], they use sampling methods to select variables when using the learned policy. For
the temperature parameter η in the sampling method, we tried values from {1/2, 2/3, 1} and η = 0.5
performs the best overall. However, in our experiment, we observe that our greedy method described in
170
Table A.1: Hyperparameters with their notations and values used.
Hyperparameter Notation Value
Suboptimality threshold to determine positive samples αp 0.5
Upper bound on the number of positive samples up 10
Suboptimality threshold to determine negative samples αn 0.05
Ratio between the numbers of positive and negative samples κ 9
Feature embedding dimension d 64
Window size of the most recent incumbent values in variable features 3
Number of attention heads in the GAT H 8
Temperature parameter in the contrastive loss τ 0.07
Rate at which k
t
increases γ 1.02
Upper bound on k
t
as a fraction of number of variables β 0.5
Temperature parameter for sampling variables in ILLNS η 0.5
Initial neighborhood size k
0 Finetuned for each case
Runtime for finding initial solution 10 seconds
Runtime limit for each reoptimization 2 minutes
Learning rate for CLLNS and ILLNS 10−3
Batch size for CLLNS and ILLNS 32
Number of training epochs for CLLNS and ILLNS 30
subsection 3.6.1.4 works better for ILLNS on SC and MIS instances. Thus, CLLNS is compared against
the corresponding results on SC and MIS instances.
For LBRELAX, three variants are presented in [86]. For simplicity, we present only the best of the
three variants for each problem in the paper.
In Table A.1, we summarize all the hyperparameters with their notations and values used in our experiments.
A.2.3 Additional Experimental Results
In this subsection, we add two more baselines and evaluate all approaches on one more metric. We show
that CLLNS outperforms all approaches in terms of all metrics.
We establish two additional baselines:
• LB LNS which selects the neighborhood with the LB heuristics. We set the time limit to 10 minutes
for solving the LB ILP in each iteration;
171
• GRAPH LNS which selects the neighborhood based on the bipartite graph representation of the ILP
similar to GINS [142]. A bipartite graph representation consists of nodes representing the variables
and constraints on two sides, respectively, with an edge connecting a variable and a constraint if a
variable has a nonzero coefficient in the constraint. It runs a breadthfirst search starting from a
random variable node in the bipartite graph and selects the first k
t variable nodes expanded.
Figure A.1 shows the full results on the primal gap as a function of runtime. Figure A.2 shows the
full results on the survival rate as a function of runtime. Figure A.3 shows the full results on the primal
bound as a function of runtime. Tables A.2 and A.3 present the average primal bound, primal gap and
primal integral at 30 and 60 minutes runtime cutoff, respectively, on the small instances. Tables A.4 and
A.5 present the average primal bound, primal gap and primal integral at 30 and 60 minutes runtime cutoff,
respectively, on the large instances.
Next, we evaluate the performance with one additional metric: The gap to virtual best at time z for
an approach is the normalized difference between its best primal bound found up to time z and the best
primal bound found up to time z by any approach in the portfolio.
Figure A.4 shows the full results on the best performing rate as a function of runtime. Figure A.5 shows
the full results on the gap to virtual best as a function of runtime.
A.3 Supplementary Materials to Section 3.7
A.3.1 Neural Network Architecture for ConPaS
We follow previous work [59, 70] to use a bipartite graph representation to encode a MILP M. For the
node (variable and constraint)and edge features of the bipartite graph, we use the same features as [70].
We use the same GCN architecture as previous work [70]. The GCN takes as input the bipartite graph
representation of a MILP M with its features and outputs pθ
(xM), a [0, 1]score vector for the binary
variables. For node features, we use 2layer multilayer perceptrons (MLP) with 64 hidden units per layer
172
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure A.1: The primal gap (the lower the better) as a function of time, averaged over 100 instances. For
ML approaches, the policies are trained on only small training instances but tested on both small and large
test instances.
173
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure A.2: The survival rate (the higher the better) over 100 instances as a function of time to meet primal
gap threshold 1.00%. For ML approaches, the policies are trained on only small training instances but tested
on both small and large test instances.
174
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure A.3: The primal bound (the lower the better) as a function of time, averaged over 100 instances. For
ML approaches, the policies are trained on only small training instances but tested on both small and large
test instances.
175
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure A.4: The best performing rate (the higher the better) as a function of runtime over 100 test instances.
For ML approaches, the policies are trained on only small training instances but tested on both small and
large test instances.
176
(a) MVCS (left) and MVCL (right).
(b) MISS (left) and MISL (right).
(c) CAS (left) and CAL (right).
(d) SCS (left) and SCL (right).
Figure A.5: The gap to virtual best (the lower the better) as a function of runtime, averaged over 100 test
instances. For ML approaches, the policies are trained on only small training instances but tested on both
small and large test instances.
177
PB PG (%) PI PB PG (%) PI
MVC MIS
BnB 449.67±9.69 1.55±0.44 40.2±6.6 2,004.24±26.21 5.60±1.00 127.1±12.4
LB 454.89±11.55 2.66±1.16 58.2±14.1 2,064.30±16.40 2.77±0.51 89.9±7.3
RANDOM 447.16±11.22 0.98±1.26 20.6±22.5 2,115.23±11.82 0.37±0.16 16.9±2.7
GRAPH 447.75±11.39 1.11±1.30 24.2±22.1 2,111.84±12.06 0.53±0.16 24.4±2.7
LBRELAX 449.02±11.53 1.38±1.51 32.1±24.2 2,102.85±11.97 0.95±0.19 33.0±3.6
ILLNS 444.27±9.61 0.35±0.25 13.5±6.9 2,115.30±12.04 0.36±0.18 14.4±3.2
RLLNS 445.71±9.98 0.67±0.35 18.2±5.7 2,116.64±11.53 0.30±0.15 12.7±2.9
CLLNS 443.48±9.56 0.17±0.09 5.5±3.6 2,117.58±11.86 0.26±0.17 9.3±3.0
CA SC
BnB 113,068±1,595 2.75±0.62 93.5±18.6 172.09±12.65 1.63±1.20 62.9±22.5
LB 110,303±2,001 5.13±1.08 191.6±16.9 172.37±12.71 1.79±1.11 89.4±22.3
RANDOM 109,040±1,685 6.21±1.05 126.8±17.6 174.70±12.75 3.10±1.38 73.4±24.6
GRAPH 107,802±1,892 7.28±1.07 152.2±18.9 186.79±14.13 9.33±2.28 175.7±38.8
LBRELAX 114,103±1,521 1.86±0.57 109.5±9.4 171.60±12.43 1.36±1.02 44.6±19.3
ILLNS 114,621±1638 1.41±0.58 68.1±13.9 171.59±12.45 1.35±1.00 39.3±17.4
RLLNS 108,562±1,854 6.63±1.05 132.9±18.2 171.70±12.30 1.42±0.88 55.7±15.6
CLLNS 115,513±1,621 0.65±0.32 39.1±11.6 170.16±12.13 0.53±0.63 16.7±12.3
Table A.2: Test results on small instances: Primal bound (PB), primal gap (PG) (in percent), primal integral
(PI) at 30 minutes time cutoff, averaged over 100 instances and their standard deviations.
PB PG (%) PI PB PG (%) PI
MVCS MISS
BnB 448.63±9.58 1.32±0.43 66.1±13.1 2,014.85±20.04 5.10±0.69 222.8±25.9
LB 453.45±11.81 2.35±1.30 102.2±35.9 2,079.07±14.34 2.07±0.44 130.9±13.6
RANDOM 447.06±11.21 0.96±1.26 38.0±44.8 2,117.92±11.31 0.24±0.14 22.1±5.0
GRAPH 447.14±10.83 0.98±1.20 42.9±44.0 2,116.15±11.58 0.32±0.15 31.8±5.0
LBRELAX 449.01±11.53 1.38±1.51 57.0±51.2 2,109.17±11.17 0.65±0.20 46.9±6.5
ILLNS 444.00±9.73 0.29±0.23 19.2±10.2 2,118.38±11.77 0.22±0.17 19.4±5.8
RLLNS 445.45±9.99 0.61±0.34 29.6±11.5 2,118.44±11.36 0.22±0.14 17.2±5.2
CLLNS 443.48±9.56 0.17±0.09 8.7±6.7 2,119.78±12.14 0.15±0.15 12.8±5.4
CAS SCS
BnB 113,608±1,611 2.28±0.59 137.4±25.9 171.22±12.50 1.13±0.95 86.7±37.9
LB 111,342±1,732 4.23±0.75 272.1±26.9 171.39±12.81 1.22±0.97 113.7±35.2
RANDOM 109,397±1,684 5.90±1.02 235.6±34.9 173.95±12.98 2.67±1.29 124.3±45.4
GRAPH 108,422±1,775 6.74±1.03 277.7±36.5 185.57±14.17 8.74±2.13 337.8±76.4
LBRELAX 114,348±1,516 1.65±0.57 140.5±18.3 170.74±12.35 0.86±0.83 63.2±31.6
ILLNS 115,001±1,564 1.09±0.51 90.0±20.8 171.55±12.47 1.33±0.97 63.2±34.3
RLLNS 108,920±1,816 6.32±1.03 249.2±35.9 171.14±12.30 1.10±0.77 77.8±28.9
CLLNS 115,513±1,621 0.65±0.32 50.7±22.7 170.11±12.10 0.50±0.58 26.2±12.8
Table A.3: Test results on small instances: Primal bound (PB), primal gap (PG) (in percent), primal integral
(PI) at 60 minutes time cutoff, averaged over 100 instances and their standard deviations.
178
PB PG (%) PI PB PG (%) PI
MVC MIS
BnB 919.96±12.38 4.06±0.38 73.4±6.8 3,888.39±20.62 8.24±0.31 150.5±5.6
LB 900.15±12.32 1.95±0.35 52.6±6.0 4,009.23±71.94 5.39±1.59 123.1±15.1
RANDOM 886.39±12.71 0.43±0.25 15.6±3.9 4,225.74±15.63 0.28±0.10 15.8±1.8
GRAPH 886.89±12.79 0.48±0.23 22.9±3.9 4,206.29±16.76 0.74±0.16 31.6±2.7
LBRELAX 887.64±12.21 0.57±0.23 39.4±4.4 4,177.14±18.22 1.42±0.16 48.5±3.0
ILLNS 885.58±12.65 0.33±0.26 15.9±4.0 4,216.32±17.30 0.50±0.17 20.4±3.0
RLLNS 888.89±12.64 0.71±0.30 25.8±4.8 4,224.37±15.79 0.31±0.13 15.1±2.2
CLLNS 883.07±12.61 0.05±0.04 8.1±2.1 4,226.65±15.56 0.26±0.13 9.7±2.6
CA SC
BnB 216,772±13,060 5.58±5.42 257.1±56.4 109.39±7.26 2.02±1.36 84.4±22.2
LB 206,526±3,750 10.03±1.39 245.1±19.2 116.43±8.97 7.84±2.88 162.6±39.2
RANDOM 216,326±2,603 5.76±0.74 129.4±12.1 111.71±7.65 4.02±1.86 100.6±32.0
GRAPH 213,142±2,713 7.14±0.78 177.6±13.2 112.74±7.64 4.91±1.80 141.7±31.1
LBRELAX 225,154±4,366 1.91±1.60 121.9±23.9 109.26±7.07 1.91±1.42 53.9±24.5
ILLNS 214,495±3,148 6.56±1.01 154.0±17.9 109.04±6.94 1.72±1.19 48.1±21.3
RLLNS 217,600±2,705 5.20±0.84 106.3±14.2 108.66±6.83 1.38±0.99 98.1±15.1
CLLNS 223,257±2,667 2.74±0.71 95.0±12.5 107.78±6.64 0.58±0.45 28.6±12.6
Table A.4: Generalization results on large instances: Primal bound (PB), primal gap (PG) (in percent),
primal integral (PI) at 30 minutes time cutoff, averaged over 100 instances and their standard deviations.
PB PG (%) PI PB PG (%) PI
MVCL MISL
BnB 904.41±12.95 2.41±0.40 130.2±11.1 3,970.78±71.54 6.29±1.62 285.1±18.2
LB 893.56±12.62 1.22±0.30 77.8±10.1 4,079.76±43.09 3.72±0.87 200.7±32.5
RANDOM 886.00±12.74 0.38±0.24 22.7±8.0 4,232.68±15.42 0.11±0.08 19.0±3.1
GRAPH 886.34±12.67 0.42±0.23 30.9±7.6 4,220.89±16.42 0.39±0.15 41.1±5.1
LBRELAX 886.68±12.33 0.46±0.23 48.4±7.5 4,199.04±17.54 0.91±0.16 68.6±5.5
ILLNS 885.00±12.56 0.27±0.23 21.2±8.1 4,225.28±16.25 0.29±0.15 27.1±5.5
RLLNS 887.90±12.67 0.59±0.30 37.3±9.6 4,231.52±15.97 0.14±0.12 18.9±4.1
CLLNS 883.07±12.61 0.05±0.04 9.1±3.4 4,232.50±14.86 0.12±0.11 12.9±4.4
CAL SCL
BnB 223,225±5,106 2.74±1.87 320.9±83.1 108.87±7.35 1.54±1.33 115.0±42.5
LB 208,500±3,976 9.17±1.43 414.0±36.9 115.12±8.77 6.80±2.73 293.5±79.7
RANDOM 217,204±2,612 5.37±0.75 229.2±24.4 110.88±7.55 3.31±1.79 166.4±61.3
GRAPH 214,926±2,649 6.37±0.86 297.5±26.9 111.49±7.51 3.85±1.74 218.9±56.7
LBRELAX 225,848±4,201 1.61±1.50 153.0±50.3 109.26±7.07 1.91±1.42 88.3±48.9
ILLNS 219,074±3,278 4.56±0.98 254.2±33.4 109.04±6.94 1.72±1.19 79.1±42.4
RLLNS 218,273±2,725 4.91±0.81 197.0±28.5 107.87±6.74 0.66±0.72 116.2±27.1
CLLNS 229,331±2,800 0.09±0.10 116.1±18.0 107.78±6.64 0.58±0.45 39.2±23.2
Table A.5: Generalization results on large instances: Primal bound (PB), primal gap (PG) (in percent) and
primal integral (PI) at 60 minutes time cutoff, averaged over 100 instances and their standard deviations.
179
Figure A.6: The primal gap as a function of runtime and the primal integral at 1,000 seconds runtime cutoff.
Note that the curves of PaS and ConPaS highly overlap with each other.
and ReLU as the activation function to map them to R
64. We then perform two rounds of message passings,
the first from variable nodes to constraint nodes and the second from constraint nodes to variable nodes,
using graph convolution layers [59] to obtain a final variable embedding. The final variable embedding
is then passed through a 2layer MLP with 64 hidden units per layer and ReLU as the activation function
followed by a sigmoid layer to obtain the output pθ
(xM).
PaS ConPaSInf ConPaSLQ
MVC (500, 100, 10) (800, 200, 20) (800, 200, 20)
MIS (600, 600, 5) (1200, 600, 10) (1000, 600, 15)
CA (2000, 0, 0) (2000, 0, 0) (2000, 0, 0)
IP (400, 5, 3) (400, 5, 5) (400, 5, 2)
Table A.6: Hyperparameters (k0, k1, ∆) used for PaS and ConPaS.
A.3.2 Hyperparameter Tuning
In this subsection, we discuss the hyperparameters used for SCIP, ND, PaS and ConPaS.
For SCIP, we finetune its restart, presolving and primal heuristic modes on the validation instances.
We observe that allowing both restarts and presolving with the aggressive mode turned on for primal
heuristics yields the best performance for SCIP. For SCIP with the default mode, it delivers similar primal
performance for the CA problem but is worse than the finetuned version on others. We also observe that
allowing restarts is especially helpful for the IP instances.
180
SCIP ND PaS ConPaSInf ConPaSLQ
MVC 44.5±2.7 10.7±1.2 13.9±6.3 3.1±0.9 2.8±0.6
MIS 46.3±2.9 22.9±14.9 34.5±5.8 5.5±1.3 5.4±1.3
CA 138.9±28.6 71.0±18.2 28.9±5.6 24.0±6.2 19.7±4.8
IP 349.3±87.1 244.0±76.4 236.8±80.6 221.8±73.0 192.0±67.8
MVC (large) 88.3±5.0 8.8±2.2 5.0±2.1 3.7±1.1 2.1±0.8
CA (large) 167.2±8.2 151.4±21.5 96.9±17.1 39.4±10.4 28.7±5.7
Table A.7: Tabular representation of the primal integral plots in Figures 3.10 and 3.11: The primal integral
and the standard deviation at 1,000 seconds runtime cutoff averaged over 100 instances.
PaS ConPaSLQ
MVC (500, 100, 10) (500, 100, 15)
MIS (500, 500, 10) (500, 500, 10)
CA (1500, 0, 0) (1500, 0, 0)
Table A.8: Comparisons with Gurobi: Hyperparameters (k0, k1, ∆) used for PaS and ConPaSLQ.
For ND, following [148], we train a model separately for each coverage rate value. Due to limited
computing resources, we train models with {0.2, 0.3, 0.4} coverage rate values. The best coverage rates
we found for the MVC, MIS, CA and IP problems are 0.2, 0.2, 0.4 and 0.3, respectively.
For PaS and ConPaS, the values of k0, k1 and ∆ are summarized in Table A.6. Note that the best
hyperparameters for both MVC and MIS are quite different for PaS and ConPaS. On MVC instances for
PaS, we observe that (k0, k1, ∆) = (600, 200, 20) has a smaller primal integral than (500, 100, 10) but
has a larger primal gap at 1,000 seconds runtime cutoff. We also test (k0, k1, ∆) = (500, 100, 10) for
ConPaSLQ, it converges to the same primal gaps (with < 0.002% differences) as (800, 200, 20) but has a
34.1% increase in primal integral. On MIS instances for PaS, we observe that increasing k0 or ∆ (or both)
leads to significantly worse performance. However, if we use (k0, k1, ∆) = (600, 600, 6) for ConPaSLQ,
it converges to the same primal gaps (with < 0.032% differences) as (1000, 600, 15) but has a 131.8%
increase in primal integral (still being better than any other baseline).
181
(a) MVC. (b) MIS. (c) CA.
Figure A.7: Comparisons with Gurobi: The primal gap (the lower, the better) as a function of runtime
averaged over 100 test instances.
(a) MVC. (b) MIS. (c) CA.
Figure A.8: Comparisons with Gurobi: The primal integral (the lower, the better) at 1,000 seconds runtime
cutoff, averaged over 100 test instances. The error bars represent the standard deviation.
182
MVC CA
Accuracy AUROC Accuracy AUROC
PaS 81.2% 0.88 88.3% 0.87
ConPaSLQ 76.9% 0.91 86.9% 0.86
Table A.9: Prediction accuracy and AUROC on 100 validation instances.
A.3.3 Additional Experimental Results
Results on theWorkload Appointment Problem FigureA.6 presents the results on the WA instances.
Both PaS and ConPaSLQ outperform SCIP significantly in terms of the primal gap and the primal integral.
However, both approaches converge quickly to low primal gaps, with ConPaSLQ being very slightly better
than PaS.
Comparisons with Gurobi We compare the performance of ConPaSLQ against PaS and Gurobi on the
MVC, MIS and CA instances. Note that in this experiment, we use Gurobi in the PredictandSearch phase
for both PaS and ConPaSLQ to ensure a fair comparison. The hyperparameters (k0, k1, ∆) are reported
in Table A.8. Figure A.7 shows the primal gap as a function of runtime. Figure A.8 shows the primal
integral at 1,000 seconds runtime cutoff. The results show that both PaS and ConPaSLQ outperform
Gurobi significantly on MVC and MIS instances. Overall, ConPaSLQ is still the best when applied on
Gurobi.
Prediction Accuracy To assess how accurate the predicted solutions by the neural networks are, we
report the classification accuracy over all binary variables (with the threshold set to 0.5) in Table A.9.
We report it for both PaS and ConPaSLQ on the MVC and CA problems on 100 validation instances. The
accuracy is the fraction of correctly classified variables averaged over 50 positive samples for each instance,
and we report the average accuracy over 100 validation instances. Since the classification accuracy is
sensitive to the threshold, we also report the AUROC. On the MVC instances, though ConPaS has a lower
accuracy (w.r.t. the threshold of 0.5), it has higher AUROC than PaS. On the CA instances, their accuracies
183
and AUROCs are similar. However, we would like to point out that a better accuracy/AUROC does not
necessarily indicate a better downstream search performance.
184
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Machine learning in interacting multiagent systems
PDF
Efficient boundedsuboptimal multiagent path finding and motion planning via improvements to focal search
PDF
Mixedinteger nonlinear programming with binary variables
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
Efficient and effective techniques for largescale multiagent path finding
PDF
Decisionaware learning in the smalldata, largescale regime
PDF
Provable reinforcement learning for constrained and multiagent control systems
PDF
Speeding up distributed constraint optimization search algorithms
PDF
Performance tradeoffs of accelerated firstorder optimization algorithms
PDF
Target assignment and path planning for navigation tasks with teams of agents
PDF
Integer optimization for analytics in high stakes domain
PDF
Robustness of gradient methods for datadriven decision making
PDF
Algorithm and system cooptimization of graph and machine learning systems
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Exploiting structure in the Boolean weighted constraint satisfaction problem: a constraint composite graphbased approach
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Incremental searchbased path planning for moving target search
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Sequential Decision Making and Learning in MultiAgent Networked Systems
Asset Metadata
Creator
Huang, Taoan
(author)
Core Title
Improving decisionmaking in search algorithms for combinatorial optimization with machine learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
202412
Publication Date
10/04/2024
Defense Date
08/01/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
combinatorial optimization,machine learning,mixed integer linear program,multiagent path finding,OAIPMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dilkina, Bistra (
committee chair
), Koenig, Sven (
committee chair
), Lindeman, Lars (
committee member
), Razaviyayn, Meisam (
committee member
), Stuckey, Peter (
committee member
)
Creator Email
taoanhua@usc.edu,taoanhuang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/uscthesesoUC11399BMJQ
Unique identifier
UC11399BMJQ
Identifier
etdHuangTaoan13581.pdf (filename)
Legacy Identifier
etdHuangTaoan13581
Document Type
Dissertation
Format
theses (aat)
Rights
Huang, Taoan
Internet Media Type
application/pdf
Type
texts
Source
20241004uscthesesbatch1217
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 900892810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
combinatorial optimization
machine learning
mixed integer linear program
multiagent path finding