Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integer optimization for analytics in high stakes domain
(USC Thesis Other)
Integer optimization for analytics in high stakes domain
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Integer Optimization for Analytics
in High Stakes Domains
by
Sina Aghaei
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(INDUSTRIAL AND SYSTEMS ENGINEERING)
May 2024
Copyright 2024 Sina Aghaei
Dedication
To my beloved grandpa, ’Bapira,’ whom I dearly love and miss. His beautiful
memories and kindness warm my heart
ii
Acknowledgements
First and foremost, my deepest gratitude goes to my advisor, Dr. Phebe Vayanos, whose
unwavering support and guidance have been pivotal throughout my journey. Her insights and
encouragement have been invaluable in shaping both this thesis and my academic growth.
This work is a culmination of collaborative efforts with amazing faculty, researchers, and
experts. I extend a heartfelt thank you to Dr. Andrés Gómez. Andrés, your exceptional
qualities as a person and a researcher have greatly enriched my experience. Collaborating
with you and learning from your expertise has been a highlight of my academic career. I also
would like to express my heartfelt gratitude to Dr. Ça˘gıl Koçyi˘git. Thank you, Ça˘gıl, for
your brilliant mind and kind heart. Your amazing guidance, help, and support throughout
these years have been incredibly valuable and deeply appreciated.
I also owe a significant debt of gratitude to my thesis committee members, Dr. Bistra
Dilkina and Dr. John Carlsson. Their invaluable feedback and mentorship have been
instrumental in my development over these years.
My time at USC’s Center for Artificial Intelligence in Society (CAIS) has been marked
by interactions with many remarkable individuals. Special thanks to Mohammad Javad
Azizi, Kathryn Dullured, Aaron Ferber, Tye Hines, Nathanael Jo, Caroline Johnston, Nathan
Justin, Qing Jin, Yingxiao Ye, Han Yu, Aida Rahmattalabi, Qingshi Sun, Bill Tang, Omkar
Thakoor, and Hailey Winetrobe for their collaboration and camaraderie.
I am also grateful for the strong network of friends at USC, including Sina Baharlouei,
Matin Barekatain, Hamidreza Shaye, Amirmohammad Nazari, Kianoush Sadeghian Esfahani
and Ghazaleh Ostovar, whose companionship has been a source of joy and support.
iii
Most importantly, my heartfelt appreciation goes to my family—my parents and sister.
Your unwavering encouragement and support have been the foundation of who I am today.
Your belief in me has been my greatest strength.
Lastly, a special thanks to my beloved partner, Negar, for her love, patience, and endless
support. You have been my rock and a constant source of inspiration.
Funding
I gratefully acknowledge support from the Hilton C. Foundation, the Homeless Policy
Research Institute, the Home for Good foundation under the “C.E.S. Triage Tool Research
& Refinement” grant, Schmidt Futures and the James H. Zumberge Faculty Research and
Innovation Fund at the University of Southern California.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Integer Optimization for Machine Learning in High-Stakes Domains . . . . . 2
1.2 Optimization and Causal Inference for Policy-Making . . . . . . . . . . . . . 3
I Integer Optimization for Machine Learning in High-Stakes
Domains 5
Chapter 2: Strong Optimal Classification Trees . . . . . . . . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Motivation & Related Work . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Proposed Approach & Contributions . . . . . . . . . . . . . . . . . . 11
2.2 Learning Balanced Classification Trees . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Decision Tree and Associated Flow Graph . . . . . . . . . . . . . . . 12
2.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Benders’ Decomposition via Facet-defining Cuts . . . . . . . . . . . . . . . . 17
2.3.1 Main Problem, Subproblems, and Benders’ Decomposition . . . . . . 17
2.3.2 Generating Facet-Defining Cuts via a Tailored Min-Cut Procedure . . 20
2.4 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Imbalanced Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Imbalanced Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Learning Fair Decision Trees . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.4 Solution Approach for Generalizations . . . . . . . . . . . . . . . . . 38
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.1 Benchmark Approaches and Datasets . . . . . . . . . . . . . . . . . . 39
v
2.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.3 Results on Categorical Datasets . . . . . . . . . . . . . . . . . . . . . 42
2.5.4 Results on Mixed-Feature Datasets . . . . . . . . . . . . . . . . . . . 44
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 3: Learning Optimal and Fair Decision Trees for Non-Discriminative
Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.2 Proposed Approach and Contributions . . . . . . . . . . . . . . . . . 53
3.2 A Unifying Framework for Fairness in Classification and Regression . . . . . 54
3.2.1 Disparate Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 Disparate Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Mixed Integer Optimization Framework for Learning Fair Decision Trees . . 59
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 General Classes of Decision-Trees . . . . . . . . . . . . . . . . . . . . 60
3.3.3 MILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
II Optimization and Causal Inference for Policy-Making 69
Chapter 4: Balancing Efficiency, Fairness, and Interpretability in Learning Housing
Allocation Policies for Individuals Experiencing Homelessness in Los Angeles 70
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 MIO Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Treatment Assignment Strategy . . . . . . . . . . . . . . . . . . . . . 82
4.2.3 Resource Capacity Constraints . . . . . . . . . . . . . . . . . . . . . 83
4.2.4 Fairness Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Decision Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.1 Data Description and Experiments Setup . . . . . . . . . . . . . . . . 86
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 5: Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . 101
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.2 OCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
vi
A.3 OCT’s Numerical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.4 Comparison with OCT (Proof of Theorem 1) . . . . . . . . . . . . . . 124
A.4.1 Strengthening . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.4.2 Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.4.3 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.4.4 Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.5 Benders’ Decomposition for Regularized Problems . . . . . . . . . . . 129
A.6 Detail of Experimental Results in Section 2.5 . . . . . . . . . . . . . . 133
A.7 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . 141
A.7.1 BendersOCT’s Variants . . . . . . . . . . . . . . . . . . . . . 141
A.7.2 Worst-case Accuracy . . . . . . . . . . . . . . . . . . . . . . 142
A.7.3 LO Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . 143
vii
List of Tables
2.1 Benchmark datasets with only categorical features, along with their number
of rows (|I|), number of features (|F|), and number of classes (|K|). . . . . . 40
2.2 Benchmark datasets with mixed features, along with their number of rows
(|I|), number of features (|F|), and number of classes (|K|). . . . . . . . . . . 41
2.3 The summary of the out-of-sample performance of all methods on categorical
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 The summary of the comparison of the in-sample results of LST vs BendersOCT
on categorical datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 The summary of the out-of-sample performance of various approaches on
mixed-feature datasets given the calibrated λ. . . . . . . . . . . . . . . . . . 46
5.1 Companion table for the proof of Theorem 2: list of affinely independent
points that lie on the cut generated by inputting i ∈ I and (
¯b, w¯, g¯) in
Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 In-sample results including the average and standard deviation of training
accuracy, optimality gap, and solving time across 5 samples for the case of
λ = 0 on categorical datasets. The best performance achieved in a given
dataset and depth is reported in bold. . . . . . . . . . . . . . . . . . . . . . 134
5.3 In-sample results including the average and standard deviation of optimality
gap and solving time across 45 instances (5 samples × 9 value of lambdas) for
the case of λ > 0 on categorical datasets. The best performance achieved in a
given dataset and depth is reported in bold. . . . . . . . . . . . . . . . . . . 135
5.4 Average out-of-sample accuracy and standard deviation of accuracy across 5
samples given the calibrated λ on categorical datasets. The highest accuracy
achieved in a given dataset and depth is reported in bold. . . . . . . . . . . 136
5.5 Average in-sample accuracy and standard deviation of accuracy across 5
samples for the case of λ = 0 on categorical datasets. The highest accuracy
achieved in a given dataset and depth is reported in bold. . . . . . . . . . . 137
viii
5.6 Average in-sample accuracy and standard deviation of accuracy across 5
samples for the case of λ = 0 on mixed-feature datasets (part 1). Due to the
numerical issues of OCT (as discussed in Appendix A.3), we do not provide
solving time and optimality gap for this approach as it tackles a different
problem. Instead, we report the number of instances (out of five samples)
in a given dataset and depth, where we observe a discrepancy of at least
0.001 between the objective value of the optimization problem and the actual
in-sample accuracy. The best performance achieved in a given dataset and
depth is reported in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.7 Average in-sample accuracy and standard deviation of accuracy across 5
samples for the case of λ = 0 on mixed-feature datasets (part 2). The best
performance achieved in a given dataset and depth is reported in bold. . . . 139
5.8 Average in-sample accuracy and standard deviation of accuracy across 5
samples for the case of λ = 0 on mixed-feature datasets (part 3). The best
performance achieved in a given dataset and depth is reported in bold. . . . 140
5.9 In-sample results including the average and standard deviation of optimality
gap and solving time across 45 instances (5 samples and 9 values of λ) for
the case of λ > 0 on mixed-feature datasets (part 1). The best performance
achieved in a given dataset and depth is reported in bold. . . . . . . . . . . 145
5.10 In-sample results including the average and standard deviation of optimality
gap and solving time across 45 instances (5 samples and 9 values of λ) for
the case of λ > 0 on mixed-feature datasets (part 2). The best performance
achieved in a given dataset and depth is reported in bold. . . . . . . . . . . 146
5.11 Average out-of-sample accuracy and standard deviation of accuracy across 5
samples on mixed-feature datasets (part 1). The highest accuracy achieved in
a given dataset and depth is reported in bold. . . . . . . . . . . . . . . . . . 147
5.12 Average out-of-sample accuracy and standard deviation of accuracy across 5
samples on mixed-feature datasets (part 2). The highest accuracy achieved in
a given dataset and depth is reported in bold. . . . . . . . . . . . . . . . . . 148
5.13 Number of branch-and-bound nodes explored by each approach during the
solving process of 115 instances (out of 240 instances comprising 12 categorical
datasets, 5 samples, and 4 different depths), which were successfully solved to
optimality by all approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.14 In-sample results of the LO relaxation including the average and standard
deviation of objective value, root improvement and solving time across 45
instances (5 samples and 9 values of λ) for the case of λ > 0 on categorical
datasets. The best performance achieved in a given dataset and depth is
reported in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
ix
5.15 In-sample results of the LO relaxation including the average and standard
deviation of objective value, root improvement and solving time across 5
samples for the case of λ = 0 on categorical datasets. The best performance
achieved in a given dataset and depth is reported in bold. . . . . . . . . . . 151
x
List of Figures
2.1 A decision tree of depth 2 (left) and its associated flow graph (right).
Here, B = {1, 2, 3} and L = {4, 5, 6, 7}, while V = {s, 1, 2, . . . , 7, t} and
A = {(s, 1),(1, 2), . . . ,(7, t)}. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Illustration of Algorithm 1 on two datapoints that are correctly classified
(datapoint 1, left) and incorrectly classified (datapoint 2, right). Unbroken
(green) arcs (n, n′
) have capacity c
i
n,n′(b, w) = 1 (and others capacity 0). In
the case of datapoint 1 which is correctly classified since there exists a path
from source to sink, Algorithm 1 terminates on line 16 and returns −1. In
the case of datapoint 2 which is incorrectly classified, Algorithm 1 returns
set S = {s, 1, 3, 6} on line 14. The associated minimum cut consists of arcs
(1, 2), (6, t), and (3, 7) and is represented by the thick (red) dashed line. . . . 23
2.3 A decision tree of depth 2 (left) and its associated flow graph (right)
that can be used to train imbalanced decision trees of maximum depth 2.
Here, B = {1, 2, 3} and T = {4, 5, 6, 7}, while V = {s, 1, 2, . . . , 7, t} and
A = {(s, 1),(1, 2),(1, t), . . . ,(7, t)}. The additional arcs that connect the
branching nodes to the sink allow branching nodes n ∈ B to be converted to
leaves where a prediction is made. Correctly classified datapoints that reach a
leaf are directed to the sink. Incorrectly classified datapoints are not allowed
to flow in the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 The left (resp. right) figure shows for balanced (resp. imbalanced) decision
trees the number of instances solved to optimality by each approach within a
given time on the time axis, and the number of instances with optimality gap
no larger than each given value at the time limit on the optimality gap axis. 43
2.5 The left (resp. right) figure shows for balanced (resp. imbalanced) decision
trees the number of instances solved to optimality by each approach on the
time axis, and the number of instances with optimality gap no larger than each
given value at the time limit on the optimality gap axis. OCT is not included
in this figure because of its numerical instabilities due to having “little-m”
constraints which caused discrepancy between the optimal objective value and
the actual in-sample accuracy. Refer to Appendix A.3 for further information. 46
xi
3.1 Accuracy-discrimination trade-off of 4 families of approaches on 3 classification
datasets: (a) Default, (b) Adult, and (c) COMPAS. Each dot represents a
different sample from 5-fold cross-validation and each shaded area corresponds
to the convex hull of the results associated with each approach in accuracydiscrimination space. Same trade-off of 3 families of approaches on the
regression dataset Crime is shown in (d). . . . . . . . . . . . . . . . . . . . . 65
3.2 From left to right: (a) MIP objective value and (b) Accuracy and fairness
in dependence of tree depth; (c) Comparison of upper and lower bound
evolution while solving MILP problem; and (d) Empirical distribution of
γ(x) := P(y|xp, xp) − P(y|xp) (see Definition 11) when x is valued in the test
set in both CART (λ = 0) and MIP. . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Accuracy of maximally non-discriminative models in each approach for (a)
classification and (b) regression. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 A prescriptive tree with depth 2 (left) and its associated flow graph (right). . 80
4.2 Decision complexity of different methods. We show the complexity of different
optimal prescriptive trees with different depths, different treatment assignment
strategy, and different set of branching features. We also show the complexity
of the historical policy and dual-price queuing policy. . . . . . . . . . . . . . 87
4.3 The effect of increasing depth on the distribution of out-of-sample expected
outcome (left), optimality gap (middle), and solving time (right) across all
MIO instances solved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 The effect of different treatment assignment strategies (randomized vs
deterministic) on the distribution of out-of-sample expected outcome (left),
optimality gap (middle) and solving time (right) across all MIO instances solved. 91
4.5 The effect of using different sets of branching features (the best 20 predictors
of outcome vs the predicted counterfactual outcomes) on the distribution of
out-of-sample expected outcome (left), optimality gap (middle) and solving
time (right) across all MIO instances solved. . . . . . . . . . . . . . . . . . . 92
4.6 The effect of including/excluding of the protected attributes in the branching
features on the distribution of out-of-sample expected outcome (left),
optimality gap (middle) and solving time (right) across all MIO instances solved. 92
4.7 The out-of-sample expected outcome vs discrimination for all methods
enforcing statistical parity in outcomes for the case where protected attributes
are included in the branching features. . . . . . . . . . . . . . . . . . . . . . 94
4.8 The out-of-sample expected outcome vs discrimination for all methods
enforcing statistical parity in outcomes for the case where protected attributes
are not included in the branching features. . . . . . . . . . . . . . . . . . . . 94
xii
4.9 Sample prescriptive trees with statistical parity in outcomes. The numbers
associated with each tree correspond to the highlighted points in Figure 4.7. 95
4.10 The out-of-sample expected outcome vs discrimination metrics for all methods
enforcing statistical parity in allocation for the case where protected attributes
are included in the branching features. . . . . . . . . . . . . . . . . . . . . . 96
4.11 The out-of-sample expected outcome vs discrimination metrics for all methods
enforcing statistical parity in allocation for the case where protected attributes
are not included in the branching features. . . . . . . . . . . . . . . . . . . . 97
4.12 Sample prescriptive trees with statistical parity in treatment allocation. The
numbers associated with each tree correspond to the highlighted points in
Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.13 The out-of-sample performance of tree number 1 (depicted in Figure 4.9) is
displayed. This tree, at a depth of 2, ensures statistical parity in outcomes
using a threshold of 0.01, utilizing predicted outcomes and protected
attributes as the branching features. On the left part, for both the highlighted
prescriptive tree and the historical policy we show the expected outcome
across race (top), the proportion of each race receiving PSH (middle), and
RRH (bottom). On the right part, for each level of vulnerability score, the
top (resp. bottom) shows the percentage of all PSH (resp. RRH) resources
allocated to individuals within that score bracket. . . . . . . . . . . . . . . 99
4.14 The out-of-sample performance of tree number 1 (depicted in Figure 4.12) is
displayed. This tree, at a depth of 2, ensures statistical parity in allocation
using a threshold of 0.03, utilizing predicted outcomes and protected
attributes as the branching features. On the left part, for both the highlighted
prescriptive tree and the historical policy we show the expected outcome
across race (top), the proportion of each race receiving PSH (middle), and
RRH (bottom). On the right part, for each level of vulnerability score, the
top (resp. bottom) shows the percentage of all PSH (resp. RRH) resources
allocated to individuals within that score bracket. . . . . . . . . . . . . . . . 100
5.1 A classification tree of depth 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Example of an instance where OCT exhibits numerical issues. According to the
solution of the optimization problem, datapoint 175 should get routed to leaf
node 7 where it gets misclassified. However, in practice, we observe that this
datapoint is assigned to leaf node 6 and mistakenly reported as being correctly
classified. This causes a discrepancy between the optimization problem
objective and the actual accuracy of the tree returned by the optimization. . 124
xiii
5.3 Illustration of Algorithm 2 on four datapoints, two of which are correctly
classified (datapoints 1 and 3) and two of which are incorrectly classified
(datapoints 2 and 4). Unbroken (green) arcs (n, n′
) have capacity c
i
n,n′(b, w) = 1
(and others, capacity 0). In the case of datapoints 1 and 3 which are correctly
classified, since there exists a path from source to sink, Algorithm 2 terminates
on line 16 and returns −1. In the case of datapoints 2 and 4 which are
incorrectly classified, Algorithm 2 returns set S = {s, 1, 3} set S = {s, 1, 3, 6}
respectively on line 14. The associated minimum cut for datapoint 2 consists
of arcs (1, 2), (1, t), (3, 6), (3, t) and (3, 7) and is represented by the thick (red)
dashed line. Similarly the associated minimum cut for datapoint 4 consists of
arcs (1, 2), (1, t), (6, t), (3, t) and (3, 7). . . . . . . . . . . . . . . . . . . . . . 133
5.4 Number of instances solved to optimality by each approach within the given
time on the time axis, and number of instances with optimality gap no larger
than each given value at the time limit on the optimality gap axis. . . . . . . 142
5.5 The left (resp. right) figure depicts the density of out-of-sample accuracy (resp.
worst accuracy), among all class labels, for each approach. . . . . . . . . . . 143
5.6 Number of instances solved to optimality by each approach within the given
time on the time axis, and number of instances with optimality gap no larger
than each given value at the time limit on the optimality gap axis. . . . . . . 143
xiv
Abstract
Data-driven approaches are increasingly being used to support decision-making in high
stakes domains, e.g., to predict the vulnerability of homeless individuals to prioritize them
for housing, or to identify those at risk of suicide. The deployment of data-driven predictive
or prescriptive tools in high-stakes domains where people’s lives and livelihoods are at stakes
creates an urgent need for approaches that are fair, interpretable, and optimal. Crafting
predictive and prescriptive models possessing these vital attributes, derived from data that
might be biased, or observational, leads to grappling with constrained optimization problems
that are inherently combinatorial and often hard to solve. To navigate these challenges, I
integrate techniques from integer optimization with machine learning, statistics, and causal
inference. Subsequently, I develop effective solution methodologies to address these complex
problems.
xv
Chapter 1
Introduction
In recent years, data-driven approaches are increasingly being used to support decision-making
in high-stakes domains, where data is often biased, noisy, or missing and where the individuals
the system is intended to serve have been historically marginalized or discriminated upon.
For example, machine learning (ML) is being used to predict the vulnerability of homeless
individuals to prioritize them for housing [1] and to help decide who to give access to credit
and benefits [2]. The deployment of data-driven predictive or prescriptive tools in highstakes domains where people’s lives and livelihoods are at stakes creates an urgent need for
approaches that are:
(a) fair, to avoid introducing or recreating biases;
(b) interpretable, to make them easy for practitioners to understand and implement; and
(c) optimal, to ensure the highest possible accuracy and decision quality.
Designing predictive and prescriptive models with these characteristics from biased, noisy,
or missing data results in constrained optimization problems that are combinatorial in nature
and often affected by uncertainty. To solve such problems, I integrate methods from integer
optimization with machine learning, statistics, and causal inference, to deal with noise,
estimation errors, and biases. I then devise efficient solution procedures to tackle the resulting
problem. My thesis thus lies at the interface of Operations Research (OR) and Artificial
Intelligence (AI).
1
In what follows, I briefly describe my research contributions to predictive and prescriptive
analytics and how they fit within the existing literature. I have divided my contributions
into two distinct streams as outlined below:
1.1 Integer Optimization for Machine Learning in HighStakes Domains
In the first stream of my research, I focus on designing optimal and interpretable ML
models. I advocate for simpler, inherently interpretable ML models over complex black-box
models, a topic widely discussed in the literature, highlighted in [3, 4]. For this reason in
my research I have primarily focused on a class of ML models called decision trees, known
for their interpretability and popularity in high-stakes domains. Traditional algorithms
used for learning decision trees typically rely on heuristics that employ intuitive yet ad-hoc
rules for constructing these trees. For instance, CART utilizes the Gini Index for splitting
decisions [5], ID3 employs entropy [6], and C4.5 uses normalized information gain [7]. Despite
their high speed and reasonably good performance, these methods lack optimality guarantees
and significant modeling power. Consequently, integrating fairness constraints becomes a
challenge. Hence, there is a clear necessity for mathematical modeling-based approaches to
learn decision trees, addressing these limitations.
Thanks to its modeling power and optimality guarantee, mixed-integer optimization
(MIO) has gained popularity to address various problems such as sparse regression problems
[8–12], verification of neural networks [13–15], and sparse principal component analysis [16,
17], among other topics. More related to the topic of this thesis, MIO methods have been
proposed to learn optimal decision trees [18–21].
Additionally, in the domain of fair machine learning, “in-process” algorithms have emerged
to tackle bias in standard ML models by incorporating fairness notions during training. These
methods integrate fairness criteria like statistical parity [22] or equalized odds [23], aiming
2
to penalize or constrain discrimination. Several in-process approaches have been proposed
specifically for learning fair decision trees, primarily relying on heuristic methods [24–27].
In this part of my research I leveraged the modeling power of integer optimization and
polyhedral theory to design optimal and interpretable decision trees that can conveniently be
augmented with fairness constraints. In the following I discuss my contributions.
Learning Optimal Decision Trees. Motivated by the interpretability and popularity
of decision trees in high-stakes domains, in Chapter 2 I devise a novel MIO formulation
for learning optimal classification trees which is provably stronger than prior methods. I
proposed a tailored method for solving the MIO problem by exploiting its decomposable
structure. This approach achieved a 31-fold speed-up relative to state-of-the-art [18, 28] and
improved out-of-sample performance by up to 8%. Notably, the research findings of this
chapter have been recognized and published in [29].
Learning Optimal and Fair Decision Trees. In Chapter 3 I propose the first MIO based
approach capable of learning fair classification and regression trees able to mitigate disparate
impact and/or treatment incorporated as penalty functions into the objective. Our approach
improves out-of-sample (OOS) accuracy by 2 percentage on average and obtains a higher
OOS accuracy in 100% of cases against existing (heuristic) approaches for building fair trees.
The findings of this chapter has been published in [30].
1.2 Optimization and Causal Inference for PolicyMaking
In the second stream of my research, I focused on designing optimal and interpretable
prescriptive policies to efficiently allocate limited resources crucial for fulfilling basic needs.
This encompasses areas such as kidney transplantation (explored in studies like [31, 32]) and
homeless services (investigated in research such as [1, 33–35]). Among these, [35] emerges
as closely aligned with my research. This research outlines an online allocation policy
using linear optimization (LO), optimizing the expected outcome while adhering to resource
3
capacities and fairness constraints. However, despite its performance, this work lacks sufficient
interpretability. In the following, I delve into my contributions in this stream.
Interpretable and Fair Housing Allocation Policy for Individuals Experiencing
Homelessness in Los Angeles In Chapter 4, our focus revolves around the challenge of
efficiently allocating housing resources to individuals facing homelessness, leveraging their
observed covariates. Using administrative data gathered by the Los Angeles Homeless Services
Authority (LAHSA), we embark on creating an interpretable policy aimed at better serving
this vulnerable population. Our approach involves crafting personalized policies, adopting the
form of shallow decision trees termed as prescriptive trees, all derived from the observational
data. This method is not only provably asymptotically optimal but also demonstrates its
ability to incorporate various constraints such as capacity limits, fairness considerations, or
other domain-specific requirements. Our proposed policy represents a significant enhancement
compared to the current historical deployment, empowering authorities to enforce fairness in
resource allocation and outcomes, effectively curbing discrimination against minority racial
groups.
4
Part I
Integer Optimization for Machine Learning in
High-Stakes Domains
5
Chapter 2
Strong Optimal Classification Trees
Decision trees are among the most popular machine learning models and are used routinely
in applications ranging from revenue management and medicine to bioinformatics. In this
paper, we consider the problem of learning optimal binary classification trees with univariate
splits. Literature on the topic has burgeoned in recent years, motivated both by the empirical
suboptimality of heuristic approaches and the tremendous improvements in mixed-integer
optimization (MIO) technology. Yet, existing MIO-based approaches from the literature do
not leverage the power of MIO to its full extent: they rely on weak formulations, resulting
in slow convergence and large optimality gaps. To fill this gap in the literature, we propose
an intuitive flow-based MIO formulation for learning optimal binary classification trees.
Our formulation can accommodate side constraints to enable the design of interpretable
and fair decision trees. Moreover, we show that our formulation has a stronger linear
optimization relaxation than existing methods in the case of binary data. We exploit the
decomposable structure of our formulation and max-flow/min-cut duality to derive a Benders’
decomposition method to speed-up computation. We propose a tailored procedure for solving
each decomposed subproblem that provably generates facets of the feasible set of the MIO as
constraints to add to the main problem. We conduct extensive computational experiments on
standard benchmark datasets on which we show that our proposed approaches are 29 times
faster than state-of-the-art MIO-based techniques and improve out-of-sample performance by
up to 8%.
6
2.1 Introduction
2.1.1 Motivation & Related Work
Since their inception over 30 years ago, see [5], decision trees have become among the most
popular techniques for interpretable machine learning (ML), see [3]. Typically, a decision
tree takes the form of a binary tree. In each branching node of the tree, a binary test is
performed on a specific feature. Two branches emanate from each branching node, with each
branch representing the outcome of the test. If a datapoint passes (resp. fails) the test, it
is directed to the left (resp. right) branch. A predicted label is assigned to all leaf nodes.
Thus, each path from root to leaf represents a classification rule that assigns a unique label
to all datapoints that reach that leaf. The goal in the design of optimal decision trees is to
select the tests to perform at each branching node and the labels to assign to each leaf to
maximize prediction accuracy (classification) or to minimize prediction error (regression). In
practice, shallow trees are preferred as they are easier to understand and interpret. Thus,
we focus on designing trees of bounded depth. Not only are decision trees popular in their
own right; they also form the backbone for more sophisticated machine learning models. For
example, they are the building blocks for ensemble methods (such as random forests) which
combine several decision trees and constitute some of the most popular and stable machine
learning techniques available, see e.g., [36, 37] and [38]. They have also proved useful to
provide explanations for the solutions to optimization problems, see e.g., [39].
Decision trees are used routinely in applications ranging from revenue management
to chemical engineering, medicine, and bioinformatics. For example, they are used for
the management of substance-abusing psychiatric patients [40], to manage Parkinson’s
disease [41], to predict a mortality on liver transplant waitlists [42], to learn chemical concepts
such as octane number and molecular substructures [43], and to evaluate housing systems for
homeless youth [44]. Moreover, tree ensembles can be used to reveal associations between
7
micro RNAs and human diseases [45] and to predict outcomes in antibody incompatible
kidney transplantation [46].
The problem of learning optimal decision trees is an N P-hard problem, see [47] and [5].
It can intuitively be viewed as a combinatorial optimization problem with an exponential
number of decision variables: at each branching node of the tree, one can select which feature
to branch on (and potentially the level of that feature), guiding each datapoint to the left or
right using logical constraints.
Traditional Methods. Motivated by these hardness results, traditional algorithms for
learning decision trees have relied on heuristics that employ very intuitive, yet ad-hoc, rules
for constructing the decision trees. For example, CART uses the Gini Index to decide on the
splitting, see [5]; ID3 employs entropy, see [6]; and C4.5 leverages normalized information
gain, see [7]. The high quality and speed of these algorithms combined with the availability
of software packages in many popular languages such as R or Python has facilitated their
popularization, see e.g., [48] and [49].
Mathematical Optimization Techniques. Motivated by the heuristic nature of traditional approaches which provide no guarantees on the quality of the learned tree, several
researchers have proposed algorithms for learning provably optimal trees based on techniques
from mathematical optimization. Approaches for learning optimal decision trees rely on
enumeration coupled with rules to prune-out the search space. For example, [50] use itemset
mining algorithms and [51] use satisfiability (SAT) solvers. [52] propose a more elaborate
implementation combining several ideas from the literature, including branch-and-bound,
itemset mining techniques, and caching. [53] use analytical bounds (to aggressively prune-out
the search space) combined with a tailored bit-vector based implementation. [54] extend
the approach of [53] to produce optimal decision trees over a variety of objectives such as
F-score and area under the receiver operating characteristic curve (AUROC). [55] and [56]
use dynamic programming and [57] use caching branch-and-bound search to compute optimal
decision trees. [58] propose a continuous-based randomized approach for learning optimal
8
classification trees with oblique cuts. Incorporating constraints into the above approaches is
usually a challenging task, see e.g., [59].
The Special Case of MIO. As an alternative approach to conducting the search, [18]
recently proposed to use mixed-integer optimization (MIO) to learn optimal classification
trees. Following this work, using MIO to learn decision trees gained a lot of traction in the
literature with the works of [60], [30], and [19]. This is no coincidence. First, there exists a
suite of MIO off-the-shelf solvers and algorithms that can be leveraged to effectively prune-out
the search space. Indeed, solvers such as [61] and [62] have benefited from decades of research,
see [63], and have been very successful at solving broad classes of MIO problems. Second,
MIO comes with a highly expressive language that can be used to tailor the objective function
of the problem or to augment the learning problem with constraints of practical interest. For
example, [30] leverage the power of MIO to learn fair and interpretable classification and
regression trees by augmenting their model with additional constraints. They also show how
MIO technology can be exploited to learn decision trees with more sophisticated structure
(linear branching and leafing rules). Similarly, [60] use MIO to solve the problem of learning
classification trees by taking into account the special structure of categorical features and
allowing combinatorial decisions (based on subsets of values of features) at each node. MIO
formulations have also been leveraged to design decision trees for decision- and policy-making
problems, see e.g., [33] and [64], for optimizing decisions over tree ensembles, see [65] and [66],
and also for developing heuristic approaches for learning classification trees, see [67]. We
refer the interested reader to the paper of [68] for an in-depth review of the field.
2.1.2 Discussion
The works of [18], [30], and [19] have served to showcase the modeling power of using MIO to
learn decision trees and the potential suboptimality of traditional algorithms. Yet, we argue
that they have not leveraged the power of MIO to its full extent.
9
A critical component for efficiently solving MIOs is to pose good formulations, but
determining such formulations is no simple task. The standard approach for solving MIO
problems is the branch-and-bound method, which partitions the search space recursively
and solves Linear Optimization (LO) relaxations for each partition to produce bounds for
fathoming sections of the search space. Thus, since solving an MIO requires solving a large
sequence of LO problems, small and compact formulations are desirable as they enable the
LO relaxation to be solved faster. Moreover, formulations with tight LO relaxations, referred
to as strong formulations, are also desirable as they produce higher quality bounds which lead
to a faster pruning of the search space, ultimately reducing the number of LO problems to be
solved. Indeed, a recent research thrust focuses on devising strong formulations for inference
problems [10, 11, 69–74]. Unfortunately, these two goals are at odds with one another:
stronger formulations often require more variables and constraints than weak ones. For
example, in the context of decision trees, [19] propose an MIO formulation with significantly
fewer variables and constraints than the formulation of [18], but in the process weaken the
LO relaxation. As a consequence, neither method consistently outperforms the other.
We note that in the case of MIO problems with large numbers of decision variables
and constraints, classical decomposition techniques from the Operations Research literature
may be leveraged to break the problem up into multiple tractable subproblems of benign
complexity, see e.g., [75–80]. A notable example of a decomposition algorithm is Benders’
decomposition, see [81]. This decomposition approach exploits the structure of mathematical
optimization problems with so-called complicating variables which couple constraints with
one another and which, once fixed, result in an attractive decomposable structure that is
leveraged to speed-up computation and alleviate memory consumption, allowing the solution
of large-scale MIO problems.
To the best of our knowledge, existing approaches from the literature have not sought
explicitly strong formulations, neither have they attempted to leverage the potentially
10
decomposable structure of the problem. This is precisely the gap we fill with the present
work.
2.1.3 Proposed Approach & Contributions
Our approach and main contributions in this paper are:
(a) We propose a flow-based MIO formulation for learning optimal classification trees
with binary features. In this model, correctly classified datapoints can be seen as
flowing from the root of the tree to a suitable leaf while incorrectly classified datapoints
are not allowed to flow through the tree. Our formulation can easily be augmented
with constraints (e.g., imposing fairness), regularization penalties and conveniently be
adjusted to cater for imbalanced datasets.
(b) We demonstrate that our proposed formulation has a stronger LO relaxation than
existing alternatives. Notably, it does not involve big-M constraints. It is also amenable
to Benders’ decomposition. In particular, binary tests are selected in the main problem
and each subproblem guides each datapoint through the tree via a max-flow subproblem.
We leverage the max-flow structure of the subproblems to solve them efficiently via a
tailored min-cut procedure. Moreover, we prove that all the constraints generated by
this Benders’ procedure are facet-defining, i.e., they are required to describe the (convex
hull of the) projection of the feasible region into the space of variables appearing in the
main problem.
(c) We conduct extensive computational studies on benchmark datasets from the literature,
showing that our formulations improve upon the state-of-the-art MIO algorithms, both
in terms of in-sample solution quality (and speed) and out-of-sample performance.
The proposed modeling and solution paradigm can act as a building block for the faster and
more accurate learning of more sophisticated trees and tree ensembles.
The rest of the paper is organized as follows. We introduce our flow-based formulation
and our Benders’ decomposition method in Section 2.2 and Section 2.3, respectively. Several
11
generalizations of our core formulation are discussed in Section 2.4. We report on computational experiments with popular benchmark datasets in Section 2.5. Most proofs and detailed
computational results are provided in the appendix.
2.2 Learning Balanced Classification Trees
In this section, we describe our MIO formulation for learning optimal balanced classification
trees of a given depth, i.e., trees wherein the distance between all nodes where a prediction
is made and the root node is equal to the tree depth. Our MIO formulation relies on the
observation that once the structure of the tree is fixed, determining whether a datapoint is
correctly classified or not reduces to checking whether the datapoint can, based on its features
and label, flow from the root of the tree to a leaf where the prediction made matches its label.
Thus, we begin this section by formally defining balanced trees and their associated flow graph
in Section 2.2.1, before introducing our proposed flow-based formulation in Section 2.2.2. We
discuss variants of this basic model that can be used to learn sparse, possibly imbalanced,
trees in Section 2.4. Throughout the paper, we represent vectors using bold fonts and sets
using capital calligraphic fonts.
2.2.1 Decision Tree and Associated Flow Graph
A key step towards our flow-based MIO formulation of the problem consists in converting the
decision tree of fixed depth that we wish to train to a directed acyclic graph where all arcs
are directed from the root of the tree to the leaves. We detail this conversion together with
the basic terminology that we use in our paper in what follows.
Definition 1 (Balanced Decision Tree) A balanced decision tree of depth d ∈ N is a
perfect binary tree, i.e., a binary tree in which all interior nodes have two children and all
leaves have the same depth. We number the nodes in the tree in the same order they appear
in the breadth-first search traverse such that 1 is the root node and 2
d+1 − 1 is the bottom
right node. We define the set of first 2
d − 1 nodes B := {1, . . . , 2
d − 1} as branching nodes
and the remaining 2
d nodes L := {2
d
, . . . , 2
d+1 − 1} as the leaf nodes of the decision tree.
12
An illustration of a balanced decision tree is provided in Figure 2.1 (left). The key
idea behind our model is to convert a balanced decision tree to a directed acyclic graph by
augmenting it with a single source node s that is connected to the root node (node 1) of the
tree and a single sink node t connected to all leaf nodes of the tree. We refer to this graph as
the flow graph of the decision tree. An illustration of these concepts on a decision tree of
depth d = 2 is provided on Figure 2.1.
1
2 3
L 4 5 6 7
B
L
B
s
2 3
4 5 6 7
t
1
Figure 2.1: A decision tree of depth 2 (left) and its associated flow graph (right). Here,
B = {1, 2, 3} and L = {4, 5, 6, 7}, while V = {s, 1, 2, . . . , 7, t} and A = {(s, 1),(1, 2), . . . ,(7, t)}.
A formal definition for the flow graph associated with a balanced decision tree is as follows.
Definition 2 (Flow Graph of a Balanced Decision Tree) Given a balanced decision
tree of depth d, define the directed flow graph G = (V, A) associated with the tree as follows.
Let V := {s, t} ∪ B ∪ L be the vertices of the flow graph. Given n ∈ B, let ℓ(n) := 2n be the
left descendant of n, r(n) := 2n + 1 be the right descendant of n, and
A :=
(n, ℓ(n)) : n ∈ B}
∪
(n, r(n)) : n ∈ B}
∪ {(s, 1)} ∪
(n, t) : n ∈ L
be the arcs of the graph. Also, given n ∈ B ∪ L, let a(n) be the ancestor of n, defined through
a(1) := s and a(n) := ⌊n/2⌋ if n ̸= 1.
2.2.2 Problem Formulation
Let D := {x
i
, yi}i∈I be a training dataset consisting of datapoints indexed in the set I ⊆ N.
Each row i ∈ I consists of F binary features indexed in the set F ⊆ N, which we collect in
13
the vector x
i ∈ {0, 1}
F
, and a label y
i drawn from the finite set K of classes. In this section
we formulate the problem of learning a multi-class classification tree of fixed finite depth
d ∈ N that minimizes the misclassification rate (or equivalently, maximizes the number of
correctly classified datapoints) based on MIO technology.
In our formulation, the classification tree is described through the branching variables
b and the prediction variables w. In particular, we use the variables bnf ∈ {0, 1}, f ∈ F,
n ∈ B, to indicate if the tree branches on feature f at branching node n (i.e., it equals 1
if and only if the binary test performed at n asks “is x
i
f = 0”?). Accordingly, we employ
the variables w
n
k ∈ {0, 1}, n ∈ L, k ∈ K, to indicate that at leaf node n the tree predicts
class k. We use the auxiliary routing/flow variables z to decide on the flow of data through
the flow graph associated with the decision tree. Specifically, for each node n ∈ B ∪ L and
for each datapoint i ∈ I, we introduce a decision variable z
i
a(n),n ∈ {0, 1} which equals 1 if
and only if the ith datapoint is correctly classified and its flow traverses the arc (a(n), n) on
its way to the sink t. We let z
i
n,t be defined accordingly for each arc between node n ∈ L
and sink t. Datapoint i ∈ I is correctly classified if and only if its corresponding flow passes
through some leaf node n ∈ L such that w
n
yi = 1, i.e., where the class predicted coincides
with the class of the datapoint. If the flow of a datapoint i arrives at such a leaf node n
and the datapoint is correctly classified, its corresponding flow is directed to the sink, i.e.,
z
i
n,t = 1; otherwise, the corresponding flow is not initiated from the source at all. With these
variables, the flow-based formulation reads
maximize X
i∈I
X
n∈L
z
i
n,t (2.1a)
subject to X
f∈F
bnf = 1 ∀n ∈ B (2.1b)
z
i
a(n),n = z
i
n,ℓ(n) + z
i
n,r(n) ∀n ∈ B, i ∈ I (2.1c)
z
i
a(n),n = z
i
n,t ∀n ∈ L, i ∈ I (2.1d)
z
i
s,1 ≤ 1 ∀i ∈ I (2.1e)
14
z
i
n,ℓ(n) ≤
X
f∈F:x
i
f =0
bnf ∀n ∈ B, i ∈ I (2.1f)
z
i
n,r(n) ≤
X
f∈F:x
i
f =1
bnf ∀n ∈ B, i ∈ I (2.1g)
z
i
n,t ≤ w
n
y
i ∀n ∈ L, i ∈ I (2.1h)
X
k∈K
w
n
k = 1 ∀n ∈ L (2.1i)
w
n
k ∈ {0, 1} ∀n ∈ L, k ∈ K (2.1j)
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (2.1k)
z
i
a(n),n ∈ {0, 1} ∀n ∈ B ∪ L, i ∈ I (2.1l)
z
i
n,t ∈ {0, 1} ∀n ∈ L, i ∈ I. (2.1m)
An interpretation of the constraints and objective is as follows. Constraints (2.1b) ensure
that at each branching node n ∈ B we branch on exactly one feature f ∈ F. Constraints (2.1c)
are flow conservation constraints for each datapoint i and node n ∈ B: they ensure that if a
datapoint arrives at a node, then it must also leave the node through one of its descendants.
Similarly, constraints (2.1d) enforce flow conservation for each node n ∈ L. The inequality
constraints (2.1e) imply that at most one unit of flow can enter the graph through the source
for each datapoint. Constraints (2.1f) (resp. (2.1g)) ensure that if the flow of a datapoint is
routed to the left (resp. right) at node n, then one of the features such that x
i
f = 0 (resp.
x
i
f = 1) must have been selected for branching at the node. Constraints (2.1h) guarantee that
datapoints whose flow is routed to the sink node t are correctly classified. Constraints (2.1i)
make sure that each leaf node is assigned a predicted class k ∈ K. The objective (2.1a)
maximizes the total number of correctly classified datapoints.
Formulation (2.1) has several distinguishing features relative to existing MIO formulations
for training decision trees: First, it does not use big-M constraints. Second, it includes
flow variables indicating whether each datapoint is directed to the left or right at each
branching node, which resembles the well-known multi-commodity flow concept, see [82].
15
Third, incorrectly classified datapoints are associated with a flow of zero. In contrast, previous
formulations [18, 19] and more sophisticated formulations we propose later in Section 2.4
include binary decision variables that indicate the leaf each datapoint lands on. The advantage
of modeling misclassified points with a flow of zero is that the resulting formulation is smaller.
A similar idea of “projecting out” variables associated with misclassified datapoints was
proposed by [60].
The number of variables and constraints in Problem (2.1) is O
2
d
(|I|+|F|+|K|)
, where d
is the tree depth. Thus, its size is of the same order as the univariate splits formulation of [18]
(formulation (24) in their paper) while being of higher order than that of [19]. Nonetheless,
the LO relaxation of formulation (2.1) is tighter than those of [18] and [19], as demonstrated
in the following theorem which we prove in Appendix A.4, and therefore results in a more
aggressive pruning of the branch-and-bound tree.
Theorem 1 Problem (2.1) has a stronger LO relaxation than the formulations of [18]
and [19].
Remark 1 Formulation (2.1) assumes that all features f ∈ F are binary. However, this
formulation can also be applied to datasets involving categorical or bounded integer features
by first preprocessing the data as follows. For each categorical feature, we encode it as a
one-hot vector, i.e., for each level of the feature, we create a new binary column which has
value 1 if and only if the original column has the corresponding level. We follow a similar
approach for encoding integer features with a slight change. The new binary column has value
1 if and only if the main column has the corresponding value or any value smaller than it,
see e.g., [19] and [83]. This discretization of the features increases the size of the dataset
linearly with the number of possible values of each categorical/integer feature. We also note
that formulation (2.1) only considers univariate splits. It can however be generalized to allow
for multivariate splits. Indeed, in the case of binary features, having multivariate (or oblique)
splits is equivalent to “combinatorial” branching, which can be done by creating additional
artificial features, see [60] for additional discussion. We can also extend formulation (2.1) for
16
the case of non-binary trees, where each splitting node can have more than two children. This
can be achieved by introducing additional nodes and edges to the flow graph while adjusting
the formulation accordingly.
2.3 Benders’ Decomposition via Facet-defining Cuts
In Section 2.2, we proposed a formulation for designing optimal classification trees that
is provably stronger than existing approaches from the literature. Our model presents an
attractive decomposable structure that we leverage in this section to speed-up computation.
2.3.1 Main Problem, Subproblems, and Benders’ Decomposition
A classification tree is uniquely characterized by the branching decisions b and predictions w
made at the branching nodes and leaves, respectively. Given a choice of b and w, each
datapoint in problem (2.1) is allotted one unit of flow that can be guided from the source node
to the sink node through the flow graph associated with the decision tree. If the datapoint
cannot be correctly classified, the flow that will reach the sink (and by extension enter the
source) will be zero. In particular, once the branching variables b and prediction variables w
have been fixed, optimization of the auxiliary flow variables z can be done separately
for each datapoint. In particular, we can decompose problem (2.1) into a main problem
involving the variables (b, w), and |I| subproblems indexed by i ∈ I each involving the flow
variables z
i associated with datapoint i. Additionally, each subproblem is a maximum flow
problem for which specialized polynomial-time methods exist. Due to these characteristics,
formulation (2.1) can be naturally tackled using Benders’ decomposition, see [81]. In what
follows, we describe the Benders’ decomposition approach.
Problem (2.1) can be written equivalently as:
maximize X
i∈I
g
i
(b, w) (2.2a)
subject to X
f∈F
bnf = 1 ∀n ∈ B (2.2b)
X
k∈K
w
n
k = 1 ∀n ∈ L (2.2c)
17
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (2.2d)
w
n
k ∈ {0, 1} ∀n ∈ L, k ∈ K, (2.2e)
where, for any fixed i ∈ I, w and b, the quantity g
i
(b, w) is defined as the optimal objective
value of the problem
g
i
(b, w) = maximize X
n∈L
z
i
n,t (2.3a)
subject to z
i
a(n),n = z
i
n,ℓ(n) + z
i
n,r(n) ∀n ∈ B (2.3b)
z
i
a(n),n = z
i
n,t ∀n ∈ L (2.3c)
z
i
s,1 ≤ 1 (2.3d)
z
i
n,ℓ(n) ≤
X
f∈F:x
i
f =0
bnf ∀n ∈ B (2.3e)
z
i
n,r(n) ≤
X
f∈F:x
i
f =1
bnf ∀n ∈ B (2.3f)
z
i
n,t ≤ w
n
y
i ∀n ∈ L (2.3g)
z
i
a(n),n ∈ {0, 1} ∀n ∈ B ∪ L (2.3h)
z
i
n,t ∈ {0, 1} ∀n ∈ L. (2.3i)
Problem (2.3) is a maximum flow problem on the flow graph G, see Definition 2, whose arc
capacities are determined by (b, w) and datapoint i ∈ I, as formalized next.
Definition 3 (Capacitated flow graph) Given the graph G = (V, A), vectors (b, w), and
datapoint i ∈ I, define arc capacities c
i
(b, w) as follows. Let c
i
s,1
(b, w) := 1, c
i
n,ℓ(n)
(b, w) :=
P
f∈F:x
i
f =0 bnf and c
i
n,r(n)
(b, w) := P
f∈F:x
i
f =1 bnf for all n ∈ B, and c
i
n,t(b, w) := w
n
y
i for
n ∈ L. Define the capacitated flow graph G
i
(b, w) as the flow graph G augmented with
capacities c
i
(b, w).
Observe that the arc capacities in G
i
(b, w) are affine functions of (b, w). An interpretation
of the arc capacities and capacitated flow graph for b and w feasible in problem (2.2) is as
follows. Constraints (2.2b) imply that the weights c
i
(b, w), i ∈ I, in the above definition are
binary. Indeed, the capacity of the arc incoming into node 1 is 1. Moreover, for each node
18
n ∈ B and datapoint i ∈ I, exactly one of the outgoing arcs of node n in graph G
i
(b, w) has
capacity 1: the left arc if the datapoint passes the test (c
i
n,ℓ(n)
(b, w) = 1) or the right arc
if it fails it (c
i
n,r(n)
(b, w) = 1). Finally, for each node n ∈ L in graph G
i
(b, w), its outgoing
arc has capacity one if and only if the datapoint has the same label y
i as that predicted at
node n. Thus, for each datapoint, the set of arcs with capacity 1 in the graph G
i
(b, w) forms
a path from the source node to the leaf where this datapoint is assigned in the decision tree.
This path is connected to the sink via an arc of capacity 1 if and only if the datapoint is
correctly classified. We note that as all the arc capacities in the flow graph are binary, the
integrality constraints (2.3h) and (2.3i) can be relaxed.
Problem (2.3) is equivalent to a maximum flow problem on the capacitated flow graph
G
i
(b, w). From the well-known max-flow/min-cut duality, see e.g., [84], it follows that g
i
(b, w)
is the capacity of a minimum (s, t) cut of graph G
i
(b, w). In other words, it is the greatest
value smaller than or equal to the value of all cuts in graph G
i
(b, w). Given a set S ⊆ V, we
let C(S) := {(n1, n2) ∈ A : n1 ∈ S, n2 ̸∈ S} denote the cut-set corresponding to the source
set S. With this notation, problem (2.2) can be reformulated as
maximize
g,b,w
X
i∈I
g
i
(2.4a)
subject to g
i ≤
X
(n1,n2)∈C(S)
c
i
n1,n2
(b, w) ∀i ∈ I, S ⊆ V \ {t} : s ∈ S (2.4b)
X
f∈F
bnf = 1 ∀n ∈ B (2.4c)
X
k∈K
w
n
k = 1 ∀n ∈ L (2.4d)
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (2.4e)
w
n
k ∈ {0, 1} ∀n ∈ L, k ∈ K (2.4f)
g
i ≤ 1 ∀i ∈ I, (2.4g)
where, at an optimal solution, g
i
represents the value of a minimum (s, t) cut of graph G
i
(b, w).
Indeed, the objective function (2.4a) maximizes the g
i
, while constraints (2.4b) ensure that
the value of g
i
is no greater than the value of a minimum cut.
19
Formulation (2.4) contains an exponential number of inequalities (2.4b) and is implemented
using row generation, whereby constraints (2.4b) are initially dropped and added on the fly
during optimization as Benders’ cuts. Note that we added the redundant constraints (2.4g)
to ensure problem (2.4) is bounded even if all constraints (2.4b) are dropped. Row generation
can be implemented in modern MIO solvers via callbacks, by adding lazy constraints at
relevant nodes of the branch-and-bound tree. Identifying which constraint (2.4b) to add can
in general be done by solving a minimum cut problem, and could in principle be solved via
well-known algorithms, such as [85] and [86].
The Benders’ method we propose is reminiscent of the approach proposed by [87] to tackle
two-stage problems in which the subproblem corresponds to a “tractable” 0-1 problem (i.e.,
a problem that admits an exact LO reformulation). In [87], this subproblem is a decision
diagram; in our case, it is a maximum flow problem. Thus, inequalities (2.4b) correspond to
usual Benders’ cuts, that replace the discrete subproblem with its (equivalent) continuous LO
reformulation. As also pointed out by [87], at each iteration of Benders’ method, there are
often multiple optimal dual solutions, each corresponding to a candidate inequality that can
be added. Cuts obtained from most of these solutions might be weak in general, and thus
they discuss how to strengthen them. In the next section, we propose a method that, among
all potential optimal dual solutions, finds one that is guaranteed to result in a facet-defining
cut for the convex hull of the feasible region given by inequalities (2.4b)-(2.4g), that is, a
cut that is already the best possible and admits no further strengthening. In addition, the
method is faster than using a general purpose method to find minimum cuts or solve LO
problems which we show in Appendix A.7.1.
2.3.2 Generating Facet-Defining Cuts via a Tailored Min-Cut
Procedure
Row generation methods for integer optimization problems such as Benders’ decomposition
may require a long time to converge to an optimal solution if each added inequality is
20
weak for the feasible region of interest. It is therefore of critical importance to add strong
non-dominated cuts at each iteration of the algorithm, e.g., see [88]. We now argue that not
all inequalities (2.4b) are facet-defining for the convex hull of the feasible set of problem (2.4).
To this end, let H= represent the (mixed-binary) feasible region defined by constraints
(2.4b)-(2.4g), and denote by conv(H=) its convex hull. Example 1 below shows that some of
inequalities (2.4b) are not facet-defining for conv(H=) –even if they correspond to a minimum
cut for a given value of (b, w)– and are in fact dominated.
Example 1 Consider an instance of problem (2.4) with a depth d = 1 decision tree (i.e.,
B = {1} and L = {2, 3}) and a dataset involving a single feature (F = {1}). Consider
datapoint i such that x
i
1 = 0 and y
i = 1. Suppose that the solution to the main problem is
such that we branch on (the unique) feature at node 1 and predict class 0 at nodes 2 and 3.
Then, datapoint i is routed left at node 1 and is misclassified. A valid min-cut for the resulting
graph includes all arcs incoming into the sink, i.e., S = {s, 1, 2, 3} and C(S) = {(2, t),(3, t)}.
The associated Benders’ inequality (2.4b) reads
g
i ≤ w
2
1 + w
3
1
. (2.5)
Intuitively, (2.5) states that datapoint i can be correctly classified if its class label is assigned
to at least one node, and is certainly valid for conv(H=). However, since datapoint i cannot
be routed to node 3, the stronger inequality
g
i ≤ w
2
1
(2.6)
is valid for conv(H=) and dominates (2.5).
Example 1 implies that an implementation of formulation (2.4) using general purpose
min-cut algorithms to identify constraints to add may perform poorly. This motivates us
to develop a tailored algorithm that exploits the structure of each capacitated flow graph
G
i
(b, w) to return inequalities that are never dominated, thus resulting in faster convergence
of the Benders’ decomposition approach.
Algorithm 1 shows the proposed procedure, which can be called at integer nodes of the
branch-and-bound tree, using for example callback procedures available in most commercial
21
and open-source solvers. Algorithm 1 traverses the flow graph associated with a datapoint
using depth-first search to investigate the existence of a path from the source to the sink.
If there is such a path, it means that the datapoint is correctly classified and there are no
violating inequalities. Otherwise, it outputs the set of all observed nodes in the traverse
as the source set of the min-cut. Figure 2.2 illustrates Algorithm 1. We now prove that
Algorithm 1 is indeed a valid separation algorithm.
Algorithm 1 Separation procedure
Input: (b, w, g) ∈ {0, 1}
B×F × {0, 1}
L×K × R
I
satisfying (2.4c)-(2.4g);
i ∈ I : datapoint used to generate the cut.
Output: −1 if all constraints (2.4b) corresponding to i are satisfied;
source set S of min-cut otherwise.
1: if g
i = 0 then return −1
2: Initialize n ← 1 ▷ Current node = root
3: Initialize S ← {s} ▷ S is in the source set of the cut
4: while n ∈ B do
5: S ← S ∪ {n}
6: if c
i
n,ℓ(n)
(b, w) = 1 then
7: n ← ℓ(n) ▷ Datapoint i is routed left
8: else if c
i
n,r(n)
(b, w) = 1 then
9: n ← r(n) ▷ Datapoint i is routed right
10: end if
11: end while ▷ At this point, n ∈ L
12: S ← S ∪ {n}
13: if g
i > ci
n,t(b, w) then ▷ Minimum cut S with capacity 0 found
14: return S
15: else ▷ Minimum cut S has capacity 1, constraints (2.4b) are satisfied
16: return −1
17: end if
Proposition 1 Given i ∈ I and (b, w, g) satisfying (2.4c)-(2.4g) (in particular, b and w
are integral), Algorithm 1 either finds a violated inequality (2.4b) or proves that all such
inequalities are satisfied.
Proof. Note that the right-hand side of (2.4b), which corresponds to the capacity of a cut
in the graph, is nonnegative. Therefore, if g
i = 0 (line 1), all inequalities are automatically
satisfied. Since (b, w) is integer, all arc capacities are either 0 or 1. We assume that the arcs
22
s
1
2 3
4 5 6 7
t
(Datapoint 1) (Datapoint 2)
s
1
2 3
4 5 6 7
t
Figure 2.2: Illustration of Algorithm 1 on two datapoints that are correctly classified (datapoint
1, left) and incorrectly classified (datapoint 2, right). Unbroken (green) arcs (n, n′
) have
capacity c
i
n,n′(b, w) = 1 (and others capacity 0). In the case of datapoint 1 which is correctly
classified since there exists a path from source to sink, Algorithm 1 terminates on line 16 and
returns −1. In the case of datapoint 2 which is incorrectly classified, Algorithm 1 returns
set S = {s, 1, 3, 6} on line 14. The associated minimum cut consists of arcs (1, 2), (6, t), and
(3, 7) and is represented by the thick (red) dashed line.
with zero capacity are removed from the flow graph. Moreover, since g
i ≤ 1, we find that
either the value of a minimum cut is 0 and there exists a violated inequality, or the value
of a minimum cut is at least 1 and there is no violated inequality. Finally, there exists a
0-capacity cut if and only if s and t belong to different connected components in the graph
G
i
(b, w).
The connected component s belongs to, can be found using depth-first search. For any
fixed n ∈ B, constraints (2.4c) and the definition of c
i
(b, w) imply that either arc (n, ℓ(n))
or arc (n, r(n)) has capacity 1 (but not both). If arc (n, ℓ(n)) has capacity 1 (line 6), then
ℓ(n) can be added to the component connected to s (set S); the case where arc (n, r(n)) has
capacity 1 (line 8) is handled analogously. This process continues until a leaf node is reached
(line 12). If the capacity of the arc to the sink is 1 (line 15), then an s − t path is found and
no cut with capacity 0 exists. Otherwise (line 13), S is the connected component of s and
t ̸∈ S, thus S is the source of a minimum cut with capacity 0. □
Observe that Algorithm 1, which exploits the specific structure of the network for binary
(b, w) feasible in (2.2), is much faster than general purpose minimum-cut methods. Indeed,
since at each iteration in the main loop (lines 4-11), the value of n is updated to a descendant
23
of n, the algorithm terminates in at most O(d) iterations, where d is the depth of the tree. As
|B∪L| is O(2d
), the complexity is logarithmic in the size of the tree. In addition to providing a
very fast method for generating Benders’ inequalities at integer nodes of a branch-and-bound
tree, Algorithm 1 is guaranteed to generate strong non-dominated inequalities.
Define
H≤ :=
b ∈ {0, 1}
B×F , w ∈ {0, 1}
L×K, g ∈ R
I
:
X
f∈F
bnf ≤ 1 ∀n ∈ B
X
k∈K
w
n
k ≤ 1 ∀n ∈ L
g
i ≤
X
(n1,n2)∈C(S)
c
i
n1,n2
(b, w)
∀i ∈ I, S ⊆ V \ {t} : s ∈ S
.
Our main motivation for introducing H≤ is that conv(H≤) is full dimensional, whereas
conv(H=) is not. Moreover, in formulation (2.4), apart from constraints (2.4c)-(2.4d), variables b and w only appear in the right-hand side of inequalities (2.4b) with non-negative
coefficients. Therefore, replacing constraints (2.4c)-(2.4f) with (b, w) ∈ H≤ still results in
valid formulation for problem (2.4), since there exists an optimal solution where the inequalities are tight. Theorem 2 below formally states that Algorithm 1 generates non-dominated
inequalities.
Theorem 2 All violated inequalities generated by Algorithm 1 are facet-defining for conv(H≤).
Example 2 (Example 1 Continued) In the instance considered in Example 1, if b1f = 1
and w
2
1 = 0, then the cut generated by Algorithm 1 has source set S = {s, 1, 2} and results in
inequality
g
i ≤
X
(n1,n2)∈C(S)
c
i
n1,n2
(b, w) = c
i
2,4
(b, w) + c
i
1,3
(b, w) = w
2
1
,
which is precisely the stronger inequality (2.6).
Remark 2 Algorithm 1 can only be invoked at integer nodes of the branch-and-bound tree.
However, there is a slight advantage in including as many cut-set inequalities as possible
at the root node of the branch-and-bound tree by utilizing Gurobi to solve the corresponding linear optimization problem for each subproblem. This approach allows us to enhance
24
the LO relaxation at integer nodes. A detailed investigation on this matter is reported in
Appendix A.7.1.
2.4 Generalizations
In Section 2.2, we proposed a flow-based MIO formulation for designing optimal balanced
decision trees, see problem (2.1). In this section, we generalize this core formulation to design
regularized (i.e., not necessarily balanced) classification trees wherein the distance from root
to leaf may vary across leaves. We also discuss a variant that tracks all datapoints, even
those that are not correctly classified, making it suitable to learn from imbalanced datasets,
and to design fair decision trees.
2.4.1 Imbalanced Decision Trees
Formulation (2.1) outputs a balanced decision tree as defined in Definition 1. Such trees may
result in over-fitting of the data and poor out-of-sample performance, in particular if d is
large. To this end, we propose a variant of formulation (2.1) which allows for the design
of trees that are not necessarily balanced and that have the potential of performing better
out-of-sample. To this end, we introduce the following terminology pertaining to imbalanced
trees.
Definition 4 (Imbalanced Decision Trees) An imbalanced decision tree of (maximum)
depth d ∈ N is a full binary tree, i.e., a tree in which every node has either 0 or 2 children
and where the largest depth of a leaf is d. We let B := {1, . . . , 2
d − 1} denote the set of all
candidate branching nodes and T := {2
d
, . . . , 2
d+1 − 1} represent the set of terminal nodes.
We will refer to a node n ∈ B ∪ T as a leaf if no branching occurs at the node.
Note that in a balanced decision tree, see Definition 1, branching occurs at all nodes in
B, and leaves of the decision tree correspond precisely to all nodes of maximum depth, i.e.,
L = T . In contrast, in imbalanced decision trees nodes in B can be leaf nodes and terminal
nodes in T need not be part of the decision tree.
25
Akin to Definition 2, we associate with an imbalanced decision tree a directed acyclic
graph by augmenting the tree with a single source node s that is connected to the root node of
the tree and a single sink node t that is connected to all nodes n ∈ B ∪ T of the tree, allowing
correctly classified datapoints to flow to the sink from any node where a prediction is made
(leaf of the learned tree). An illustration of these concepts on an imbalanced decision tree of
depth d = 2 is provided on Figure 2.3. A formal definition for the flow graph associated with
a imbalanced decision tree of maximum depth d is as follows.
1
2 3
4 5 6 7
B
T
s
1
2 3
4 5 6 7
t
B
T
Figure 2.3: A decision tree of depth 2 (left) and its associated flow graph (right) that
can be used to train imbalanced decision trees of maximum depth 2. Here, B = {1, 2, 3}
and T = {4, 5, 6, 7}, while V = {s, 1, 2, . . . , 7, t} and A = {(s, 1),(1, 2),(1, t), . . . ,(7, t)}. The
additional arcs that connect the branching nodes to the sink allow branching nodes n ∈ B to be
converted to leaves where a prediction is made. Correctly classified datapoints that reach a leaf
are directed to the sink. Incorrectly classified datapoints are not allowed to flow in the graph.
Definition 5 (Flow Graph of an Imbalanced Tree) Given an imbalanced decision tree
of depth d, we define its associated directed flow graph G = (V, A) as follows. Let V :=
{s, t} ∪ B ∪ T be the vertices of the flow graph. Given n ∈ B, let ℓ(n) := 2n be the left
descendant of n, r(n) := 2n + 1 be the right descendant of n, and
A :=
(n, ℓ(n)) : n ∈ B}
∪
(n, r(n)) : n ∈ B}
∪ {(s, 1)} ∪
(n, t) : n ∈ B ∪ T
be the arcs of the graph. Also, given n ∈ B ∪ T , let a(n) be the parent of n, defined through
a(1) := s and a(n) := ⌊n/2⌋ if n ̸= 1.
We are now ready to formulate the variant of problem (2.1) that allows for the design of
imbalanced classification trees. In addition to the decision variables from formulation (2.1)
we introduce, for every node n ∈ B ∪ T , the binary decision variable pn which has a value of
26
one if and only if node n is a leaf node of the tree, i.e., if we make a prediction at node n.
The auxiliary routing/flow variables z now account for all arcs in the flow graph introduced
in Definition 5. The problem of learning optimal imbalanced classification trees is then
expressible as
maximize (1 − λ)
X
i∈I
X
n∈B∪T
z
i
n,t − λ
X
n∈B
X
f∈F
bnf (2.7a)
subject to X
f∈F
bnf + pn +
X
m∈P(n)
pm = 1 ∀n ∈ B (2.7b)
pn +
X
m∈P(n)
pm = 1 ∀n ∈ T (2.7c)
z
i
a(n),n = z
i
n,ℓ(n) + z
i
n,r(n) + z
i
n,t ∀n ∈ B, i ∈ I (2.7d)
z
i
a(n),n = z
i
n,t ∀n ∈ T , i ∈ I (2.7e)
z
i
s,1 ≤ 1 ∀i ∈ I (2.7f)
z
i
n,ℓ(n) ≤
X
f∈F:x
i
f =0
bnf ∀n ∈ B, i ∈ I (2.7g)
z
i
n,r(n) ≤
X
f∈F:x
i
f =1
bnf ∀n ∈ B, i ∈ I (2.7h)
z
i
n,t ≤ w
n
y
i ∀n ∈ B ∪ T , i ∈ I (2.7i)
X
k∈K
w
n
k = pn ∀n ∈ B ∪ T (2.7j)
w
n
k ∈ {0, 1} ∀n ∈ B ∪ T , k ∈ K (2.7k)
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (2.7l)
pn ∈ {0, 1} ∀n ∈ B ∪ T (2.7m)
z
i
a(n),n, zi
n,t ∈ {0, 1} ∀n ∈ B ∪ T , i ∈ I, (2.7n)
where P(n) is the set of all ancestors of node n ∈ B ∪ T , i.e., the set of all nodes lying on
the unique path from the root node to node n, and λ ∈ [0, 1] is a regularization parameter.
An explanation of the new/modified problem constraints is as follows. Constraints (2.7b)
imply that at any node n ∈ B we either branch on a feature f (if P
f∈F bnf = 1), predict
a label (if pn = 1), or get pruned if a prediction is made at one of the node ancestors (i.e.,
27
if P
m∈P(n) pm = 1). Similarly constraints (2.7c) ensure that any node n ∈ T is either a leaf
node of the tree or is pruned. Constraints (2.7d)-(2.7j) exactly mirror constraints (2.1c)-(2.1i)
in problem (2.1). They are slight modifications of the original constraints accounting for the
new arcs added to the flow graph and for the possibility of making predictions at branching
nodes n ∈ B. In particular, constraints (2.7j) imply that if a node n gets pruned we do not
predict any class at the node, i.e., w
n
k = 0 for all k ∈ K. A penalty term is added to (2.7a),
to encourage sparser trees with fewer branching decisions. Note that while it is feasible to
design a decision tree with the same branching decisions in multiple nodes on a single path
from root to sink, such solutions are never optimal for (2.7) if λ > 0, as a simpler tree would
result in the same misclassification.
Problem (2.7) allows for the design of regularized decision trees by augmenting this
nominal formulation with additional regularization constraints. These either limit or penalize
the number of nodes that can be used for branching or place a lower bound on the number
of datapoints that land on each leaf. We detail these variants in the following. All of these
variants either explicitly or implicitly restrict the tree size, thereby mitigating the risk of
overfitting and resulting in more interpretable trees.
Sparsity. A commonly used approach to restrict tree size is to add sparsity constraints which
restrict the number of branching nodes, see e.g., [5, 7, 18], [30], and [89]. Such constraints
are usually met in traditional methods by performing a post-processing pruning step on the
learned (unrestricted) tree. We enforce this restriction by adding the constraint
X
n∈B
X
f∈F
bnf ≤ C (2.8)
to problem (2.7), and possibly setting λ = 0. This constraint ensures that the learned tree
has at most C branching nodes (thus at most C + 1 leaf nodes), where C is a hyper-parameter
that can be tuned using cross-validation, see e.g., [90]. We note that due to the non-convexity
introduced by the integer variables in problem (2.7), the constrained sparsity formulation
provides more options than the penalized version, see e.g., [91]. In other words, for any
choice of λ ∈ [0, 1] in the penalized version, there exists a choice of C ∈ {0, . . . , 2
d} in the
28
constrained version that yields the same solution, but the converse is not necessarily true.
The penalized version is more common in the machine learning literature and we will thus
use it for benchmarking purposes in our numerical results, see Section 2.5.
Maximum Number of Features to Use. One can also constrain the total number C of
features branched on in the decision tree as in [30] by adding the constraints
X
f∈F
bf ≤ C and bf ≥ bnf ∀n ∈ B, f ∈ F (2.9)
to problem (2.7), where bf ∈ {0, 1}, f ∈ F, are decision variables that indicate if feature f is
used in the tree. Parameter C can be a user-specified input or tuned via cross-validation.
Note that neither integrality nor bound constraints on variable bf need to be enforced in the
formulation.
Minimum Number of Datapoints in each Leaf Node. Another popular regularization
approach, see e.g., [18], consists in placing a lower bound on the number of datapoints that
land in each leaf. To ensure that each node contains at least Nmin datapoints, we impose
X
i∈I
z
i
a(n),n ≥ Nminpn ∀n ∈ B ∪ T (2.10)
in problem (2.7).
2.4.2 Imbalanced Datasets
A dataset is called imbalanced when the class distribution is not uniform, i.e., when the
number of datapoints in each class varies significantly from class to class. In the case when a
dataset consists of two classes, i.e., K := {k, k′}, we say that it is imbalanced if
|{i ∈ I : y
i = k}| ≫ |{i ∈ I : y
i = k
′
}|,
in which case k and k
′ are referred to as the majority and minority classes, respectively.
In imbalanced datasets, predicting the majority class for all datapoints results in high
accuracy, and thus decision trees that maximize prediction accuracy without accounting for
the imbalanced nature of the data perform poorly on the minority class. Imbalanced datasets
occur in many important domains, e.g., to predict landslides, to detect fraud, or to predict if
29
a patient has cancer, e.g., see, [92, 93], and [94]. Naturally, being able to predict the minority
class(es) accurately is crucial in such settings.
In the following we propose adjustments to the core formulation (2.7) to ensure meaningful
decision trees are learned even in the case of imbalanced datasets. These adjustments require
calculating metrics such as true positives, true negatives, false positives, and false negatives
or some function of them such as recall (fraction of all datapoints from the positive class that
are correctly identified) and precision (the portion of correctly classified datapoints from the
positive class out of all datapoints with positive predicted class). Thus, learning meaningful
decision trees requires us to track all datapoints, not only the correctly classified ones as done
in formulations (2.1) and (2.7).
To this end, we propose to replace the single sink in the flow graph associated with a
imbalanced decision tree, see Definition 5, with |K| sink nodes denoted by tk, one for each
class k ∈ K. Each sink node tk, k ∈ K, is connected to all nodes n ∈ B ∪ T and collects all
datapoints (both correctly and incorrectly classified) with predicted class k.
A formal definition for the flow graph associated with a decision tree that can be used to
train imbalanced decision trees and that can track all datapoints is as follows.
Definition 6 (Complete Flow Graph of an Imbalanced Decision Tree) Given a decision tree of depth d, we define the directed complete flow graph G = (V, A) associated with
the tree that can be used to learn imbalanced decision trees of maximum depth d and that
tracks all datapoints through the graph as follows. Let V := {s} ∪ {tk : k ∈ K} ∪ B ∪ T be the
vertices of the flow graph. Given n ∈ B, let ℓ(n), r(n), and a(n) be as in Definition 5 and let
A :=
(n, ℓ(n)) : n ∈ B}
∪
(n, r(n)) : n ∈ B}
∪ {(s, 1)}
∪
(n, tk) : n ∈ B ∪ T , k ∈ K
be the arcs of the graph.
We are now ready to formulate the variant of problem (2.7) that tracks all datapoints
through the flow graph with sink nodes tk, k ∈ K. In a way that parallels formulation (2.7),
we introduce auxiliary variables z
i
n,tk
∈ {0, 1} for each node n ∈ B ∪ T and each class k ∈ K
30
to track the flow of datapoints (both correctly classified and missclassified) to the sink node
tk. Our formulation reads
maximize X
i∈I
X
n∈B∪T
z
i
n,t
yi
(2.11a)
subject to X
f∈F
bnf + pn +
X
m∈P(n)
pm = 1 ∀n ∈ B (2.11b)
pn +
X
m∈P(n)
pm = 1 ∀n ∈ T (2.11c)
z
i
a(n),n = z
i
n,ℓ(n) + z
i
n,r(n) +
X
k∈K
z
i
n,tk
∀n ∈ B, i ∈ I (2.11d)
z
i
a(n),n =
X
k∈K
z
i
n,tk
∀n ∈ T , i ∈ I (2.11e)
z
i
s,1 = 1 ∀i ∈ I (2.11f)
z
i
n,ℓ(n) ≤
X
f∈F:x
i
f =0
bnf ∀n ∈ B, i ∈ I (2.11g)
z
i
n,r(n) ≤
X
f∈F:x
i
f =1
bnf ∀n ∈ B, i ∈ I (2.11h)
z
i
n,tk ≤ w
n
k ∀n ∈ B ∪ T , k ∈ K, i ∈ I (2.11i)
X
k∈K
w
n
k = pn ∀n ∈ B ∪ T (2.11j)
w
n
k ∈ {0, 1} ∀n ∈ B ∪ T , k ∈ K (2.11k)
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (2.11l)
pn ∈ {0, 1} ∀n ∈ B ∪ T (2.11m)
z
i
a(n),n, zi
n,tk
∈ {0, 1} ∀n ∈ B ∪ T , i ∈ I, k ∈ K. (2.11n)
An interpretation of the problem constraints is as follows. Constraints (2.11b)-(2.11j)
exactly mirror constraints (2.7b)-(2.7j) in problem (2.7). There are three main differences
in these constraints. Flow conservation constraints (2.11d) and (2.11e) now allow flow
to be directed to any one of the sink nodes tk from any one of the nodes n ∈ B ∪ T .
Constraints (2.11f) ensure that the flow incoming into the source and associated with each
datapoint equals 1. Constraints (2.11i) stipulate that a datapoint is only allowed to be
31
directed to the sink corresponding to the class predicted at the leaf where the datapoint
landed. Constraints (2.11d)-(2.11f) and (2.11i) together ensure that all datapoints get routed
to the sink associated with their predicted class. As a result of these changes, sink node tk
collects all datapoints, whether correctly classified or not, with predicted class k ∈ K. The
objective (2.11a) is updated to reflect that a datapoint is correctly classified if and only if it
flows to sink ty
i .
At a feasible solution to problem (2.11), the quantity P
i∈I z
i
a(n),n represents the total
number of datapoints that land at node n. The number of datapoints of class k ∈ K
that are correctly (resp. incorrectly) classified is expressible as P
i∈I:y
i=k
P
n∈B∪T z
i
n,tk
(resp.
|{i ∈ I : y
i = k}| − P
i∈I:y
i=k
P
n∈B∪T z
i
n,tk
), while the total number of datapoints for which
we predict class k can be written as P
i∈I
P
n∈B∪T z
i
n,tk
.
Problem (2.11) allows us to learn meaningful decision trees for the case of imbalanced
data by modifying the objective function of the problem and/or by augmenting this nominal
formulation with constraints. We detail these variants in what follows.
Balanced Accuracy. A common approach to handle imbalanced datasets is to optimize the
so-called balanced accuracy, which averages the accuracy across classes. Intuitively, in the case
of two classes, it corresponds to the average of the true positive and true negative rates, see
e.g., [95]. It can also be viewed as the average between sensitivity and specificity. Maximizing
balanced accuracy can be achieved by replacing the objective function of problem (2.11) with
1
|K|
X
k∈K
1
|{i ∈ I : y
i = k}|
X
i∈I:y
i=k
X
n∈B∪T
z
i
n,tk
.
Each term in the summation above represents the proportion of datapoints from class k
that are correctly classified. Having high balanced accuracy is important when dealing with
imbalanced data and correctly predicting all classes is (equally) important.
32
Worst-Case Accuracy. As a variant to optimizing the average accuracy across classes, we
propose to maximize the worst-case (minimum) accuracy. This can be achieved by replacing
the objective function of problem (2.11) with
min
k∈K
1
|{i ∈ I : y
i = k}|
X
i∈I:y
i=k
X
n∈B∪T
z
i
n,tk
.
To the best of our knowledge, this idea has not been proposed in the literature. Compared to
minimizing average accuracy, this objective attaches more importance to the class that is
“worse-off.” Having high worst-case accuracy is important when dealing with imbalanced data
and correctly predicting the class that is the hardest to predict is important. This is the case
for example when diagnosing cancer where positive cases are hard to identify and having
large false negative rates (i.e., low true positive rate) could cost a patient’s life. There are also
related works on robust classification where the objective happens to be the minimization of
worst-case accuracy over noisy features/labels, see [96], or over adversarial examples, see [97].
In the remaining of this section we focus on the case of binary classification where
K = {0, 1} and we refer to k = 1 (resp. k = 0) as the positive (resp. negative) class. In
this setting we can discuss metrics such as recall, precision, sensitivity, and specificity more
conveniently. Naturally, these definitions can be generalized to cases with more classes.
Constraining Recall. An important metric in the classification problem when dealing
with imbalanced datasets is recall (also referred to as sensitivity), which is the fraction of
all datapoints from the positive class that are correctly identified. Guaranteeing a certain
level C ∈ [0, 1] of recall can be achieved by augmenting problem (2.11) with the constraint
1
|{i ∈ I : y
i = 1}|
X
i∈I:y
i=1
X
n∈B∪T
z
i
n,t1 ≥ C.
Decreasing the number of false negatives (datapoints from the positive class that are incorrectly predicted) increases recall. Thus, in cases where false negatives can have dramatic
consequences and even cost lives, such as in cancer diagnosis or when predicting landslides,
guaranteeing a certain level of recall can help mitigate such risks.
Constraining Precision. Our method can conveniently be used to learn decision trees
that have sufficiently high precision, which is defined as the portion of correctly classified
33
datapoints from the positive class out of all datapoints with positive predicted class. This
can be achieved by augmenting problem (2.11) with the constraint
X
i∈I:y
i=1
X
n∈B∪T
z
i
n,t1 ≥ C
X
i∈I
X
n∈B∪T
z
i
n,t1
, (2.12)
where C ∈ [0, 1] is a hyper-parameter that represents the minimum acceptable precision
that can be tuned via cross-validation. Decreasing the number of false positives (datapoints
from the negative class that are incorrectly classified) increases precision. Thus, constraining
precision is useful in settings where having a low number of false negatives is not as important
and having a low number of false positives, such as when making product recommendations.
Balancing Sensitivity and Specificity. [60] address the issue of imbalanced data by
maximizing sensitivity (true positive rate) in the objective, while guaranteeing a certain level
of specificity (true negative rate), or the converse, instead of optimizing the total accuracy.
In the case of binary classification where K = {0, 1}, we can constrain specificity from below
by augmenting problem (2.11) with the constraint
1
|{i ∈ I : y
i = 0}|
X
i∈I:y
i=0
X
n∈B∪T
z
i
n,t0 ≥ C,
and maximize sensitivity by replacing its objective function with
1
|{i ∈ I : y
i = 1}|
X
i∈I:y
i=1
X
n∈B∪T
z
i
n,t1
.
This method is useful in settings such as infectious disease testing wherein we want to
maximize the chance of correctly predicting someone to be infectious while making sure to
identify non-infectious individuals with a high enough confidence.
2.4.3 Learning Fair Decision Trees
In recent years, machine learning algorithms are increasingly being used to assist decisionmaking in socially sensitive, high-stakes, domains. For example, they are used to help decide
who to give access to credit, benefits, and public services, see e.g., [2] and [33], to help guide
policing, see e.g., [98], or to assist with screening decisions for jobs/college admissions, see
e.g., [99]. Yet, decision-making systems based on standard machine learning algorithms may
result in discriminative decisions as they may treat or impact individuals unequally based on
34
certain characteristics, often referred to as protected or sensitive, including but not limited
to age, disability, ethnicity, gender, marital status, national origin, race, religion, and sexual
orientation, see e.g., [22], [100], and [101].
Over the last decade, so-called “in-process” fair machine learning algorithms have been
proposed as a way to mitigate bias in standard ML. These methods incorporate a fairness
notion, such as statistical parity [22] or equalized odds [23] in the training step, either
penalizing or constraining discrimination. Notably, several in-process approaches have been
proposed for learning fair decision trees. The vast majority of these methods are based on
heuristics. For example, [24] and [25] propose to augment the splitting criterion of CART and
of the Hoeffding Tree algorithm, respectively, with a regularizer to promote statistical parity
and accuracy. [26] propose a fair gradient boosting algorithm to design fair decision trees that
satisfy either equalized odds or statistical parity. [27] introduce a genetic algorithm for training
decision trees that maximize both accuracy and robustness to adversarial perturbations and
accounting for individual fairness. In contrast with the aforementioned works, [30] propose an
MIO based formulation involving a fairness regularizer in the objective aimed at mitigating
disparate treatment and disparate impact, see [101]. While this approach has proved effective
relative to heuristics, it is based on a weak formulation and is therefore slow to converge, see
Section 2.1.
In this section, we discuss how our stronger formulation (2.11) can be augmented with
fairness constraints to learn optimal and fair decision trees satisfying some of the most
common fairness requirements in the literature. We refer the interested reader to the book
by [102] and to the survey papers of [103], [104] and [105] and to the references there-in for
in-depth reviews of the literature on the topic of fair ML including analyses on the relative
merits of various fairness metrics.
Throughout this section, and for ease of exposition, we consider the case of binary
classification where K = {0, 1}. We let k = 1 correspond to the positive outcome. For
example, in a hiring problem where we want to decide whether to interview someone or not,
35
being invited for an interview is regarded as the positive outcome. We let p
i ∈ Λp denote
the value of the protected feature(s) of datapoint i, where Λp denotes the set of all possible
levels of the protected feature(s). Depending on whether branching on the protected feature
is allowed or not, p
i
can either be included or excluded as an element of the feature vector x
i
,
see e.g., [106] and [107].
Statistical Parity. A classifier satisfies statistical parity if the probability of predicting the
positive outcome is similar across all the protected groups, see [22]. For example in the hiring
problem mentioned above, it may be appropriate to impose that the probability of receiving
an interview should be similar across genders if people from different genders are equally
likely to be meritorious. Augmenting model (2.11) with the following constraints ensures that
the decision tree learned by our MIO formulation satisfies statistical parity up to a bias δ
X
n∈B∪T
X
i∈I:p
i=p
z
i
n,t1
|{i ∈ I : p
i = p}| −
X
n∈B∪T
X
i∈I:p
i=p
′
z
i
n,t1
|{i ∈ I : p
i = p
′}|
≤ δ ∀p, p′ ∈ Λp : p ̸= p
′
,
where the first and second term correspond to empirical estimates of the conditional probabilities of predicting the positive outcome given p and p
′
, respectively.
Conditional Statistical Parity. A classifier satisfies conditional statistical parity across
protected groups if the probability of predicting the positive outcome is similar between all
groups conditional on some feature, e.g., a legitimate feature that can justify differences
across protected groups, see [108]. We let l
i ∈ Λl denote the value of the feature(s) of
datapoint i that can legitimize differences, where Λl denotes the set of all possible levels
of the legitimate feature(s). For example, in the problem of matching people experiencing
homelessness to scarce housing resources, it is natural to require that the probability of
receiving a resource among all individuals with the same vulnerability (a risk score) should be
similar across genders, races, or other protected attribute. By adding the following constraints
36
to model (2.11), we can ensure our learned trees satisfy conditional statistical parity up to a
bias δ given any value l ∈ Λl
X
n∈B∪T
X
i∈I
I(p
i = p ∧ l
i = l)z
i
n,t1
|{i ∈ I : p
i = p ∧ l
i = l}| −
X
n∈B∪T
X
i∈I
I(p
i = p
′ ∧ l
i = l)z
i
n,t1
|{i ∈ I : p
i = p
′ ∧ l
i = l}|
≤ δ
∀p, p′ ∈ Λp : p ̸= p
′
, l ∈ Λl
,
where the first and second term correspond to empirical estimates of the conditional probabilities of predicting positive outcome given p and p
′
, respectively, conditional on l.
Predictive Equality. A classifier satisfies predictive equality if it results in the same false
positive rates across protected groups, see [109]. For example, this fairness notion may be
useful when using machine learning to predict if a convicted person will recidivate so as to
decide if it is appropriate to release them on bail, see [100]. Indeed, predictive equality in
this context requires that, among defendants who would not have gone on to recidivate if
released, detention rates should be similar across all races. Adding the following constraints
to model (2.11) ensures that the learned decision trees satisfy predictive equality, up to a
constant δ
X
i∈I
X
n∈B∪T
I(p
i = p ∧ y
i = 0)z
i
n,t1
|{i ∈ I : p
i = p ∧ y
i = 0}| −
X
i∈I
X
n∈B∪T
I(p
i = p
′ ∧ y
i = 0)z
i
n,t1
|{i ∈ I : p
i = p
′ ∧ y
i = 0}|
≤ δ
∀p, p′ ∈ Λp : p ̸= p
′
,
where the first and second terms inside the absolute value are the estimated false positive
rate given p and p
′
, respectively.
Equalized Odds. A classifier satisfies equalized odds if the predicted outcome and protected
feature are independent conditional on the outcome, see [23]. In other words, equalized
odds requires same true positive rates and same false positive rates across protected groups.
For example, in the college admissions process, equalized odds requires that no matter the
applicant’s gender, if they are qualified (or unqualified), they should get admitted at equal
rates. By adding the following constraints to model (2.11) we ensure the decision trees learned
by our MIO formulation satisfy equalized odds up to a constant δ
37
X
n∈B∪T
X
i∈I
I(p
i = p ∧ y
i = k)z
i
n,t1
|{i ∈ I : p
i = p ∧ y
i = k}| −
X
n∈B∪T
X
i∈I
I(p
i = p
′ ∧ y
i = k)z
i
n,t1
|{i ∈ I : p
i = p
′ ∧ y
i = k}|
≤ δ
∀k ∈ K, p, p′ ∈ Λp : p ̸= p
′
,
where the first and second terms correspond to empirical estimates of conditional probabilities of predicting the positive outcome given (p, y = k) and (p
′
, y = k), respectively. If
we relax the above constraints to only hold for k = 1, we achieve equal opportunity up to a
constant δ, see [23].
All the introduced nonlinear constraints can be linearized using standard MIO techniques.
We leave this task to the reader. Our approach can also impose fairness by means of
regularization and can also model more sophisticated fairness metrics such as the ones of [30].
We leave these extensions to the reader.
2.4.4 Solution Approach for Generalizations
We now briefly discuss how to solve formulation (7) (for learning imbalanced decision trees),
and formulation (11) (which tracks all datapoints through the flow graph), and their variants
introduced in Sections 2.4.1 through 2.4.3.
The nominal formulations (2.7) and (2.11) and all their variants augmented with constraints that do not couple datapoints with one another (e.g., sparsity constraints (2.8) or
interpretability constraints (2.9)) can be effectively solved using Benders’ decomposition. This
can be achieved by adapting Algorithm 1 from Section 2.3. On the other hand, when these
formulations are augmented with constraints that do couple datapoints with one another (like
the fairness constraints from Section 2.4.3), Benders’ decomposition is no longer applicable.
In such cases, these formulations need to be solved directly as monolithic MIO problems.
The Benders’ decomposition approach for solving formulation (2.7) and the corresponding
variants is provided in Appendix A.5. The derivation of the Benders’ decomposition for
formulation (2.11) and the corresponding variants is left for the reader.
38
2.5 Experiments
In the following section, we discuss the datasets we use in our experiments, the approaches
that we compare to, the experimental setup, and our findings. More extensive numerical
results are included in the appendix.
2.5.1 Benchmark Approaches and Datasets
In our numerical experiments, we have two sets of experiments, one on datasets with only
categorical features and one on datasets with a mixture of categorical and real-valued features.
We now describe both sets of data and the approaches we compare to in each case.
Datasets with only Categorical Features. In the first part of our experiments we use
all twelve publicly available datasets with only categorical features from the UCI data repository [110] as detailed in Table 2.1. We compare the flow-based formulation (FlowOCT) given
in problem (2.7) and its Benders’ decomposition (BendersOCT) described in Appendix A.5
to the univariate splits formulations proposed by [18] (OCT) and [19] (BinOCT). We also
compare to the heuristic algorithm based on local search for solving OCT proposed by [67]
(LST). As the code used for OCT is not publicly available, we implemented the corresponding
formulation (adapted to the case of binary data). The details of this implementation are
given in Appendix A.2. We also used the Python implementation of LST which is available
for academic use. In LST, we set the complexity parameter ‘cp’ to 0 which is analogous to
setting λ = 0. In order to make the optimality gap comparable across approaches, we made
some adjustments to the objective function of FlowOCT and BendersOCT, by subtracting
number of total datapoints from the objective, such that, for all approaches, the objective
value reflects the number of misclassified datapoints. The results for worst-case accuracy
objective discussed in Section 2.4.2 can be found in Appendix A.7.2. We also numerically
analyze the strength of all formulations by looking at their LO relaxation, see Appendix A.7.3.
Analysis of some implementation variants of BendersOCT, see Remark 2, can be found in
Appendix A.7.1
39
Table 2.1: Benchmark datasets with only categorical features, along with their number of rows
(|I|), number of features (|F|), and number of classes (|K|).
Dataset |I| |F| |K|
soybean-small 47 45 4
monk3 122 15 2
monk1 124 15 2
hayes-roth 132 15 3
monk2 169 15 2
house-votes-84 232 16 2
spect 267 22 2
breast-cancer 277 38 2
balance-scale 625 20 3
tic-tac-toe 958 27 2
car-evaluation 1728 20 4
kr-vs-kp 3196 38 2
Datasets with Mixed Features. In the second part of our experiments, we use 28
publicly available datasets with both categorical and real-valued features from the UCI data
repository as detailed in Table 2.2. In this part, we compare BendersOCT to OCT. We did not
include BinOCT as it cannot handle real-valued features. We also did not include FlowOCT as it
is outperformed by BendersOCT. To run BendersOCT on these datasets, we first discretize the
real-valued features into 5 and 10 buckets (quantiles) and then one-hot encode the discretized
columns. We refer to the version of BendersOCT which we run on the discretized data with 5
(resp. 10) buckets as BendersOCT-5 (resp. BendersOCT-10). As OCT can handle real-valued
features without special encoding, we use the original format of the datasets for OCT. For
both OCT and BendersOCT the categorical features are one-hot encoded.
2.5.2 Experimental Setup
For each dataset, we create 5 random splits of the data each consisting of a training set
(50%), a calibration set (25%) used to calibrate the hyperparameters, and a test set (25%).
For each split, for each depth d ∈ {2, 3, 4, 5}, and for each choice of regularization parameter
λ ∈ {0, 0.1, 0.2, . . . 0.9}, we train a decision tree. For any given depth we calibrate λ on the
calibration set. Having the best λ for any given split and depth, we train a decision tree on
40
Table 2.2: Benchmark datasets with mixed features, along with their number of rows (|I|),
number of features (|F|), and number of classes (|K|).
Dataset |I| |F| for
OCT
|F| for
BendersOCT-5
|F| for
BendersOCT-10 |K|
echocardiogram 61 8 32 61 2
hepatitis 80 19 43 68 2
fertility 100 20 28 28 2
iris 150 4 20 38 3
wine 178 13 65 130 3
planning-relax 182 12 60 120 2
breast-cancer-prognostic 194 33 164 321 2
parkinsons 195 22 110 218 2
connectionist-bench-sonar 208 60 300 600 2
seeds 210 7 35 70 3
cylinder-bands 277 257 314 370 2
heart-cleveland 297 22 44 68 5
ionosphere 351 33 157 298 2
thoracic-surgery 470 27 39 54 2
climate 540 18 90 180 2
breast-cancer-diagnostic 569 30 150 300 2
indian-liver-patient 579 10 45 88 2
credit-approval 653 42 64 89 2
blood-transfusion 748 4 20 36 2
diabetes 768 8 39 75 2
qsar-biodegradation 1055 41 139 234 2
banknote-authentication 1372 4 20 40 2
ozone-level-detection-one 1848 72 358 714 2
image-segmentation 2310 18 82 164 7
seismic-bumps 2584 19 48 73 2
thyroid-disease-ann-thyroid 3772 21 45 74 3
spambase 4601 57 108 190 2
wall-following-robot-2 5456 24 116 228 4
41
the union of the training and calibration sets and report the out-of-sample performance on
the test set.
All approaches are implemented in Python programming language and solved using Gurobi
8.1, see [62]. All problems are solved on a single core of SL250s Xeon CPUs and 4GB of
memory with a 60-minute time limit. Our implementation of FlowOCT and BendersOCT
can be found online at https://github.com/D3M-Research-Group/StrongTree along with
instructions and is freely distributed for academic and non-profit use. In what follows,
we discuss the computational efficiency and statistical performance of the different MIO
approaches.
2.5.3 Results on Categorical Datasets
In-sample (Optimization) Performance. Figure 2.4 summarizes the in-sample performance of all methods. Detailed results are provided in Appendix A.6. From the time axis of
Figure 2.4 (left), we observe that for the case of balanced decision trees, BinOCT and OCT are
able to solve 122 instances (out of 240) within the time limit, but BendersOCT solves the same
number of instances in only 125 seconds, resulting in a j
3600
125 k
= 29× speedup. Similarly, from
the time axis of Figure 2.4 (right), it can be seen that in the case of imbalanced decision trees,
OCT is able to solve 1087 instances (out of 2400) within the time limit, while BendersOCT
requires only 71 seconds to do so, resulting in a 51× speedup. BinOCT’s implementation
does not allow for imbalanced decision trees, so it is excluded from Figure 2.4 (right). From
the optimality gap axis of Figure 2.4 (left), we observe that BendersOCT and FlowOCT both
achieve better optimality gaps than either of BinOCT or OCT. Similarly, from the optimality
gap axis of Figure 2.4 (right), we observe the smaller optimality gap of BendersOCT and
FlowOCT compared to OCT, when we optimize over imbalanced decision trees.
Our experiments also demonstrate the limitations of our approaches. We observe that
BendersOCT successfully solves almost all the MIO instances up to the dataset “spect”
(consisting of 267 datapoints and 22 features) within the 1-hour time limit, regardless of
the chosen depth. However, when dealing with larger datasets like “kr-vs-kp” (consisting
42
of 3196 datapoints and 38 features), we are unable to find the optimal solution for depths
greater than 3. For instance, when considering a depth of 5 for the “kr-vs-kp” dataset, we
observe an average optimality gap of 93%. It is worth noting that our approach achieves an
out-of-sample accuracy of 89% in this instance, surpassing the performance of both BinOCT
(87%) and OCT (66%).
0
50
100
150
200
250
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach BendersOCT FlowOCT BinOCT OCT
0
50
100
150
200
250
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach
BendersOCT
FlowOCT
BinOCT
OCT
0
500
1000
1500
2000
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach
BendersOCT
FlowOCT
OCT
Figure 2.4: The left (resp. right) figure shows for balanced (resp. imbalanced) decision trees
the number of instances solved to optimality by each approach within a given time on the time
axis, and the number of instances with optimality gap no larger than each given value at the
time limit on the optimality gap axis.
Out-of-sample Performance. Table 2.3 summarizes the out-of-sample performance of all
methods. Detailed results are reported in Appendix A.6. From the table we observe that
the better optimization performance translates to superior out-of-sample properties as well:
out of 48 instances (average accuracy across 5 samples for each dataset and depth given
the calibrated λ), OCT achieves the best out-of-sample accuracy in 7 instances (excluding
ties), BinOCT in 8, while the new formulation BendersOCT (resp. FlowOCT) achieves the
best accuracy in 9 (resp. 8) instances. BendersOCT (resp. FlowOCT) improves out-of-sample
accuracy with respect to BinOCT and OCT by up to 8% (resp. 7%) and 36% (resp. 21%),
respectively.
43
Table 2.3: The summary of the out-of-sample performance of all methods on categorical
datasets
Approach Best accuracy
(out of 48, excluding ties) avg. accuracy max accuracy improvement
OCT 7 0.76 ± 0.14 -
BinOCT 8 0.79 ± 0.13 -
FlowOCT 8 0.79 ± 0.13 7% (21%) w.r.p. to OCT (BinOCT)
BendersOCT 9 0.80 ± 0.14 8% (36%) w.r.p. to OCT (BinOCT)
Comparing with LST. We compare BendersOCT (with λ = 0) to the state-of-the-art
approach LST –which is based on local search– on 240 MIO instances (which consist of all 12
datasets, 5 splits per dataset, and 4 different depths). On the one hand, LST is much faster,
requiring only seconds to find a local optimum (does not guarantee optimality). The average
solving time among all 240 instances for LST (resp. BendersOCT) is 0.52 (resp. 1539) seconds.
On the other hand, BendersOCT can solve 143 of the instances to provable optimality: out of
those, LST is able to find an optimal solution (without certificate) in 117, and produces a
suboptimal (by 1% in average in-sample accuracy) solution in the remaining 26. The average
solving time among the 117 instances that both approaches solve to optimality, for LST (resp.
BendersOCT) is 0.26 (resp. 80) seconds. Overall, out of the 240 instances (including those
not solved to optimality), BendersOCT produces a better solution in 67 instances (by 1% in
average in-sample accuracy) while LST outperforms BendersOCT in 37 instances (by 2% in
average in-sample accuracy), with the remaining 136 being tied between the two methods.
Thus, we conclude that LST is a method able to deliver high-quality solutions very fast;
however, if enough computational resources are available, the exact BendersOCT is able to
deliver better solutions overall. Table 2.4 reports a summary of these results. Detailed results
are provided in Appendix A.6.
2.5.4 Results on Mixed-Feature Datasets
In-sample Performance. First, we point that OCT results in numerical issues performance
in over half of the instances tested. In our experiments, we found out that while the
formulations described [18] are correct if solved on real arithmetic, they may fail to produce
44
Table 2.4: The summary of the comparison of the in-sample results of LST vs BendersOCT on
categorical datasets.
Metric LST BendersOCT
Avg. solving time 0.52 s 1539 s
Avg. in-sample accuracy 89% 89%
Instances solved to optimality 117 (without certificate) 143
Avg. solving time (117 optimal instances) 0.26 s 80 s
Better solutions (excluding ties) 37 67
optimal solutions with solvers working with numerical tolerances. In particular, in our
experiments with Gurobi, we found out that in many cases the solver terminates with a
solution it claims as optimal (often in seconds, with a solution that allegedly correctly classifies
all points), but upon further inspection the tree described by the decision variables (b, w)
misclassifies most of the points. A detailed discussion on this matter, along with an example,
can be found in Appendix A.3. Thus, we do not report computational times of OCT (since
several instances that are solved very fast are in fact considerably suboptimal), but we do
report out-of-sample performance corresponding to the solution found by the solver.
Figure 2.5 summarizes the in-sample performance of BendersOCT-5 and BendersOCT-10.
Detailed results can be found in Appendix A.6. From the figure, we observe that adding
more buckets to the discretization process increases the computational time. This outcome
is expected since adding more features leads to a linear growth in the size of the MIO
formulation. When comparing average in-sample accuracy, using 10 buckets does not offer an
advantage over 5 buckets; in fact, it slightly underperforms. For balanced (resp. imbalanced)
decision trees, utilizing 5 buckets improves in-sample accuracy from 0.8872 to 0.8914 (resp.
from 0.8594 to 0.8614), while reducing computational time by 7% (resp. 9%). Note that using
10 buckets in large instances may hamper solvers, and thus using just five buckets results in
better feasible solutions found within the time limit.
Out-of-sample Performance. Table 2.5 summarizes the out-of-sample results on the
mixed-features datasets. Detailed results are reported in Appendix A.6. We see that in
terms of out-of-sample accuracy, out of 112 instances (average accuracy across 5 samples
45
0
50
100
150
200
250
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach BendersOCT FlowOCT BinOCT OCT
0
1000
2000
3000
4000
5000
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach OCT BendersOCT_5 BendersOCT_10
0
1000
2000
3000
4000
5000
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach OCT BendersOCT_5 BendersOCT_10
0
1000
2000
3000
4000
5000
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach OCT BendersOCT_5 BendersOCT_10
0
50
100
150
200
250
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach BendersOCT FlowOCT BinOCT OCT
0
200
400
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach
BendersOCT_5
BendersOCT_10
0
1000
2000
3000
4000
5000
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach
BendersOCT_5
BendersOCT_10
Figure 2.5: The left (resp. right) figure shows for balanced (resp. imbalanced) decision trees the
number of instances solved to optimality by each approach on the time axis, and the number of
instances with optimality gap no larger than each given value at the time limit on the optimality
gap axis. OCT is not included in this figure because of its numerical instabilities due to having
“little-m” constraints which caused discrepancy between the optimal objective value and the
actual in-sample accuracy. Refer to Appendix A.3 for further information.
Table 2.5: The summary of the out-of-sample performance of various approaches on mixedfeature datasets given the calibrated λ.
Approach Best accuracy
(out of 112, excluding ties) avg. accuracy max accuracy improvement
OCT 25 0.75 ± 0.18 -
BendersOCT-5 37 0.81 ± 0.12 50% w.r.p. to OCT
BendersOCT-10 34 0.81 ± 0.13 51% w.r.p. to OCT
for each dataset and depth given the calibrated λ) OCT is the best method in 25 instances
(excluding ties), while BendersOCT-5 and BendersOCT-10 are better in 71. Furthermore,
BendersOCT-5 (BendersOCT-10) improves out-of-sample accuracy with respect to OCT by up
to 50% (51%). So despite the fact that we are losing some information by discretizing the
features, BendersOCT can still output higher quality solutions compared to OCT. Moreover,
the gains achieved by BendersOCT-10 over BendersOCT-5 are on average small, suggesting
that a coarse discretization is sufficient in practice. One could justify these findings by
interpreting the discretization as some form of regularization which helps avoiding overfitting
and yields better out-of-sample performance.
46
2.6 Conclusion
We proposed a new MIO formulation for classification trees with univariate splits with
a stronger LO relaxation than state-of-the-art MIO based approaches. We also provided a
tailored Benders’ decomposition method to speed-up the computations. Our experiments
reveal better computational performance than state-of-the-art methods including 29 (resp.
51) times speedup when we optimize over balanced (resp. imbalanced) trees. These also
translated to improved out-of-sample performance up to 36%. We showcase the modeling
power of our framework to deal with imbalanced datasets and to design interpretable and fair
decision trees. Our model can also act as a building block for more sophisticated predictive
and prescriptive tasks. For example, variants of our method can be used to learn optimal
prescriptive trees from observational data, see [111], or optimal robust classification trees,
see [112].
47
Chapter 3
Learning Optimal and Fair Decision Trees for
Non-Discriminative Decision-Making
The increasing use of machine learning in high-stakes domains – where people’s livelihoods
are impacted – creates an urgent need for interpretable and fair algorithms. In these settings
it is also critical for such algorithms to be accurate. With these needs in mind, we propose
a mixed integer optimization (MIO) framework for learning optimal classification trees of
fixed depth that can be augmented with arbitrary domain specific fairness constraints. We
benchmark our method against state-of-the-art approaches for building fair trees on popular
datasets; given a fixed discrimination threshold, our approach improves out-of-sample (OOS)
accuracy by 2 percentage points on average and obtains a higher OOS accuracy on 100% of
the experiments. We also incorporate various algorithmic fairness notions into our method,
showcasing its versatile modeling power that allows decision makers to fine-tune the trade-off
between accuracy and fairness.
3.1 Introduction
Discrimination refers to the unfair, unequal, or prejudicial treatment of an individual or
group based on certain characteristics, often referred to as protected or sensitive, including
age, disability, ethnicity, gender, marital status, national origin, race, religion, and sexual
orientation. Most philosophical, political, and legal discussions around discrimination assume
that discrimination is morally and ethically wrong and thus undesirable in our societies [113].
48
Broadly speaking, one can distinguish between two types of discrimination: disparate
treatment (aka direct discrimination) and disparate impact (aka indirect discrimination).
Disparate treatment consists of rules explicitly imposing different treatment to individuals
that are similarly situated and that only differ in their protected characteristics. Disparate
impact on the other hand does not explicitly use sensitive attributes to decide treatment but
implicitly results in systematic different handling of individuals from protected groups.
In recent years, machine learning (ML) techniques, in particular supervised learning
approaches such as classification and regression, routinely assist or even replace human
decision-making. For example, they have been used to make product recommendations [114]
and to guide the production of entertainment content [115]. More recently, such algorithms
are increasingly being used to also assist socially sensitive decision-making. For example,
they can help inform the decision to give access to credit, benefits, or public services [2],
they can help support criminal sentencing decisions [98], and assist screening decisions for
jobs/college admissions [99].
Yet, these automated data-driven tools may result in discriminative decision-making,
causing disparate treatment and/or disparate impact and violating moral and ethical standards.
First, this may happen when the training dataset is biased so that the “ground truth” is
not available. Consider for example the case of a dataset wherein individuals belonging to a
particular group have historically been discriminated upon (e.g., the dataset of a company
in which female employees are never promoted although they perform equally well to their
male counterparts who are, on the contrary, advancing their careers; in this case, the true
merit of female employees –the ground truth– is not observable). Then, the machine learning
algorithm will likely uncover this bias (effectively encoding endemic prejudices) and yield
discriminative decisions (e.g., recommend male hires). Second, machine learning algorithms
may yield discriminative decisions even when the training dataset is unbiased (i.e., even if the
“ground truth” is available). This is the case if the errors made by the system affect individuals
belonging to a category or minority differently. Consider for example a classification algorithm
49
for breast cancer detection that has far higher false negative rates for Blacks than for Whites
(i.e., it fails to detect breast cancer more often for Blacks than for Whites). If used for
decision-making, this algorithm would wrongfully recommend no treatment for more Blacks
than Whites, resulting in racial unfairness. In the literature, there have been a lot of reports of
algorithms resulting in unfair treatment, e.g., in racial profiling and redlining [116], mortgage
discrimination [117], personnel selection [118], and employment [119]. Note that a “naive”
approach that rids the dataset from sensitive attributes does not necessarily result in fairness
since unprotected attributes may be correlated with protected ones.
In this paper, we are motivated from the problem of using ML for decision- or policymaking in settings that are socially sensitive (e.g., education, employment, housing) given a
labeled training dataset containing one (or more) protected attribute(s). The main desiderata
for such a data-driven decision-support tool are: (1) Maximize predictive accuracy: this will
ensure that e.g., scarce resources (e.g., jobs, houses, loans) are allocated as efficiently as
possible, that innocent (guilty) individuals are not wrongfully incarcerated (released); (2)
Ensure fairness: in socially sensitive settings, it is desirable for decision-support tools to
abide by ethical and moral standards to guarantee absence of disparate treatment and/or
impact; (3) Applicable to both classification and regression tasks: indeed, disparate treatment
and disparate impact may occur whether the quantity used to drive decision-making is
categorical and unordered or continuous/discrete and ordered; (4) Applicable to both biased
and unbiased datasets: since unfairness may get encoded in machine learning algorithms
whether the ground truth is or not available, our tool must be able to enforce fairness in either
setting; (5) Customize interpretability: in socially sensitive settings, decision-makers can often
decide to comply or not with the recommendations of the automated decision-support tool;
recommendations made by interpretable systems are more likely to be adhered to; moreover,
since interpretability is subjective, it is desirable that the decision-maker be able to customize
the structure of the model. Next, we summarize the state-of-the-art in related work and
highlight the need for a unifying framework that addresses these desiderata.
50
3.1.1 Related Work
Fairness in Machine Learning. The first line of research in this domain focuses on
identifying discrimination in the data [120] or in the model [121]. The second stream of
research focuses on preventing discrimination and can be divided into three parts. First,
pre-processing approaches, which rely on modifying the data to eliminate or neutralize any
preexisting bias and subsequently apply standard ML techniques [122–124]. We emphasize
that preprocessing approaches cannot be employed to eliminate bias arising from the algorithm
itself. Second, post-processing approaches, which a-posteriori adjust the predictors learned
using standard ML techniques to improve their fairness properties [125, 126]. The third type of
approach, which most closely relates to our work, is an in-processing one. It consists in adding
a fairness regularizer to the loss function objective, which serves to penalize discrimination,
mitigating disparate treatment [22, 127, 128] or disparate impact [129, 130]. Our approach
most closely relates to the work in [130], where the authors propose a heuristic algorithm for
learning fair decision-trees for classification. They use the non-discrimination constraint to
design a new splitting criterion and pruning strategy. In our work, we propose in contrast an
exact approach for designing very general classes of fair decision-trees that is applicable to
both classification and regression tasks.
Mixed-Integer Optimization for Machine Learning. Our paper also relates to a nascent
stream of research that leverages mixed-integer programming (MIP) to address ML tasks for
which heuristics were traditionally employed [28, 131–134]. Our work most closely relates
to the work in [134] which designs optimal classification trees using MIP, yielding average
absolute improvements in out-of-sample accuracy over the state-of-art CART algorithm [5] in
the range 1–5%. It also closely relates to the work in [28] which introduces optimal decision
trees and showcases how discrimination aware decision trees can be designed using MIP.
Lastly, our framework relates to the approach in [135] where an MIP is proposed to design
dynamic decision-tree-based resource allocation policies. Our approach moves a significant
step ahead of [134], [28], and [135] in that we introduce a unifying framework for designing fair
51
decision trees and showcase how different fairness metrics (quantifying disparate treatment
and disparate impact) can be explicitly incorporated in an MIP model to support fair and
interpretable decision-making that relies on either categorical or continuous/ordered variables.
Our approach thus enables the generalization of these MIP based models to general decisionmaking tasks in socially sensitive settings with diverse fairness requirements. Compared to
the regression trees introduced in [28], we consider more flexible decision tree models which
allow for linear scoring rules to be used at each branch and at each leaf – we term these
“linear branching” and “linear leafing” rules in the spirit of [135]. Compared to [28, 134]
which require one hot encoding of categorical features, we treat branching on categorical
features explicitly yielding a more interpretable and flexible tree.
Interpretable Machine Learning. Finally our work relates to interpretable ML, including
works on decision rules [136, 137], decision sets [138], and generalized additive models [139].
In this paper, we build on decision trees [5] which have been used to generate interpretable
models in many settings [140–142]. Compared to this literature, we introduce two new
model classes which generalize decision trees to allow more flexible branching structures
(linear branching rules) and the use of a linear scoring rule at each leaf of the tree (linear
leafing). An approximate algorithm for designing classification trees with linear leafing rules
was originally proposed in [143]. In contrast, we propose to use linear leafing for regression
trees. Our approach is thus capable of integrating linear branching and linear leafing rules
in the design of fair regression trees. It can also integrate linear branching in the design
of fair classification trees. Compared to the literature on interpretable ML, we use these
models to yield general interpretable and fair automated decision- or policy-making systems
rather than learning systems. By leveraging MIP technology, our approach can impose very
general interpretability requirements on the structure of the decision-tree and associated
decision-support system (e.g., limited number of times that a feature is branched on). This
flexibility make it particularly well suited for socially sensitive settings.
52
3.1.2 Proposed Approach and Contributions
Our main contributions are:
• We formalize the two types of discrimination (disparate treatment and disparate impact)
mathematically for both classification and regression tasks. We define associated indices
that enable us to quantify disparate treatment and disparate impact in classification
and regression datasets.
• We propose a unifying MIP framework for designing optimal and fair decision-trees for
classification and regression. The trade-off between accuracy and fairness is conveniently
tuned by a single, user selected parameter.
• Our approach is the first in the literature capable of designing fair regression trees able
to mitigate both types of discrimination (disparate impact and/or disparate treatment)
thus making significant contributions to the literature on fair machine learning.
• Our approach also contributes to the literature on (general) machine learning since
it generalizes the decision-tree-based approaches for classification and regression (e.g.,
CART) to more general branching and leafing rules incorporating also interpretability
constraints.
• Our framework leverages MIP technology to allow the decision-maker to conveniently
tune the interpretability of the decision-tree by selecting: the structure of the tree (e.g.,
depth), the type of branching rule (e.g., score based branching or single feature), the
type of model at each leaf (e.g., linear or constant). This translates to customizable
and interpretable decision-support systems that are particularly attractive in socially
sensitive settings.
• We conduct extensive computational studies showing that our framework improves the
state-of-the-art to yield non-discriminating decisions at lower cost to overall accuracy.
53
3.2 A Unifying Framework for Fairness in Classification
and Regression
In supervised learning, the goal is to learn a mapping fθ : R
d → R, parameterized by
θ ∈ Θ ⊂ R
n
, that maps feature vectors x ∈ X ⊆ R
d
to labels y ∈ Y ⊆ R. We let P denote
the joint distribution over X × Y and let E(·) the expectation operator relative to P. If labels
are categorical and unordered and |Y| < ∞, we refer to the task as a classification task. In
two-class (binary) classification for example, we have Y := {−1, 1}. On the other hand if
labels are continuous or ordered discrete values (typically normalized so that Y ⊆ [−1, 1]),
then the task is a regression task. Learning tasks are typically achieved by utilizing a training
set T := {xi
, yi}i∈N consisting of historical realizations of x and y. The parameters of the
classifier are then estimated as those that minimize a certain loss function L over the training
set T , i.e., θ
⋆ ∈ arg minθ∈Θ L(θ, T ).
In supervised learning for decision-making, the learned mapping fθ
⋆ is used to guide
human decision-making, e.g., to help decide whether an individual with feature vector x
should be granted bail (the answer being “yes” if the model predicts he will not commit a
crime). In socially sensitive supervised learning, it is assumed that some of the elements of the
feature vector x are sensitive. We denote the subvector of x that collects all protected (resp.
unprotected) attributes by xp with support Xp (resp. xp with support Xp). In addition to the
standard classification task, the goal here is for the resulting mapping to be non-discriminative
in the sense that it should not result in disparate treatment and/or disparate impact relative
to some (or all) of the protected features. In what follows, we formalize mathematically the
notions of unfairness and propose associated indices that serve to measure and also prevent
(see Section 3.3) discrimination.
54
3.2.1 Disparate Impact
Disparate impact does not explicitly use sensitive attributes to decide treatment but implicitly
results in systematic different handling of individuals from protected groups. Next, we
introduce the mathematical definition of disparate impact in classification, also discussed
in [101, 144].
Definition 7 (Disparate Impact in Classification) Consider a classifier that maps feature vectors x ∈ R
d
, with associated protected part xp ∈ Xp, to labels y ∈ Y. We will say that
the decision-making process does not suffer from disparate impact if the probability that it
outputs a specific value y does not change after observing the protected feature(s) xp, i.e.,
P(y|xp) = P(y) for all y ∈ Y and xp ∈ Xp. (3.1)
The following metric enables us to quantify disparate impact in a dataset with categorical or
unordered labels.
Definition 8 (DIDI in Classification) Given a classification dataset D := {xi
, yi}i∈N , we
define its Disparate Impact Discrimination Index by
DIDIc(D) = X
y∈Y
X
xp∈Xp
|{i ∈ N : yi = y}|
|N | −
|{i ∈ N : yi = y ∩ xp,i = xp}|
|{i ∈ N : xp,i = xp|
.
The higher DIDIc(D), the more the dataset suffers from disparate impact. If DIDIc(D) = 0,
we will say that the dataset does not suffer from disparate impact.
The following proposition shows that if a dataset is unbiased, then it is sufficient for the
ML to be unbiased in its errors to yield an unbiased decision-support system.
Proposition 2 Consider an (unknown) class-based decision process (a classifier) that maps
feature vectors x to class labels y ∈ Y and suppose this classifier does not suffer from disparate
impact, i.e., P(y|xp) = P(y) for all y ∈ Y and xp ∈ Xp. Consider learning (estimating) this
classifier using a classifier whose output y˜ ∈ Y is such that the probability of misclassifying a
certain value y as y˜ does not change after observing the protected feature(s), i.e.,
P(˜y|y, xp) = P(˜y|y) for all y, y ˜ ∈ Y and xp ∈ Xp. (3.2)
55
Then, the learned classifier will not suffer from disparate impact, i.e., P(y˜|xp) = P(y˜) for all
y, ˜ xp.
Proof. Fix any y˜ ∈ Y and xp ∈ Xp. We have
P(˜y) = P
y∈Y P(˜y|y)P(y) = P
y∈Y P(˜y|y, xp)P(y|xp)
=
P
y∈Y P(˜y ∩ y|xp) = P(˜y|xp).
Since the choice of y˜ ∈ Y and xp ∈ Xp was arbitrary, the claim follows.
Remark 3 Proposition (3.1) implies that if we have a (large i.i.d.) classification dataset
{xi
, yi}i∈N that does not suffer from disparate impact (see Definition 8) and we use it to learn
a mapping that maps x to y and that has the property that the probability of misclassifying
a certain value y as yˆ does not change after observing the protected feature(s) xp, then the
resulting classifier will not suffer from disparate impact. Classifiers with the Property (3.2)
are sometimes said to not suffer from disparate mistreatment, see e.g., [145]. We emphasize
that only imposing (3.2) on a classifier may result in a decision-support system that is plagued
by disparate impact if the dataset is discriminative.
Next, we propose a mathematical definition of disparate impact in regression.
Definition 9 (Disparate Impact in Regression) Consider a predictor that maps feature
vectors x ∈ R
d
, with associated protected part xp ∈ Xp, to values y ∈ Y. We will say that the
predictor does not suffer from disparate impact if the expected value y do not change after
observing the protected feature(s) xp, i.e.,
E(y|xp) = E(y) for all xp ∈ Xp. (3.3)
Remark 4 Strictly speaking, Definition 9 should exactly parallel Definition 7, i.e., the
entire distributions should be equal rather than merely their expectations. However, requiring
continuous distributions to be equal would yield computationally intractable models, which
motivates us to require fairness in the first moment of the distribution only.
Proposition 3 Consider an (unknown) decision process that maps feature vectors x to values
y ∈ Y and suppose this process does not suffer from disparate impact, i.e., E(y|xp) = E(y) for
56
all xp ∈ Xp. Consider learning (estimating) this model using a learner whose output y˜ ∈ Y is
such that
E(˜y − y|xp) = E(˜y − y) for all y˜ ∈ Y and xp ∈ Xp.
Then, the learned model will not suffer from disparate impact, i.e., E(y˜|xp) = E(y˜) for all
xp ∈ Xp.
Proof. E(ˆy|xp) = E(ˆy − y|xp) + E(y|xp) = E(ˆy − y) + E(y) = E(ˆy).
The following metric enables us to quantify disparate impact in a dataset with continuous
or ordered discrete labels.
Definition 10 (DIDI in Regression) Given a regression dataset D := {xi
, yi}i∈N , we
define its Disparate Impact Discrimination Index by
DIDIr(D) = X
xp∈Xp
P
i∈N yiI(xp,i = xp)
P
i∈N I(xp,i = xp)
−
1
|N |
X
i∈N
yi
where I(·) evaluates to 1 (0) if its argument is true (false). The higher DIDIr(D), the more
the dataset suffers from disparate impact. If DIDIr(D) = 0, we will say that the dataset does
not suffer from disparate impact.
3.2.2 Disparate Treatment
As mentioned in Section 3.1, disparate treatment arises when a decision-making system
provides different outputs for groups of people with the same (or similar) values of the
non-sensitive features but different values of sensitive features. We formalize this notion
mathematically.
Definition 11 (Disparate Treatment in Classification) Consider a class based decision
making process that maps feature vectors x ∈ R
d with associated protected (unprotected) parts
xp ∈ Xp (xp) to labels y ∈ Y. We will say that the decision-making process does not suffer
from disparate treatment if the probability that it outputs a specific value y given xp does not
change after observing the protected feature(s) xp, i.e.,
P(y|xp, xp) = P(y|xp) for all y ∈ Y and x ∈ X .
57
The following metric enables us to quantify disparate treatment in a dataset with categorical
or unordered labels.
Definition 12 (DTDI in Classification) Given a classification dataset D := {xi
, yi}i∈N ,
we define its Disparate Treatment Discrimination Index by
DTDIc(D) = X
y∈Y,
xp∈Xp
j∈N
P
i∈N d(xp,i, xp,j )I(yi = y)
P
i∈N d(xp,i, xp,j )
−
P
i∈N d(xp,i, xp,j )I(yi = y ∩ xp,i = xp)
P
i∈N d(xp,i, xp,j )I(xp,i = xp)
,
(3.4)
where d(xp,i, xp,j ) is any non-increasing function in the distance between xp,i and xp,j so
that more weight is put on pairs that are close to one another. The idea of using a locally
weighted average to estimate the conditional expectation is a well known technique in statistics
referred to as Kernel Regression, see e.g., [146]. The higher DTDIc(D), the more the dataset
suffers from disparate treatment. If DTDIc(D) = 0, the dataset does not suffer from disparate
treatment.
Example 3 (kNN) A natural choice for the weight function in (3.4) is
d(xp,i, xp,j ) =
1 if xp,i is a k-nearest neighbor of xp,j
0 else.
Next, we propose a mathematical definition of disparate treatment in regression.
Definition 13 (Disparate Treatment in Regression) Consider a decision-making process that maps feature vectors x ∈ R
d with associated protected (unprotected) parts xp ∈ Xp
(xp) to values y ∈ Y. We will say that the decision-making process does not suffer from
disparate treatment if
E(y|xp, xp) = E(y|xp) for all x ∈ X .
The following metric enables us to quantify disparate treatment in a dataset with continuous or ordered discrete labels.
Definition 14 (DTDI in Regression) Given a classification dataset D := {xi
, yi}i∈N , we
define its Disparate Treatment Discrimination Index by
DTDIr(D) = X
xp∈Xp,j∈N
P
i∈N d(xp,i, xp,j )yi
P
i∈N d(xp,i, xp,j )
−
P
i∈N d(xp,i, xp,j )I(xp,i = xp)yi
P
i∈N d(xp,i, xp,j )I(xp,i = xp)
, (3.5)
58
where d(xp,i, xp,j ) is as in Definition 12. If DTDIr(D) = 0, we say that the data does not
suffer from disparate treatment.
3.3 Mixed Integer Optimization Framework for Learning
Fair Decision Trees
We propose a mixed-integer linear program (MILP)-based regularization approach for
trading-off prediction quality and fairness in decision trees.
3.3.1 Overview
Given a training dataset T := {xi
, yi}i∈N , we let yˆi denote the prediction associated with
datapoint i ∈ N and define yˆ := {yˆi}i∈N . We propose to design classification (resp. regression)
trees that minimize a loss function ℓc(T , yˆ) (resp. ℓr(T , yˆ)) augmented with a discrimination
regularizer ℓ
d
c
(T , yˆ) (resp. ℓ
d
r
(T , yˆ) ). Thus, given a regulization weight λ ≥ 0 that allows
tuning of the fairness-accuracy trade-off, we seek to design decision trees that minimize
ℓc/r(T , yˆ) + λℓd
c/r
(T , yˆ), (3.6)
where the c (r) subscript refers to classification (regression).
A typical choice for the loss function in the case of classification tasks is the misclassification
rate, defined as the portion of incorrect predictions, i.e., ℓc(T , yˆ) := 1/|N | P
i∈N I(yi ̸= yˆi).
In the case of regression tasks, a loss function often employed is the mean absolute error
defined as ℓr(T , yˆ) := 1/|N | P
i∈N |yˆi − yi
|. Both these loss functions are attractive as they
give rise to linear models, see Section 3.3.3. Accordingly, discrimination of the learned
model is measured using a discrimination loss function taken to be any of the discrimination
indices introduced in Section 3.2. For example, in the case of classification/regression
tasks, we propose to either penalize disparate impact by defining the discrimination loss
function as ℓ
d
c/r
(T , yˆ) := DIDIc/r({xi
, yˆi}i∈N ) or to penalize disparate treatment by defining
the discrimination loss function as ℓ
d
c/r
(T , yˆ) := DTDIc/r({xi
, yˆi}i∈N ). As will become clear
59
later on, discrimination loss functions combining disparate treatment and disparate impact
are also acceptable. All of these give rise to linear models.
3.3.2 General Classes of Decision-Trees
A decision-tree [5] takes the form of a tree-like structure consisting of nodes, branches, and
leafs. In each internal node of the tree, a “test” is performed. Each branch represents the
outcome of the test. Each leaf collects all points that gave the same answers to all tests.
Thus, each path from root to leaf represents a classification rule that assigns each data point
to a leaf. At each leaf, a prediction from the set Y is made for each data point – in traditional
decision trees, the same prediction is given to all data points that fall in the same leaf.
In this work, we propose to use integer optimization to design general classes of fair
decision-trees. Thus, we introduce decision variables that decide on the branching structure
of the tree and on the predictions at each leaf. We then seek optimal values of these variables
to minimize the loss function (3.6), see Section 3.3.3.
Next, we introduce various classes of decision trees that can be handled by our framework
and which generalize the decision tree structures from the literature. We assume that the
decision-maker has selected the depth K of the tree. This assumption is in line with the
literature on fair decision-trees, see [130]. We let V and L denote the set of all branching
nodes and leaf nodes in the tree, respectively. Denote by Fc and Fq the sets of all indices of
categorical and quantitative features, respectively. Also, let F := Fc ∪ Fq (so that |F| = d).
We introduce the decision variables pνj which are zero if and only if the jth feature, j ∈ F,
is not involved in the branching rule at node ν ∈ V. We also let the binary decision variables
zil ∈ {0, 1} indicate if data point i ∈ N belongs to leaf l ∈ L. Finally, we let yˆi ∈ Y decide
on the prediction for data point i ∈ N . We denote by P and Yˆ(z) the sets of all feasible
values for p and yˆ, respectively.
60
Example 4 (Classical Decision-Trees) In classical decision-trees, the test that is performed at each internal node involves a single feature (e.g., if the age of an individual is less
than 18). Thus,
P =
n
p ∈ {0, 1}
|V|×|X| :
P
j∈F pνj = 1 ∀ν ∈ Vo
and pνj = 1 if and only if we branch on feature j at node ν. Additionally, all data points that
reach the same leaf are assigned the same prediction. Thus,
Yˆ(z) =
yˆ ∈ R
|N | : ∃u ∈ Y|L| with yˆi =
X
l∈L
zilul ∀i
.
The auxiliary decision variables ul denote the prediction for leaf l ∈ L.
Example 5 (Decision-Trees enhanced with Linear Branching) A generalization of
the decision-trees from Example 4 can be obtained by allowing the “test” to involve a linear
function of several features. In this setting, we view all features as being quantitative (i.e.,
continuous or discrete and ordered) so that Fc = ∅ and let
P =
n
p ∈ R
|V|×|X| :
P
j∈F pνj = 1 ∀ν ∈ Vo
.
As before, all data points that reach the same leaf are assigned the same prediction so that
Yˆ(z) is defined as in Example 4.
Example 6 (Decision-Trees enhanced with Linear Leafing) Another variant of the
decision-trees from Example 4 is one where, rather than having a common prediction for all
data points that reach a leaf, a linear scoring rule is employed at each leaf. Thus,
Yˆ(z) =
yˆ ∈ Y|N | : ∃ul ∈ R
d
, l ∈ L with
yˆi =
P
l∈L zilu
⊤
l xi ∀i ∈ N
.
The auxiliary decision variables ul collect the coefficients of the linear rules at each leaf l ∈ L.
In addition to the examples above, one may naturally also consider decision-trees enhanced
with both linear branching and linear leafing.
We note that all sets above are MILP representable. Indeed, they involve products
of binary and real-valued decision variables which can be easily linearized using standard
techniques. The classes of decision trees above were originally proposed in [135] in the context
61
of policy design for resource allocation problems. Our work generalizes them to generic
decision- and policy-making tasks.
3.3.3 MILP Formulation
For ν ∈ V, let L
r
(ν) (resp. L
l
(ν)) denote all the leaf nodes that lie to the right (resp. left) of
node ν. Denote with xi,j the value attained by the jth feature of the ith data point and for
j ∈ Fc, let Xj collect the possible levels attainable by feature j. Consider the following MIO
minimize ℓc/r(T , yˆ) + λℓd
c/r
(T , yˆ) (3.7a)
subject to p ∈ P, yˆ ∈ Yˆ(z) (3.7b)
qν −
P
j∈Fq
pνjxi,j = g
+
iν − g
−
iν ∀ν, i (3.7c)
g
+
iν ≤ Mwq
iν ∀ν, i (3.7d)
g
−
iν ≤ M(1 − w
q
iν) ∀ν, i (3.7e)
g
+
iν + g
−
iν ≥ ϵ(1 − w
q
iν) ∀ν, i (3.7f)
zil ≤ 1 − w
q
iν + (1 −
P
j∈Fq
pνj ) ∀ν, i, l ∈ Lr
(ν) (3.7g)
zil ≤ w
q
iν + (1 −
P
j∈Fq
pνj ) ∀ν, i, l ∈ Ll
(ν) (3.7h)
sνjk ≤ pνj ∀ν, j ∈ Fc, k ∈ Xj (3.7i)
w
c
iν =
P
j∈Fc
P
k∈Xj
sνjkI (xi,j = k) ∀ν, i (3.7j)
zil ≤ w
c
iν + (1 −
P
j∈Fc
pνj ) ∀ν, i, l ∈ Ll
(ν) (3.7k)
zil ≤ 1 − w
c
iν + (1 −
P
j∈Fc
pνj ) ∀ν, i, l ∈ Lr
(ν) (3.7l)
P
l∈L zil = 1 ∀i (3.7m)
with variables p and yˆ; qν, g+
iν, g−
iν ∈ R; and zil, w
q
iν, wc
iν, sνjk ∈ {0, 1} for all i ∈ N , l ∈ L,
ν ∈ V, j ∈ F, k ∈ Xj
, l ∈ L.
An interpretation of the variables other than z, p, and yˆ (which we introduced in
Section 3.3.2) is as follows. The variables qν, g
+
iν, g
−
iν, and w
q
iν are used to bound zil
based on the branching decisions at each node ν, whenever branching is performed on a
quantitative feature at that node. The variable qν corresponds to the cut-off value at node
62
ν. The variables g
+
iν and g
−
iν represent the positive and negative parts of qν −
P
j∈Fq
pνjxi,j ,
respectively. Whenever branching occurs on a quantitative (i.e., continuous or discrete and
ordered) feature, the variable w
q
iν will equal 1 if and only if qν ≥
P
j∈Fq
pνjxi,j , in which case
the ith data point must go left in the branch. The variables w
c
iν and sνjk are used to bound
zil based on the branching decisions at each node ν, whenever branching is performed on a
categorical feature at that node. Whenever we branch on categorical feature j ∈ Fc at node
ν, the variable sνjk equals 1 if and only if the points such that xi,j = k must go left in the
branch. If we do not branch on feature j, then sνjk will equal zero. Finally, the variable w
c
iν
equals 1 if and only if we branch on a categorical feature at node ν and data point i must go
left at the node.
An interpretation of the constraints is as follows. Constraints (3.7b) impose the adequate
structure for the decision tree, see Examples 4-6. Constraints (3.7c)-(3.7h) are used to bound
zil based on the branching decisions at each node ν, whenever branching is performed on a
quantitative attribute at that node. Constraints (3.7c)-(3.7f) are used to define w
q
iν to equal 1 if
and only if qν ≥
P
j∈Fq
pνjxi,j . Constraint (3.7g) stipulates that if we branch on a quantitative
attribute at node ν and the ith record goes left at the node (i.e., w
q
iν = 1), then that record
cannot reach any leaf node that lies to the right of the node. Constraint (3.7h) is symmetric
to (3.7g) for the case when the data point goes right at the node. Constraints (3.7i)-(3.7l)
are used to bound zil based on the branching decisions at each node ν, whenever branching
is performed on a categorical attribute at that node. Constraint (3.7i) stipulates that if we
do not branch on attribute j at node ν, then sνjk = 0. Constraint (3.7j) is used to define
w
c
iν such that it is equal to 1 if and only if we branch on a particular attribute j, the value
attained by that attribute in the ith record is k and data points with attribute value k are
assigned to the left branch of the node. Constraints (3.7k) and (3.7l) mirror constraints (3.7g)
and (3.7h), for the case of categorical attributes.
With the loss function taken as the misclassification rate or the mean absolute error and
the discrimination loss function taken as one of the indices from Section 3.2, Problem (3.7)
63
is a MIP involving a convex piecewise linear objective and linear constraints. It can be
linearized using standard techniques and be written equivalently as an MILP. The number of
decision variables (resp. constraints) in the problem is O(|V||F| maxj
|Xj
| + |N ||V|) (resp.
O(|V|2
|N | + |V||F| maxj
|Xj
|), i.e., polynomial in the size of the dataset.
Remark 5 Our approach of penalizing unfairness using a regularizer can be applied to existing
MIP models for learning optimal trees such as the ones in [28, 134]. Contrary to these papers
which require one-hot encoding of categorical features, our approach yields more interpretable
and flexible trees.
Customizing Interpretability. An appealing feature of our framework is that it can cater
for interpretability requirements. First, we can limit the value of K. Second, we can augment
our formulation through the addition of linear interpretability constraints. For example, we
can conveniently limit the number of times that a particular feature is employed in a test
by imposing an upper bound on P
ν∈V pvj . We can also easily limit the number of features
employed in branching rules.
Remark 6 Preference elicitation techniques can be used to make a suitable choice for λ and
to learn the relative priorities of decision-makers in terms of the three conflicting objectives
of predictive power, fairness, and interpretability.
3.4 Numerical Results
Classification. We evaluate our approach on 3 datasets: (A) The Default dataset
of Taiwanese credit card users [147, 148] with |N | = 30, 000 and d = 23 features, where
we predict whether individuals will default and the protected attribute is gender; (B) The
Adult dataset [147, 149] with |N | = 45, 000, d = 13, where we predict if an individual
earns more than $50k per year and the protected attribute is race; (C) The COMPAS dataset
[100, 150] with |N | = 10, 500 data points and d = 16, where we predict if a convicted
individual will commit a violent crime and the protected attribute is race. These datasets are
standard in the literature on fair ML, so useful for benchmarking. We compare our approach
(MIP-DT) to 3 other families: i) The MIP approach to classification where λ = 0 (CART); ii)
64
Figure 3.1: Accuracy-discrimination trade-off of 4 families of approaches on 3 classification
datasets: (a) Default, (b) Adult, and (c) COMPAS. Each dot represents a different sample
from 5-fold cross-validation and each shaded area corresponds to the convex hull of the results
associated with each approach in accuracy-discrimination space. Same trade-off of 3 families of
approaches on the regression dataset Crime is shown in (d).
●
●
●
●
● 0.0
0.1
0.2
0.3
2 3 4 5 6
Tree Depth
MIP Objective
(a)
●
● ● ● ●
● ● ● ●
●
● ● ● ● ●
● ●
●
●
0
25
50
75
100
2 3 4 5 6
Tree Depth
DTDI / Acc. (%)
Test
Training
●
●
CART
MIP
● Accuracy
DTDI
(b)
0.00
0.05
0.10
0.15
0.20
0.25
0 10 20 30
Time (103
sec)
Objective Bounds
Bound
Lower
Upper
Task
Classification
Regression
(c)
0
20
40
−0.2 −0.1 0.0 0.1 0.2
γ
Empirical Density
MIP CART
(d)
Figure 3.2: From left to right: (a) MIP objective value and (b) Accuracy and fairness in
dependence of tree depth; (c) Comparison of upper and lower bound evolution while solving
MILP problem; and (d) Empirical distribution of γ(x) := P(y|xp, xp) − P(y|xp) (see Definition
11) when x is valued in the test set in both CART (λ = 0) and MIP.
the discrimination-aware decision tree approach (DADT) of [130] with information gain w.r.t.
the protected attribute (IGC+IGS) and with relabeling algorithm (IGC+IGS Relab); iii) The
fair logistic regression methods of [128] (log, log-ind, and log-grp for regular logistic
regression, logistic regression with individual fairness, and group fairness penalty functions,
respectively). Finally, we also discuss the performance of an Approximate variant of our
approach (MIP-DT-A) in which we assume that individuals that have similar outcomes are
similar and replace the distance between features in (3.4) by the distance between outcomes,
as is always done in the literature [128]. As we will see, this approximation results in loss in
performance. In all approaches, we conduct a pre-processing step in which we eliminate the
protected features from the learning phase. We do not compare to uninterpretable fairness
in-processing approaches since we could not find any such approach.
65
Figure 3.3: Accuracy of maximally non-discriminative models in each approach for (a)
classification and (b) regression.
Regression. We evaluate our approach on the Crime dataset [147, 151] with |N | = 1993
and d = 128. We add a binary column called “race” which is labeled 1 iff the majority of
a community is black and 0 otherwise and we predict violent crime rates using race as the
protected attribute. We use the “repeatedcv” method in R to select the 11 most important
features. We compare our approach (MIP-DT and MIP-DT-A, where A stands for Approximate
distance function) to 2 other families: i) The MIP regression tree approach where λ = 0
(CART); ii) The linear regression methods in [128] (marked as reg, LR-ind, and LR-grp for
regular linear regression, linear regression with individual fairness, and group fairness penalty
functions).
Fairness and Accuracy. In all our experiments, we use DTDIc/r as the discrimination
index. First, we investigate the fairness/accuracy trade-off of all methods by evaluating
the performance of the most accurate models with low discrimination. We do k-fold cross
validation where for classification (regression) k is 5(4). For each (fold, approach) pair, we
select the optimal λ (call it λ
⋆
) in the objective (3.6) as follows: for each λ in {0, 0.1, 0.2, . . .},
we compute the tree on the fold using the given approach and determine the associated
discrimination level on the fold; we stop when the discrimination level is < 0.01% and
return λ as λ
⋆
; we then evaluate accuracy (misclassification rate/MAE) and discrimination
66
of the classification/regression tree associated with λ
⋆ on the test set and add this as a
point in the corresponding graph in Figure 3.1. For classification (regression), each fold is
1000 to 5000 (400) samples. Figures 3.1(a)-(c) (resp. (d)) show the fairness-accuracy results
for classification (resp. regression) datasets. On average, our approach yields results with
discrimination closer to zero but also higher accuracy. Accuracy results for the most accurate
models with zero discrimination (when available) are shown in Figure 3.3. From Figure 3.3(a),
it can be seen that our approach is more accurate than the fair log approach and has slightly
higher accuracy compared to DADT. These improved results come at computational cost: the
average solver times for our approach in the 3 classification datasets are1 18421.43s, 15944.94s
and 18161.64s, respectively. The log (resp. IGC+IGB) takes 18.43s, 16.04s, and 7.59s (65.68s,
23.39s, 4.78s). Figure 3.3(b) shows the MAE for each approach for zero discrimination.
MIP-DT has far lower error than LR-ind/grp. The average solve time for MIP-DT (resp.
LR-ind/grp) was 36007 (0.38/0.33) secs.
Fairness and Interpretability. Figures 3.2(a)-(b) show how the MIP objective and the
accuracy and fairness values change in dependence of tree depth (a proxy for interpretability)
on a fold from the Adult dataset. Such graphs can help non-technical decision-makers
understand the trade-offs between fairness, accuracy, and interpretability. Figure 3.2(d)
shows that the likelihood for individuals (that only differ in their protected characteristics,
being otherwise similar) to be treated in the same way is twice as high in MIP than in CART
on the same dataset: this is in line with our metric – in this experiment, DTDI was 0.32%
(0.7%) for MIP (CART).
Solution Times Discussion. As seen, our approaches exhibit better performance but
higher training computational cost. We emphasize that training decision-support systems for
socially sensitive tasks is usually not time sensitive. At the same time, predicting the outcome
of a new (unseen) sample with our approach, which is time-sensitive, is extremely fast (in
1We modeled the MIP using JuMP in Julia [152] and solved it using Gurobi 7.5.2 on a computer node
with 20 CPUs and 64 GB of RAM. We imposed a 5 (10) hour solve time limit for classification (regression).
67
the order of milliseconds). In addition, as seen in Figure 3.2(c), a near optimal solution is
typically found very rapidly (these are results from a fold from the Adult dataset).
68
Part II
Optimization and Causal Inference for Policy-Making
69
Chapter 4
Balancing Efficiency, Fairness, and Interpretability in
Learning Housing Allocation Policies for Individuals
Experiencing Homelessness in Los Angeles
We consider the problem of learning an optimal, highly interpretable, and fair policy for
allocating scarce housing resources to people experiencing homelessness in Los Angeles. Using
administrative data from the Homeless Management Information System, we propose a way
to create simple and actionable policies in the form of prescriptive trees that can satisfy
arbitrary fairness constraints or other domain specific requirements that stakeholders care
about. In the case of enforcing fairness in outcomes, our policies improve outcomes by up
to 2.3% and enhance fairness by up to 36.57%. When enforcing fairness in allocation, we
achieve up to a 2.3% better outcome and increase fairness by up to 100%, i.e., eliminating all
the discrimination.
4.1 Introduction
We study the problem of allocating housing resources to individuals experiencing homelessness based on their observed covariates. We are particularly motivated by the problem of
homelessness in Los Angeles County. According to the latest data released by LAHSA in
June 2023, the number of people experiencing homelessness is over 75,000 individuals [153].
However, the important issue is that the available housing resources fall significantly short
70
of meeting the staggering demand. The Housing Inventory Count conducted by LAHSA in
March 2023 revealed that approximately all of the 28,000 permanent housing units available
were already occupied [154]. This leaves a significant portion of the population residing
either on the streets or in shelters. Another area of significant concern within this realm
revolves around the fairness in allocation and outcomes among various racial groups. As
per a report by LAHSA’s Ad Hoc Committee, striking disparities were identified: despite
Black individuals comprising only 9% of LA County’s total population, they constituted
a disproportionate 40% of the homeless population [155]. Furthermore, although housing
resources were distributed evenly among racial groups, Black individuals faced elevated rates
of returning to homelessness compared to other groups. These issues surrounding equity in
outcomes have emerged as a central focus for policymakers and community stakeholders alike.
Currently, in the United States, numerous local communities collaborate within a centralized planning structure referred to as Continuum of Care (CoC). This system coordinates
funding for housing and services aimed at assisting individuals experiencing homelessness.
Coordinated Entry Systems (CES) operate within CoCs, comprising a network of providers
and funders of homeless services. CES manages housing and support services through a
standardized process, linking individuals with suitable interventions. Those seeking housing
within the LA CoC have multiple entry points to the system, such as access centers, street
outreach, or emergency shelters. Upon entry, individuals may undergo a self-reported survey
to determine eligibility and vulnerability. This assessment tool, known as the Vulnerability
Index–Service Prioritization Decision Assistance Tool (VI-SPDAT), involves a set of questions
gauging vulnerability factors like housing history and disabilities. The tool assigns weights to
responses, generating a score ranging from 0 to 17, with higher scores indicating increased
“vulnerability” [156]. This score assists in identifying and prioritizing the most vulnerable
individuals for additional supportive resources. Individuals who receive scores ranging from 8
to 17 are categorized as “high risk” and are given priority for Permanent Supportive Housing
(PSH), which is the most comprehensive and intensive type of housing assistance. PSH is
71
split into two types: PSH (Tenant-Based) and PSH (Site-Based), which are different in terms
of on-site support and the process of getting housing. Those scoring between 4 and 7 are
recommended for Rapid-Rehousing (RRH), involving short-term rental subsidies. Individuals
with scores below 4 are recommended for services, referred to as Service Only (SO). SO covers
lots of services that are not long-term housing, such as outreach, emergency shelters, childcare,
and more. We refer to SO as No Treatment. Case managers, social service professionals
aiding those experiencing homelessness, use these scores derived from VI-SPDAT assessments
to determine the appropriate treatment for each individual.
The current prioritization policy, based solely on a vulnerability score threshold, is
apparently comprehensible but lacks transparency in practice. This scoring system primarily
offers recommendations, leading to inconsistent treatments for individuals in similar situations.
For instance, a person with a high score might receive SO due to the unavailability of
PSH/RRH at that time, while a similar individual might receive a PSH at another time.
The current policy also lacks a data-driven foundation as it does not utilize the historical
data and is not tied to individuals’ outcome. This presents significant concerns: Firstly, it is
unclear whether individuals with high vulnerability scores are truly the primary beneficiaries
of PSH or RRH resources. Secondly, the current policy overlooks constraints on available
resources. And lastly, the current system fails to address stakeholder concerns regarding racial
disparities in housing outcomes. Notably, polices that prioritize solely based on vulnerability
score will usually cause disparities in outcomes even if the score is well calibrated, see [157].
Consequently, we want to design a personalized data-driven allocation policy, as opposed to
a uni-dimensional prioritization tool, that is interpretable and also mitigates racial disparities
within the system. In the following, we further discuss how we tackle these essential
characteristics within our proposed policy design.
Driven by Data. In this study, we utilize the historical observational data available
to LAHSA and other communities. This data was collected via administered VI-SPDAT
assessments, containing individual covariates, and the LA County Homeless Management
72
Information System (HMIS) database, which tracks an individual’s interactions within the
system. These interactions encompass various events, including engagements with street
outreach workers, access to emergency shelters, and any housing received. Utilizing these
trajectories, [35] devises a definition for “return to homelessness”, which serves as the primary
outcome of interest, a definition that we also adopt in our study. Our objective is to
leverage treatment outcomes for crafting personalized allocation policies tailored to the
diverse characteristics of individuals. However, treatment outcomes pose a challenge as they
represent counterfactual quantities; we lack knowledge about what would have occurred had
an individual received a different treatment than the one they actually received. To address
this issue we use techniques from causal inference to estimate these counterfactual values.
Interpretability. In this work, we advocate for simple, inherently interpretable policy
models over complex black-box models whose decisions need further explanation—a topic
extensively covered in the growing literature, as highlighted in [4]. Admittedly, there are
trade-offs between crafting interpretable models for optimal performance and retroactively
justifying decisions made by a black-box model. For an in-depth exploration of this trade-off,
see [3]. Given these considerations and the popularity of decision-trees in high-stakes domains,
our focus centers on devising “optimal prescriptive trees”. Particularly, we adopt the mixed
integer optimization (MIO) based formulation of [111] for learning optimal prescriptive trees.
These trees take the form of binary trees, where each node conducts a binary test on a specific
feature. Consequently, two branches stem from each node. If a data point passes or fails the
test, it traverses to the left or right branch, respectively. Every leaf node corresponds to a
treatment assignment. Thus, every path from the tree’s root to a leaf defines a treatment
rule applied to all data points following that path. Our goal is to choose branching decisions
and treatment assignments that maximize the expected value of treatment outcomes for the
population. The formulation of [111] is proved to be optimal out-of-sample as they show
mathematically that no other tree produces superior expected outcomes within the entire
population.
73
Fairness. In response to the concerns of policymakers and community stakeholders regarding
fairness in outcomes, particularly among minority racial groups, the modeling flexibility
of our MIO formulation allows for the integration of prevalent fairness notions from the
domain of resource allocation. For an extensive exploration of these notions and an analysis
of their potential incompatibility, refer to [157]. Within this context, our specific focus lies in
achieving statistical parity in both resource allocation and subsequent outcomes.
4.1.1 Problem Statement
The objective is to create a personalized policy π : X → K in the form of a decision tree that
maps an individual’s characteristics x ∈ X ⊆ Z
F
to a treatment chosen from a finite set of
options indexed in K which in our application the treatment options are No Treatment, RRH,
PSH (tenant-based) and PSH (site-based). Each person is characterized by their attributes
X ∈ Z
F and their potential outcomes Y (k) ∈ Y ⊆ R under each treatment k ∈ K. We
assume that the covariates are taken from a finite set of integers, and let Θ(f) capture all
levels of each feature f ∈ F. To handle categorical features, one can one-hot encode the
variables. To handle continuous variables, one can pass in finite discretizations that are
decided a priori. The combined distribution of X and {Y (k)}k∈K is unknown. We have a
dataset comprising I independent and identically distributed (i.i.d.) historical observations
denoted as D := {(Xi
, Ki
, Yi)}i∈I from the index set I := {1, . . . , I}. Due to the nature
of observational experiments, we lack control over historical treatment assignments, and
outcomes Yi(k) for k ̸= Ki remain unobserved. Essentially, our goal is to infer, from the
observational data D, a policy π ∈ Πd that maximizes the quantity
Q(π) := E [Y (π(X))] ,
where Πd represents the set of prescriptive trees with a maximum depth of d, and E(·)
represents the expectation operator on the joint distribution P of X, K, Y (1), . . . , Y (k).
A significant challenge in determining the best policy π lies in the inability to observe
the counterfactual outcomes Yi(k), k ∈ K, k ̸= Ki
, under the treatments that were not
74
experienced by datapoint i. This missing data complicates the identification of the most
suitable treatment for each data point.
In working with observational data, we encounter two scenarios. In the first case, the
historical policy is a randomized policy where treatments are assigned randomly, and all
individuals have an equal chance of receiving any given treatment, irrespective of their
characteristics X. In this setting, treatment groups are considered exchangeable, implying
that the specific group receiving a particular treatment is inconsequential. Exchangeability
asserts that for all k and k
′ ∈ K, the sub-population which received treatment k in the data
would have the same outcome distribution under k
′ as the sub-population which received
treatment k
′
. In the second scenario, we have a conditionally randomized policy. Here,
treatment assignment probabilities rely on the characteristics X of each individual. In this
case, exchangeability no longer holds, but instead, we have conditional exchangeability where
treatment groups are exchangeable conditionally on X, as no information other than X is
utilized to determine the treatment.
Throughout the paper, we assume that treatments in the data have been assigned
according to a (potentially unknown) logging policy µ, where µ(k, x) := P(K = k|X = x)
and µ(k, x) > 0 for all k ∈ K and x ∈ X . In these settings, we assume the historical policy is
randomized conditioned on the covariate vector X. To ensure the presence of a conditionally
randomized policy, three common assumptions need consideration:
a) Well-defined Treatments. This assumption hinges on treatments being well-defined
and distinctly recorded as treatment values within the data. In the context of LAHSA,
the treatment set comprises four different treatments: No Treatment, RRH, PSH
(Tenant-Based), and PSH (Site-Based). As previously explained, “No Treatment” refers
to a collection of services that do not involve long-term housing. Also not all the
instances of PSH or RRH are identical. However, we can cautiously assume that all
the instances of each treatment have a similar outcome, so it is safe to assume that all
75
the treatments are well-defined. We need to collect further data in the future to better
differentiate between each of these treatments.
b) Conditional Exchangeability. This assumption dictates that the conditional probability of receiving a treatment relies solely on the covariates X. This means that no
other covariates U should influence either the treatment assignment process or the
resulting outcome. This assumption is challenging to verify without further insight.
c) Positivity. This assumption necessitates that the probability of receiving any treatment,
given X, is positive, i.e., P(K = k|X = x) > 0 almost surely for all k. In our
investigation, we observed that in the case of LAHSA individuals without disabilities
never receive a PSH treatment. Consequently, to ensure positivity in our policy design,
we restrict non-disabled individuals from receiving PSH treatment.
We now explore various methods from the causal inference literature used to evaluate the
effectiveness of a counterfactual treatment assignment policy π in observational studies.
Inverse Propensity Weighting (IPW). The IPW estimation relies on reweighting the
outcome for each individual i in the dataset by the inverse of their propensity score, defined
as µ(Ki
, Xi). Such reweighting has the effect of creating a pseudo-population where all
individuals in the data are hypothetically given all treatments [158]. IPW evaluated policy π
as follows:
Q
IPW
I
(π) := 1
I
X
i∈I
1(π(Xi) = Ki)
µˆ(Ki
, Xi)
Yi
, (4.1)
where µˆ is an estimator of µ, which is obtained by fitting a machine learning model to
{(Xi
, Ki)}i∈I. If µ is known or if µˆ converges almost surely to µ, QIPW
I
(π) is statistically
consistent. However, it may suffer from high variance, particularly if some of the propensity
scores µ(K, X) are small, see [159].
76
Direct Method (DM). This method learns a model νˆk(X) of E(Y |K = k, X) using the
subpopulation that was assigned treatment k for all k ∈ K. Having this, DM evaluates the
expected outcomes of the policy π as follows:
Q
DM
I
(π) := 1
I
X
i∈I
νˆπ(Xi)(Xi). (4.2)
DM results in poor performance if νˆk(X), constitutes biased estimators or generates predictions
significantly deviating from the true expected outcomes. Hence, if νˆk proves to be statistically
consistent for all k ∈ K, QDM
I
(π) will also demonstrate consistency.
Doubly Robust (DR) Approach. Acknowledging the shortcomings of both IPW and DM
methods, combining these two estimators into a DR approach offers an effective strategy. A
key advantage of DR is its ability to maintain accuracy even if either IPW or DM individually
performs well, thereby reducing errors inherent in each estimator. Introduced in [159], the
DR method evaluates policy π as:
Q
DR
I
(π) := 1
I
X
i∈I
νˆπ(Xi)(Xi) + (Yi − νˆKi
(Xi))1(π(Xi) = Ki)
µˆ(Ki
, Xi)
!
. (4.3)
Provided at least one of µˆ or νˆ converges almost surely to µ or ν, respectively, the DR
estimator of Q(π) will be asymptotically consistent. Furthermore, in practice, QDR
I usually
has a smaller variance compared to QIPW
I but a higher variance than QDM
I
, see [159].
4.1.2 Related Work
In this section, we position our work within the literature on operations research, causal
inference and data-driven and fair resource allocation.
Mixed Integer Optimization. In recent years, MIO has gained popularity to address
various problems such as sparse regression problems [8–12], verification of neural networks
[13–15], and sparse principal component analysis [16, 17], among other topics. More related
to the topic of this chapter, MIO methods have been proposed to learn optimal decision trees
[18–21]. Our proposed formulations for learning optimal decision trees in Chapter 2 and 3
also belongs to this stream of research.
77
Learning Prescriptive Trees. In this chapter, we expand upon the MIO formulation outlined in [111] for learning prescriptive trees, an extension derived from the strong classification
tree formulation introduced in Chapter 2. This formulation has been mathematically proven
to be asymptotically optimal out-of-sample, signifying that there exists no other prescriptive
tree that can yield a better expected outcome across the entire population. Several other
endeavors in the literature have aimed to design prescriptive trees, including [160], which
introduces an MIO formulation for learning optimal prescriptive trees. Later [161], builds
upon [160] by enhancing the objective function with a term intended to improve the accuracy
of outcome predictions. However, both of these approaches rely on a critical assumption: the
trees must be sufficiently deep, ensuring that within the leaf nodes, the historical treatment
assignment remains independent of the characteristics of the data points reaching that leaf
node. Nevertheless, this assumption holds true primarily in randomized control experiments,
which do not align with the context of our paper. In contrast, the approach in [111] does not
hinge on this restrictive assumption. Instead, it requires the three assumptions of well-defined
treatments, positivity, and conditional exchangeability mentioned in 4.1.1. Furthermore,
neither [160] nor [161] can incorporate fairness constraints such as ours.
Data-Driven Allocation of Public Resources. Another related line of research involves
data-driven allocation policies for managing limited public resources, including studies like
kidney allocation [31, 32] and homeless services [1, 33–35]. Among these studies, [35] stands
out as the most closely related to this study. This specific research designs an online
allocation policy by solving a linear optimization (LO) problem that optimizes the expected
outcome subject to resource capacity and fairness constraints. Their policy, mathematically
proven optimal out-of-sample, assigns a treatment that maximizes the difference between
the estimated mean treatment response and the dual variables associated with budget and
fairness constraints within the LO framework. This approach stands in contrast to our work,
where we solely focus on the family of prescriptive trees. In contrast, they consider a broader
78
array encompassing all classes of policies which negatively affects the interpretability of their
design.
Fair Resource Allocation. This chapter is also related to the research on equitable
decision-making in critical domains. Typically, these studies center on implementing linear
constraints to mitigate disparities among protected groups [1, 31, 33, 35, 108] or exploring
limitations of various fairness concepts [157]. Similarly, our work introduces linear fairness
constraints, such as ensuring equal resource allocation rates across various racial groups,
aligning with the criteria used by policymakers in LA homeless services.
4.1.3 Contributions
In this chapter, we expand upon the MIO formulation proposed by [111] for learning prescriptive trees. We utilize its flexible modeling capabilities by customizing it to accommodate
fairness, budget, and other context-specific considerations pertinent to housing allocation
for individuals experiencing homelessness in LA. Our approach involves devising a range of
prescriptive trees characterized by varying levels of interpretability, measured using a decision
complexity metric proposed in [162]. We establish a framework enabling decision-makers
to select their preferred policy by assessing trade-offs between interpretability, efficiency,
and fairness. Moreover, we benchmark our proposed solutions against the current allocation
policy under deployment and a dual-priced queuing policy introduced in [35] that is proved
to be asymptotically optimal. Our findings demonstrate improvements in exit rates from
homelessness of up to 1.9% compared to the historical policy. These improvements come with
the added benefits of enhanced interpretability, adherence to fairness criteria, and budget
constraints. We also stay competitive to the dual-priced queuing policy while being a much
more interpretable solution.
The structure of this chapter is as follows: Section 4.2 presents the MIO formulation
for learning optimal prescriptive trees, emphasizing fairness, budget, and interpretability
constraints relevant to our housing allocation problem. In Section 4.3, we introduce a decision
79
complexity metric for a systematic comparison of different policies in terms of interpretability.
Section 4.4 delves into our computational experiments, analyzing real-world data on LA
homelessness sourced from the HMIS database.
4.2 Problem Formulation
In this section, we present the MIO formulation to learn optimal prescriptive trees proposed
in [111]. We use this formulation as a building block to which we add various fairness, budget
and context specific constraints for specific case of housing allocation.
4.2.1 MIO Formulation
The core block of our formulation is a complete binary tree of depth d whose nodes are
numbered 1 through (2d+1 − 1) from top to bottom-right. Let B := {1, . . . , 2
d − 1} denote
the set of branching nodes and T := {2
d
, . . . , 2
d+1 − 1} denote the set terminal nodes (see
Figure 4.1, left). Given the complete binary tree, we build a flow graph by connecting a
source node s to the root node, and connecting all nodes other than s to |K| sink nodes,
denoted by tk, ∀k ∈ K (see Figure 4.1, right). All the edges in the flow graph have a flow
capacity of 1 and are directed from the source to the sink.
s
1
2 3
4 5 6 7
t
1
2 3
4 5 6 7
B
T
s
1
2 3
4 5 6 7
t1 t2
Figure 4.1: A prescriptive tree with depth 2 (left) and its associated flow graph (right).
Having the flow graph, we formulate an MIO problem to maximize the expected outcome
estimated by the DR objective (4.3) over all trees of maximum depth d. We later show how
to optimize for IPW (resp. DM) objective (4.1) (resp. (4.2)) as well. For every branching
80
node n ∈ B, feature f ∈ F, and threshold θ ∈ Θ(f), let the binary variable bnfθ denote the
selection of feature f with threshold θ for branching at node n. For every node n ∈ T ∪ B,
the binary variable pn takes the value 1 if and only if (iff) node n is a treatment node. In such
cases no further branching is permitted. For k ∈ K, let wnk ∈ {0, 1} equal 1 iff treatment k
is assigned at node n. The MIO formulation which find the best prescriptive tree according
to the doubly robust estimator reads as follows:
maximize X
x∈X
X
k∈K
X
n∈B∪T
z
n,tk
x
X
i:Xi=x
"
νˆk(Xi) + 1[k = Ki
](Yi − νˆKi
(Xi))
µ(Ki
, Xi)
#
(4.4a)
subject to X
f∈F
X
θ∈Θ(f)
bnfθ + pn +
X
m∈A(n)
pm = 1 ∀n ∈ B (4.4b)
pn +
X
m∈A(n)
pm = 1 ∀n ∈ T (4.4c)
z
a(n),n
x = z
n,ℓ(n)
x + z
n,r(n)
x +
X
k∈K
z
n,tk
x ∀n ∈ B, x ∈ X (4.4d)
z
a(n),n
x =
X
k∈K
z
n,tk
x ∀x ∈ X , n ∈ T (4.4e)
z
s,1
x = 1 ∀x ∈ X (4.4f)
z
n,ℓ(n)
x ≤
X
f∈F
X
θ∈Θ(f):xf ≤θ
bnfθ ∀n ∈ B, x ∈ X (4.4g)
z
n,r(n)
x ≤
X
f∈F
X
θ∈Θ(f):xf >θ
bnfθ ∀n ∈ B, x ∈ X (4.4h)
z
n,tk
x ≤ wnk ∀x ∈ X , n ∈ T , k ∈ K (4.4i)
X
k∈K
wnk = pn ∀n ∈ B ∪ T (4.4j)
wnk ∈ {0, 1} ∀n ∈ T , k ∈ K (4.4k)
bnfθ ∈ {0, 1} ∀n ∈ B, f ∈ F, θ ∈ Θ(f) (4.4l)
pn ∈ {0, 1} ∀n ∈ B ∪ T (4.4m)
z
a(n),n
x
, zn,tk
x ∈ {0, 1} ∀n ∈ B ∪ T , x ∈ X , k ∈ K. (4.4n)
where µˆ and νˆk are as defined in Sections 4.1.1. A(n) is the set of all ancestors of node
n ∈ B ∪ T , r(n) (resp. ℓ(n)) is the right (resp. left) descendant of n. The objective
(4.4a) maximizes the expected outcome estimated by the doubly robust estimator (4.3).
81
Constraints (4.4b) guarantee that at any branching node, either a branch is created or a
treatment is prescribed at that node or one of its ancestors. Similarly, constraints (4.4c)
ensure that a treatment is assigned at all terminal nodes, unless a treatment has already been
assigned to one of its ancestor nodes. Constraints (4.4d) and (4.4e) act as flow conservation
measures, ensuring that any data point entering node n must either proceed to its right or left
descendant, or directly transition to one of the sink nodes. Constraints (4.4f) ensure that a
flow of at most one can enter the source node for each datapoint. Constraints (4.4g) and (4.4h)
ensure that the flow of data points aligns with the branching decisions as determined by the
variables b. Constraints (4.4i) enforce that datapoints can only flow to the sink associated with
their assigned treatment. Constraints (4.4j) combined with integrality of the wnk variables
make sure that exactly one of the available treatments must be assigned to a treatment node.
To optimize the expected outcome estimated by the IPW estimator (4.1), we can replace (4.4a) to get
maximize X
x∈X
X
k∈K
X
n∈B∪T
z
n,tk
x
X
i:Xi=x
1[Ki = k]Yi
µ(Ki
, Xi)
subject to (4.4b)-(4.4n).
(4.5)
Formulation (4.4) can similarly be adapted to optimize the DM objective (4.2) by simply
dropping the second term in the objective (4.4a), i.e.,
maximize X
x∈X
X
k∈K
X
n∈B∪T
z
n,tk
x
X
i:Xi=x
νˆk(Xi)
subject to (4.4b)-(4.4n).
(4.6)
4.2.2 Treatment Assignment Strategy
Formulations (4.4)- (4.6) assign a deterministic treatment, denoted as wnk ∈ 0, 1, at each leaf
node. Throughout this paper, we refer to this setup as “deterministic” treatment assignment.
However, we can modify Formulations (4.4)- (4.6) to introduce a probability assignment
for each treatment at prediction nodes, by relaxing the prescription variables wnk and the
routing decisions z to be continuous, i.e., wnk and z ∈ [0, 1]. In this scenario, termed
82
“probabilistic” treatment assignment, instead of assigning a single treatment to all arriving
data points at a prediction node, we allocate a probability for each treatment. Consequently,
for each data point, we assign a treatment by sampling from this distribution. In Section 4.4,
we demonstrate that the probabilistic strategy, while less interpretable, exhibits superior
performance. Our numerical analysis indicates that relaxing the binary decision variables
expedites the solving process of the MIO formulation. This acceleration enables the discovery
of improved prescriptive trees, particularly within time-constrained problem-solving scenarios.
4.2.3 Resource Capacity Constraints
In our housing allocation problem, the availability of more comprehensive housing assistance
types, like RRH and PSH, is notably limited. Therefore, it is crucial to account for these
resource constraints during our training process. We can learn tree-based policies that
adhere to these capacity constraints by enhancing formulations (4.4)-(4.6) with additional
constraints:
X
n∈B∪T
X
x∈X
X
i∈I:Xi=x
z
n,tk
x ≤ |I|Ck ∀k ∈ K,
where Ck denotes the percentage of instances that can be assigned treatment k.
4.2.4 Fairness Constraints
As outlined in Section 4.1, policymakers and community members hold deep concerns about
fairness, equity, or the lack thereof. There are significant worries regarding disparities in
outcomes among minority groups, including black race, women, and young adults. In this
section, we discuss the fairness notion of statistical parity for both treatment assignment and
outcomes. We illustrate how we can incorporate these fairness notions into our formulations.
Statistical Parity in Treatment Assignment. We say that a policy satisfies treatment
assignment parity if the probability of assigning a specific treatment is similar across all
protected groups. This definition is derived from the notion of statistical parity in classification
in the fair ML literature; see, e.g., [23, 162–167]. Let P denote all protected covariates (e.g.,
83
race, gender or sexual orientation). Accordingly, we let Px ∈ P represent the value of the
protected feature(s) of covariate x. The features in P are often not included in covariate
vector X to prevent making decisions based on protected features. To make sure that the
policy satisfies treatment assignment parity up to a threshold δ, we augment (4.4), (4.5), and
(4.6) with the constraint
X
n∈B∪T
X
x∈X:Px=p
X
i∈I:Xi=x
z
n,tk
x
|{i ∈ I : Pi = p}| −
X
n∈B∪T
X
x∈X:Px=p
′
X
i∈I:Xi=x
z
n,tk
x
|{i ∈ I : Pi = p
′}|
≤ δ
∀p, p′ ∈ P : p ̸= p
′
, k ∈ K.
Statistical Parity in Treatment Outcomes. Our formulations can also be augmented
with fairness constraints to ensure statistical parity in expected outcomes among protected
groups. This requires the absolute difference between the average expected outcomes of any
two protected groups p and p
′ ∈ P to be constrained by some threshold δ ∈ [0, 1]:
X
n∈B∪T
X
x∈X:Px=p
X
k∈K
X
i:Xi=x
z
n,tk
x
"
νˆk(Xi) + 1[k = Ki
](Yi − νˆKi
(Xi))
µ(Ki
, Xi)
#
−
X
n∈B∪T
X
x∈X:Px=p
X
k∈K
X
i:Xi=x
z
n,tk
x
"
νˆk(Xi) + 1[k = Ki
](Yi − νˆKi
(Xi))
µ(Ki
, Xi)
#
≤ δ,
∀p, p′ ∈ P : p ̸= p
′
.
In the above constraint we evaluate the outcome using the DR objective. One could have
the same constraint for DM and IPW objectives as well. Also, instead of restricting the
absolute value of the difference between protected groups, we could enforce that the minority
group should be better off by a margin of δ, see Section 4.4.
4.3 Decision Complexity
In this chapter, we adapt the decision complexity metric proposed in [162] to the prescriptive setting. This metric offers a quantifiable measure of interpretability, applicable to both
predictive and prescriptive models within the same or different model classes which facilitates
better comparisons.
84
Definition 15 (Decision Complexity) Given a trained prescriptive model, decision complexity captures the minimum number of parameters needed for the model to make a prescription on a new datapoint.
In the following section, we discuss the decision complexity for various allocation policies.
Prescriptive Tree with Deterministic Treatment Assignment. The decision complexity of binary prescriptive tree with deterministic treatment assignment is measured by
the number of nodes in the tree (branching plus leaves), which corresponds to the number of
times a datapoint is routed through the tree and how it is prescribed.
Prescriptive Tree with Randomized Treatment Assignment. This tree carries a
complexity similar to the deterministic tree, differing in that each leaf node contributes to
the complexity based on the number of treatment probabilities it holds.
Prescriptive Trees with Estimated Counterfactual Outcomes as the branching
Features. In accordance with the insights presented in [157], achieving statistical parity
in outcomes necessitates considering solely the treatment outcome effects and protected
attributes within the treatment assignment policy. Consequently, for our experiments, we
investigated a specific class of prescriptive trees that branch exclusively based on the estimated
treatment outcomes, denoted as νˆ in Section 4.1.1. Estimating these counterfactual values
entails using model-generated estimates. Therefore, if a prescriptive tree contains a branching
node reliant on the estimated outcome of treatment k, we need to incorporate the decision
complexity of the associated predictive model into the contribution of this particular node to
the overall complexity of the prescriptive tree.
Historical Policy. In this policy, we initially calculate the vulnerability score for each
individual, requiring responses to N questions. Subsequently, we compare this score against
two threshold values to categorize individuals into low, medium, or high vulnerability groups,
based on which treatments are allocated. Hence, the overall complexity of this process
amounts to N + 2.
85
Dual-price Queuing Policy. This policy, proposed in [35] involves solving a linear optimization (LO) problem that optimize the expected outcome subject to the resource capacities
and fairness constraints. This policy assigns a treatment that maximizes the difference
between the estimated mean treatment response and the dual variables associated with
resource and fairness constraints within the LO framework. Technically, for each individual,
we must compute the estimated counterfactual outcome for each treatment. If we use a
predictive model with complexity of N for this task and if we have K available treatments,
the complexity would be K ×N. Additionally, the consideration of dual variables for resource
and fairness constraints adds M more parameters, one for each constraint in the LO problem.
In Figure 4.2, we illustrate various methods across a spectrum of decision complexity.
Our framework demonstrates versatility by accommodating a range of prescriptive trees,
offering flexibility for policymakers to select a policy that aligns with their requirements. As
discussed earlier, the historical policy displays a relatively interpretable nature. Conversely,
the dual-price queuing policy stands at the far end of the complexity spectrum.
4.4 Numerical Results
In this section, we assess the effectiveness of our prescriptive trees using real-world data to
formulate strategies for allocating limited resources among people experiencing homelessness
in LA. We compare these prescriptive trees against the current prioritization policy in use,
termed as the “historical policy”, and the “dual-price queuing” policy presented in [35].
4.4.1 Data Description and Experiments Setup
Data Description. The data we are using is obtained from the HMIS dataset and includes
information about people experiencing homelessness and how they interacted with the system
from around January 2015 to December 2019 which includes 63,764 cases. Each interaction,
called an enrollment, shows when a person got a specific treatment among the four possible
options: No Treatment, RRH, PSH (Tenant-Based), and PSH (Site-Based). The capacities
of each one of the treatments as a proportion of the population, are 100%, 14.2%, 4.1%, and
86
2
OPT (depth 2, deterministic,
best_20)
OPT (depth 2, probabilistic,
best_20)
OPT (depth 2, probabilistic,
predicted_outcomes)
OPT (depth 4, deterministic,
best_20)
OPT (depth 2, deterministic,
predicted_outcomes)
OPT (depth 3, deterministic,
predicted_outcomes)
OPT (depth 3, probabilistic,
best_20)
OPT (depth 4, probabilistic,
best_20)
OPT (depth 3, deterministic,
best_20)
OPT (depth 3, probabilistic,
predicted_outcomes)
OPT (depth 4, deterministic,
predicted_outcomes)
OPT (depth 4, probabilistic,
predicted_outcomes)
Historical
Dual-price queuing
10 50 100 150
Decision
Complexity Figure 4.2: Decision complexity of different methods. We show the complexity of different
optimal prescriptive trees with different depths, different treatment assignment strategy, and
different set of branching features. We also show the complexity of the historical policy and
dual-price queuing policy.
4.6%, respectively. For each individual within the dataset, their responses to the administered
VI-SPDAT surveys, used for assessing individuals’ vulnerability, is recorded. These surveys,
consisting of 27 questions, serve as valuable tools for collecting information on individual
covariates, encompassing details such as disabilities and prior housing history at the time of
assessment. We are looking at fairness, especially concerning race, as mentioned earlier. The
groups we are focusing on are Black or African American, Hispanic, White, and Other which
includes all other races. In this context, people who are not White are considered minority
groups, and we are trying to make sure everyone gets fair treatments and outcome, especially
considering past discrimination.
Outcome of Interest We adopt the same outcome definition as outlined in [35]. For
any individual following their initial intervention, we consider them to have returned to
homelessness if they become enrolled in “emergency shelter”, “safe haven”, or “street outreach”
87
subsequent to their first intervention. Accordingly, we define a positive outcome (Y = 1) as
not observing a subsequent return to homelessness within two years of the intervention, and
Y = 0 otherwise.
Experiments Setup We partitioned our data into training and testing sets. The model is
trained using data from 2015 to 2017, while its performance evaluation utilized data from
2018 to 2019.
Given that our MIO formulations specifically operate with binary features, we applied a
similar process as outlined in Chapter 2 to convert categorical and continuous features into
binary ones. The resulting feature count stood at 180, which posed computational challenges
for solving the MIO formulations. Hence, we experimented with two sets of branching
features:
• The set of top 20 predictors of the outcome. These features were identified using the
Python-based sklearn feature selection method. Experimenting with varying feature
counts (20, 30, 40, 60), we found that using 20 features provided an acceptable prediction
accuracy while keeping the dataset reasonably compact. It is important to note that
among the 180 features initially considered, only 40 of them significantly predicted the
outcome. From these 40, we selected the best 20 features to expedite the computations
further.
• The set of estimated counterfactual outcomes. Following the results discussed in
[157], attaining statistical parity in outcome requires focusing only on the effects of
treatment outcomes and protected characteristics in the treatment assignment strategy.
Therefore, in our experiments, we explored a class of prescriptive trees that make
branching decisions based solely on the estimated treatment outcomes, referred to as νˆ
in Section 4.1.1.
88
For each set of features mentioned above, we explore two scenarios, one when we include
the protected feature as part of the branching features and one when we exclude them. Additionally, we conducted experiments exploring different tree depths (2, 3, 4) and investigated
two treatment assignment strategies: randomized and deterministic. For our analysis we
only adopted the direct method objective (problem (4.6)) as the dual-price queuing method
from [35] also uses the direct method to estimate the counterfactual outcomes. Our analysis
includes both statistical parity in treatment allocation and outcome separately. We did not
consider both of these fairness notions together due to the findings in [157], which demonstrate
the incompatibility between these two fairness notions. The Race attribute is the protected
variable of interest, and we varied fairness thresholds δ = 0.01, . . . , 0.09. In our experiments,
instead of restricting the absolute value of the difference between protected groups, we impose
that the minority group should be better off by a margin δ. We set a time limit of 3 hours
for solving the MIO formulations.
Counterfactual Outcome Estimations As we do not have any information on what
would happen if people got different treatments, except for the one they received, we take
a semi-synthetic approach and use model generated counterfactuals. We used the same
estimated values as employed in [35]. For a detailed understanding of the process involved in
deriving these values, we refer readers to that paper.
4.4.2 Results
In the following we discuss the out-of-sample (OOS) performance of our prescriptive trees.
Depth Analysis. Figure 4.3 reveals that an increase in depth generally leads to higher
out-of-sample expected outcomes. However, this improvement diminishes when transitioning
from a depth of 3 to 4. This plateauing effect can be attributed to the larger MIO problem
associated with increased depth, often surpassing our time limit for finding optimal solutions.
The escalating depth incurs a rise in average solving time from 849 seconds to 9322 seconds
as the depth progresses from 2 to 4, accompanied by a shift in the optimality gap from 0.00%
89
to 1.28%. While increasing depth showcase enhancement in out-of-sample expected outcomes
on average, the time limit constraint that we enforce impede the ability to obtain superior
solutions with deeper trees in some instances. Furthermore, deeper trees inherently tend
to sacrifice interpretability, and there is a potential risk of overfitting the data, potentially
reducing out-of-sample performance.
2 3 4
Depth
57
58
59
60
61
62
OOS - Expected Outcome (%)
2 3 4
Depth
0
1
2
3
4
5
6
Gap (%)
2 3 4
Depth
0
2000
4000
6000
8000
10000
Solving Time (s)
Figure 4.3: The effect of increasing depth on the distribution of out-of-sample expected
outcome (left), optimality gap (middle), and solving time (right) across all MIO instances
solved.
Treatment Assignment Strategy Analysis. As shown in Figure 4.4, the randomized
treatment assignment noticeably aids in solving the MIO instances. Notably, within the time
limit, the randomized strategy allows us to find feasible solutions in 45% more instances
compared to the deterministic approach. Additionally, employing the randomized approach
results in a 2.21% increase in expected outcomes. Nevertheless, a trade-off exists between
the performance indicated by the expected outcome and the interpretability of the derived
prescriptive trees. Despite the randomized treatment assignment showcasing better performance, as suggested by the decision complexity metric introduced in Section 4.3, it is less
interpretable. Hence, the extent to which interpretability can be sacrificed for performance
enhancement rests upon the discretion of policymakers and stakeholders.
Sets of Branching Features Analysis. As depicted in Figure 4.5, utilizing the predicted
counterfactual outcomes as our branching features leads to reduced solving times and smaller
optimality gaps. This approach also enables us to discover feasible solutions in 21% more
90
randomized deterministic
Treatment assignment
57
58
59
60
61
62
OOS - Expected Outcome (%)
randomized deterministic
Treatment assignment
0
1
2
3
4
5
6
Gap (%)
randomized deterministic
Treatment assignment
0
2000
4000
6000
8000
10000
Solving Time (s)
Figure 4.4: The effect of different treatment assignment strategies (randomized vs deterministic)
on the distribution of out-of-sample expected outcome (left), optimality gap (middle) and solving
time (right) across all MIO instances solved.
instances compared to using the top 20 predictors of outcomes. Furthermore, employing the
predicted outcomes leads to a 1% increase in expected outcomes. However, in depth=2, using
the best 20 features results in better performance. This also explains the larger standard
deviation of expected outcome when we use the predicted outcomes. At the end, there
is a trade-off between the performance, as indicated by the expected outcome, and the
interpretability of the derived prescriptive trees. Although utilizing the predicted outcomes
demonstrates superior performance on average, it has less interpretability. Therefore, the
degree to which interpretability can be compromised for performance improvement depends
on the preferences of policymakers and stakeholders. However, for trees of depth = 2, it is
advisable to adhere to the top 20 features. This approach proves to be both more efficient
and also more interpretable.
Including/Excluding Protected Attributes in the Branching Features Analysis.
While using protected attributes like race or gender in the branching nodes of prescriptive
trees might be considered unethical or even illegal, our analysis reveals an interesting trend.
Specifically, when enforcing fairness constraints—especially with larger fairness margins δ
(as detailed in Section 3.2)—the problem becomes infeasible, and we struggle to identify a
viable solution. When we incorporate these protected attributes, we are able to find feasible
solutions in 200% more instances. Moreover, our analysis suggests that these protected
91
best_20 predicted_outcomes
Feature set
57
58
59
60
61
62
OOS - Expected Outcome (%)
best_20 predicted_outcomes
Feature set
0
1
2
3
4
5
6
Gap (%)
best_20 predicted_outcomes
Feature set
0
2000
4000
6000
8000
10000
Solving Time (s)
Figure 4.5: The effect of using different sets of branching features (the best 20 predictors of
outcome vs the predicted counterfactual outcomes) on the distribution of out-of-sample expected
outcome (left), optimality gap (middle) and solving time (right) across all MIO instances solved.
attributes serve as predictors of outcomes, contributing to a 1% increase in expected outcomes
when included.
0 1
Include protected
57
58
59
60
61
OOS - Expected Outcome (%)
0 1
Include protected
0
1
2
3
4
5
6
Gap (%)
0 1
Include protected
0
2000
4000
6000
8000
10000
Solving Time (s)
Figure 4.6: The effect of including/excluding of the protected attributes in the branching
features on the distribution of out-of-sample expected outcome (left), optimality gap (middle)
and solving time (right) across all MIO instances solved.
The trade-off between performance and fairness across all methods. In the following,
we compare all variations of our prescriptive trees among themselves, evaluating them alongside
the historical policy and the dual-price queuing policy. Specifically, we analyze these policies
separately concerning fairness in treatment allocation and fairness in outcome.
Figure 4.7 displays all policies enforcing fairness in outcome with protected attributes being
included in the branching features. For improved readability of the figures, we have broken
92
them down by depth. Each graph showcases the relationship between expected outcome and
discrimination in outcome. We define discrimination in outcome as the maximum difference in
expected outcome for each race compared to the White race. A negative discrimination value
is preferable, indicating that minority groups achieve a higher expected outcome compared to
the majority group, White. We define discrimination in PSH/RRH allocation using a similar
methodology. In the analysis of this figure, we observe that: as discrimination increases,
the expected outcome tends to rise. Less interpretable policies exhibit better performance,
showcasing higher expected outcomes for a given discrimination level (positions skewed
toward the upper left represent more desirable results). However, the variation (best_20,
randomized) appears to be a quite competitive approach compared to the more complex
variation (predicted_outcome, deterministic). This observation aligns with what we noticed
in Figure 4.4, which highlighted the advantage of randomized treatment assignment over
deterministic assignment. Among all methods, the dual-price queuing policy attains the
highest expected outcome which is expected as this method finds the best policy among all
class of policies not only prescriptive trees. However, the prescriptive trees demonstrate a
diverse range of discrimination levels, enabling less discriminatory practices. To highlight
some derived prescriptive trees, we have spotlighted one tree per depth in Figure 4.7, and the
corresponding trees are shown in Figure 4.9. All these trees exhibit diverse interpretability
and consistently surpass the performance of the historical policy, securing higher expected
outcomes and showcasing lower discrimination (positioned in the upper left relative to the
historical policy). Figure 4.8 is analogous to Figure 4.7, except for the scenario where
protected attributes are excluded from the branching features. As previously discussed, when
the protected attribute is omitted, we solve fewer instances, which is also evident in the
figure.
Similarly, Figure 4.10 (resp. Figure 4.11) showcases all policies emphasizing fairness in
allocation, with protected attributes included (resp. excluded) from the branching features.
93
As previously, we highlight one tree per depth in Figure 4.10, and the corresponding trees
are depicted in Figure 4.12.
Methods
OPT (predicted_outcomes, randomized)
OPT (predicted_outcomes, deterministic)
OPT (best_20, randomized)
OPT (best_20, deterministic)
- - - - - - - - - - - - - - - - - - - - - - - -
Historical Policy
Dual-price Queuing Policy (Base)
Dual-price Queuing Policy (Fair)
Least
Interpretable
Most
Interpretable
Methods
Depth = 2 Depth = 3 Depth = 4
$ $ $ $
#!"!
$ $ $ $
"! !
$ $ $ $
1
2
3
fairness in outcome, included, test data
Figure 4.7: The out-of-sample expected outcome vs discrimination for all methods enforcing
statistical parity in outcomes for the case where protected attributes are included in the
branching features.
Methods
OPT (predicted_outcomes, randomized)
OPT (predicted_outcomes, deterministic)
OPT (best_20, randomized)
OPT (best_20, deterministic)
- - - - - - - - - - - - - - - - - - - - - - - -
Historical Policy
Dual-price Queuing Policy (Base)
Dual-price Queuing Policy (Fair)
Least
Interpretable
Most
Interpretable
Methods
Depth = 2 Depth = 3 Depth = 4
fairness in outcome, not included, test data
$ $
$ $
$
$
#!"!
$
$ $
$
$
"! !
$ $
$
$
Figure 4.8: The out-of-sample expected outcome vs discrimination for all methods enforcing
statistical parity in outcomes for the case where protected attributes are not included in the
branching features.
Trade-offs: Fairness in Outcome vs. Fairness in Allocation. As highlighted in
[157], which demonstrates the inherent conflict between fairness in allocation and achieving
equitable outcomes, it becomes evident that attaining both simultaneously is an intricate
94
Yes
(1)
NoTr : 36.4%
RRH: 27.7%
PSH (TB) : 16.9%
PSH (SB): 19%
NoTr NoTr : 50.1%
RRH: 49.9%
NoTr
Race
is Black?
Disabled?
No
(2)
NoTr PSH (SB) NoTr PSH (TB) NoTr RRH NoTr RRH
Race
is White?
ν̂ ? ν̂ ? ν3
̂ > = 0.73? 0 > = 0.5 0 > = 0.5
ν̂ ? 0 > = 0.5
Gender
is Male?
Race
is Hispanic?
Yes
Disabled?
No
(3)
NoTr RRH
employment
status is known?
NoTr RRH RRH NoTr
Un-employed? Gender
is Male?
Part-time
employment?
Race
is Black?
Yes
Disabled?
No
PSH (SB) PSH (TB)
Gender
is Male?
NoTr
Days in the
System >= 1300
Figure 4.9: Sample prescriptive trees with statistical parity in outcomes. The numbers
associated with each tree correspond to the highlighted points in Figure 4.7.
challenge. Our own findings align with this theoretical framework. In Figure 4.13, we present
the performance analysis of a depth 2 tree (tree (1) in Figure 4.9). This particular tree is
implementing a randomized treatment assignment. It utilizes predicted outcomes alongside
protected attributes as branching features to enforce fairness in outcomes, employing a
threshold value of 0.01. Notably, this tree surpasses the historical policy in both performance
metrics and fairness. The observed results demonstrate an anticipated enhancement in the
expected outcome for all racial groups compared to the privileged white group. Nevertheless,
the pursuit of fairness in outcomes inadvertently leads to an imbalanced allocation across
racial groups, notably in the case of RRH.
95
Methods
OPT (predicted_outcomes, randomized)
OPT (predicted_outcomes, deterministic)
OPT (best_20, randomized)
OPT (best_20, deterministic)
- - - - - - - - - - - - - - - - - - - - - - - -
Historical Policy
Dual-price Queuing Policy (Base)
Dual-price Queuing Policy (Fair)
Least
Interpretable
Most
Interpretable
Depth = 2 Depth = 3 Depth = 4 Methods
( (
'"%&%!
( ( (
!%! $# %!
( ( ( (
(
'"%&%!
(
(
(
!%! $# %!
(
(
fairness in treatment, included, test data
1
2
3
Figure 4.10: The out-of-sample expected outcome vs discrimination metrics for all methods
enforcing statistical parity in allocation for the case where protected attributes are included in
the branching features.
Conversely, in Figure 4.14, the performance evaluation of another tree (tree (1) in
Figure 4.12) is depicted. This tree, also of depth 2 with randomized treatment assignment,
utilizes predicted outcomes and protected attributes as branching features to enforce fairness
in allocation with a threshold value of 0.03. Here, we achieve our goals in allocation fairness,
yet the outcomes across racial groups show slight discrepancies.
As evident, there exists a trade-off among fairness in allocation and fairness in outcome,
necessitating individuals to carefully tailor their approach according to their specific needs.
Prescriptive Trees Performance with Respect to Vulnerability Score. An intriguing
observation in both Figures 4.13 and 4.14 is that, unlike the historical data, the policies
do not necessarily allocate PSH resources to individuals with high vulnerability scores. As
mentioned in Section 4.1, the historical policy lacks data-driven or personalized criteria, and
its thresholds are not linked to individual outcomes. Consequently, individuals with high
vulnerability scores may not truly benefit from PSH resources, and our prescriptive trees
shed light on this significant issue.
96
Methods
OPT (predicted_outcomes, randomized)
OPT (predicted_outcomes, deterministic)
OPT (best_20, randomized)
OPT (best_20, deterministic)
- - - - - - - - - - - - - - - - - - - - - - - -
Historical Policy
Dual-price Queuing Policy (Base)
Dual-price Queuing Policy (Fair)
Least
Interpretable
Most
Interpretable
Depth = 2 Depth = 3 Depth = 4 Methods
fairness in treatment, not included, test data
'
&!$%$
'
$ #"$
'
'
(
'"%&%!
(
!%! $# %!
(
Figure 4.11: The out-of-sample expected outcome vs discrimination metrics for all methods
enforcing statistical parity in allocation for the case where protected attributes are not included
in the branching features.
97
Yes
(1)
NoTr RRH
ν̂ ? 0 > = 0.5
Disabled?
No
(2)
NoTr RRH RRH NoTr
Gender
is Male?
ν̂ ? 1 > = 0.57
ν̂ ? 0 > = 0.5
PSH (SB) NoTr RRH PSH (TB)
ν̂ ? 0 > = 0.75 ν̂ ? 0 > = 0.75
ν̂ ? 3 > = 0.73
Yes
Disabled?
No
(3)
NoTr RRH
Gender NoTr
is Other?
NoTr PSH (TB)
ν̂ ? 3 > = 0.73
NoTr PSH (SB)
ν̂ ? 0 > = 0.5
Employed? Race
is Black?
RRH RRH
NoTr
PSH (SB) RRH: 34.6 %
PSH (TB):65,4 %
Days in the
System >= 1303?
VA_HealthCare_N
0n_Eligible ?
VA_HealthCare_N
0n_Eligible ?
Race
is White?
Days in the
System >= 1303?
Yes
Disabled?
No
NoTr : 90%
PSH (SB): 10%
Race
is White?
NoTr : 76.5%
RRH: 10.5%
PSH (TB) : 7.6%
PSH (SB): 5.3%
NoTr: 58.8 %
PSH (TB):41.2 %
RRH: 18.4 %
PSH (SB):81.6 %
LengthOfStay
>= 11?
PriorLOSDays
>= 365
Figure 4.12: Sample prescriptive trees with statistical parity in treatment allocation. The
numbers associated with each tree correspond to the highlighted points in Figure 4.10.
98
8
Depth=2, randomized assignment, fairness in outcome (δ=0.01), predicted_outcomes
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$ Black Hispanic Other White
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$
Figure 4.13: The out-of-sample performance of tree number 1 (depicted in Figure 4.9) is
displayed. This tree, at a depth of 2, ensures statistical parity in outcomes using a threshold
of 0.01, utilizing predicted outcomes and protected attributes as the branching features. On
the left part, for both the highlighted prescriptive tree and the historical policy we show the
expected outcome across race (top), the proportion of each race receiving PSH (middle), and
RRH (bottom). On the right part, for each level of vulnerability score, the top (resp. bottom)
shows the percentage of all PSH (resp. RRH) resources allocated to individuals within that
score bracket.
99
16
Depth=2, randomized assignment, fairness in allocation (δ=0.03), predicted outcomes
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$ Black Hispanic Other White
"#! !$
!
"
#!
#
"#! !$
!
"
#!
#
"#!
!$
"#! !$
!
"
#!
#
"#!
!$
Figure 4.14: The out-of-sample performance of tree number 1 (depicted in Figure 4.12) is
displayed. This tree, at a depth of 2, ensures statistical parity in allocation using a threshold
of 0.03, utilizing predicted outcomes and protected attributes as the branching features. On
the left part, for both the highlighted prescriptive tree and the historical policy we show the
expected outcome across race (top), the proportion of each race receiving PSH (middle), and
RRH (bottom). On the right part, for each level of vulnerability score, the top (resp. bottom)
shows the percentage of all PSH (resp. RRH) resources allocated to individuals within that
score bracket.
100
Chapter 5
Conclusion and Future work
This thesis is situated at the crossroads of operations research and artificial intelligence,
offering contributions that diverge into two main streams. The first stream, titled “Integer
Optimization for Machine Learning in High-Stakes Domains,” delves into the identification
and resolution of several challenges associated with designing predictive models intended for
deployment in high-stakes domains. Its contributions primarily lean towards the theoretical realm, presenting a framework geared towards learning optimal decision trees capable
of both classification and regression tasks. Notably, these trees possess adaptability for
accommodating fairness and interpretability constraints.
In the second stream, titled “Optimization and Causal Inference for Policy-Making,”
this work introduces a data-driven policy derived from observational data. Specifically, it
formulates a decision tree paradigm known as prescriptive trees to address housing allocation
challenges for individuals in Los Angeles. Remarkably, these prescriptive trees outperform
the current deployed policy while emphasizing enhanced interpretability and fairness.
Despite these advancements, numerous unaddressed challenges persist. Consequently,
the following sections aim to explore research questions within both streams of this thesis,
shedding light on areas that require further investigation.
Integer Optimization for Machine Learning in High-Stakes Domains. In this
stream, I aim to address the following research questions:
101
• Can we learn strong optimal regression trees? In many high-stakes domains, the label of
interest is real-valued. For example, in the context of policing, we may want to estimate
the crime rate in various neighborhoods to design a policy for allocating police forces.
To address this problem, one could build upon the strong formulation in Chapter 2 and
adjust it such that it can track the error associated with each datapoint throughout
the tree to minimize the total estimation loss.
• Can we learn optimal and sparse risk scores satisfying arbitrary fairness constraints?
Together with decision trees, risk scores are some of the most interpretable ML models:
they correspond to linear functions of the covariates that have few non-zero integer
coefficients. They are often used as an interpretable way to predict risk, e.g., of
recidivism [168] or of returning to homelessness [156]. To address this problem, one
could formulate the problem of learning optimal and fair risk scores as an integer
exponential cone program and solve it via cutting plane approaches.
Optimization and Causal Inference for Policy-Making. In the analyses of observational data, a common assumption is that of unconfoundedness which requires that all
confounders which influence treatment assignment and outcomes can be observed in the data.
However, this assumption often fails to hold in practice, e.g., if those assigning treatments
have access to side information not recorded in the data. In this stream, being able to
design optimal prescriptive and predictive tools that are robust to unobserved confounders is
extremely important. More specifically, we can ask the following questions:
• Can we learn optimal sparse risk scores that are robust against potential unobserved
confounders? For example, the problem of prioritizing people experiencing homeless for
scarce housing resources based on their vulnerability under no housing can be cast in
this form. Indeed, historical housing assignments are typically made not only based on
observed covariates but also based on private information the stakeholders have shared
with the matchers that assign housing.
102
• How can we learn optimal prescriptive trees that are robust against potential unobserved
confounders? Being able to design robust personalized policies is very important as in
the presence of unobserved confounders we may prescribe non-optimal treatments.
To address these problems, one could capture the effect of the unobserved confounders
in an uncertainty set that is defined over the propensity scores and based on methods from
sensitivity analysis in causal inference.
103
References
1. Rahmattalabi, A., Vayanos, P., Dullerud, K. & Rice, E. Learning resource allocation
policies from observational data with an application to homeless services delivery in
Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency
(2022), 1240–1256.
2. Byrnes, N. Artificial intolerance. MIT Tech. Review (2016).
3. Rudin, C. Stop explaining black box machine learning models for high stakes decisions
and use interpretable models instead. Nature Machine Intelligence 1, 206–215 (2019).
4. Linardatos, P., Papastefanopoulos, V. & Kotsiantis, S. Explainable ai: A review of
machine learning interpretability methods. Entropy 23, 18 (2020).
5. Breiman, L. Classification and regression trees tech. rep. (1984).
6. Quinlan, J. R. Induction of decision trees. Machine learning 1, 81–106 (1986).
7. Quinlan, J. R. C4. 5: programs for machine learning (Elsevier, 2014).
8. Wilson, Z. T. & Sahinidis, N. V. The ALAMO approach to machine learning. Computers
& Chemical Engineering 106, 785–795 (2017).
9. Atamtürk, A. & Gómez, A. Rank-one convexification for sparse regression arXiv:1901.10334.
2019.
10. Bertsimas, D., Van Parys, B., et al. Sparse high-dimensional regression: Exact scalable
algorithms and phase transitions. The Annals of Statistics 48, 300–323 (2020).
11. Hazimeh, H., Mazumder, R. & Saab, A. Sparse regression at scale: Branch-and-bound
rooted in first-order optimization arXiv:2004.06152. 2020.
12. Gómez, A. & Prokopyev, O. A. A mixed-integer fractional optimization approach to
best subset selection. INFORMS Journal on Computing 176, 957–969 (2021).
13. Fischetti, M. & Jo, J. Deep neural networks and mixed integer linear optimization.
Constraints 23, 296–309 (2018).
104
14. Khalil, E. B., Gupta, A. & Dilkina, B. Combinatorial attacks on binarized neural
networks arXiv:1810.03538. 2018.
15. Tjandraatmadja, C., Anderson, R., Huchette, J., Ma, W., Patel, K. & Vielma, J. P.
The convex relaxation barrier, revisited: Tightened single-neuron relaxations for neural
network verification arXiv:2006.14076. 2020.
16. Dey, S. S., Mazumder, R. & Wang, G. A convex integer programming approach for
optimal sparse PCA arXiv:1810.09062. 2018.
17. Bertsimas, D., Cory-Wright, R. & Pauphilet, J. Solving Large-Scale Sparse PCA to
Certifiable (Near) Optimality arXiv:2005.05195. 2020.
18. Bertsimas, D. & Dunn, J. Optimal classification trees. Machine Learning 106, 1039–
1082 (2017).
19. Verwer, S. & Zhang, Y. Learning optimal classification trees using a binary linear
program formulation in 33rd AAAI Conference on Artificial Intelligence (2019).
20. Elmachtoub, A. N., Liang, J. C. N. & McNellis, R. Decision Trees for Decision-Making
under the Predict-then-Optimize Framework. arXiv preprint arXiv:2003.00360 (2020).
21. Miši´c, V. V. Optimization of tree ensembles. Operations Research 68, 1605–1624 (2020).
22. Dwork, C., Hardt, M., Pitassi, T., Reingold, O. & Zemel, R. Fairness Through Awareness
in Proceedings of the 3rd Innovations in Theoretical Computer Science Conference
(ACM, Cambridge, Massachusetts, 2012), 214–226. doi:10.1145/2090236.2090255.
23. Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning in
Advances in neural information processing systems (2016), 3315–3323.
24. Kamiran, F., Calders, T. & Pechenizkiy, M. Discrimination aware decision tree learning
in 2010 IEEE International Conference on Data Mining (2010), 869–874.
25. Zhang, W. & Ntoutsi, E. Faht: an adaptive fairness-aware decision tree classifier. arXiv
preprint arXiv:1907.07237 (2019).
26. Grari, V., Ruf, B., Lamprier, S. & Detyniecki, M. Fair Adversarial Gradient Tree
Boosting in 2019 IEEE International Conference on Data Mining (ICDM) (2019),
1060–1065.
27. Ranzato, F., Urban, C. & Zanella, M. Fair Training of Decision Tree Classifiers. arXiv
preprint arXiv:2101.00909 (2021).
28. Verwer, S. & Zhang, Y. Learning Decision Trees with Flexible Constraints and Objectives
Using Integer Optimization in 14th International CPAIOR Conference (2017).
105
29. Aghaei, S., Gómez, A. & Vayanos, P. Strong Optimal Classification Trees. Accepted
for publication at Operations Research. URL: http://www.optimization-online.
org/DB_FILE/2021/01/8220.pdf (2023).
30. Aghaei, S., Azizi, M. J. & Vayanos, P. Learning optimal and fair decision trees for
non-discriminative decision-making in Proceedings of the AAAI Conference on Artificial
Intelligence 33 (2019), 1418–1426.
31. Bertsimas, D., Farias, V. F. & Trichakis, N. Fairness, efficiency, and flexibility in organ
allocation for kidney transplantation. Operations Research 61, 73–87 (2013).
32. Dickerson, J. & Sandholm, T. FutureMatch: Combining human value judgments and
machine learning to match in dynamic environments in Proceedings of the AAAI
Conference on Artificial Intelligence 29 (2015).
33. Azizi, M. J., Vayanos, P., Wilder, B., Rice, E. & Tambe, M. Designing fair, efficient,
and interpretable policies for prioritizing homeless youth for housing resources in
International Conference on the Integration of Constraint Programming, Artificial
Intelligence, and Operations Research (2018), 35–51.
34. Kube, A., Das, S. & Fowler, P. J. Allocating interventions based on predicted outcomes:
A case study on homelessness services in Proceedings of the AAAI Conference on
Artificial Intelligence 33 (2019), 622–629.
35. Tang, B., Koçyi˘git, Ç., Rice, E. & Vayanos, P. Learning Optimal and Fair Policies for
Online Allocation of Scarce Societal Resources from Data Collected in Deployment.
arXiv preprint arXiv:2311.13765 (2023).
36. Breiman, L. Bagging predictors. Machine learning 24, 123–140 (1996).
37. Breiman, L. Random forests. Machine learning 45, 5–32 (2001).
38. Liaw, A. & Wiener, M. Classification and regression by randomForest. R news 2, 18–22
(2002).
39. Bertsimas, D. & Stellato, B. The Voice of Optimization. arXiv preprint arXiv:1812.09991
(2018).
40. Intrator, J., Allan, E. & Palmer, M. Decision tree for the management of substanceabusing psychiatric patients. Journal of substance abuse treatment 9, 215–220 (1992).
41. Olanow, C. W., Watts, R. L. & Koller, W. C. An algorithm (decision tree) for the
management of Parkinson’s disease (2001):: Treatment Guidelines. Neurology 56, S1–
S88 (2001).
106
42. Bertsimas, D., Kung, J., Trichakis, N., Wang, Y., Hirose, R. & Vagefi, P. A. Development
and validation of an optimized prediction of mortality for candidates awaiting liver
transplantation. American Journal of Transplantation 19, 1109–1118 (2019).
43. Blurock, E. S. Automatic learning of chemical concepts: Research octane number and
molecular substructures. Computers & chemistry 19, 91–99 (1995).
44. Chan, H., Rice, E., Vayanos, P., Tambe, M. & Morton, M. From Empirical Analysis
to Public Policy: Evaluating Housing Systems for Homeless Youth in Joint European
Conference on Machine Learning and Knowledge Discovery in Databases (2018), 69–85.
45. Chen, X., Zhu, C.-C. & Yin, J. Ensemble of decision tree reveals potential miRNAdisease associations. PLoS computational biology 15, e1007209 (2019).
46. Shaikhina, T., Lowe, D., Daga, S., Briggs, D., Higgins, R. & Khovanova, N. Decision
tree and random forest models for outcome prediction in antibody incompatible kidney
transplantation. Biomedical Signal Processing and Control 52, 456–462 (2019).
47. Hyafil, L. & Rivest, R. L. Constructing Optimal Binary Search Trees is NP Complete.
Information Processing Letters (1976).
48. Therneau, T., Atkinson, B., Ripley, B. & Ripley, M. B. Package ‘rpart’ (2015).
49. Kuhn, M., Weston, S., Culp, M., Coulter, N. & Quinlan, R. Package ‘C50’ 2018.
50. Nijssen, S. & Fromont, E. Optimal constraint-based decision tree induction from itemset
lattices. Data Mining and Knowledge Discovery 21, 9–51 (2010).
51. Narodytska, N., Ignatiev, A., Pereira, F. & Marques-Silva, J. Learning Optimal Decision
Trees with SAT. in IJCAI (2018), 1362–1368.
52. Verhaeghe, H., Nijssen, S., Pesant, G., Quimper, C.-G. & Schaus, P. Learning optimal
decision trees using constraint programming in The 25th International Conference on
Principles and Practice of Constraint Programming (CP2019) (2019).
53. Hu, X., Rudin, C. & Seltzer, M. Optimal sparse decision trees. arXiv preprint arXiv:1904.12847
(2019).
54. Lin, J., Zhong, C., Hu, D., Rudin, C. & Seltzer, M. Generalized and scalable optimal
sparse decision trees in International Conference on Machine Learning (2020), 6150–
6160.
55. Demirovi´c, E., Lukina, A., Hebrard, E., Chan, J., Bailey, J., Leckie, C., et al. MurTree:
Optimal Classification Trees via Dynamic Programming and Search. arXiv preprint
arXiv:2007.12652 (2020).
107
56. Demirovi´c, E. & Stuckey, P. J. Optimal Decision Trees for Nonlinear Metrics. arXiv
preprint arXiv:2009.06921 (2020).
57. Nijssen, S., Schaus, P., et al. Learning Optimal Decision Trees Using Caching Branchand-Bound Search in Thirty-Fourth AAAI Conference on Artificial Intelligence (2020).
58. Blanquero, R., Carrizosa, E., Molero-Río, C. & Romero Morales, D. Optimal randomized
classification trees. Computers & Operations Research 132, 105281 (2021).
59. Detassis, F., Lombardi, M. & Milano, M. Teaching the Old Dog New Tricks: Supervised
Learning with Constraints. arXiv preprint arXiv:2002.10766 (2020).
60. Günlük, O., Kalagnanam, J., Menickelly, M. & Scheinberg, K. Optimal decision trees
for categorical data via integer programming. arXiv preprint arXiv:1612.03225 (2018).
61. CPLEX, I. I. V12. 1: User’s Manual for CPLEX. International Business Machines
Corporation 46, 157 (2009).
62. Gurobi, I. Gurobi optimizer reference manual. URL http://www. gurobi. com (2015).
63. Bixby, R. E. A brief history of linear and mixed-integer programming computation.
Documenta Mathematica, 107–121 (2012).
64. Ciocan, D. & Miši´c, V. Interpretable optimal stopping (2018).
65. Mišic, V. V. Optimization of tree ensembles. arXiv preprint arXiv:1705.10883 (2017).
66. Biggs, M. & Hariss, R. Optimizing objective functions determined from random forests.
Available at SSRN 2986630 (2018).
67. Bertsimas, D. & Dunn, J. Machine learning under a modern optimization lens (Dynamic
Ideas LLC, 2019).
68. Carrizosa, E., Molero-Río, C. & Romero Morales, D. Mathematical optimization in
classification and regression trees. TOP 29, 5–33 (2021).
69. Dong, H., Chen, K. & Linderoth, J. Regularization vs. relaxation: A conic optimization
perspective of statistical variable selection. arXiv preprint arXiv:1510.06083 (2015).
70. Atamtürk, A., Gómez, A. & Han, S. Sparse and smooth signal estimation: Convexification of ℓ0 formulations. arXiv preprint arXiv:1811.02655 (2018).
71. Bienstock, D., Muñoz, G. & Pokutta, S. Principled deep neural network training
through linear programming. arXiv preprint arXiv:1810.03218 (2018).
72. Xie, W. & Deng, X. Scalable Algorithms for the Sparse Ridge Regression. arXiv
preprint arXiv:1806.03756 (2018).
108
73. Gómez, A. Outlier detection in time series via mixed-integer conic quadratic optimization. http://www.optimization-online.org/DB_HTML/2019/11/7488.html (2019).
74. Anderson, R., Huchette, J., Ma, W., Tjandraatmadja, C. & Vielma, J. P. Strong
mixed-integer programming formulations for trained neural networks. Mathematical
Programming, 1–37 (2020).
75. Gade, D., Küçükyavuz, S. & Sen, S. Decomposition algorithms with parametric Gomory
cuts for two-stage stochastic integer programs. Mathematical Programming 144, 39–64
(2014).
76. Liu, X., Küçükyavuz, S. & Luedtke, J. Decomposition algorithms for two-stage chanceconstrained programs. Mathematical Programming 157, 219–243 (2016).
77. Guo, C., Bodur, M., Aleman, D. M. & Urbach, D. R. Logic-based Benders Decomposition and Binary Decision Diagram Based Approaches for Stochastic Distributed
Operating Room Scheduling. arXiv preprint arXiv:1907.13265 (2019).
78. MacNeil, M. & Bodur, M. Integer Programming, Constraint Programming, and Hybrid Decomposition Approaches to Discretizable Distance Geometry Problems. arXiv
preprint arXiv:1907.12468 (2019).
79. Gangammanavar, H., Liu, Y. & Sen, S. Stochastic decomposition for two-stage stochastic linear programs with random cost coefficients. INFORMS Journal on Computing
(2020).
80. Liu, J. & Sen, S. Asymptotic results of stochastic decomposition for two-stage stochastic
quadratic programming. SIAM Journal on Optimization 30, 823–852 (2020).
81. Benders, J. F. Partitioning procedures for solving mixed-variables programming problems. Numerische mathematik 4, 238–252 (1962).
82. Hu, T. C. Multi-commodity network flows. Operations research 11, 344–360 (1963).
83. Okada, S., Ohzeki, M. & Taguchi, S. Efficient partition of integer optimization problems
with one-hot encoding. Scientific reports 9, 1–12 (2019).
84. Vazirani, V. V. Approximation algorithms (Springer Science & Business Media, 2013).
85. Goldberg, A. V. & Tarjan, R. E. A new approach to the maximum-flow problem.
Journal of the ACM (JACM) 35, 921–940 (1988).
86. Hochbaum, D. S. The pseudoflow algorithm: A new algorithm for the maximum-flow
problem. Operations research 56, 992–1009 (2008).
109
87. Lozano, L. & Smith, J. C. A binary decision diagram based algorithm for solving a class
of binary two-stage stochastic programs. Mathematical Programming 191, 381–404
(2022).
88. Magnanti, T. L. & Wong, R. T. Accelerating Benders decomposition: Algorithmic
enhancement and model selection criteria. Operations research 29, 464–484 (1981).
89. Blanquero, R., Carrizosa, E., Molero-Río, C. & Romero Morales, D. Sparsity in optimal
randomized classification trees. European Journal of Operational Research 284, 255–272
(2020).
90. Bishop, C. M. Pattern recognition and machine learning (springer, 2006).
91. Lombardi, M., Baldo, F., Borghesi, A. & Milano, M. An Analysis of Regularized
Approaches for Constrained Machine Learning. arXiv preprint arXiv:2005.10674 (2020).
92. Kirschbaum, D., Adler, R., Hong, Y. & Lerner-Lam, A. Evaluation of a preliminary
satellite-based landslide hazard algorithm using global landslide inventories. Natural
Hazards & Earth System Sciences 9 (2009).
93. Wei, W., Li, J., Cao, L., Ou, Y. & Chen, J. Effective detection of sophisticated online
banking fraud on extremely imbalanced data. World Wide Web 16, 449–475 (2013).
94. Khalilia, M., Chakraborty, S. & Popescu, M. Predicting disease risks from highly
imbalanced data using random forest. BMC medical informatics and decision making
11, 51 (2011).
95. Mower, J. P. PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC
bioinformatics 6, 96 (2005).
96. Bertsimas, D., Dunn, J., Pawlowski, C. & Zhuo, Y. D. Robust classification. INFORMS
Journal on Optimization 1, 2–34 (2019).
97. Vos, D. & Verwer, S. Robust Optimal Classification Trees Against Adversarial Examples
arXiv preprint arXiv:2109.03857. 2021.
98. Rudin, C. Predictive policing: using machine learning to detect patterns of crime.
Wired Magazine (2013).
99. Miller, C. Can an algorithm hire better than a human? New York Times (2015).
100. Angwin, J., Larson, J., Mattu, S. & Kirchne, L. Machine Bias. ProPublica (2016).
101. Barocas, S. & Selbst, A. D. Big data’s disparate impact. Cal. L. Rev. 104, 671 (2016).
102. Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning http://www.
fairmlbook.org (fairmlbook.org, 2019).
110
103. Corbett-Davies, S. & Goel, S. The measure and mismeasure of fairness: A critical
review of fair machine learning. arXiv preprint arXiv:1808.00023 (2018).
104. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias
and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019).
105. Caton, S. & Haas, C. Fairness in Machine Learning: A Survey. arXiv preprint arXiv:2010.04053
(2020).
106. Gajane, P. & Pechenizkiy, M. On formalizing fairness in prediction with machine
learning. arXiv preprint arXiv:1710.03184 (2017).
107. Chen, J., Kallus, N., Mao, X., Svacha, G. & Udell, M. Fairness under unawareness:
Assessing disparity when protected class is unobserved in Proceedings of the conference
on fairness, accountability, and transparency (2019), 339–348.
108. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S. & Huq, A. Algorithmic decision
making and the cost of fairness in Proceedings of the 23rd acm sigkdd international
conference on knowledge discovery and data mining (2017), 797–806.
109. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments. Big data 5, 153–163 (2017).
110. Dua, D. & Graff, C. UCI Machine Learning Repository 2017.
111. Jo, N., Aghaei, S., Gómez, A. & Vayanos, P. Learning Optimal Prescriptive Trees from
Observational Data. Under second round of review at Management Science (after major
revision); short version appeared at AAAI Workshop on AI and Behavior Change.
URL: https://arxiv.org/pdf/2108.13628.pdf (2023).
112. Justin, N., Aghaei, S., Gómez, A. & Vayanos, P. Optimal Robust Classification Trees.
Under review at Operations Research; short version appeared at AAAI Workshop on
Adversarial Machine Learning and Beyond. URL: https://openreview.net/pdf?id=
HbasA9ysA3 (2021).
113. Altman, A. in The Stanford Encyclopedia of Philosophy (ed Zalta, E. N.) Winter 2016
(Metaphysics Research Lab, Stanford University, 2016).
114. Finley, K. Amazon’s giving away the AI behind its product recommendations. Wired
Magazine (2016).
115. Kumar, R., Misra, V., Walraven, J., Sharan, L., Azarnoush, B., Chen, B., et al. Data
Science and the Art of Producing Entertainment at Netflix. Medium (2018).
116. Squires, G. D. Racial Profiling, Insurance Style: Insurance Redlining and the Uneven
Development of Metropolitan Areas. Journal of Urban Affairs 25, 391–410. doi:10.
1111/1467-9906.t01-1-00168 (2003).
111
117. LaCour-Little, M. Discrimination in Mortgage Lending: A Critical Review of the
Literature. Journal of Real Estate Literature 7, 15–50. doi:10.1023/A:1008616203852
(1999).
118. Stoll, M. A., Raphael, S. & Holzer, H. J. Black Job Applicants and the Hiring Officer’s
Race. ILR Review 57, 267–287 (2004).
119. Kuhn, P. Sex Discrimination in Labor Markets: The Role of Statistical Evidence: Reply.
American Economic Review 80, 290–97 (1990).
120. Pedreshi, D., Ruggieri, S. & Turini, F. Discrimination-aware Data Mining in Proceedings
of the 14th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (ACM, Las Vegas, Nevada, USA, 2008), 560–568. doi:10.1145/1401890.
1401959.
121. Adler, P., Falk, C., Friedler, S. A., Nix, T., Rybeck, G., Scheidegger, C., et al. Auditing
Black-box Models for Indirect Influence. Knowl. Inf. Syst. 54, 95–122. doi:10.1007/
s10115-017-1116-3 (2018).
122. Kamiran, F. & Calders, T. Data preprocessing techniques for classification without
discrimination. Knowledge and Information Systems 33, 1–33. doi:10.1007/s10115-
011-0463-8 (2012).
123. Kamiran, F. & Calders, T. Classifying without discriminating in 2009 2nd International
Conference on Computer, Control and Communication (2009), 1–6. doi:10.1109/IC4.
2009.4909197.
124. Luong, B. T., Ruggieri, S. & Turini, F. k-NN As an Implementation of Situation Testing
for Discrimination Discovery and Prevention in Proceedings of the 17th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (ACM, San Diego,
California, USA, 2011), 502–510. doi:10.1145/2020408.2020488.
125. Hardt, M., Price, E. & Srebro, N. Equality of Opportunity in Supervised Learning.
ArXiv e-prints (2016).
126. Fish, B., Kun, J. & Lelkes, Á. D. A Confidence-Based Approach for Balancing Fairness
and Accuracy. ArXiv (2016).
127. Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning Fair Representations
in Proceedings of the 30th International Conference on Machine Learning (eds Dasgupta,
S. & McAllester, D.) 28 (PMLR, Atlanta, Georgia, USA, 2013), 325–333.
128. Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., et al. A
Convex Framework for Fair Regression. ArXiv e-prints (2017).
112
129. Calders, T. & Verwer, S. Three naive Bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21, 277–292. doi:10.1007/s10618-
010-0190-x (2010).
130. Kamiran, F., Calders, T. & Pechenizkiy, M. Discrimination Aware Decision Tree
Learning in 2010 IEEE International Conference on Data Mining (2010), 869–874.
doi:10.1109/ICDM.2010.50.
131. Bertsimas, D., King, A. & Mazumder, R. Best Subset Selection via a Modern Optimization Lens. ArXiv (2015).
132. Lou, Y., Caruana, R., Gehrke, J. & Hooker, G. Accurate Intelligible Models with Pairwise
Interactions in Proceedings of the 19th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (ACM, Chicago, Illinois, USA, 2013), 623–631.
doi:10.1145/2487575.2487579.
133. Mazumder, R. & Radchenko, P. The Discrete Dantzig Selector: Estimating Sparse
Linear Models via Mixed Integer Linear Optimization. ArXiv e-prints (2015).
134. Bertsimas, D. & Dunn, J. Optimal classification trees. Machine Learning 106, 1039–
1082. doi:10.1007/s10994-017-5633-9 (2017).
135. Azizi, M., Vayanos, P., Wilder, B., Rice, E. & Tambe, M. Designing Fair, Efficient,
and Interpretable Policies for Prioritizing Homeless Youth for Housing Resources in
15th International CPAIOR Conference (2018).
136. Wang, T., Rudin, C., Doshi-Velez, F., Liu, Y., Klampfl, E. & MacNeille, P. A Bayesian
Framework for Learning Rule Sets for Interpretable Classification. Journal of Machine
Learning Research 18, 1–37 (2017).
137. Letham, B., Rudin, C., McCormick, T. H. & Madigan, D. Interpretable classifiers using
rules and Bayesian analysis: Building a better stroke prediction model. The Annals of
Applied Statistics 9, 1350–1371 (2015).
138. Lakkaraju, H., Bach, S. H. & Jure, L. Interpretable decision sets: A joint framework
for description and prediction in Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (2016), 1675–1684.
139. Lou, Y., Caruana, R., Gehrke, J. & Hooker, G. Accurate intelligible models with pairwise
interactions in Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining (2013), 623–631.
140. Valdes, G., Luna, J. M., Eaton, E. & Simone, C. B. MediBoost: a Patient Stratification
Tool for Interpretable Decision Making in the Era of Precision Medicine. Scientific
reports 6 (2016).
113
141. Huang, L.-T., Gromiha, M. M. & Ho, S.-Y. iPTREE-STAB: interpretable decision tree
based method for predicting protein stability changes upon mutations. Bioinformatics
23, 1292–1293 (2007).
142. Che, Z., Purushotham, S., Khemani, R. & Liu, Y. Interpretable deep models for icu
outcome prediction in AMIA Annual Symposium Proceedings 2016 (2016), 371.
143. Frank, E., Wang, Y., Inglis, S., Holmes, G. & Witten, I. H. Using Model Trees for
Classification. Machine Learning 32, 63–76 (1998).
144. Zafar, M. B., Valera, I., Rodriguez, M. G. & Gummadi, K. P. Fairness Beyond Disparate
Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment.
arXiv (2017).
145. Bilal Zafar, M., Valera, I., Gomez Rodriguez, M. & Gummadi, K. P. Fairness Beyond
Disparate Treatment: Disparate Impact: Learning Classification without Disparate
Mistreatment. ArXiv e-prints (2016).
146. Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications 9,
141–142 (1964).
147. Dheeru, D. & Karra Taniskidou, E. UCI Machine Learning Repository 2017.
148. Yeh, I.-C. & Lien, C.-h. The Comparisons of Data Mining Techniques for the Predictive
Accuracy of Probability of Default of Credit Card Clients. Expert Syst. Appl. 36,
2473–2480. doi:10.1016/j.eswa.2007.12.020 (2009).
149. Kohavi, R. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-tree Hybrid
in Proceedings of the Second International Conference on Knowledge Discovery and
Data Mining (AAAI Press, Portland, Oregon, 1996), 202–207.
150. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S. & Huq, A. Algorithmic decision
making and the cost of fairness. ArXiv e-prints (2017).
151. Redmond, M. & Baveja, A. A data-driven software tool for enabling cooperative
information sharing among police departments. European Journal of Operational
Research 141, 660–678 (2002).
152. Dunning, I., Huchette, J. & Lubin, M. JuMP: A Modeling Language for Mathematical
Optimization. SIAM Review 59, 295–320. doi:10.1137/15M1020575 (2017).
153. Authority, L. A. H. S. 2023 Greater Los Angeles Homeless Count URL: https :
/ / www . lahsa . org / news ? article = 927 - lahsa - releases - results - of - 2023 -
greater-los-angeles-homeless-count. 2023.
154. Authority, L. A. H. S. 2023 Housing Inventory Count URL: https://www.lahsa.org/
documents?id=7698-2023-housing-inventory-count. 2023.
114
155. Authority, L. A. H. S. Report and Recommendations of the Ad Hoc Com- mittee on
Black People Experiencing Homelessness URL: https://www.lahsa.org/documents?
id=2823-report-and-recommendations-of-the-ad-hoc-committee-on-blackpeople-experiencing-homelessness.pdf. 2018.
156. Vulnerability Index-Service Prioritization Decision Assistance Tool (VI-SPDAT) URL:
https://pehgc.org/wp-content/uploads/2016/09/VI-SPDAT-v2.01-Single-USFillable.pdf.
157. Jo, N., Tang, B., Dullerud, K., Aghaei, S., Rice, E. & Vayanos, P. Fairness in contextual
resource allocation systems: Metrics and incompatibility results in Proceedings of the
AAAI Conference on Artificial Intelligence 37 (2023), 11837–11846.
158. Horvitz, D. G. & Thompson, D. J. A generalization of sampling without replacement
from a finite universe. Journal of the American statistical Association 47, 663–685
(1952).
159. Dudík, M., Langford, J. & Li, H. Doubly robust policy evaluation and learning in
Proceedings of the 28th International Conference on Machine Learning, ICML 2011
(2011).
160. Kallus, N. Recursive partitioning for personalization using observational data in 34th
International Conference on Machine Learning, ICML 2017 4 (2017).
161. Bertsimas, D., Dunn, J. & Mundru, N. Optimal Prescriptive Trees. INFORMS Journal
on Optimization 1. doi:10.1287/ijoo.2018.0005 (2019).
162. Jo, N., Aghaei, S., Benson, J., Gómez, A. & Vayanos, P. Learning Optimal Fair Decision
Trees: Trade-offs Between Interpretability, Fairness, and Accuracy. In proceedings of
the 6th AAAI/ACM Conference on AI, Ethics, and Society (AIES). URL: https:
//arxiv.org/pdf/2201.09932.pdf (2023).
163. Dwork, C., Hardt, M., Pitassi, T., Reingold, O. & Zemel, R. Fairness through awareness
in Proceedings of the 3rd innovations in theoretical computer science conference (2012),
214–226.
164. Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations
in International conference on machine learning (2013), 325–333.
165. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V. & Kalai, A. T. Man is
to computer programmer as woman is to homemaker? debiasing word embeddings.
Advances in neural information processing systems 29 (2016).
166. Zafar, M. B., Valera, I., Gomez Rodriguez, M. & Gummadi, K. P. Fairness beyond
disparate treatment & disparate impact: Learning classification without disparate mistreatment in Proceedings of the 26th international conference on world wide web (2017),
1171–1180.
115
167. Olfat, M. & Aswani, A. Spectral algorithms for computing fair support vector machines
in International Conference on Artificial Intelligence and Statistics (2018), 1933–1942.
168. Latessa, E., Smith, P., Lemke, R., Makarios, M. & Lowenkamp, C. Creation and
validation of the Ohio risk assessment system: Final report. Center for Criminal
Justice Research, School of Criminal Justice, University of Cincinnati, Cincinnati, OH.
Retrieved from http://www. ocjs. ohio. gov/ORAS_FinalReport. pd f (2009).
116
Appendices
A Appendix to Chapter 2
A.1 Proof of Theorem 2
The proof proceeds in three steps. We fix i ∈ I. We derive the specific structure of the cuts
associated with datapoint i generated by our procedure. We then provide |B×F|+|L×K|+|I|
affinely independent points that lie in conv(H≤) and at each of which the cut generated holds
with equality. Since the choice of i ∈ I is arbitrary and since the cuts generated by our
procedure are valid (by construction), this will conclude the proof.
Given a set M and a point m ∈ M, we use M \ m as a shorthand for M \ {m}. Finally,
with a slight abuse of notation, we let eij be a vector (whose dimensions will be clear from
the context) with a 1 in coordinate (i, j) and 0 elsewhere.
Fix i ∈ I and let (
¯b, w¯, g¯) ∈ H≤ be integral. Given j ∈ I (possibly different from i),
let l(j) ∈ L be the leaf node of the tree defined by (
¯b, w¯) that datapoint j is assigned to.
Given n ∈ B, let f(n) ∈ F be the feature selected for branching at node n under (
¯b, w¯), i.e.,
¯bnf(n) = 1.
We now derive the structure of the cuts (2.4b) generated by Algorithm 1 when (
¯b, w¯, g¯)
is input. A minimum cut is returned by Algorithm 1 if and only if s and t belong to
different connected components in the graph G
i
(
¯b, w¯). Under this assumption, the connected
component S constructed in Algorithm 1 forms a path from s to nd = l(i) ∈ L, i.e.,
S = {s, n1, n2, . . . , nd}. The cut-set C(S) then corresponds to the arcs adjacent to nodes
117
in S that do not belong to the path formed by S. Therefore, the cut (2.4b) returned by
Algorithm 1 reads
gi ≤ w
l(i)
y
i +
X
n∈S
X
f∈F:
x
i
f
̸=x
i
f(n)
bnf . (A.1)
Next, we give |B × F| + |L × K| + |I| affinely independent points in H≤ for which (A.1)
holds with equality. Given a vector b ∈ {0, 1}
|B|·|F|, we let bS (resp. bB\S) collect those
elements of b whose first index is n ∈ S (resp. n /∈ S).
We now describe the points, which are also summarized in Table 5.1, and argue that the
points belong to H≤ and that inequality (A.1) is active. All the |B × F| + |L × K| + |I|
points are affinely independent, since each differs from all the previously introduced points in
at least one (new) coordinate.
Table 5.1: Companion table for the proof of Theorem 2: list of affinely independent points
that lie on the cut generated by inputting i ∈ I and (
¯b, w¯, g¯) in Algorithm 1.
# condition dim= |S| · |F| |B \ S| · |F| |L| · |K| |I|
sol= bS bB\S w g
1 “baseline” point ¯bS 0 0 0
2 n ∈ L, k ∈ K \ y
i ¯bS 0 enk 0
3 n ∈ L \ l(i)
¯bS 0 enyi 0
4 n = l(i)
¯bS 0 el(i)y
i ei
5 n ∈ B \ S, f ∈ F ¯bS enf 0 0
6 n ∈ S ¯bS − enf(n) 0 0 0
7 n ∈ S, f ∈ F : f ̸= f(n), xi
f = x
i
f(n)
¯bS − enf(n) + enf 0 0 0
8 n ∈ S, f ∈ F : f ̸= f(n), xi
f
̸= x
i
f(n)
¯bS − enf(n) + enf 0
X
n∈L:n̸=l(i)
enyi ei
9 j ∈ I \ i : y
j ̸= y
i ¯bS
¯bB\S el(j)y
j ej
10 j ∈ I \ i : y
j = y
i
, l(j) ̸= l(i)
¯bS
¯bB\S el(j)y
j ej
11 j ∈ I \ i : y
j = y
i
, l(j) = l(i)
¯bS
¯bB\S el(i)y
i ei + ej
1 One point that is a “baseline” point; all other points are variants of it. It is given by
bS = ¯bS, bB\S = 0, w = 0 and g = 0 and corresponds to selecting the features to
118
branch on according to ¯b for nodes in S and setting all remaining variables to 0. The
baseline point belongs to H≤ and constraint (A.1) is active at this point.
2-4 |L| × |K| points obtained from the baseline point by varying the w coordinates and
adjusting g as necessary to ensure (A.1) remains active: 2: |L| × (|K| − 1) points, each
associated with a leaf n ∈ L and class k ∈ K : k ̸= y
i
, where the label of leaf n is
changed to k. 3: |L| − 1 points, each associated with a leaf n ∈ L : n ≠ l(i), where the
class label of n is changed to y
i
. 4: One point where the class label of leaf l(i) is set
to y
i
, allowing for correct classification of datapoint i; in this case, the value of the
right-hand side (rhs) of (A.1) is 1, and we set g
i = 1 to ensure the cut (A.1) remains
active.
5 |B \S|×|F| points obtained from the baseline point by varying the bB\S coordinates, that
is branching decisions made at nodes outside of the path S. Each point is associated
with a node n ∈ B\S and feature f ∈ F and is obtained by changing the decision to
branch on feature f and node n to 1. As those branching decisions do not impact the
routing of datapoint i, the value of the rhs of inequality (A.1) remains unchanged and
the inequality stays active.
6-8 |S| × |F| points, obtained from the baseline point by varying the bS coordinates (that is,
the branching decisions in the path S used by datapoint i) and adjusting w and g as
necessary to guarantee feasibility of the resulting point and to ensure that (A.1) stays
active. 6: |S| points, each associated with a node n ∈ S obtained by not branching on
feature f(n) at node n (nor on any other feature), resulting in a “dead-end” node. The
value of the rhs of (A.1) is unchanged in this case and the inequality remains active.
7-8: |S| × (|F| − 1) points, each associated with a node n ∈ S and feature f ≠ f(n).
7: If the branching decision f(n) at node n is replaced with a branching decision that
results in the same path for datapoint i, i.e., if x
i
f = x
i
f(n)
, it is possible to swap those
decisions without affecting the value of the rhs in inequality (A.1). 8: If a feature that
119
causes i to change paths is chosen for branching, i.e., if x
i
f ̸= x
i
f(n)
, then the value of
the rhs of (A.1) is increased by 1, and we set g
i = 1 to ensure the inequality remains
active; to guarantee feasibility of the resulting point, we label each leaf node except for
l(i) with the class y
i
, which does not affect inequality (A.1).
9-11 |I| − 1 points, one for each j ∈ I \ {i}, where point j is correctly classified. We
let b = ¯b (that is, all branching decisions coincide with ¯b, both for nodes in path S and
elsewhere), and adjusting w and g as necessary. 9: If datapoint j has a different class
than datapoint i (y
j ≠ y
i
), we label the leaf node j is routed to with the class of j,
i.e., wl(j)y
j = 1. The value of the rhs of (A.1) is unaffected and the inequality remains
active. 10: If datapoint j has the same class as datapoint i but is routed to a different
leaf than i, an argument paralleling that in 9 can be made. 11: If datapoint j has the
same class as datapoint i and is routed to the same leaf l(i), we label l(i) with the class
of y
i = y
j and set g
j = 1; the value of the rhs of (A.1) increases by 1. Thus, we set
also correctly classify datapoint i by setting g
i = 1 to ensure that (A.1) is active.
This concludes the proof.
A.2 OCT
In this section, we provide a simplified version of the formulation of [18] (formulation (24) in
their paper) specialized to the case of binary data.
N
L
1
3
6 7
X
f2F
b1fxi
f >v1
2
4 5
X
f2F
b1fxi
f v1
Figure 5.1: A classification tree of depth 2
We start by introducing the notation that is used in the formulation. Let B and T
denote the sets of all branching and terminal nodes in the tree structure. For each node
120
n ∈ B ∪ T \{1}, a(n) refers to the parent of node n. We let PL(n) (resp. PR(n)) denote the
set of ancestors of n whose left (resp. right) branch has been followed on the path from the
root node to n. In particular, P(n) = PL(n) ∪ PR(n).
Let bnf be a binary decision variable where bnf = 1 if and only if feature f is used
for branching at node n. For each datapoint i at node n ∈ B, a test P
f∈F bnfx
i
f < vn is
performed where vn ∈ R is a decision variable representing the cut-off value of the test. If
datapoint i passes the test it follows the left branch; otherwise, it follows the right one. Let
pn = 1 if and only if node n applies a split, that is, if it is not a leaf. To track each datapoint
i through the tree, the decision variable ζ
i
a(n),n is introduced, where ζ
i
a(n),n = 1 if and only if
datapoint i is routed to node n.
Let Qnk to be the number of datapoints of class k assigned to leaf node n and Qn to be
the total number of datapoints in leaf node n ∈ T . We denote by w
n
k
the prediction at leaf
node n, where w
n
k = 1 if and only if the predicted label at node n is k ∈ K. Finally, we
let Ln denote the number of missclassified datapoints at node n. With this notation, the
formulation of [18] is expressible as
maximize (1 − λ)
|I| − X
n∈T
Ln
!
− λ
X
n∈B
pn (A.2a)
subject to Ln ≥ Qn − Qnk − |I|(1 − w
n
k
) ∀k ∈ K, n ∈ T (A.2b)
Ln ≤ Qn − Qnk + |I|w
n
k ∀k ∈ K, n ∈ T (A.2c)
Qnk =
X
i∈I:
y
i=k
ζ
i
a(n),n ∀k ∈ K, n ∈ T (A.2d)
Qn =
X
i∈I
ζ
i
a(n),n ∀n ∈ T (A.2e)
ln =
X
k∈K
w
n
k ∀n ∈ T (A.2f)
ζ
i
a(n),n ≤ ln ∀n ∈ T (A.2g)
X
n∈T
ζ
i
a(n),n = 1 ∀i ∈ I (A.2h)
X
f∈F
bmfx
i
f ≥ vm + ζ
i
a(n),n − 1 ∀i ∈ I, n ∈ T , m ∈ PR(n) (A.2i)
121
X
f∈F
bmfx
i
f ≤ vm − 2ζ
i
a(n),n + 1 ∀i ∈ I, n ∈ T , m ∈ PL(n) (A.2j)
X
f∈F
bnf = pn ∀n ∈ B (A.2k)
0 ≤ vn ≤ pn ∀n ∈ B (A.2l)
pn ≤ pa(n) ∀n ∈ B\{1} (A.2m)
z
i
n
, ln ∈ {0, 1} ∀i ∈ I, n ∈ T (A.2n)
bnf , pn ∈ {0, 1} ∀f ∈ F, n ∈ B, (A.2o)
where λ ∈ [0, 1] is a regularization term. The objective (A.2a) maximizes the total number
of correctly classified datapoints |I| − P
n∈T Ln while minimizing the number of splits
P
n∈B pn. Constraints (A.2b) and (A.2c) define the number of missclassified datapoints at
each node n. Constraints (A.2d) and (A.2e) give the definitions of Qnk and Qn, respectively.
Constraints (A.2f)-(A.2g), enforce that if a terminal node n does not have an assigned class
label, no datapoint should land in that node. Constraint (A.2h) makes sure that each
datapoint i is assigned to exactly one of the terminal nodes. Constraint (A.2i) implies that
if datapoint i is assigned to node n, it should take the right branch for all ancestors of n
belonging to PR(n). Similarly, constraint (A.2j) implies that if datapoint i is assigned to node
n, it should take the left branch for all ancestors of n belonging to PL(n). Constraint (A.2k)
enforces that if the tree branches at node n, it should branch on exactly one of the features
f ∈ F. Constraint (A.2l) implies that if the tree does not branch at a node, all datapoints
going through this node would take the right branch. Finally, constraint (A.2m) makes sure
that if the tree does not branch at node n it cannot branch on any of the descendants of the
node. We note that (A.2) is slightly different from the original formulation of [18]. Indeed,
the objective function (A.2a) maximizes correctly classified points instead of minimizing the
number of missclassified datapoints–the two are clearly equivalent since the later can be
obtained from the former by subtracting it from the number of datapoints |I|). Moreover,
we have omitted a constraint similar to (2.10), as we do not use it in our computations in
122
Section 2.5. Unlike the original formulation in [18], the “big M” and “little m” constants in
constraints (A.2j) are not directly visible since we have assumed all the features to be binary.
A.3 OCT’s Numerical Issues
In this section we show, by means of an example, that, for the case of real-valued features,
the “little-m” constraints in the formulation of [18] can cause numerical instabilities.
For the case of datasets with real-valued features, constraints (A.2i) and (A.2j) read as
follows:
X
f∈F
bmfx
i
f ≥ vm − (1 − ζ
i
a(n),n) ∀i ∈ I, n ∈ T , m ∈ PR(n) (A.3a)
X
f∈F
bmfx
i
f + ϵmin ≤ vm + (1 + ϵmax)(1 − ζ
i
a(n),n) ∀i ∈ I, n ∈ T , m ∈ PL(n), (A.3b)
where ϵmin (resp. ϵmax) is defined as minf∈F (ϵf ) (resp. maxf∈F (ϵf )), where
ϵf := {x
(i+1)
f − x
(i)
f
| x
(i+1)
f
̸= x
(i)
f
, i = 1, . . . , n − 1} ∀f ∈ F,
and x
(i+1)
f
is the ith largest value taken by feature f in the data. Note that ϵmin (resp. ϵmax)
represents the largest (resp. smallest) possible value that does not impact the feasibility of
any valid solution to the problem. Consider the “ionosphere” dataset, see Table 5.7. For this
dataset, ϵmin = 4.99e − 6 and ϵmax = 1. OCT outputs the decision tree shown in Figure 5.2. In
this instance, for node m = 3 and datapoint i = 175, constraints (A.3a) and (A.3b) read:
1 ≥ 1 − (1 − ζ
i
3,7
)
1 + ϵmin ≤ 1 + 2(1 − ζ
i
3,6
).
On paper, ζ
i
3,6 = 1 is infeasible. In Gurobi on the other hand, ζ
i
3,6 = 1 behaves as feasible,
taking on value 0.9999950003800046. And since in this case, ζ
i
3,6 = 1 results in a correct
classification for datapoint i = 175, Gurobi chooses this assignment. However, in reality
datapoint i = 175 should get routed to leaf node 7, i.e., ζ
175
3,7 = 1, where it is misclassified.
This numerical issue, which is caused by the small value of ϵmin creates a situation wherein
there is a discrepancy between the optimization problem and the actual training accuracy.
Since ϵmin is already the largest possible value, this issue cannot be resolved. In our numerical
experiments, we obtained an optimal objective value of 12, which corresponds to the number of
123
misclassified datapoints, resulting in an in-sample accuracy of 93%. However, upon evaluating
the output tree, we found that the actual in-sample accuracy is 92%. In the case of balanced
decision trees, out of the 560 MIO instances (28 datasets × 5 samples × 4 depths) that we
solved, we encountered numerical issues with OCT in 54% of the instances. These issues led to
a discrepancy in the in-sample accuracy of up to 92%, see Tables 5.6-5.8 for detailed results.
1
3
6 7
w6
2 =1 w7
1 =1
2
4 5
w4
1 =1 w5
2 =1
xi
V27 < 1 xi x V27 1 i
xi V5 0.6 V5 < 0.6
xi
V27 < 1 xi
V27 1
i V5 V27 y
175 11 2
Figure 5.2: Example of an instance where OCT exhibits numerical issues. According to the
solution of the optimization problem, datapoint 175 should get routed to leaf node 7 where it
gets misclassified. However, in practice, we observe that this datapoint is assigned to leaf node
6 and mistakenly reported as being correctly classified. This causes a discrepancy between the
optimization problem objective and the actual accuracy of the tree returned by the optimization.
A.4 Comparison with OCT (Proof of Theorem 1)
In this section, we demonstrate that formulation (2.1) has a stronger LO relaxation than
formulation (A.2). In formulation (A.2), vn can be fixed to pn for all nodes (in the case
of binary data). Moreover, regularization variables pn can be fixed to 1 for balanced trees.
Using the identity P
f∈F:x
i
f =1 bmf = 1 −
P
f∈F:x
i
f =0 bmf and noting that ln can be fixed to 1 in
the formulation, we obtain the simplified OCT formulation
maximize
|I| − X
n∈T
Ln
!
(A.4a)
subject to Ln ≥ Qn − Qnk − |I|(1 − w
n
k
) ∀k ∈ K, n ∈ T (A.4b)
Ln ≤ Qn − Qnk + |I|w
n
k ∀k ∈ K, n ∈ T (A.4c)
Qnk =
X
i∈I:
y
i=k
ζ
i
a(n),n ∀k ∈ K, n ∈ T (A.4d)
124
Qn =
X
i∈I
ζ
i
a(n),n ∀n ∈ T (A.4e)
X
k∈K
w
n
k = 1 ∀n ∈ T (A.4f)
X
n∈T
ζ
i
a(n),n = 1 ∀i ∈ I (A.4g)
X
f∈F:x
i
f =1
bmf ≥ ζ
i
a(n),n ∀i ∈ I, n ∈ T , m ∈ PR(n) (A.4h)
X
f∈F:x
i
f =0
bmf ≥ 2ζ
i
a(n),n − 1 ∀i ∈ I, n ∈ T , m ∈ PL(n) (A.4i)
X
f∈F
bnf = 1 ∀n ∈ B (A.4j)
ζ
i
a(n),n ∈ {0, 1} ∀i ∈ I, n ∈ T (A.4k)
bnf ∈ {0, 1} ∀f ∈ F, n ∈ B. (A.4l)
A.4.1 Strengthening
We now show how formulation (A.4) can be strengthened resulting in formulation (2.1). We
note that the validity of the steps below is guaranteed by correctness of formulation (2.1).
Thus we do not explicitly discuss the validity in the arguments.
Bound tightening for (A.4i). Adding the quantity 1 − ζ
i
a(n),n ≥ 0 to the right-hand side
of (A.4i), we obtain the stronger constraints
X
f∈F:x
i
f =0
bmf ≥ ζ
i
a(n),n ∀i ∈ I, n ∈ T , m ∈ PL(n). (A.5)
Improved branching constraints. Constraints (A.4h) can be strengthened to
X
f∈F:x
i
f =1
bmf ≥
X
n∈T :m∈PR(n)
ζ
i
a(n),n ∀i ∈ I, m ∈ B. (A.6)
Observe that constraints (A.6), in addition to being stronger than (A.4h), also reduce the
number of constraints required to represent the LO relaxation. Similarly, constraint (A.5)
can be further improved to
X
f∈F:x
i
f =0
bmf ≥
X
n∈T :m∈PL(n)
ζ
i
a(n),n ∀i ∈ I, m ∈ B. (A.7)
125
Improved missclassification formulation. For all i ∈ I and n ∈ T , define additional
variables z
i
a(n),n ∈ {0, 1} such that z
i
a(n),n ≤ ζa(n),nwn,yi . Note that z
i
a(n),n = 1 implies that
datapoint i is routed to terminal node n (ζ
i
a(n),n = 1) and the class of i is assigned to n
(wnyi = 1). Hence z
i
a(n),n = 1 only if datapoint i is correctly classified at terminal node n.
Upper bounds of z
i
a(n),n = 1 can be imposed via the linear constraints
z
i
a(n),n ≤ ζ
i
a(n),n, zi
a(n),n ≤ w
n
y
i ∀n ∈ T , i ∈ I. (A.8)
In addition, since Ln corresponds to the number of missclassified points at terminal node
n ∈ T and the total number of missclassified points is P
n∈T Ln, we find that constraints
Ln ≥
X
i∈I
(ζ
i
a(n),n − z
i
a(n),n) (A.9)
are valid. Note that constraints (A.9) and (A.4g) imply that
X
n∈T
Ln ≥ |I| − X
i∈I
X
n∈T
z
i
a(n),n. (A.10)
A.4.2 Simplification
As discussed in the preceding sections, the linear optimization relaxation of the formulation
obtained in Section A.4.1, given by constraints (A.4b)-(A.4g), (A.4j)-(A.4l), (A.6), (A.7),
(A.8) and (A.9), is stronger than the relaxation of OCT, as either constraints were tightened
or additional constraints were added. We now show how the resulting formulation can be
simplified without loss of relaxation quality to obtain problem (2.1).
Upper bound on missclassification. Variable Ln has a negative objective coefficient and
only appears in constraints (A.4b), (A.4c), and (A.9), it will always be set to a lower bound.
Therefore, constraint (A.4c) which imposes an upper bound on Ln is redundant and can be
eliminated without affecting the relaxation of the problem.
Lower bound on missclassification. Substituting variables according to (A.4d) and
(A.4e), we find that for a given k ∈ K and n ∈ T , (A.4b) is equivalent to
Ln ≥
X
i∈I
ζ
i
a(n),n −
X
i∈I:
y
i=k
ζ
i
a(n),n − |I|(1 − w
n
k
)
126
⇔ Ln ≥
X
i∈I
yi=k
(w
n
k − 1) + X
i∈I
y
i̸=k
(ζ
i
a(n),n − 1 + w
n
k
). (A.11)
Observe that w
n
k − 1 ≤ 0 ≤ ζ
i
a(n),n − z
i
a(n),n. Moreover, we also have that for any i ∈ I and
k ∈ K \ {y
i},
z
i
a(n),n ≤ w
n
y
i ≤ 1 − w
n
k
, (A.12)
where the first inequality follows from (A.8) and the second inequality follows from (A.4f).
Therefore, from (A.12) we conclude that ζ
i
a(n),n − 1 + w
n
k ≤ ζ
i
a(n),n − z
i
a(n),n and inequalities
(A.9) dominate inequalities (A.11) and thus (A.4b). Since inequalities (A.4d) and (A.4e)
only appear in inequalities (A.4b) and (A.4c), which where shown to be redundant, they
can be dropped as well. Finally, as inequalities (A.9) define the unique lower bounds of
Ln in the simplified formulation, inequalities (A.10) can be converted to an equality and
the objective (A.4a) can be updated accordingly. After all the changes outlined so far, the
formulation (A.4) reduces to
maximize X
i∈I
X
n∈T
z
i
n,a(n)
(A.13a)
subject to X
k∈K
w
n
k = 1 ∀n ∈ T (A.13b)
X
n∈T
ζ
i
a(n),n = 1 ∀i ∈ I (A.13c)
X
f∈F:x
i
f =1
bmf ≥
X
n∈T :m∈PR(n)
ζ
i
a(n),n ∀i ∈ I, m ∈ B (A.13d)
X
f∈F:x
i
f =0
bmf ≥
X
n∈T :m∈PL(n)
ζ
i
a(n),n ∀i ∈ I, m ∈ B (A.13e)
X
f∈F
bnf = 1 ∀n ∈ B (A.13f)
z
i
a(n),n ≤ ζ
i
a(n),n ∀n ∈ T , i ∈ I (A.13g)
z
i
a(n),n ≤ wnyi ∀n ∈ T , i ∈ I (A.13h)
ζ
i
a(n),n ∈ {0, 1} ∀i ∈ I, n ∈ T (A.13i)
bnf ∈ {0, 1} ∀f ∈ F, n ∈ B. (A.13j)
127
A.4.3 Projection
We now project out the ζ variables, obtaining a more compact formulation with the same
LO relaxation. Specifically, consider the formulation
maximize X
i∈I
X
n∈T
z
i
n,a(n)
(A.14a)
subject to X
k∈K
w
n
k = 1 ∀n ∈ T (A.14b)
X
f∈F:x
i
f =1
bmf ≥
X
n∈T :m∈PR(n)
z
i
a(n),n ∀i ∈ I, m ∈ B (A.14c)
X
f∈F:x
i
f =0
bmf ≥
X
n∈T :m∈PL(n)
z
i
a(n),n ∀i ∈ I, m ∈ B (A.14d)
X
f∈F
bnf = 1 ∀n ∈ B (A.14e)
z
i
a(n),n ≤ w
n
y
i ∀n ∈ T , i ∈ I (A.14f)
z
i
a(n),n ∈ {0, 1} ∀i ∈ I, n ∈ T (A.14g)
bnf ∈ {0, 1} ∀f ∈ F, n ∈ B.. (A.14h)
Proposition 4 Formulations (A.13) and (A.14) are equivalent, i.e., their LO relaxations
have the same optimal objective value.
Proof. Let ν1 and ν2 be the optimal objective values of the LO relaxations of (A.13)
and (A.14), respectively. Note that (A.14) is a relaxation of (A.13), obtained by dropping
constraint (A.13c) and replacing ζ with a lower bound in constraints (A.13d) and (A.13e).
Therefore, it follows that ν2 ≥ ν1. We now show that ν2 ≤ ν1.
Let (b
∗
, w∗
, z
∗
) be an optimal solution of (A.14) and let i ∈ I. We show how to construct
a feasible solution of (A.13) with same objective value, thus implying that ν2 ≤ ν1. For any
given i ∈ I, by summing constraints (A.14c) and (A.14d) for the root node m = 1, we find
that
1 = X
f∈F:x
i
f =1
b
∗
1f +
X
f∈F:x
i
f =0
b
∗
1f ≥
X
n∈T
(z
i
a(n),n)
∗
. (A.15)
Now let ζ = z
∗
. If the inequality in (A.15) is active, then (b
∗
, w∗
, z
∗
, ζ) satisfies all constraints
in (A.13) and the proof is complete. Otherwise, it follows that either (A.14c) or (A.14d)
128
is not active at node m = 1, and without loss of generality assume (A.14c) is not active.
Summing up inequalities (A.14c) and (A.14d) for node m = r(1), we find that
1 = X
f∈F
b
∗
r(1)f >
X
n∈T :r(1)∈PR(n)∪PL(n)
(z
i
a(n),n)
∗
, (A.16)
where the strict inequality holds since the right-hand side of (A.16) is no greater than the
right-hand side of (A.15). By applying this process recursively, we obtain a path from node
1 to a terminal node h ∈ T such that all inequalities (A.14c) and (A.14d) associated with
nodes in this path are inactive. The value ζ
i
a(h),h can be then increased by the minimum slack
in the constraints, and the overall process can be repeated until inequality (A.13c) is tight.
A.4.4 Substitution
Finally, to recover formulation (2.1), substitute, for all m ∈ T , variables
z
i
m,r(m)
:= X
n∈T :m∈PR(n)
z
i
a(n),n, and
z
i
m,ℓ(m)
:= X
n∈T :m∈PL(n)
z
i
a(n),n.
Similarly, for all n ∈ T introduce variables zn,t := za(n),n. Constraints (A.14c) and (A.14d)
reduce to P
f∈F:x
i
f =1 bmf ≥ z
i
m,r(m)
and P
f∈F:x
i
f =0 bmf ≥ z
i
m,ℓ(m)
. Finally, since
za(m),m =
X
n∈T :m∈PR(n)∪PL(n)
za(n),n
=
X
n∈T :m∈PR(n)
za(n),n +
X
n∈T :m∈PL(n)
za(n),n
= zm,r(m) + zm,ℓ(m)
,
we recover the flow conservation constraints. In formulation (2.1), we do not use the notion
of terminal nodes T and use L instead. However, in formulation (2.1), the set of leaf nodes L
coincides with the set of terminal nodes T . Therefore, we correctly recover formulation (2.1).
A.5 Benders’ Decomposition for Regularized Problems
In this section, we describe our proposed Benders’ decomposition approach adapted to
formulation (2.7), which can be written equivalently as:
maximize (1 − λ)
X
i∈I
g
i
(b, w, p) − λ
X
n∈B
X
f∈F
bnf (A.17a)
129
subject to X
f∈F
bnf + pn +
X
m∈P(n)
pm = 1 ∀n ∈ B (A.17b)
pn +
X
m∈P(n)
pm = 1 ∀n ∈ T (A.17c)
X
k∈K
w
n
k = pn ∀n ∈ B ∪ T (A.17d)
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (A.17e)
pn ∈ {0, 1} ∀n ∈ B ∪ T (A.17f)
w
n
k ∈ {0, 1} ∀n ∈ B ∪ T , k ∈ K, (A.17g)
where, for any fixed i ∈ I, w and b, g
i
(b, w, p) is defined as the optimal objective value of
the problem
maximize X
i∈I
X
n∈B∪T
z
i
n,t (A.18a)
subject to z
i
a(n),n = z
i
n,ℓ(n) + z
i
n,r(n) + z
i
n,t ∀n ∈ B (A.18b)
z
i
a(n),n = z
i
n,t ∀n ∈ T (A.18c)
z
i
s,1 ≤ 1 (A.18d)
z
i
n,ℓ(n) ≤
X
f∈F:x
i
f =0
bnf ∀n ∈ B (A.18e)
z
i
n,r(n) ≤
X
f∈F:x
i
f =1
bnf ∀n ∈ B (A.18f)
z
i
n,t ≤ w
n
y
i ∀n ∈ B ∪ T (A.18g)
z
i
a(n),n, zi
n,t ∈ {0, 1} ∀n ∈ B ∪ T . (A.18h)
Problem (A.18) is a maximum flow problem on the flow graph G, see Definition 5, whose
arc capacities are determined by (b, w) and datapoint i ∈ I, as formalized next.
Definition 16 (Capacitated flow graph of imbalanced trees) Given the graph G =
(V, A), vectors (b, w), and datapoint i ∈ I, define arc capacities c
i
(b, w) as follows. Let
c
i
s,1
(b, w) := 1, c
i
n,ℓ(n)
(b, w) := P
f∈F:x
i
f =0 bnf , c
i
n,r(n)
(b, w) := P
f∈F:x
i
f =1 bnf for all n ∈ B,
130
and c
i
n,t(b, w) := w
n
y
i for n ∈ B ∪ T . Define the capacitated flow graph G
i
(b, w) as the flow
graph G augmented with capacities c
i
(b, w).
Similar to the derivation of problem (2.4), we can reformulate problem (A.17) as
maximize
g,b,w
(1 − λ)
X
i∈I
g
i − λ
X
n∈B
X
f∈F
bnf (A.19a)
subject to g
i ≤
X
(n1,n2)∈C(S)
c
i
n1,n2
(b, w)∀i ∈ I, S ⊆ V \ {t} : s ∈ S (A.19b)
X
f∈F
bnf + pn +
X
m∈P(n)
pm = 1 ∀n ∈ B (A.19c)
pn +
X
m∈P(n)
pm = 1 ∀n ∈ T (A.19d)
X
k∈K
w
n
k = pn ∀n ∈ B ∪ T (A.19e)
bnf ∈ {0, 1} ∀n ∈ B, f ∈ F (A.19f)
pn ∈ {0, 1} ∀n ∈ B ∪ T (A.19g)
w
n
k ∈ {0, 1} ∀n ∈ B ∪ T , k ∈ K (A.19h)
g
i ≤ 1 ∀i ∈ I. (A.19i)
Algorithm 2 is a modified version of Algorithm 1 tailored to the flow graph introduced in
Definition 16. Figure 5.3 illustrates Algorithm 2. The difference between Algorithms 2 and 1
is highlighted with underline.
Proposition 5 Given i ∈ I and (b, w, p, g) satisfying (A.19c)-(A.19i), Algorithm 2 either
finds a violated inequality (A.19b) or proves that all such inequalities are satisfied.
Proof. Note that the right-hand side of (A.19b), which corresponds to the capacity of a
cut in the graph, is nonnegative. Therefore, if g
i = 0 (line 1), all inequalities are automatically
satisfied. Since (b, w) is integer, all arc capacities in formulation (A.18) are either 0 or 1.
Moreover, since g
i ≤ 1, we find that either the value of a minimum cut is 0 and there exists
a violated inequality, or the value of a minimum cut is at least 1 and there is no violated
131
Algorithm 2 Separation procedure for constraints (A.19b)
Input: (b, w, p, g) ∈ {0, 1}
B·F · {0, 1}
T ·K · {0, 1}
B∪T
· R
I
satisfying (A.19c)-(A.19i);
i ∈ I : datapoint used to generate the cut.
Output: −1 if all constraints (A.19b) corresponding to i are satisfied;
source set S of min-cut otherwise.
1: if g
i = 0 then return −1
2: Initialize n ← 1 ▷ Current node = root
3: Initialize S ← {s} ▷ S is in the source set of the cut
4: while pn = 0 do
5: S ← S ∪ {n}
6: if c
i
n,ℓ(n)
(b, w) = 1 then
7: n ← ℓ(n) ▷ Datapoint i is routed left
8: else if c
i
n,r(n)
(b, w) = 1 then
9: n ← r(n) ▷ Datapoint i is routed right
10: end if
11: end while ▷ At this point, n is a leaf node of the tree
12: S ← S ∪ {n}
13: if g
i > ci
n,t(b, w) then ▷ Minimum cut S with capacity 0 found
14: return S
15: else ▷ Minimum cut S has capacity 1, constraints (A.19b) are satisfied
16: return −1
17: end if
inequality. Finally, there exists a 0-capacity cut if and only if s and t belong to different
connected components in the graph G
i
(b, w).
The connected component s belongs to, can be found using depth-first search. For any
fixed n ∈ B, constraints (A.18b)-(A.18c) and the definition of c
i
(b, w) imply that only one of
the arcs (n, ℓ(n)), (n, r(n)) and (n, t) has capacity 1. If arc (n, ℓ(n)) has capacity 1 (line 6),
then ℓ(n) can be added to the component connected to s (set S); the case where arc (n, r(n))
has capacity 1 (line 8) is handled analogously. This process continues until a leaf node is
reached, i.e., pn = 1 (line 12). If the capacity of the arc to the sink is 1 (line 15), then an
(s, t) is found and no cut with capacity 0 exists. Otherwise (line 13), S is the connected
component of s and t ̸∈ S, thus S is the source of a minimum cut with capacity 0. □
132
(Datapoint 1) (Datapoint 2)
s
1
2 3
4 5 6 7
t
1
2 3
4 5 6 7
t
s
s
1
2 3
5 6 7
t
1
2 3
4 5 6 7
t
s
4
(Datapoint 3) (Datapoint 4)
Figure 5.3: Illustration of Algorithm 2 on four datapoints, two of which are correctly classified
(datapoints 1 and 3) and two of which are incorrectly classified (datapoints 2 and 4). Unbroken
(green) arcs (n, n′
) have capacity c
i
n,n′(b, w) = 1 (and others, capacity 0). In the case of
datapoints 1 and 3 which are correctly classified, since there exists a path from source to sink,
Algorithm 2 terminates on line 16 and returns −1. In the case of datapoints 2 and 4 which are
incorrectly classified, Algorithm 2 returns set S = {s, 1, 3} set S = {s, 1, 3, 6} respectively on
line 14. The associated minimum cut for datapoint 2 consists of arcs (1, 2), (1, t), (3, 6), (3, t)
and (3, 7) and is represented by the thick (red) dashed line. Similarly the associated minimum
cut for datapoint 4 consists of arcs (1, 2), (1, t), (6, t), (3, t) and (3, 7).
A.6 Detail of Experimental Results in Section 2.5
Categorical Datasets. The detail of the experiments presented in Section 2.5.3 is provided
in Tables 5.2 and 5.3 (for in-sample results), 5.4 (for out-of-sample results) and 5.5 (for
comparison with LST).
Mixed-Feature Datasets. The detail of the experiments presented in Section 2.5.4 is
provided in Tables 5.6- 5.10 (for in-sample results), Tables 5.11 and 5.12 (for out-of-sample
results)
133
Table 5.2: In-sample results including the average and standard deviation of training accuracy, optimality gap, and solving time
across 5 samples for the case of λ = 0 on categorical datasets. The best performance achieved in a given dataset and depth is reported
in bold.
Dataset Depth OCT BinOCT FlowOCT BendersOCT
Train-acc Gap Time Train-acc Gap Time Train-acc Gap Time Train-acc Gap Time
soybean-small 2 1.00±0.00 0.00±0.00 2± 1 1.00±0.00 0.00±0.00 0± 0 1.00±0.00 0.00±0.00 0± 0 1.00±0.00 0.00±0.00 1± 0
soybean-small 3 1.00±0.00 0.00±0.00 2± 0 1.00±0.00 0.00±0.00 0± 0 1.00±0.00 0.00±0.00 1± 0 1.00±0.00 0.00±0.00 1± 0
soybean-small 4 1.00±0.00 0.00±0.00 7± 1 1.00±0.00 0.00±0.00 1± 0 1.00±0.00 0.00±0.00 1± 0 1.00±0.00 0.00±0.00 2± 0
soybean-small 5 1.00±0.00 0.00±0.00 16± 2 1.00±0.00 0.00±0.00 1± 0 1.00±0.00 0.00±0.00 2± 0 1.00±0.00 0.00±0.00 3± 2
monk3 2 0.94±0.01 0.00±0.00 2± 0 0.94±0.01 0.00±0.00 0± 0 0.94±0.01 0.00±0.00 1± 0 0.94±0.01 0.00±0.00 1± 0
monk3 3 0.98±0.01 0.00±0.00 409± 359 0.98±0.01 0.00±0.00 32± 25 0.98±0.01 0.00±0.00 39± 18 0.98±0.01 0.00±0.00 15± 12
monk3 4 1.00±0.00 0.00±0.00 177± 123 1.00±0.00 0.00±0.00 652± 876 1.00±0.00 0.00±0.00 18± 14 1.00±0.00 0.00±0.00 58± 121
monk3 5 1.00±0.00 0.00±0.00 108± 71 1.00±0.00 0.00±0.00 156± 157 1.00±0.00 0.00±0.00 15± 13 1.00±0.00 0.00±0.00 11± 2
monk1 2 0.86±0.03 0.00±0.00 3± 1 0.86±0.03 0.00±0.00 1± 1 0.86±0.03 0.00±0.00 1± 0 0.86±0.03 0.00±0.00 1± 0
monk1 3 0.95±0.02 0.00±0.00 687± 783 0.95±0.02 0.00±0.00 45± 29 0.95±0.02 0.00±0.00 29± 7 0.95±0.02 0.00±0.00 5± 2
monk1 4 1.00±0.00 0.00±0.00 89± 91 1.00±0.00 0.00±0.00 62± 127 1.00±0.00 0.00±0.00 8± 10 1.00±0.00 0.00±0.00 4± 2
monk1 5 1.00±0.00 0.00±0.00 91± 44 1.00±0.00 0.00±0.00 5± 2 1.00±0.00 0.00±0.00 13± 5 1.00±0.00 0.00±0.00 7± 2
hayes-roth 2 0.66±0.02 0.00±0.00 11± 2 0.66±0.02 0.00±0.00 5± 1 0.66±0.02 0.00±0.00 1± 0 0.66±0.02 0.00±0.00 1± 0
hayes-roth 3 0.81±0.04 0.13±0.29 1426±1326 0.81±0.04 0.75±0.42 3496± 233 0.81±0.04 0.00±0.00 28± 7 0.81±0.04 0.00±0.00 13± 5
hayes-roth 4 0.87±0.03 1.00±0.00 3600± 0 0.85±0.03 1.00±0.00 3600± 0 0.89±0.02 0.07±0.10 3104± 751 0.89±0.02 0.17±0.11 3434± 372
hayes-roth 5 0.89±0.02 1.00±0.00 3600± 0 0.90±0.03 1.00±0.00 3600± 0 0.92±0.02 1.00±0.00 3600± 0 0.92±0.02 1.00±0.00 3600± 0
monk2 2 0.71±0.02 0.00±0.00 28± 10 0.71±0.02 0.00±0.00 6± 0 0.71±0.02 0.00±0.00 5± 2 0.71±0.02 0.00±0.00 11± 18
monk2 3 0.80±0.02 0.82±0.11 3600± 0 0.80±0.03 1.00±0.01 3600± 0 0.81±0.02 0.01±0.02 1533±1288 0.81±0.02 0.00±0.00 591± 456
monk2 4 0.87±0.03 1.00±0.00 3600± 0 0.88±0.03 1.00±0.00 3600± 0 0.90±0.03 0.75±0.24 3600± 0 0.90±0.02 0.66±0.19 3600± 0
monk2 5 0.91±0.02 1.00±0.00 3600± 0 0.93±0.01 1.00±0.00 3600± 0 0.95±0.03 1.00±0.00 3600± 0 0.94±0.01 1.00±0.00 3600± 0
house-votes-84 2 0.97±0.00 0.00±0.00 6± 2 0.97±0.00 0.00±0.00 1± 0 0.97±0.00 0.00±0.00 2± 0 0.97±0.00 0.00±0.00 2± 0
house-votes-84 3 0.99±0.00 0.00±0.00 341± 229 0.99±0.00 0.00±0.00 181± 182 0.99±0.00 0.00±0.00 184± 120 0.99±0.00 0.00±0.00 29± 18
house-votes-84 4 1.00±0.00 0.00±0.00 52± 26 1.00±0.00 0.20±0.45 1107±1585 1.00±0.00 0.00±0.00 204± 301 1.00±0.00 0.00±0.00 17± 15
house-votes-84 5 1.00±0.00 0.00±0.00 156± 168 1.00±0.00 0.00±0.00 454± 603 1.00±0.00 0.00±0.00 34± 34 1.00±0.00 0.00±0.00 29± 31
spect 2 0.81±0.01 0.00±0.00 35± 17 0.81±0.01 0.00±0.00 8± 4 0.81±0.01 0.00±0.00 8± 3 0.81±0.01 0.00±0.00 6± 1
spect 3 0.86±0.02 0.00±0.00 1495±1045 0.86±0.02 0.51±0.47 2985± 874 0.86±0.02 0.02±0.05 1027±1453 0.86±0.02 0.00±0.00 335± 613
spect 4 0.88±0.07 0.78±0.10 3600± 0 0.90±0.01 0.96±0.04 3600± 0 0.91±0.01 0.28±0.10 3600± 0 0.92±0.01 0.15±0.10 3600± 0
spect 5 0.91±0.02 0.84±0.17 3600± 0 0.93±0.01 0.88±0.19 3600± 0 0.95±0.01 0.40±0.21 3600± 0 0.95±0.01 0.30±0.22 3238± 810
breast-cancer 2 0.80±0.02 0.00±0.00 108± 62 0.80±0.02 0.00±0.00 12± 2 0.80±0.02 0.00±0.00 21± 3 0.80±0.02 0.00±0.00 21± 27
breast-cancer 3 0.85±0.02 1.00±0.00 3600± 0 0.84±0.02 1.00±0.00 3600± 0 0.85±0.02 0.72±0.10 3600± 0 0.85±0.02 0.38±0.23 3376± 501
breast-cancer 4 0.87±0.02 1.00±0.00 3600± 0 0.87±0.02 1.00±0.00 3600± 0 0.89±0.02 1.00±0.00 3600± 0 0.89±0.02 1.00±0.00 3600± 0
breast-cancer 5 0.89±0.04 1.00±0.00 3600± 0 0.91±0.02 1.00±0.00 3600± 0 0.90±0.03 1.00±0.00 3600± 0 0.92±0.03 1.00±0.00 3600± 0
balance-scale 2 0.70±0.01 0.00±0.00 203± 101 0.70±0.01 0.00±0.00 7± 1 0.70±0.01 0.00±0.00 11± 2 0.70±0.01 0.00±0.00 8± 3
balance-scale 3 0.75±0.02 1.00±0.00 3600± 0 0.76±0.01 0.95±0.05 3600± 0 0.77±0.01 0.00±0.00 1229± 204 0.77±0.01 0.00±0.00 439± 230
balance-scale 4 0.74±0.03 1.00±0.00 3600± 0 0.79±0.02 1.00±0.00 3600± 0 0.79±0.01 1.00±0.01 3600± 0 0.80±0.01 0.41±0.08 3600± 0
balance-scale 5 0.76±0.05 1.00±0.00 3600± 0 0.82±0.01 1.00±0.00 3600± 0 0.80±0.01 1.00±0.00 3600± 0 0.83±0.01 1.00±0.00 3600± 0
tic-tac-toe 2 0.73±0.01 0.10±0.17 2340±1154 0.73±0.01 0.00±0.00 114± 49 0.73±0.01 0.00±0.00 312± 24 0.73±0.01 0.00±0.00 255± 207
tic-tac-toe 3 0.75±0.02 1.00±0.00 3600± 0 0.78±0.01 1.00±0.00 3600± 0 0.79±0.01 1.00±0.00 3600± 0 0.79±0.01 0.72±0.13 3600± 0
tic-tac-toe 4 0.76±0.05 1.00±0.00 3600± 0 0.84±0.01 1.00±0.00 3600± 0 0.83±0.01 1.00±0.00 3600± 0 0.85±0.02 1.00±0.00 3600± 0
tic-tac-toe 5 0.70±0.01 1.00±0.00 3600± 0 0.89±0.02 1.00±0.00 3600± 0 0.82±0.01 1.00±0.00 3600± 0 0.88±0.02 1.00±0.00 3600± 0
car-evaluation 2 0.78±0.01 0.00±0.00 1168± 317 0.78±0.01 0.00±0.00 28± 6 0.78±0.01 0.00±0.00 59± 27 0.78±0.01 0.00±0.00 63± 22
car-evaluation 3 0.77±0.04 1.00±0.00 3600± 0 0.82±0.01 1.00±0.00 3600± 0 0.82±0.01 0.87±0.06 3600± 0 0.82±0.01 0.23±0.10 3600± 0
car-evaluation 4 0.78±0.02 1.00±0.00 3600± 0 0.83±0.01 1.00±0.00 3600± 0 0.81±0.02 1.00±0.00 3600± 0 0.84±0.01 1.00±0.00 3600± 0
car-evaluation 5 0.73±0.03 1.00±0.00 3600± 0 0.86±0.01 1.00±0.00 3600± 0 0.81±0.03 1.00±0.00 3600± 0 0.86±0.02 1.00±0.00 3600± 0
kr-vs-kp 2 0.83±0.05 0.98±0.02 3600± 0 0.87±0.01 0.00±0.00 102± 41 0.87±0.01 0.00±0.00 1399± 430 0.87±0.01 0.00±0.00 710± 369
kr-vs-kp 3 0.72±0.04 1.00±0.00 3600± 0 0.92±0.01 1.00±0.00 3600± 0 0.77±0.07 1.00±0.00 3600± 0 0.92±0.03 0.93±0.15 3600± 0
kr-vs-kp 4 0.70±0.06 1.00±0.00 3600± 0 0.94±0.02 1.00±0.00 3600± 0 0.76±0.07 1.00±0.00 3600± 0 0.91±0.05 1.00±0.00 3600± 0
kr-vs-kp 5 0.65±0.06 1.00±0.00 3600± 0 0.94±0.01 1.00±0.00 3600± 0 0.74±0.06 1.00±0.00 3600± 0 0.93±0.03 1.00±0.00 3600± 0
134
Table 5.3: In-sample results including the average and standard deviation of optimality gap
and solving time across 45 instances (5 samples × 9 value of lambdas) for the case of λ > 0 on
categorical datasets. The best performance achieved in a given dataset and depth is reported in
bold.
Dataset Depth OCT FlowOCT BendersOCT
Gap Time Gap Time Gap Time
soybean-small 2 0.00±0.00 3± 1 0.00±0.00 0± 0 0.00±0.00 1± 0
soybean-small 3 0.00±0.00 7± 6 0.00±0.00 1± 0 0.00±0.00 3± 1
soybean-small 4 0.00±0.00 22± 20 0.00±0.00 2± 1 0.00±0.00 6± 4
soybean-small 5 0.00±0.00 33± 22 0.00±0.00 6± 3 0.00±0.00 14± 10
monk3 2 0.00±0.00 3± 1 0.00±0.00 1± 0 0.00±0.00 2± 0
monk3 3 0.00±0.00 395± 760 0.00±0.00 10± 12 0.00±0.00 11± 8
monk3 4 0.07±0.14 1161±1456 0.00±0.00 104± 186 0.00±0.00 41± 53
monk3 5 0.14±0.22 1633±1534 0.00±0.00 153± 261 0.00±0.00 46± 44
monk1 2 0.00±0.00 4± 2 0.00±0.00 1± 0 0.00±0.00 2± 0
monk1 3 0.01±0.04 1048±1015 0.00±0.00 13± 7 0.00±0.00 9± 4
monk1 4 0.01±0.04 1097± 819 0.00±0.00 22± 11 0.00±0.00 18± 28
monk1 5 0.11±0.15 2600±1303 0.00±0.00 32± 16 0.00±0.00 24± 12
hayes-roth 2 0.00±0.00 9± 3 0.00±0.00 1± 0 0.00±0.00 2± 1
hayes-roth 3 0.17±0.28 2326±1257 0.00±0.00 26± 14 0.00±0.00 19± 8
hayes-roth 4 0.67±0.30 3298± 886 0.05±0.09 1850±1470 0.00±0.00 954± 799
hayes-roth 5 0.73±0.29 3263± 966 0.39±0.32 2812±1383 0.29±0.27 2612±1485
monk2 2 0.00±0.00 15± 15 0.00±0.00 4± 2 0.00±0.00 5± 2
monk2 3 0.55±0.33 3056±1228 0.00±0.02 1141±1127 0.00±0.00 566± 550
monk2 4 0.71±0.30 3205±1131 0.41±0.27 2892±1356 0.36±0.25 2828±1463
monk2 5 0.76±0.30 3205±1129 0.55±0.33 2973±1292 0.51±0.32 2850±1427
house-votes-84 2 0.00±0.00 4± 2 0.00±0.00 2± 1 0.00±0.00 2± 1
house-votes-84 3 0.00±0.00 348± 482 0.00±0.00 42± 66 0.00±0.00 19± 26
house-votes-84 4 0.02±0.08 808±1109 0.00±0.00 118± 309 0.00±0.00 37± 85
house-votes-84 5 0.10±0.21 1187±1418 0.00±0.00 105± 242 0.00±0.00 42± 58
spect 2 0.00±0.00 20± 16 0.00±0.00 5± 3 0.00±0.00 5± 2
spect 3 0.02±0.07 1231±1065 0.01±0.02 409± 952 0.00±0.00 209± 597
spect 4 0.55±0.29 3194±1136 0.07±0.09 1814±1661 0.02±0.06 1040±1374
spect 5 0.70±0.29 3206±1126 0.16±0.16 2528±1554 0.09±0.13 2035±1653
breast-cancer 2 0.00±0.00 124± 92 0.00±0.00 20± 7 0.00±0.00 11± 3
breast-cancer 3 0.75±0.27 3403± 663 0.44±0.25 3082±1185 0.35±0.21 2949±1274
breast-cancer 4 0.81±0.25 3584± 106 0.67±0.31 3212±1110 0.64±0.31 3206±1128
breast-cancer 5 0.84±0.23 3600± 0 0.70±0.30 3215±1100 0.67±0.30 3209±1120
balance-scale 2 0.00±0.00 152± 67 0.00±0.00 10± 1 0.00±0.00 7± 4
balance-scale 3 0.93±0.08 3600± 0 0.00±0.00 981± 372 0.00±0.00 418± 235
balance-scale 4 0.95±0.06 3600± 0 0.81±0.17 3600± 0 0.48±0.21 3342± 741
balance-scale 5 0.97±0.04 3600± 0 0.88±0.13 3600± 0 0.83±0.19 3600± 0
tic-tac-toe 2 0.03±0.12 2300± 859 0.00±0.00 352± 104 0.00±0.00 159± 25
tic-tac-toe 3 0.96±0.05 3600± 0 0.91±0.11 3600± 0 0.79±0.14 3600± 0
tic-tac-toe 4 0.97±0.03 3600± 0 0.93±0.08 3600± 0 0.91±0.10 3600± 0
tic-tac-toe 5 0.98±0.02 3600± 0 0.93±0.07 3600± 0 0.91±0.09 3600± 0
car-evaluation 2 0.02±0.11 1451± 730 0.00±0.00 67± 26 0.00±0.00 67± 17
car-evaluation 3 0.99±0.02 3600± 0 0.73±0.10 3600± 0 0.24±0.12 3449± 464
car-evaluation 4 0.99±0.01 3600± 0 0.95±0.06 3600± 0 0.91±0.10 3600± 0
car-evaluation 5 0.99±0.01 3600± 0 0.97±0.04 3600± 0 0.93±0.08 3600± 0
kr-vs-kp 2 0.98±0.03 3600± 0 0.00±0.00 921± 291 0.00±0.00 293± 151
kr-vs-kp 3 0.99±0.01 3600± 0 0.96±0.05 3600± 0 0.91±0.11 3600± 0
kr-vs-kp 4 1.00±0.01 3600± 0 0.97±0.03 3600± 0 0.92±0.09 3600± 0
kr-vs-kp 5 1.00±0.00 3600± 0 0.97±0.04 3600± 0 0.93±0.07 3600± 0
135
Table 5.4: Average out-of-sample accuracy and standard deviation of accuracy across 5 samples
given the calibrated λ on categorical datasets. The highest accuracy achieved in a given dataset
and depth is reported in bold.
dataset depth OCT BinOCT FlowOCT BendersOCT
soybean-small 2 1.00±0.00 0.98±0.04 1.00±0.00 1.00±0.00
soybean-small 3 0.98±0.04 0.98±0.04 1.00±0.00 0.98±0.04
soybean-small 4 0.98±0.04 0.98±0.04 1.00±0.00 0.98±0.04
soybean-small 5 0.98±0.04 0.98±0.04 0.98±0.04 0.98±0.04
monk3 2 0.92±0.02 0.92±0.02 0.92±0.02 0.92±0.02
monk3 3 0.92±0.02 0.91±0.01 0.91±0.03 0.91±0.03
monk3 4 0.92±0.02 0.84±0.08 0.92±0.03 0.92±0.02
monk3 5 0.92±0.03 0.87±0.04 0.92±0.03 0.92±0.03
monk1 2 0.71±0.08 0.72±0.10 0.71±0.08 0.71±0.08
monk1 3 0.83±0.14 0.83±0.07 0.81±0.13 0.83±0.13
monk1 4 1.00±0.00 0.99±0.01 1.00±0.00 1.00±0.00
monk1 5 0.88±0.19 0.97±0.07 1.00±0.00 1.00±0.00
hayes-roth 2 0.39±0.09 0.45±0.06 0.44±0.08 0.41±0.10
hayes-roth 3 0.53±0.07 0.56±0.07 0.55±0.09 0.55±0.07
hayes-roth 4 0.72±0.09 0.71±0.09 0.72±0.05 0.72±0.06
hayes-roth 5 0.64±0.12 0.76±0.06 0.79±0.05 0.81±0.02
monk2 2 0.57±0.06 0.50±0.05 0.57±0.06 0.57±0.06
monk2 3 0.58±0.08 0.59±0.09 0.66±0.06 0.63±0.05
monk2 4 0.63±0.08 0.60±0.04 0.62±0.06 0.60±0.06
monk2 5 0.64±0.07 0.57±0.08 0.65±0.05 0.60±0.07
house-votes-84 2 0.78±0.25 0.96±0.04 0.97±0.02 0.97±0.02
house-votes-84 3 0.97±0.02 0.94±0.02 0.97±0.02 0.97±0.02
house-votes-84 4 0.98±0.02 0.95±0.03 0.96±0.01 0.96±0.01
house-votes-84 5 0.97±0.02 0.93±0.05 0.97±0.02 0.97±0.02
spect 2 0.76±0.05 0.74±0.05 0.76±0.05 0.76±0.05
spect 3 0.76±0.05 0.73±0.04 0.76±0.05 0.76±0.05
spect 4 0.76±0.05 0.75±0.05 0.76±0.05 0.76±0.05
spect 5 0.74±0.02 0.74±0.08 0.75±0.04 0.76±0.05
breast-cancer 2 0.72±0.04 0.71±0.03 0.73±0.05 0.72±0.04
breast-cancer 3 0.74±0.04 0.71±0.05 0.75±0.02 0.72±0.04
breast-cancer 4 0.72±0.05 0.66±0.08 0.73±0.04 0.71±0.04
breast-cancer 5 0.74±0.01 0.71±0.03 0.73±0.05 0.72±0.02
balance-scale 2 0.69±0.02 0.68±0.03 0.69±0.02 0.69±0.02
balance-scale 3 0.70±0.02 0.72±0.03 0.70±0.03 0.71±0.03
balance-scale 4 0.67±0.04 0.74±0.03 0.73±0.03 0.72±0.03
balance-scale 5 0.61±0.05 0.76±0.02 0.73±0.02 0.75±0.03
tic-tac-toe 2 0.67±0.02 0.66±0.02 0.67±0.02 0.67±0.02
tic-tac-toe 3 0.69±0.03 0.72±0.02 0.72±0.01 0.72±0.03
tic-tac-toe 4 0.70±0.04 0.78±0.03 0.77±0.02 0.78±0.03
tic-tac-toe 5 0.68±0.03 0.80±0.04 0.80±0.02 0.78±0.04
car-evaluation 2 0.77±0.01 0.77±0.01 0.77±0.01 0.77±0.01
car-evaluation 3 0.72±0.02 0.78±0.01 0.79±0.01 0.79±0.01
car-evaluation 4 0.72±0.02 0.81±0.01 0.79±0.01 0.81±0.01
car-evaluation 5 0.76±0.03 0.82±0.01 0.76±0.04 0.84±0.01
kr-vs-kp 2 0.68±0.06 0.87±0.01 0.87±0.01 0.87±0.01
kr-vs-kp 3 0.68±0.08 0.86±0.06 0.74±0.05 0.92±0.02
kr-vs-kp 4 0.58±0.07 0.90±0.03 0.80±0.04 0.94±0.00
kr-vs-kp 5 0.66±0.11 0.87±0.08 0.84±0.06 0.89±0.08
136
Table 5.5: Average in-sample accuracy and standard deviation of accuracy across 5 samples
for the case of λ = 0 on categorical datasets. The highest accuracy achieved in a given dataset
and depth is reported in bold.
dataset depth OCT LST BendersOCT
soybean-small 2 1.00±0.00 1.00±0.00 1.00±0.00
soybean-small 3 1.00±0.00 1.00±0.00 1.00±0.00
soybean-small 4 1.00±0.00 1.00±0.00 1.00±0.00
soybean-small 5 1.00±0.00 1.00±0.00 1.00±0.00
monk3 2 0.94±0.01 0.94±0.01 0.94±0.01
monk3 3 0.98±0.01 0.96±0.01 0.98±0.01
monk3 4 1.00±0.00 0.99±0.01 1.00±0.00
monk3 5 1.00±0.00 1.00±0.00 1.00±0.00
monk1 2 0.86±0.03 0.86±0.03 0.86±0.03
monk1 3 0.95±0.02 0.95±0.02 0.95±0.02
monk1 4 1.00±0.00 1.00±0.00 1.00±0.00
monk1 5 1.00±0.00 1.00±0.00 1.00±0.00
hayes-roth 2 0.66±0.02 0.66±0.02 0.66±0.02
hayes-roth 3 0.81±0.04 0.80±0.04 0.81±0.04
hayes-roth 4 0.87±0.03 0.87±0.03 0.89±0.02
hayes-roth 5 0.89±0.02 0.91±0.02 0.92±0.02
monk2 2 0.71±0.02 0.71±0.02 0.71±0.02
monk2 3 0.80±0.02 0.81±0.02 0.81±0.02
monk2 4 0.87±0.03 0.88±0.02 0.90±0.02
monk2 5 0.91±0.02 0.93±0.02 0.94±0.01
house-votes-84 2 0.97±0.00 0.97±0.00 0.97±0.00
house-votes-84 3 0.99±0.00 0.98±0.01 0.99±0.00
house-votes-84 4 1.00±0.00 1.00±0.00 1.00±0.00
house-votes-84 5 1.00±0.00 1.00±0.00 1.00±0.00
spect 2 0.81±0.01 0.81±0.01 0.81±0.01
spect 3 0.86±0.02 0.86±0.02 0.86±0.02
spect 4 0.88±0.07 0.91±0.01 0.92±0.01
spect 5 0.91±0.02 0.93±0.01 0.95±0.01
breast-cancer 2 0.80±0.02 0.80±0.02 0.80±0.02
breast-cancer 3 0.85±0.02 0.85±0.02 0.85±0.02
breast-cancer 4 0.87±0.02 0.89±0.02 0.89±0.02
breast-cancer 5 0.89±0.04 0.92±0.02 0.92±0.03
balance-scale 2 0.70±0.01 0.70±0.01 0.70±0.01
balance-scale 3 0.75±0.02 0.77±0.01 0.77±0.01
balance-scale 4 0.74±0.03 0.80±0.01 0.80±0.01
balance-scale 5 0.76±0.05 0.84±0.01 0.83±0.01
tic-tac-toe 2 0.73±0.01 0.73±0.01 0.73±0.01
tic-tac-toe 3 0.75±0.02 0.79±0.01 0.79±0.01
tic-tac-toe 4 0.76±0.05 0.86±0.00 0.85±0.02
tic-tac-toe 5 0.70±0.01 0.93±0.01 0.88±0.02
car-evaluation 2 0.78±0.01 0.78±0.01 0.78±0.01
car-evaluation 3 0.77±0.04 0.82±0.01 0.82±0.01
car-evaluation 4 0.78±0.02 0.84±0.00 0.84±0.01
car-evaluation 5 0.73±0.03 0.88±0.00 0.86±0.02
kr-vs-kp 2 0.83±0.05 0.87±0.01 0.87±0.01
kr-vs-kp 3 0.72±0.04 0.94±0.01 0.92±0.03
kr-vs-kp 4 0.70±0.06 0.95±0.00 0.91±0.05
kr-vs-kp 5 0.65±0.06 0.97±0.01 0.93±0.03
137
Table 5.6: Average in-sample accuracy and standard deviation of accuracy across 5 samples for the case of λ = 0 on mixed-feature
datasets (part 1). Due to the numerical issues of OCT (as discussed in Appendix A.3), we do not provide solving time and optimality
gap for this approach as it tackles a different problem. Instead, we report the number of instances (out of five samples) in a given
dataset and depth, where we observe a discrepancy of at least 0.001 between the objective value of the optimization problem and the
actual in-sample accuracy. The best performance achieved in a given dataset and depth is reported in bold.
Dataset Depth OCT BendersOCT-5 BendersOCT-10
Train-acc Numerical Issues Train-acc Gap Time Train-acc Gap Time
echocardiogram 2 1.00±0.00 0 1.00±0.00 0.00±0.00 1± 0 1.00±0.00 0.00±0.00 1± 0
echocardiogram 3 1.00±0.00 0 1.00±0.00 0.00±0.00 1± 0 1.00±0.00 0.00±0.00 1± 0
echocardiogram 4 0.99±0.01 1 1.00±0.00 0.00±0.00 2± 0 1.00±0.00 0.00±0.00 2± 2
echocardiogram 5 1.00±0.00 0 1.00±0.00 0.00±0.00 0± 0 1.00±0.00 0.00±0.00 3± 1
hepatitis 2 0.98±0.01 1 0.96±0.03 0.00±0.00 3± 2 0.96±0.03 0.00±0.00 9± 7
hepatitis 3 1.00±0.00 0 1.00±0.01 0.20±0.45 722±1609 1.00±0.00 0.00±0.00 224± 495
hepatitis 4 1.00±0.01 1 1.00±0.00 0.00±0.00 8± 7 1.00±0.00 0.00±0.00 6± 3
hepatitis 5 0.99±0.01 2 1.00±0.00 0.00±0.00 5± 2 1.00±0.00 0.00±0.00 8± 3
fertility 2 0.91±0.01 0 0.90±0.01 0.00±0.00 4± 1 0.90±0.01 0.00±0.00 4± 1
fertility 3 0.97±0.02 0 0.96±0.01 0.23±0.32 2504±1193 0.96±0.01 0.23±0.32 2692±1062
fertility 4 0.98±0.02 1 0.98±0.02 0.60±0.55 2220±1893 0.98±0.02 0.60±0.55 2213±1901
fertility 5 0.97±0.05 1 0.99±0.01 0.60±0.55 2171±1956 0.99±0.01 0.60±0.55 2172±1956
iris 2 0.96±0.04 2 0.95±0.03 0.00±0.00 1± 0 0.96±0.02 0.00±0.00 2± 0
iris 3 0.99±0.01 2 0.98±0.02 0.00±0.00 31± 31 0.99±0.01 0.00±0.00 267± 427
iris 4 0.99±0.01 2 0.99±0.01 0.40±0.55 1443±1969 1.00±0.01 0.20±0.45 1216±1706
iris 5 0.98±0.02 3 0.99±0.01 0.40±0.55 1444±1969 1.00±0.00 0.00±0.00 25± 31
wine 2 0.96±0.01 3 0.95±0.02 0.00±0.00 8± 3 0.95±0.01 0.00±0.00 31± 9
wine 3 0.99±0.01 2 1.00±0.01 0.10±0.22 736±1601 1.00±0.00 0.00±0.00 783± 857
wine 4 0.99±0.01 3 1.00±0.00 0.00±0.00 175± 186 1.00±0.00 0.00±0.00 42± 13
wine 5 0.99±0.01 3 1.00±0.00 0.00±0.00 41± 11 1.00±0.00 0.00±0.00 91± 65
planning-relax 2 0.68±0.27 2 0.79±0.03 0.00±0.00 83± 27 0.81±0.03 0.00±0.00 861± 373
planning-relax 3 0.72±0.19 4 0.86±0.03 0.99±0.02 3600± 0 0.86±0.03 1.00±0.00 3600± 0
planning-relax 4 0.67±0.20 5 0.91±0.03 1.00±0.00 3600± 0 0.91±0.03 1.00±0.00 3600± 0
planning-relax 5 0.86±0.13 3 0.97±0.03 0.60±0.55 3075±1016 0.93±0.05 1.00±0.00 3600± 0
breast-cancer-prognostic 2 0.87±0.02 3 0.85±0.02 0.00±0.00 865± 283 0.86±0.03 0.83±0.08 3600± 0
breast-cancer-prognostic 3 0.93±0.02 1 0.91±0.02 1.00±0.00 3600± 0 0.92±0.02 1.00±0.00 3600± 0
breast-cancer-prognostic 4 0.97±0.04 2 0.94±0.02 1.00±0.00 3600± 0 0.94±0.02 1.00±0.00 3600± 0
breast-cancer-prognostic 5 0.99±0.01 2 0.95±0.04 1.00±0.00 3600± 0 0.93±0.03 1.00±0.00 3600± 0
parkinsons 2 0.95±0.01 0 0.91±0.01 0.00±0.00 36± 8 0.93±0.01 0.00±0.00 291± 102
parkinsons 3 0.99±0.01 2 0.97±0.01 1.00±0.00 3600± 0 0.98±0.01 1.00±0.00 3600± 0
parkinsons 4 0.99±0.01 3 0.99±0.00 0.80±0.45 3255± 772 0.99±0.01 0.60±0.55 2817±1123
parkinsons 5 0.99±0.01 3 1.00±0.00 0.00±0.00 426± 609 0.99±0.01 0.60±0.55 2251±1849
connectionist-bench-sonar 2 0.84±0.01 2 0.83±0.01 0.54±0.42 3600± 0 0.83±0.02 1.00±0.00 3600± 0
connectionist-bench-sonar 3 0.88±0.03 3 0.89±0.02 1.00±0.00 3600± 0 0.89±0.02 1.00±0.00 3600± 0
connectionist-bench-sonar 4 0.91±0.06 3 0.95±0.03 1.00±0.00 3600± 0 0.94±0.02 1.00±0.00 3600± 0
connectionist-bench-sonar 5 0.96±0.04 3 0.98±0.02 0.80±0.45 2995±1354 0.89±0.03 1.00±0.00 3600± 0 138
Table 5.7: Average in-sample accuracy and standard deviation of accuracy across 5 samples for the case of λ = 0 on mixed-feature
datasets (part 2). The best performance achieved in a given dataset and depth is reported in bold.
Dataset Depth OCT BendersOCT-5 BendersOCT-10
Train-acc Numerical Issues Train-acc Gap Time Train-acc Gap Time
seeds 2 0.96±0.01 1 0.90±0.02 0.00±0.00 3± 1 0.93±0.01 0.00±0.00 9± 1
seeds 3 0.98±0.01 4 0.94±0.02 0.04±0.06 1654±1799 0.97±0.02 0.42±0.39 2977± 856
seeds 4 0.99±0.01 3 0.97±0.01 1.00±0.00 3600± 0 0.99±0.01 0.80±0.45 3586± 30
seeds 5 0.99±0.01 3 0.98±0.01 1.00±0.00 3600± 0 0.99±0.01 0.40±0.55 1938±1537
cylinder-bands 2 0.74±0.02 1 0.76±0.01 0.26±0.27 3515± 191 0.75±0.02 0.98±0.03 3600± 0
cylinder-bands 3 0.81±0.02 0 0.81±0.01 1.00±0.00 3600± 0 0.82±0.02 1.00±0.00 3600± 0
cylinder-bands 4 0.85±0.02 1 0.79±0.05 1.00±0.00 3600± 0 0.81±0.03 1.00±0.00 3600± 0
cylinder-bands 5 0.89±0.01 3 0.82±0.07 1.00±0.00 3600± 0 0.82±0.05 1.00±0.00 3600± 0
heart-cleveland 2 0.63±0.02 2 0.63±0.02 0.00±0.00 118± 87 0.63±0.02 0.00±0.00 237± 69
heart-cleveland 3 0.68±0.02 1 0.69±0.02 0.97±0.02 3600± 0 0.71±0.02 1.00±0.00 3600± 0
heart-cleveland 4 0.71±0.02 3 0.76±0.03 1.00±0.00 3600± 0 0.75±0.04 1.00±0.00 3600± 0
heart-cleveland 5 0.73±0.03 4 0.80±0.05 1.00±0.00 3600± 0 0.76±0.07 1.00±0.00 3600± 0
ionosphere 2 0.92±0.01 4 0.89±0.01 0.00±0.00 522± 466 0.90±0.01 0.32±0.45 2999± 782
ionosphere 3 0.94±0.01 4 0.94±0.01 1.00±0.00 3600± 0 0.95±0.01 1.00±0.00 3600± 0
ionosphere 4 0.95±0.01 3 0.96±0.01 1.00±0.00 3600± 0 0.96±0.01 1.00±0.00 3600± 0
ionosphere 5 0.96±0.01 4 0.98±0.01 1.00±0.00 3600± 0 0.97±0.02 1.00±0.00 3600± 0
thoracic-surgery 2 0.88±0.01 1 0.87±0.01 0.00±0.00 41± 12 0.88±0.01 0.00±0.00 188± 49
thoracic-surgery 3 0.89±0.02 1 0.89±0.01 0.99±0.02 3600± 0 0.89±0.02 1.00±0.00 3600± 0
thoracic-surgery 4 0.89±0.01 4 0.90±0.02 1.00±0.00 3600± 0 0.90±0.02 1.00±0.00 3600± 0
thoracic-surgery 5 0.91±0.02 3 0.92±0.01 1.00±0.00 3600± 0 0.92±0.01 1.00±0.00 3600± 0
climate 2 0.87±0.16 2 0.93±0.01 0.00±0.00 654± 224 0.94±0.01 0.70±0.18 3600± 0
climate 3 0.90±0.12 3 0.95±0.01 1.00±0.00 3600± 0 0.95±0.01 1.00±0.00 3600± 0
climate 4 0.95±0.03 4 0.97±0.01 1.00±0.00 3600± 0 0.96±0.01 1.00±0.00 3600± 0
climate 5 0.94±0.06 3 0.97±0.02 1.00±0.00 3600± 0 0.97±0.01 1.00±0.00 3600± 0
breast-cancer-diagnostic 2 0.97±0.01 2 0.95±0.01 0.00±0.00 344± 162 0.96±0.01 0.10±0.22 1948±1102
breast-cancer-diagnostic 3 0.98±0.00 3 0.97±0.01 1.00±0.00 3600± 0 0.98±0.01 1.00±0.00 3600± 0
breast-cancer-diagnostic 4 0.99±0.01 3 0.98±0.01 1.00±0.00 3600± 0 0.99±0.01 1.00±0.00 3600± 0
breast-cancer-diagnostic 5 0.99±0.01 3 0.99±0.01 0.60±0.55 3056± 747 0.99±0.00 1.00±0.00 3600± 0
indian-liver-patient 2 0.76±0.02 0 0.74±0.02 0.00±0.00 248± 99 0.75±0.02 0.00±0.00 1326± 174
indian-liver-patient 3 0.78±0.01 4 0.77±0.02 0.99±0.02 3600± 0 0.78±0.01 1.00±0.00 3600± 0
indian-liver-patient 4 0.80±0.02 4 0.80±0.02 1.00±0.00 3600± 0 0.80±0.02 1.00±0.00 3600± 0
indian-liver-patient 5 0.83±0.02 5 0.82±0.02 1.00±0.00 3600± 0 0.84±0.01 1.00±0.00 3600± 0
credit-approval 2 0.87±0.01 5 0.88±0.01 0.00±0.00 274± 162 0.88±0.01 0.00±0.00 1070± 266
credit-approval 3 0.85±0.09 5 0.89±0.01 1.00±0.00 3600± 0 0.89±0.01 1.00±0.00 3600± 0
credit-approval 4 0.84±0.05 5 0.91±0.01 1.00±0.00 3600± 0 0.90±0.01 1.00±0.00 3600± 0
credit-approval 5 0.83±0.04 5 0.91±0.01 1.00±0.00 3600± 0 0.92±0.02 1.00±0.00 3600± 0
139
Table 5.8: Average in-sample accuracy and standard deviation of accuracy across 5 samples for the case of λ = 0 on mixed-feature
datasets (part 3). The best performance achieved in a given dataset and depth is reported in bold.
Dataset Depth OCT BendersOCT-5 BendersOCT-10
Train-acc Numerical Issues Train-acc Gap Time Train-acc Gap Time
blood-transfusion 2 0.79±0.01 1 0.77±0.01 0.00±0.00 17± 4 0.78±0.01 0.00±0.00 68± 24
blood-transfusion 3 0.81±0.01 4 0.79±0.01 0.68±0.11 3600± 0 0.81±0.01 0.81±0.03 3600± 0
blood-transfusion 4 0.81±0.03 5 0.80±0.01 0.96±0.02 3600± 0 0.82±0.01 0.93±0.02 3600± 0
blood-transfusion 5 0.81±0.03 5 0.81±0.01 0.96±0.02 3600± 0 0.82±0.01 0.95±0.02 3600± 0
diabetes 2 0.78±0.01 2 0.77±0.00 0.00±0.00 216± 173 0.77±0.00 0.00±0.00 1059± 290
diabetes 3 0.80±0.01 3 0.79±0.01 1.00±0.00 3600± 0 0.79±0.01 1.00±0.00 3600± 0
diabetes 4 0.81±0.02 4 0.80±0.01 1.00±0.00 3600± 0 0.79±0.02 1.00±0.00 3600± 0
diabetes 5 0.83±0.02 4 0.82±0.01 1.00±0.00 3600± 0 0.81±0.02 1.00±0.00 3600± 0
qsar-biodegradation 2 0.80±0.01 2 0.80±0.02 0.04±0.08 2686± 587 0.80±0.02 0.95±0.06 3600± 0
qsar-biodegradation 3 0.83±0.01 3 0.83±0.02 1.00±0.00 3600± 0 0.83±0.02 1.00±0.00 3600± 0
qsar-biodegradation 4 0.85±0.02 4 0.86±0.02 1.00±0.00 3600± 0 0.83±0.03 1.00±0.00 3600± 0
qsar-biodegradation 5 0.87±0.03 4 0.86±0.02 1.00±0.00 3600± 0 0.84±0.03 1.00±0.00 3600± 0
banknote-authentication 2 0.55±0.17 5 0.90±0.01 0.00±0.00 19± 3 0.92±0.01 0.00±0.00 153± 110
banknote-authentication 3 0.66±0.29 3 0.95±0.00 0.00±0.00 1655± 352 0.98±0.01 1.00±0.00 3600± 0
banknote-authentication 4 0.65±0.29 5 0.97±0.00 1.00±0.00 3600± 0 0.99±0.00 1.00±0.00 3600± 0
banknote-authentication 5 0.63±0.21 5 0.98±0.01 1.00±0.00 3600± 0 0.99±0.00 1.00±0.00 3600± 0
ozone-level-detection-one 2 0.97±0.01 0 0.97±0.01 0.98±0.03 3600± 0 0.97±0.01 1.00±0.00 3600± 0
ozone-level-detection-one 3 0.98±0.01 2 0.97±0.01 1.00±0.00 3600± 0 0.97±0.01 1.00±0.00 3600± 0
ozone-level-detection-one 4 0.97±0.01 2 0.97±0.01 1.00±0.00 3600± 0 0.97±0.01 1.00±0.00 3600± 0
ozone-level-detection-one 5 0.97±0.01 2 0.97±0.01 1.00±0.00 3600± 0 0.97±0.00 1.00±0.00 3600± 0
image-segmentation 2 0.15±0.00 5 0.57±0.01 0.08±0.18 1645±1172 0.58±0.01 0.26±0.15 3292± 689
image-segmentation 3 0.14±0.01 5 0.76±0.05 1.00±0.00 3600± 0 0.52±0.05 1.00±0.00 3600± 0
image-segmentation 4 0.14±0.01 5 0.82±0.02 1.00±0.00 3600± 0 0.57±0.04 1.00±0.00 3600± 1
image-segmentation 5 0.15±0.00 5 0.78±0.03 1.00±0.00 3600± 0 0.76±0.13 1.00±0.00 3600± 0
seismic-bumps 2 0.73±0.37 5 0.94±0.01 0.00±0.00 789± 441 0.94±0.01 0.19±0.12 3331± 603
seismic-bumps 3 0.80±0.20 5 0.94±0.01 1.00±0.00 3600± 0 0.94±0.01 1.00±0.00 3600± 0
seismic-bumps 4 0.92±0.04 5 0.94±0.01 1.00±0.00 3600± 0 0.94±0.01 1.00±0.00 3600± 0
seismic-bumps 5 0.79±0.28 5 0.94±0.01 1.00±0.00 3600± 0 0.94±0.01 1.00±0.00 3600± 0
thyroid-disease-ann-thyroid 2 0.98±0.01 0 0.92±0.00 0.00±0.00 616± 250 0.97±0.00 0.00±0.00 349± 116
thyroid-disease-ann-thyroid 3 0.99±0.00 0 0.93±0.00 0.97±0.04 3600± 0 0.98±0.00 1.00±0.00 3600± 0
thyroid-disease-ann-thyroid 4 0.98±0.01 1 0.94±0.00 1.00±0.00 3600± 0 0.99±0.00 1.00±0.00 3600± 0
thyroid-disease-ann-thyroid 5 0.98±0.01 1 0.94±0.01 1.00±0.00 3600± 0 0.99±0.00 1.00±0.00 3600± 0
spambase 2 0.52±0.12 5 0.86±0.01 0.12±0.26 3072± 633 0.83±0.01 0.89±0.02 3600± 0
spambase 3 0.48±0.12 5 0.86±0.02 0.98±0.01 3600± 0 0.81±0.02 0.99±0.00 3600± 0
spambase 4 0.44±0.09 5 0.87±0.01 0.99±0.01 3600± 0 0.83±0.03 1.00±0.00 3600± 0
spambase 5 0.57±0.09 5 0.86±0.03 0.99±0.01 3600± 1 0.83±0.03 1.00±0.00 3601± 0
wall-following-robot-2 2 0.64±0.10 0 0.63±0.06 0.98±0.02 3600± 0 0.60±0.09 1.00±0.00 3601± 1
wall-following-robot-2 3 0.68±0.05 0 0.66±0.04 1.00±0.00 3600± 0 0.58±0.11 1.00±0.00 3600± 0
wall-following-robot-2 4 0.70±0.09 2 0.63±0.12 1.00±0.00 3600± 0 0.56±0.03 1.00±0.00 3600± 0
wall-following-robot-2 5 0.63±0.11 0 0.62±0.06 1.00±0.00 3600± 0 0.64±0.11 1.00±0.00 3602± 3
140
A.7 Additional Experimental Results
In this section, we report and discuss additional numerical experiments conducted on the
categorical datasets.
A.7.1 BendersOCT’s Variants
In this section, we evaluate three implementation variants of BendersOCT that differ in the
strategy for adding the cut-set inequalities. Recall from Remark 2, that in the implementation
of BendersOCT, we first add as many cut-set inequalities as possible at the root node of
the branch-and-bound tree by using Gurobi to solve each subproblem. After that, we only
separate the cut-set inequalities at the integral solutions by solving the min-cut subproblems
using Algorithm 2. The first variant, BendersOCT-MIPSol-Alg2, is similar to BendersOCT
with the only difference being that we no longer add the cut-set inequalities at the root
node of the branch-and-bound tree. In the second variant, BendersOCT-MIPSol-LP, we
again separate the cut-set inequalities only at the integral solutions but instead of using
Algorithm 2 to solve the min-cut subproblems, we use Gurobi to solve them. In the third
variant, BendersOCT-AllSol-LP, we separate the cut-set inequalities at all solutions including
both fractional and integral ones by solving each subproblem using Gurobi. Figure 5.4
summarizes the in-sample performance of all four methods.
From the left part (time axis) of Figure 5.4, we observe that BendersOCT-AllSol-LP can
solve 1159 instances (out of 2400) to optimality within the time limit. BendersOCT-MIPSol-LP
can solve the same number of instances in 110 seconds, resulting in a ⌊
3600
110 ⌋ = 33×
speedup. From this observation, we conclude that it is better to separate the cut-set
inequalities only at the integral solutions. BendersOCT-MIPSol-Alg2 solves the same number of instances in 53 seconds, which means that using the separation procedure described in Algorithm 2, instead of solving the corresponding LO, results in a ⌊
110
53 ⌋ = 2×
speedup. However, BendersOCT slightly outperforms BendersOCT-MIPSol-Alg2 as it can
solves more instances (1536 vs 1522) within the time limit by adding some extra cuts
141
0
500
1000
1500
2000
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach
BendersOCT
BendersOCT−MIPSol−Alg2
BendersOCT−MIPSol−LP
BendersOCT−AllSol−LP
Figure 5.4: Number of instances solved to optimality by each approach within the given time
on the time axis, and number of instances with optimality gap no larger than each given value
at the time limit on the optimality gap axis.
at the root node of the branch-and-bound search tree. The right part (gap axis) of Figure 5.4 summarizes the optimality gap of each approach at the time limit. As we can
see, BendersOCT, BendersOCT-MIPSol-Alg2, and BendersOCT-MIPSol-LP have a similar
performance and all outperform BendersOCT-AllSol-LP.
A.7.2 Worst-case Accuracy
In this section, we study a variant of FlowOCT, called FlowOCT-worst, where we maximize
the worst-case accuracy objective defined in Section 2.4.2. For this purpose we implement
formulation (2.11). We compare FlowOCT and FlowOCT-worst on three imbalanced datasets
(car-evaluation, spect and breast-cancer). Figures 5.5 and 5.6 summerize the numerical results.
In Figure 5.5 (right) we show the density of the worst out-of-sample accuracy, among all
class labels, for each approach. As we can see, FlowOCT-worst achieves better worst-case
accuracy. However as it is shown in Figure 5.5 (left), FlowOCT-worst has a worst performance
in out-of-sample classification accuracy. So there’s a trade-off between the total accuracy and
worst-case accuracy. On one hand, from the first part (time axis) of Figure 5.6, we observe
that FlowOCT-worst can solve 11 instances to optimality within the timelimit. However,
FlowOCT can solve the same number of instances in 1150 seconds resulting in a ⌊
3600
1150 ⌋ = 3×
speedup. On the other hand, from the second part (gap axis) of Figure 5.6, we see that
FlowOCT-worst tends to have lower optimality gap with respect to FlowOCT.
142
0
1
2
3
0.0 0.2 0.4 0.6
Worst Test Accuracy
Density
Approach
FlowOCT
FlowOCT_worst
0.0
2.5
5.0
7.5
10.0
0.4 0.5 0.6 0.7 0.8
Test Accuracy
Density
Approach
FlowOCT
FlowOCT_worst
0.0
2.5
5.0
7.5
10.0
0.4 0.5 0.6 0.7 0.8
Test Accuracy
Density
Approach
FlowOCT
FlowOCT_worst
0
1
2
3
0.0 0.2 0.4 0.6
Worst Test Accuracy
Density
Approach
FlowOCT
FlowOCT_worst Figure 5.5: The left (resp. right) figure depicts the density of out-of-sample accuracy (resp.
worst accuracy), among all class labels, for each approach.
0
10
20
30
40
0 12 24 36 0 25 50 75 100
Time (100 sec) | Optimality gap (%)
Number of Instances
Approach
FlowOCT
FlowOCT_worst
Figure 5.6: Number of instances solved to optimality by each approach within the given time
on the time axis, and number of instances with optimality gap no larger than each given value
at the time limit on the optimality gap axis.
A.7.3 LO Relaxation
In this section we compare the strength, i.e., LO relaxation optimal value, of the various
formulations. For this purpose, for all 2400 instances involving the categorical datasets, we
solve the LO relaxation of the approaches FlowOCT, OCT, and BinOCT. For all the approaches,
the objective value reflects number of misclassified datapoints. In this setting, a trivial value
for the objective value of the relaxed problem would be zero, i.e., correctly classifying all
datapoints. However neither of OCT nor BinOCT outputs a non-trivial objective value. But
in 44% of the instances, FlowOCT outputs a non-trivial objective value, i.e., an objective
value less than or equal to −1 and also gives a root improvement of a factor of 8, where root
143
improvement is defined as the ratio of the MIO objective value to the LO relaxation objective
value. Tables 5.14 and 5.15 report the detailed results.
Furthermore, we look at the number of branch-and-bound nodes explored by each approach.
For this purpose, we only consider 115 (out of 240) instances where all the approaches can
solve to optimality. Table (5.13) summarizes the distribution of the number of explored
branch-and-bound nodes for each approach. As it is shown in Table (5.13), FlowOCT and
BendersOCT find the optimal solution by exploring far less branch-and-bound nodes in
comparison to OCT and BinOCT. On average, FlowOCT (resp. BendersOCT) explores 60% and
98% (resp. 43% and 97%) less nodes in comparison to OCT and BinOCT, respectively.
144
Table 5.9: In-sample results including the average and standard deviation of optimality gap
and solving time across 45 instances (5 samples and 9 values of λ) for the case of λ > 0 on
mixed-feature datasets (part 1). The best performance achieved in a given dataset and depth is
reported in bold.
Dataset Depth BendersOCT-5 BendersOCT-10
Gap Time Gap Time
echocardiogram 2 0.00±0.00 1± 1 0.00±0.00 1± 1
echocardiogram 3 0.00±0.00 2± 2 0.00±0.00 1± 1
echocardiogram 4 0.00±0.00 3± 2 0.00±0.00 4± 4
echocardiogram 5 0.00±0.00 7± 5 0.00±0.00 5± 5
hepatitis 2 0.00±0.00 6± 10 0.00±0.00 14± 21
hepatitis 3 0.00±0.03 188± 623 0.00±0.01 244± 697
hepatitis 4 0.00±0.00 223± 597 0.00±0.00 150± 364
hepatitis 5 0.00±0.02 238± 701 0.00±0.00 148± 271
fertility 2 0.00±0.00 4± 2 0.00±0.00 4± 3
fertility 3 0.04±0.14 1179±1416 0.04±0.13 1168±1385
fertility 4 0.14±0.19 1756±1741 0.15±0.20 1768±1737
fertility 5 0.17±0.22 1851±1755 0.18±0.22 1859±1744
iris 2 0.00±0.00 2± 0 0.00±0.00 3± 1
iris 3 0.00±0.00 16± 22 0.00±0.00 107± 222
iris 4 0.02±0.08 407± 960 0.04±0.12 705±1321
iris 5 0.07±0.16 787±1435 0.07±0.14 825±1456
wine 2 0.00±0.00 12± 4 0.00±0.00 49± 23
wine 3 0.01±0.04 565± 797 0.08±0.17 997±1433
wine 4 0.05±0.13 1075±1189 0.03±0.13 1223±1414
wine 5 0.05±0.12 1401±1381 0.05±0.10 1176±1394
planning-relax 2 0.00±0.00 80± 43 0.00±0.00 1426± 850
planning-relax 3 0.60±0.36 3003±1301 0.67±0.33 3155±1158
planning-relax 4 0.62±0.34 3063±1267 0.67±0.32 3189±1136
planning-relax 5 0.62±0.33 3066±1258 0.68±0.32 3197±1103
breast-cancer-prognostic 2 0.00±0.00 1188± 717 0.53±0.35 3112±1147
breast-cancer-prognostic 3 0.62±0.32 3136±1196 0.69±0.30 3220±1088
breast-cancer-prognostic 4 0.61±0.31 3167±1126 0.68±0.29 3242±1024
breast-cancer-prognostic 5 0.63±0.31 3186±1074 0.69±0.30 3284± 912
parkinsons 2 0.00±0.00 51± 26 0.00±0.00 442± 295
parkinsons 3 0.34±0.29 2738±1472 0.46±0.28 3079±1216
parkinsons 4 0.38±0.26 2837±1416 0.52±0.27 3076±1239
parkinsons 5 0.38±0.25 2876±1360 0.52±0.26 3116±1146
connectionist-bench-sonar 2 0.62±0.30 3391± 626 0.79±0.22 3600± 0
connectionist-bench-sonar 3 0.75±0.25 3600± 0 0.80±0.17 3600± 0
connectionist-bench-sonar 4 0.75±0.22 3600± 0 0.80±0.17 3600± 0
connectionist-bench-sonar 5 0.75±0.24 3600± 0 0.81±0.17 3600± 0
seeds 2 0.00±0.00 5± 1 0.00±0.00 12± 4
seeds 3 0.01±0.05 1010±1271 0.15±0.21 2109±1589
seeds 4 0.35±0.30 2610±1533 0.37±0.27 2838±1445
seeds 5 0.41±0.31 2731±1474 0.38±0.28 2823±1454
cylinder-bands 2 0.75±0.23 3548± 352 0.86±0.15 3604± 27
cylinder-bands 3 0.87±0.14 3600± 0 0.89±0.11 3600± 0
cylinder-bands 4 0.86±0.14 3600± 0 0.87±0.14 3600± 1
cylinder-bands 5 0.86±0.15 3600± 0 0.88±0.12 3670± 445
heart-cleveland 2 0.00±0.00 109± 33 0.00±0.00 437± 141
heart-cleveland 3 0.77±0.24 3600± 0 0.82±0.20 3600± 0
heart-cleveland 4 0.84±0.19 3600± 0 0.85±0.18 3600± 0
heart-cleveland 5 0.85±0.18 3600± 0 0.86±0.17 3600± 0
ionosphere 2 0.00±0.00 488± 305 0.32±0.37 3031± 821
ionosphere 3 0.69±0.30 3276± 931 0.73±0.24 3600± 0
ionosphere 4 0.71±0.29 3354± 742 0.73±0.22 3600± 0
ionosphere 5 0.71±0.29 3430± 607 0.73±0.22 3600± 0
thoracic-surgery 2 0.00±0.00 69± 40 0.00±0.00 167± 78
thoracic-surgery 3 0.70±0.31 3213±1107 0.74±0.30 3221±1083
thoracic-surgery 4 0.74±0.30 3218±1092 0.75±0.30 3243±1032
thoracic-surgery 5 0.75±0.30 3227±1070 0.78±0.30 3242±1030
145
Table 5.10: In-sample results including the average and standard deviation of optimality gap
and solving time across 45 instances (5 samples and 9 values of λ) for the case of λ > 0 on
mixed-feature datasets (part 2). The best performance achieved in a given dataset and depth is
reported in bold.
Dataset Depth BendersOCT-5 BendersOCT-10
Gap Time Gap Time
climate 2 0.00±0.00 563± 323 0.63±0.33 3163±1083
climate 3 0.66±0.33 3162±1145 0.72±0.30 3225±1074
climate 4 0.69±0.32 3236±1044 0.71±0.30 3258± 988
climate 5 0.69±0.31 3208±1120 0.72±0.30 3226±1072
breast-cancer-diagnostic 2 0.00±0.00 405± 208 0.09±0.22 2602± 927
breast-cancer-diagnostic 3 0.67±0.30 3293± 884 0.70±0.26 3530± 470
breast-cancer-diagnostic 4 0.70±0.28 3310± 841 0.73±0.26 3526± 496
breast-cancer-diagnostic 5 0.71±0.28 3379± 702 0.72±0.27 3528± 485
indian-liver-patient 2 0.00±0.00 111± 32 0.00±0.00 1940± 635
indian-liver-patient 3 0.82±0.19 3600± 0 0.89±0.13 3600± 0
indian-liver-patient 4 0.86±0.17 3600± 0 0.91±0.11 3600± 0
indian-liver-patient 5 0.88±0.15 3600± 0 0.91±0.11 3600± 0
credit-approval 2 0.00±0.00 296± 101 0.00±0.00 1298± 587
credit-approval 3 0.83±0.21 3600± 0 0.86±0.17 3600± 0
credit-approval 4 0.86±0.16 3600± 0 0.87±0.15 3600± 0
credit-approval 5 0.87±0.15 3600± 0 0.88±0.13 3600± 0
blood-transfusion 2 0.00±0.00 22± 5 0.00±0.00 96± 44
blood-transfusion 3 0.52±0.22 3342± 743 0.67±0.16 3600± 0
blood-transfusion 4 0.75±0.18 3600± 0 0.78±0.16 3600± 0
blood-transfusion 5 0.80±0.17 3600± 0 0.82±0.15 3600± 0
diabetes 2 0.00±0.00 123± 56 0.00±0.00 1305± 310
diabetes 3 0.86±0.14 3600± 0 0.91±0.10 3600± 0
diabetes 4 0.90±0.12 3600± 0 0.92±0.09 3600± 0
diabetes 5 0.91±0.11 3600± 0 0.93±0.09 3600± 0
qsar-biodegradation 2 0.17±0.21 3345± 392 0.92±0.08 3600± 0
qsar-biodegradation 3 0.93±0.07 3600± 0 0.95±0.06 3600± 0
qsar-biodegradation 4 0.94±0.07 3600± 0 0.95±0.06 3600± 0
qsar-biodegradation 5 0.95±0.06 3600± 0 0.95±0.06 3600± 0
banknote-authentication 2 0.00±0.00 21± 4 0.00±0.00 163± 55
banknote-authentication 3 0.11±0.23 2430±1020 0.71±0.23 3600± 0
banknote-authentication 4 0.75±0.20 3600± 0 0.77±0.16 3600± 0
banknote-authentication 5 0.76±0.18 3600± 0 0.80±0.15 3600± 0
ozone-level-detection-one 2 0.78±0.28 3372± 752 0.83±0.25 3553± 317
ozone-level-detection-one 3 0.82±0.25 3527± 493 0.86±0.19 3583± 115
ozone-level-detection-one 4 0.85±0.21 3568± 218 0.85±0.21 3569± 210
ozone-level-detection-one 5 0.84±0.23 3538± 417 0.87±0.16 3611± 37
image-segmentation 2 0.20±0.16 3543± 171 0.52±0.17 3572± 187
image-segmentation 3 0.97±0.03 3600± 0 0.99±0.01 3600± 0
image-segmentation 4 0.98±0.02 3600± 0 0.99±0.01 3600± 0
image-segmentation 5 0.98±0.02 3600± 0 0.99±0.02 3600± 0
seismic-bumps 2 0.00±0.00 1040± 414 0.16±0.17 3209± 700
seismic-bumps 3 0.88±0.14 3600± 0 0.90±0.12 3600± 0
seismic-bumps 4 0.90±0.13 3600± 0 0.91±0.11 3600± 0
seismic-bumps 5 0.91±0.12 3600± 0 0.92±0.10 3600± 0
thyroid-disease-ann-thyroid 2 0.00±0.00 320± 98 0.00±0.00 617± 322
thyroid-disease-ann-thyroid 3 0.88±0.13 3600± 0 0.85±0.16 3600± 0
thyroid-disease-ann-thyroid 4 0.93±0.08 3600± 0 0.87±0.13 3600± 0
thyroid-disease-ann-thyroid 5 0.94±0.07 3600± 0 0.89±0.11 3600± 0
spambase 2 0.23±0.25 3177± 563 0.89±0.04 3600± 0
spambase 3 0.96±0.03 3600± 0 0.98±0.02 3600± 0
spambase 4 0.96±0.03 3600± 0 0.98±0.01 3600± 0
spambase 5 0.97±0.02 3600± 1 0.98±0.02 3601± 1
wall-following-robot-2 2 0.95±0.05 3600± 0 0.99±0.01 3600± 0
wall-following-robot-2 3 0.99±0.01 3600± 0 1.00±0.00 3601± 2
wall-following-robot-2 4 0.99±0.01 3600± 1 1.00±0.00 3601± 1
wall-following-robot-2 5 1.00±0.01 3600± 1 1.00±0.00 3600± 1
146
Table 5.11: Average out-of-sample accuracy and standard deviation of accuracy across 5
samples on mixed-feature datasets (part 1). The highest accuracy achieved in a given dataset
and depth is reported in bold.
dataset depth OCT BendersOCT-5 BendersOCT-10
echocardiogram 2 0.92±0.07 0.95±0.08 0.91±0.06
echocardiogram 3 0.92±0.07 0.95±0.08 0.91±0.06
echocardiogram 4 0.92±0.07 0.95±0.08 0.94±0.00
echocardiogram 5 0.96±0.03 0.95±0.08 0.91±0.07
hepatitis 2 0.75±0.06 0.77±0.10 0.76±0.10
hepatitis 3 0.75±0.09 0.75±0.10 0.77±0.09
hepatitis 4 0.75±0.06 0.77±0.08 0.81±0.11
hepatitis 5 0.72±0.10 0.76±0.11 0.78±0.12
fertility 2 0.90±0.06 0.90±0.06 0.90±0.06
fertility 3 0.90±0.06 0.90±0.06 0.90±0.06
fertility 4 0.90±0.06 0.90±0.06 0.90±0.06
fertility 5 0.90±0.06 0.90±0.06 0.90±0.06
iris 2 0.94±0.03 0.91±0.05 0.90±0.05
iris 3 0.94±0.03 0.89±0.05 0.88±0.05
iris 4 0.95±0.03 0.88±0.04 0.91±0.05
iris 5 0.94±0.03 0.88±0.04 0.91±0.05
wine 2 0.94±0.03 0.90±0.07 0.86±0.05
wine 3 0.95±0.04 0.93±0.02 0.91±0.07
wine 4 0.90±0.07 0.91±0.07 0.92±0.05
wine 5 0.92±0.07 0.91±0.08 0.90±0.06
planning-relax 2 0.64±0.07 0.67±0.07 0.64±0.08
planning-relax 3 0.43±0.17 0.67±0.07 0.64±0.08
planning-relax 4 0.63±0.13 0.63±0.13 0.64±0.07
planning-relax 5 0.54±0.19 0.67±0.07 0.62±0.10
breast-cancer-prognostic 2 0.73±0.05 0.75±0.07 0.75±0.07
breast-cancer-prognostic 3 0.73±0.05 0.72±0.04 0.74±0.06
breast-cancer-prognostic 4 0.74±0.06 0.73±0.07 0.75±0.06
breast-cancer-prognostic 5 0.70±0.06 0.76±0.06 0.72±0.06
parkinsons 2 0.82±0.05 0.87±0.02 0.83±0.03
parkinsons 3 0.84±0.08 0.84±0.02 0.84±0.05
parkinsons 4 0.87±0.04 0.84±0.02 0.89±0.05
parkinsons 5 0.79±0.04 0.87±0.03 0.86±0.04
connectionist-bench-sonar 2 0.75±0.10 0.76±0.06 0.70±0.06
connectionist-bench-sonar 3 0.67±0.07 0.67±0.05 0.72±0.06
connectionist-bench-sonar 4 0.76±0.03 0.74±0.10 0.75±0.06
connectionist-bench-sonar 5 0.74±0.02 0.72±0.06 0.73±0.06
seeds 2 0.88±0.02 0.89±0.04 0.88±0.03
seeds 3 0.88±0.02 0.89±0.04 0.88±0.03
seeds 4 0.89±0.03 0.89±0.04 0.89±0.03
seeds 5 0.91±0.04 0.88±0.02 0.90±0.02
cylinder-bands 2 0.65±0.06 0.72±0.04 0.64±0.06
cylinder-bands 3 0.68±0.04 0.67±0.04 0.69±0.03
cylinder-bands 4 0.65±0.03 0.74±0.05 0.70±0.03
cylinder-bands 5 0.61±0.14 0.68±0.08 0.70±0.04
heart-cleveland 2 0.53±0.03 0.54±0.01 0.54±0.01
heart-cleveland 3 0.55±0.04 0.52±0.04 0.52±0.03
heart-cleveland 4 0.56±0.04 0.59±0.04 0.55±0.07
heart-cleveland 5 0.53±0.02 0.55±0.04 0.53±0.05
ionosphere 2 0.88±0.09 0.88±0.06 0.87±0.03
ionosphere 3 0.84±0.06 0.90±0.02 0.90±0.03
ionosphere 4 0.86±0.11 0.90±0.02 0.90±0.02
ionosphere 5 0.70±0.25 0.89±0.03 0.87±0.05
thoracic-surgery 2 0.84±0.02 0.84±0.02 0.84±0.02
thoracic-surgery 3 0.84±0.02 0.84±0.02 0.84±0.02
thoracic-surgery 4 0.84±0.02 0.84±0.02 0.84±0.02
thoracic-surgery 5 0.84±0.02 0.84±0.03 0.84±0.02
147
Table 5.12: Average out-of-sample accuracy and standard deviation of accuracy across 5
samples on mixed-feature datasets (part 2). The highest accuracy achieved in a given dataset
and depth is reported in bold.
dataset depth OCT BendersOCT-5 BendersOCT-10
climate 2 0.71±0.36 0.92±0.03 0.92±0.03
climate 3 0.57±0.46 0.93±0.02 0.92±0.02
climate 4 0.72±0.38 0.92±0.02 0.91±0.02
climate 5 0.77±0.36 0.92±0.02 0.93±0.03
breast-cancer-diagnostic 2 0.93±0.03 0.93±0.03 0.95±0.02
breast-cancer-diagnostic 3 0.95±0.03 0.93±0.02 0.94±0.02
breast-cancer-diagnostic 4 0.93±0.01 0.93±0.03 0.94±0.02
breast-cancer-diagnostic 5 0.91±0.03 0.93±0.02 0.95±0.02
indian-liver-patient 2 0.73±0.02 0.73±0.03 0.73±0.03
indian-liver-patient 3 0.72±0.03 0.72±0.05 0.71±0.03
indian-liver-patient 4 0.72±0.05 0.69±0.05 0.71±0.04
indian-liver-patient 5 0.73±0.02 0.71±0.04 0.70±0.05
credit-approval 2 0.86±0.03 0.86±0.03 0.86±0.03
credit-approval 3 0.81±0.12 0.86±0.04 0.86±0.03
credit-approval 4 0.79±0.09 0.86±0.03 0.86±0.03
credit-approval 5 0.69±0.18 0.86±0.03 0.86±0.03
blood-transfusion 2 0.77±0.01 0.76±0.03 0.77±0.01
blood-transfusion 3 0.79±0.02 0.77±0.02 0.78±0.01
blood-transfusion 4 0.77±0.01 0.78±0.02 0.78±0.02
blood-transfusion 5 0.78±0.02 0.78±0.02 0.76±0.03
diabetes 2 0.75±0.03 0.74±0.02 0.74±0.02
diabetes 3 0.76±0.02 0.73±0.02 0.75±0.02
diabetes 4 0.75±0.02 0.74±0.03 0.73±0.02
diabetes 5 0.75±0.03 0.74±0.03 0.73±0.02
qsar-biodegradation 2 0.76±0.02 0.78±0.02 0.78±0.02
qsar-biodegradation 3 0.77±0.04 0.81±0.02 0.77±0.03
qsar-biodegradation 4 0.74±0.04 0.76±0.02 0.77±0.02
qsar-biodegradation 5 0.72±0.07 0.76±0.03 0.75±0.05
banknote-authentication 2 0.50±0.07 0.89±0.02 0.91±0.01
banknote-authentication 3 0.50±0.07 0.93±0.00 0.97±0.01
banknote-authentication 4 0.54±0.05 0.97±0.01 0.97±0.01
banknote-authentication 5 0.48±0.07 0.97±0.01 0.97±0.01
ozone-level-detection-one 2 0.97±0.00 0.97±0.00 0.97±0.00
ozone-level-detection-one 3 0.97±0.00 0.97±0.00 0.97±0.00
ozone-level-detection-one 4 0.97±0.00 0.97±0.00 0.97±0.00
ozone-level-detection-one 5 0.97±0.00 0.97±0.01 0.97±0.00
image-segmentation 2 0.14±0.00 0.55±0.02 0.43±0.06
image-segmentation 3 0.14±0.01 0.52±0.03 0.51±0.03
image-segmentation 4 0.14±0.01 0.64±0.10 0.49±0.09
image-segmentation 5 0.14±0.01 0.65±0.13 0.66±0.08
seismic-bumps 2 0.76±0.39 0.93±0.01 0.93±0.01
seismic-bumps 3 0.81±0.26 0.93±0.01 0.93±0.01
seismic-bumps 4 0.93±0.01 0.93±0.01 0.93±0.01
seismic-bumps 5 0.77±0.37 0.93±0.01 0.93±0.01
thyroid-disease-ann-thyroid 2 0.96±0.01 0.93±0.00 0.96±0.00
thyroid-disease-ann-thyroid 3 0.96±0.02 0.94±0.00 0.97±0.01
thyroid-disease-ann-thyroid 4 0.96±0.02 0.94±0.01 0.98±0.01
thyroid-disease-ann-thyroid 5 0.96±0.01 0.94±0.01 0.97±0.01
spambase 2 0.47±0.11 0.85±0.01 0.81±0.02
spambase 3 0.47±0.11 0.83±0.03 0.82±0.02
spambase 4 0.51±0.11 0.82±0.02 0.81±0.03
spambase 5 0.60±0.01 0.80±0.06 0.75±0.08
wall-following-robot-2 2 0.55±0.07 0.64±0.06 0.58±0.12
wall-following-robot-2 3 0.61±0.06 0.61±0.11 0.59±0.10
wall-following-robot-2 4 0.63±0.05 0.57±0.09 0.54±0.10
wall-following-robot-2 5 0.51±0.8 0.54±0.05 0.49±0.09
148
Table 5.13: Number of branch-and-bound nodes explored by each approach during the solving
process of 115 instances (out of 240 instances comprising 12 categorical datasets, 5 samples,
and 4 different depths), which were successfully solved to optimality by all approaches.
Approach Min 1st Quarter Median Mean 3rd Quarter Max
FlowOCT 0.0 129.5 681.0 4480.0 4556.0 79175.0
BendersOCT 1.0 251 1029.0 6566.3 4060.0 180663.0
OCT 1.0 706.0 3234.0 11449.0 9785.0 270265.0
BinOCT 60.0 1508.0 6808.0 243229.0 45659.0 10755902.0
149
Table 5.14: In-sample results of the LO relaxation including the average and standard deviation
of objective value, root improvement and solving time across 45 instances (5 samples and 9
values of λ) for the case of λ > 0 on categorical datasets. The best performance achieved in a
given dataset and depth is reported in bold.
Dataset Depth OCT FlowOCT
Obj Value Root Improvement Time Obj Value Root Improvement Time
soybean-small 2 0.00±0.00 0.00± 0.00 0± 0 -1.32±0.62 0.67± 0.48 0± 0
soybean-small 3 0.00±0.00 0.00± 0.00 0± 0 -1.32±0.62 0.67± 0.48 0± 0
soybean-small 4 0.00±0.00 0.00± 0.00 0± 0 -1.32±0.62 0.67± 0.48 0± 0
soybean-small 5 0.00±0.00 0.00± 0.00 0± 0 -1.32±0.62 0.67± 0.48 2± 1
monk3 2 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 0.65± 0.95 0± 0
monk3 3 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 0.65± 0.95 0± 0
monk3 4 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 0.65± 0.95 0± 0
monk3 5 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 0.65± 0.95 0± 0
monk1 2 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.10± 1.70 0± 0
monk1 3 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.03± 1.56 0± 0
monk1 4 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.01± 1.52 0± 0
monk1 5 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.01± 1.52 0± 0
hayes-roth 2 0.00±0.00 0.00± 0.00 0± 0 -1.00±0.52 3.71± 4.34 0± 0
hayes-roth 3 0.00±0.00 0.00± 0.00 0± 0 -1.00±0.52 3.18± 3.50 0± 0
hayes-roth 4 0.00±0.00 0.00± 0.00 0± 0 -1.00±0.52 2.96± 3.17 0± 0
hayes-roth 5 0.00±0.00 0.00± 0.00 0± 0 -1.00±0.52 2.84± 2.98 0± 0
monk2 2 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.83± 3.02 0± 0
monk2 3 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.78± 2.90 0± 0
monk2 4 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.78± 2.90 0± 0
monk2 5 0.00±0.00 0.00± 0.00 0± 0 -0.75±0.39 1.79± 2.92 0± 0
house-votes-84 2 0.00±0.00 0.00± 0.00 0± 0 -0.81±0.39 0.58± 0.73 0± 0
house-votes-84 3 0.00±0.00 0.00± 0.00 0± 0 -0.81±0.39 0.58± 0.73 0± 0
house-votes-84 4 0.00±0.00 0.00± 0.00 0± 0 -0.81±0.39 0.58± 0.73 0± 0
house-votes-84 5 0.00±0.00 0.00± 0.00 0± 0 -0.81±0.39 0.58± 0.73 0± 0
spect 2 0.00±0.00 0.00± 0.00 0± 0 -6.23±2.24 1.96± 0.51 0± 0
spect 3 0.00±0.00 0.00± 0.00 0± 0 -4.98±1.35 2.08± 0.64 0± 0
spect 4 0.00±0.00 0.00± 0.00 0± 0 -4.41±0.98 1.91± 0.51 1± 0
spect 5 0.00±0.00 0.00± 0.00 0± 0 -4.02±0.90 1.89± 0.58 1± 0
breast-cancer 2 0.00±0.00 0.00± 0.00 0± 0 -0.65±0.31 0.49± 1.66 0± 0
breast-cancer 3 0.00±0.00 0.00± 0.00 0± 0 -0.65±0.31 0.64± 1.89 0± 0
breast-cancer 4 0.00±0.00 0.00± 0.00 0± 0 -0.65±0.31 0.49± 1.66 2± 1
breast-cancer 5 0.00±0.00 0.00± 0.00 0± 0 -0.65±0.31 0.49± 1.66 3± 1
balance-scale 2 0.00±0.00 0.00± 0.00 0± 0 -19.80±9.32 2.37± 0.18 0± 0
balance-scale 3 0.00±0.00 0.00± 0.00 0± 0 -3.92±0.54 9.66± 3.54 2± 0
balance-scale 4 0.00±0.00 0.00± 0.00 0± 0 -1.67±0.87 13.38±14.04 17± 5
balance-scale 5 0.00±0.00 0.00± 0.00 1± 0 -1.67±0.87 10.67±11.32 21± 15
tic-tac-toe 2 0.00±0.00 0.00± 0.00 0± 0 -0.71±0.37 4.01± 8.19 1± 0
tic-tac-toe 3 0.00±0.00 0.00± 0.00 0± 0 -0.71±0.37 3.65± 7.36 5± 1
tic-tac-toe 4 0.00±0.00 0.00± 0.00 1± 0 -0.71±0.37 3.62± 7.23 18± 13
tic-tac-toe 5 0.00±0.00 0.00± 0.00 2± 0 -0.71±0.37 3.68± 7.39 14± 13
car-evaluation 2 0.00±0.00 0.00± 0.00 0± 0 -9.09±3.26 9.68± 2.30 2± 0
car-evaluation 3 0.00±0.00 0.00± 0.00 0± 0 -1.50±0.78 24.05±26.93 9± 2
car-evaluation 4 0.00±0.00 0.00± 0.00 1± 0 -1.50±0.78 24.45±27.69 104± 27
car-evaluation 5 0.00±0.00 0.00± 0.00 3± 0 -1.50±0.78 27.34±31.74 1219± 230
kr-vs-kp 2 0.00±0.00 0.00± 0.00 1± 0 -0.51±0.26 0.00± 0.00 18± 5
kr-vs-kp 3 0.00±0.00 0.00± 0.00 2± 0 -0.51±0.26 0.00± 0.00 186± 97
kr-vs-kp 4 0.00±0.00 0.00± 0.00 6± 2 -0.51±0.26 0.00± 0.00 1520±1076
kr-vs-kp 5 0.00±0.00 0.00± 0.00 16± 2 -Inf 0.00± 0.00 1557±1516
150
Table 5.15: In-sample results of the LO relaxation including the average and standard deviation of objective value, root improvement
and solving time across 5 samples for the case of λ = 0 on categorical datasets. The best performance achieved in a given dataset and
depth is reported in bold.
Dataset Depth OCT BinOCT FlowOCT
Obj Value Root Improvement Time Obj Value Root Improvement Time Obj Value Root Improvement
soybean-small 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
soybean-small 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
soybean-small 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
soybean-small 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk3 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk3 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk3 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk3 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk1 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk1 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk1 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk1 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
hayes-roth 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
hayes-roth 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
hayes-roth 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
hayes-roth 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk2 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk2 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk2 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
monk2 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
house-votes-84 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
house-votes-84 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
house-votes-84 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 1± 0
house-votes-84 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 1± 0
spect 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -10.26±1.04 2.51±0.28 0± 0
spect 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -7.00±1.20 2.72±0.55 0± 0
spect 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -5.17±1.26 2.41±0.66 1± 0
spect 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -3.76±1.40 2.22±1.07 3± 1
breast-cancer 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0
breast-cancer 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 1± 0
breast-cancer 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 4± 1
breast-cancer 5 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 4± 4
balance-scale 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -37.60±1.39 2.48±0.15 0± 0
balance-scale 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -4.84±0.50 15.20±0.97 2± 1
balance-scale 4 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 41± 4
balance-scale 5 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 20± 3
tic-tac-toe 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 1± 0
tic-tac-toe 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 10± 2
tic-tac-toe 4 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 29± 42
tic-tac-toe 5 0.00±0.00 0.00±0.00 2± 0 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 20± 12
car-evaluation 2 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 -15.20±1.15 12.34±0.71 2± 0
car-evaluation 3 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 5± 1
car-evaluation 4 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 16± 5
car-evaluation 5 0.00±0.00 0.00±0.00 2± 0 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 233± 71
kr-vs-kp 2 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 0± 0 0.00±0.00 0.00±0.00 4± 1
kr-vs-kp 3 0.00±0.00 0.00±0.00 2± 0 0.00±0.00 0.00±0.00 1± 0 0.00±0.00 0.00±0.00 188±148
kr-vs-kp 4 0.00±0.00 0.00±0.00 6± 0 0.00±0.00 0.00±0.00 2± 1 0.00±0.00 0.00±0.00 115±113
kr-vs-kp 5 0.00±0.00 0.00±0.00 16± 2 0.00±0.00 0.00±0.00 6± 2 0.00±0.00 0.00±0.00 210±133
151
Abstract (if available)
Abstract
Data-driven approaches are increasingly being used to support decision-making in high stakes domains, e.g., to predict the vulnerability of homeless individuals to prioritize them for housing, or to identify those at risk of suicide. The deployment of data-driven predictive or prescriptive tools in high-stakes domains where people’s lives and livelihoods are at stakes creates an urgent need for approaches that are fair, interpretable, and optimal. Crafting predictive and prescriptive models possessing these vital attributes, derived from data that might be biased, or observational, leads to grappling with constrained optimization problems that are inherently combinatorial and often hard to solve. To navigate these challenges, I integrate techniques from integer optimization with machine learning, statistics, and causal inference. Subsequently, I develop effective solution methodologies to address these complex problems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Towards trustworthy and data-driven social interventions
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Applications of Wasserstein distance in distributionally robust optimization
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
A continuous approximation model for the parallel drone scheduling traveling salesman problem
PDF
Continuous approximation for selection routing problems
PDF
Optimizing healthcare decision-making: Markov decision processes for liver transplants, frequent interventions, and infectious disease control
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Machine learning in interacting multi-agent systems
PDF
The implementation of data-driven techniques for the synthesis and optimization of colloidal inorganic nanocrystals
PDF
Controlling information in neural networks for fairness and privacy
PDF
Models and algorithms for the freight movement problem in drayage operations
PDF
Modeling dynamic behaviors in the wild
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Applications of topological data analysis to operational research problems
Asset Metadata
Creator
Aghaei, Sina
(author)
Core Title
Integer optimization for analytics in high stakes domain
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2024-05
Publication Date
05/21/2024
Defense Date
12/05/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI for social good,data-driven decision making,fair machine learning,Homelessness,integer optimization,machine learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Vayanos, Phebe (
committee chair
), Carlsson, John (
committee member
), Dilkina, Bistra (
committee member
), Gomez, Andres (
committee member
)
Creator Email
saghaei@usc.edu,saghaei121@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113950173
Unique identifier
UC113950173
Identifier
etd-AghaeiSina-12979.pdf (filename)
Legacy Identifier
etd-AghaeiSina-12979
Document Type
Dissertation
Format
theses (aat)
Rights
Aghaei, Sina
Internet Media Type
application/pdf
Type
texts
Source
20240521-usctheses-batch-1158
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
AI for social good
data-driven decision making
fair machine learning
integer optimization
machine learning