Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mixed-integer nonlinear programming with binary variables
(USC Thesis Other)
Mixed-integer nonlinear programming with binary variables
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Mixed-integer nonlinear programming with binary variables by Shaoning Han A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (INDUSTRIAL AND SYSTEMS ENGINEERING) August 2022 Copyright 2022 Shaoning Han I dedicate this thesis to my parents. ii Acknowledgements First and foremost I wish to express my sincere gratitude to my advisor, Dr. Andres Gomez, for providing all supports I needed over the last five years. Dr. Gomez has helped me build research from the ground up, and during this process, he is always reachable to offer professional guidance. Moreover, he has been open-minded and gives me a lot of freedom to pursue my own academic career. I am also grateful to him for giving me various opportunities to attend conferences and collaborate with other people. His diligence, integrity and working style will make a lifetime impact on me. I would like to thank Dr. Jong-Shi Pang and Dr. Alper Atamturk. It is lucky for me to meet the two experienced and well respected researchers in the early stage development of my academic career. I benefit greatly from all perspectives by collaborating with them. I would also like to thank Dr. Meisam Razaviyayn and Dr. Bistra Dilkina for being part of my dissertation committee and taking time to examine me in the qualifying exam. Many thanks to my collaborators including Dr. Oleg Prokopyev, Dr. Ying Cui, Dr. Leonardo Lozano and my colleague Ziyu He. I learned a lot from them. I wish to give special thanks to Dr. Bo Zeng, who brought me to the US. Finally, I would like to thank my family and my friends for their tremendous moral support and practical help. In particular, I would like to thank Yuanyuan Lei, Shaomian Qi, Xueyu Shi, Chaosheng Dong, Yang Wang, Shanmeng Peng and Xiaonan Zhu. I could not have completed this dissertation without the trust and encouragement from my parents. I can not express my thanks enough to them. iii Table of Contents Dedication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ii Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x Chapter 1: Introduction and general theory : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Mixed-integer optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Preliminaries of convex analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Conic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Disjunctive programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Submodular optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Sparse and smooth signal estimation: convexification of` 0 formulations : : : : 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 L1-norm approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Mixed-integer optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 The perspective reformulation . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Strong formulations for pairwise quadratic terms . . . . . . . . . . . . . . 20 2.3 Strong convex formulations for signal estimation . . . . . . . . . . . . . . . . . . 22 2.3.1 Convexification of the parametric pairwise terms . . . . . . . . . . . . . . 23 2.3.2 Convex relaxations for general M-matrices . . . . . . . . . . . . . . . . . 27 2.4 SOCP representation and Lagrangian decomposition . . . . . . . . . . . . . . . . 32 2.4.1 Extended formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Implementation via conic quadratic optimization . . . . . . . . . . . . . . 37 2.4.2.1 Conic quadratic reformulation of functions f and g . . . . . . . 38 2.4.2.2 Improved cutting surface method . . . . . . . . . . . . . . . . . 39 2.4.3 Lagrangian methods for estimation with regularized objective . . . . . . . 42 2.5 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.5.1 Relaxation quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.5.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 iv 2.5.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.2 Statistical performance – modeling with priors . . . . . . . . . . . . . . . 50 2.5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.5.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.5.2.3 Computational setting . . . . . . . . . . . . . . . . . . . . . . . 52 2.5.2.4 Results with respect to the error criterion . . . . . . . . . . . . . 53 2.5.2.5 Results with respect to the sparsity pattern criterion . . . . . . . 55 2.5.3 Computational experiments - Lagrangian methods . . . . . . . . . . . . . 58 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 3: Convexification for convex quadratic optimization with indicator variables : : : 62 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Optimal perspective formulation vs. Shor’s SDP . . . . . . . . . . . . . . . . . . . 68 3.3 Convex hull results forX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3.2 Rank-one results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.3 Full-rank results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.4 Convex hull description ofZ + . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.4.1 Conic quadratic-representable extended formulation . . . . . . . . . . . . 79 3.4.2 Description in the original space of variables x;y;t . . . . . . . . . . . . . 81 3.4.3 Rank-one approximations ofZ + . . . . . . . . . . . . . . . . . . . . . . . 88 3.5 An SDP relaxation for (QI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.6 Comparison of convex relaxations . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.7 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.7.1 Synthetic instances – the diagonally dominant case . . . . . . . . . . . . . 104 3.7.1.1 Instance generation . . . . . . . . . . . . . . . . . . . . . . . . 104 3.7.1.2 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.7.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.7.2 Real instances – the general case . . . . . . . . . . . . . . . . . . . . . . . 110 3.7.2.1 Instance generation . . . . . . . . . . . . . . . . . . . . . . . . 110 3.7.2.2 Convex relaxations . . . . . . . . . . . . . . . . . . . . . . . . 111 3.7.2.3 Exact mixed-integer optimization approaches . . . . . . . . . . 112 3.7.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 4: Compact extended formulations for low-rank functions with indicator variables 118 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2 A convex analysis perspective on convexification . . . . . . . . . . . . . . . . . . 120 4.2.1 Notations and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2.2 Convex hull characterization . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3 Rank-one convexification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.3.1 Extended formulation of clconv(Q) . . . . . . . . . . . . . . . . . . . . . 126 4.3.2 Explicit form of clconv(Q) with unconstrained continuous variables . . . . 129 4.3.3 Explicit form of clconv(Q) with nonnegative continuous variables . . . . . 131 v 4.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.4 NP-hardness with upper bound constraints on continuous variables . . . . . . . 144 4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Chapter 5: Fractional 0-1 programming and submodularity : : : : : : : : : : : : : : : : : 151 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.2.2 Submodularity and approximation algorithms. . . . . . . . . . . . . . . . 155 5.2.3 Submodularity and cutting-plane methods. . . . . . . . . . . . . . . . . . 156 5.3 Submodularity of a single ratio and its implications . . . . . . . . . . . . . . . . . 157 5.3.1 A necessary and sufficient condition . . . . . . . . . . . . . . . . . . . . . 157 5.3.2 Monotonicity implies submodularity . . . . . . . . . . . . . . . . . . . . . 159 5.3.3 On non-monotone submodular functions . . . . . . . . . . . . . . . . . . . 161 5.3.4 On homogeneous fractional functions . . . . . . . . . . . . . . . . . . . . 162 5.3.5 Submodularity testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.4.1 Assortment optimization problem . . . . . . . . . . . . . . . . . . . . . . 165 5.4.1.1 Cannibalization and submodularity . . . . . . . . . . . . . . . . 166 5.4.1.2 Revenue spread, no-purchase probability and submodularity . . . 166 5.4.1.3 On the greedy algorithm and revenue-ordered assortments . . . . 168 5.4.2 p-choice facility location problem . . . . . . . . . . . . . . . . . . . . . . 169 5.4.3 On minimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Chapter 6: Towards a general theory for convexification of MINLPs with indicators : : : : 174 6.1 General techniques for convexifying mixed-integer sets . . . . . . . . . . . . . . . 174 6.1.1 Disjunctive programming and extended formulation . . . . . . . . . . . . 174 6.1.2 Outer construction and valid inequalities . . . . . . . . . . . . . . . . . . 176 6.1.3 Convexification by part . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.2 Convexification of structured sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.2.1 Permutation-invariant sets . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.2.2 Chain-function with indicator variables . . . . . . . . . . . . . . . . . . . 182 6.2.3 Triangular function with hierarchical indicator variables . . . . . . . . . . 184 6.2.4 Convexification and decomposition for conic quadratic programs with in- dicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.3 Convexification and optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Chapter 7: Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 195 vi Appendices : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 210 A Supplementary results for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 211 B Supplementary results for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 214 B.1 Proof of Proposition 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 C Supplementary results for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . 216 C.1 On computational experiments with branch-and-bound . . . . . . . . . . . 216 C.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 vii List of Tables 2.1 Comparison with exact branch-and-bound method. . . . . . . . . . . . . . . . . . 50 2.2 Signal-to-Noise Ratio for different values of the noise. . . . . . . . . . . . . . . . 53 2.3 Performance of the Lagrangian method for signals with n= 100;000. . . . . . . . 60 3.1 Comparison of convex relaxations of (QI). . . . . . . . . . . . . . . . . . . . . . . 99 3.2 Experiments with varying diagonal dominance,r = 0:3. . . . . . . . . . . . . . . 108 3.3 Experiments with varying positive off-diagonal entries,d = 0:1. . . . . . . . . . . 109 3.4 Results with stock return data since 2010. . . . . . . . . . . . . . . . . . . . . . . 114 3.5 Results with stock return data since 2015. . . . . . . . . . . . . . . . . . . . . . . 115 7.1 Performance of CPLEX solver in an instance with n= 500 and k= 10. . . . . . . . 216 viii List of Figures 2.1 Comparison of estimators for signal denoising. . . . . . . . . . . . . . . . . . . . 15 2.2 Underlying signals and noisy observations. . . . . . . . . . . . . . . . . . . . . . 46 2.3 Optimality gaps of the` 1 relaxation anddecomp . . . . . . . . . . . . . . . . . . . 48 2.4 Distribution of CPU times for each method. . . . . . . . . . . . . . . . . . . . . . 49 2.5 Average out-of-sample error as a function of SNR (in log-scale). . . . . . . . . . . 54 2.6 Average out-of-sample number of false positives and false negatives as a function of SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.7 Distribution of the out-of-sample errors for different SNRs when the true values of the signal used in training are available. . . . . . . . . . . . . . . . . . . . . . . . 55 2.8 CPU time in seconds as a function of SNR (in log-scale). The error bars correspond to1 stdev. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.9 Average out-of-sample error as a function of SNR (in log-scale). . . . . . . . . . . 57 2.10 Average out-of-sample number of false positives and false negatives as a function of SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.11 Distribution of the out-of-sample errors. . . . . . . . . . . . . . . . . . . . . . . . 58 3.1 Relationship between the convex relaxations for (QI) discussed in Chapter 3. . . . 67 3.2 Distribution of gaps forOptPersp andOptPairs. . . . . . . . . . . . . . . . . . . . 116 4.1 Number of instances solved as a function of time. . . . . . . . . . . . . . . . . . . 144 6.1 2D Example ofA 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 ix Abstract Operational problems involving discrete structures naturally arise in a vast number of applica- tions which provide the impetus for the development of mixed-integer programming (MIP). In this dissertation, we are mainly concerned with the theory of mixed-integer nonlinear programming (MINLP) with indicator variables as well as their applications in other fields including machine learning, statistics, finance, and revenue management. In parametric learning problems, a regularizer is often added to the loss function to control the overfitting of the model. If the regularizer is convex, then the resulting optimization problem is often easy to solve but may result in estimators with inferior statistical properties. On the other hand, if the regularizer is chosen to be nonconvex, then the problem is hard to solve to global op- timality, and the performance of the resulting local minimizer or stationary point highly relies on the initialization of the algorithm. A convex relaxation of MINLP formulations provides another way to regularize the model. It can be perceived as an adaptive nonconvex regularizer but mean- while, makes the overall problem remain convex and thus computationally tractable. In Chapter 2, we study a regression problem with sparsity priors which can be naturally modeled as quadratic optimization with` 0 -“norm” constraints. Since such problems are hard-to-solve, the standard ap- proach is to tackle their convex surrogates with the` 1 -norm regularizer. Alternatively, we propose new iterative (convex) conic quadratic relaxations that exploit not only the` 0 -“norm” terms, but also the fitness and smoothness functions. These stronger relaxations lead to significantly better estimators than` 1 -norm approaches and also allow one to utilize affine sparsity priors. In addition, the parameters of the model and the resulting estimators are easily interpretable. Experiments with a tailored Lagrangian decomposition method indicate that the proposed iterative convex relaxations x yield solutions within 1% of the exact ` 0 approach, and can tackle instances with up to 100,000 variables under one minute. In Chapter 3, we study the convex quadratic optimization problem with indicator variables (CQI). We first prove the equivalence between the optimal perspective reformulation and Shor’s SDP formulation for the CQI. These two formulations have been studied extensively in literature. What is more, for the 2 2 case, we describe the convex hull of the epigraph in the original space of variables, and also give a conic quadratic extended formulation. Then, using the convex hull description for the 2 2 case as a building block, we derive an extended SDP relaxation for the general case. This new formulation is stronger than other SDP relaxations proposed in the literature for the problem, including Shor’s SDP relaxation, the optimal perspective relaxation as well as the optimal rank-one relaxation. Computational experiments indicate that the proposed formulations are quite effective in reducing the integrality gap of the optimization problems. In Chapter 4, we study the mixed-integer epigraph of a low-rank convex function with non- convex indicator constraints, which are often used to impose logical constraints on the support of the solutions. Extended formulations describing the convex hull of such sets can easily be con- structed via disjunctive programming, although a direct application of this method often yields prohibitively large formulations, whose size is exponential in the number of variables. In that chapter, we propose a new disjunctive representation of the sets under study, which leads to com- pact formulations with size exponential in the rank of the function, but polynomial in the number of variables. Moreover, we show how to project out the additional variables for the case of rank- one functions, recovering or generalizing known results for the convex hulls of such sets (in the original space of variables). In Chapter 5, we study multiple-ratio fractional 0-1 programs, a broad class ofNP-hard combinatorial optimization problems. In particular, under some relatively mild assumptions we provide a complete characterization of the conditions, which ensure that a single-ratio function is submodular. Then we illustrate our theoretical results with the assortment optimization and facility location problems, and discuss practical situations that guarantee submodularity in the considered xi application settings. In such cases, near-optimal solutions for multiple-ratio fractional 0-1 pro- grams can be found via simple greedy algorithms. In Chapter 6, we summarize the techniques for convexifying mixed-integer sets. In Chapter 7, we conclude the dissertation and discuss further promising research directions. xii 1 Introduction and general theory The last three decades have witnessed enormous advances in mixed-integer linear programming (MILP) technology. A difficult MILP problem that would have taken 124 years to solve in 1988 can now be solved in one second 1 using state-of-art commercial solvers such as CPLEX, XPRESS, and GUROBI. Such an order of magnitude of speedup can be attributed to the development of many solution techniques, which include preprocessing, branching, heuristic methods for finding feasible solutions, and most importantly, cutting planes methods for improving relaxation bounds. Although polyhedral approaches to MILP have been proven successful in practice, when nonlin- earity arises, theses conventional LP-based algorithms are insufficient and incapable of effectively solving problems. This motivates people to study the theory of mixed-integer nonlinear program- ming (MINLP), which is currently far from well-understood. In contrast to MILP problems, the central difficulty of handling MINLP problems is that the convex hull of mixed-integer sets under study is not polyhedral anymore. Such nonlinearity has to be taken into consideration when de- veloping efficient solution methods for MINLP problems, and this requires a comprehensive study of the structure of the non-polyhedral sets especially when integer variables come into play. In this dissertation, we focus on a special class of MINLP problems where the integer variables are restricted to be binary. Binary variables are ubiquitous in operations research models. They can be used to carry logic relationships between continuous variables such as disjunction, conjunction, negation and 1 http://www.focapo-cpc.org/pdf/Linderoth.pdf 1 implication. For example, in facility location problems, binary variables appear to account for fixed costs. In bilevel optimization, binary variables can be used to characterize the optimality conditions of the second stage problems. In revenue management, a discrete choice model is needed to describe customers’ decision choice behavior when they are provided with finitely many different alternatives. The relative preferences of a customer for the products can be expressed as a fractional function of binary variables. A high-dimensional statistical learning model often assumes that parameters to be estimated are sparse. The restriction of sparsity imposed over the model can be encoded by introducing binary variables to indicate whether a covariate is zero. In Chapter 2, we study signal estimation problems with sparsity priors. In order to recover a sparse signal from its noisy observations, a well-known approach in the statistics and machine learning communities called LASSO is often adopted. However, such a method may yield a dense estimator with excessive shrinkage. To address this issue, we introduce additional binary variables to encode the model sparsity, and thus the problem is cast as a MINLP program. It turns out that the resulting formulation is able to yield a signal estimator with substantially improved inference properties. Most successful algorithms in literature for solving MINLPs consist of two key ingredients - convexification and enumeration. Typically, at each iteration, a convex program is solved to get a tight bound on the optimal value of the original MINLP program. Based on the new information generated, an enumerating procedure such as branching is followed to search for an optimal so- lution. During this process, the strength of convex relaxations is at the heart of the algorithm. A strong convex relaxation is often indispensable to keeping the search time within practical limits. What is more, in many cases, approximation algorithms or even heuristics built upon strong con- vex relaxations are able to deliver a good feasible solution to the MINLP. One famous example is that Goemans and Williamson [107] round the optimum of Shor’s semidefinite relaxation to find a solution to max-cut problems. As one can expect, efficiency of solving convex optimization subproblems plays a fundamental role in the overall solution time of MINLPs. Second order cone programs (SOCP) and semidef- inite programs (SDP) are two classes of well-studied convex optimization problems in literature. 2 Compared with linear programming, one advantage of SOCP/SDP is their remarkable ability to represent or approach a nonlinear set. A non-polyhedral set that can be well approximated with a few SOCP/SDP-representable inequalities often demands much more (even infinitely many) hy- perplanes to achieve the same effect. The most popular solution methods for conic optimization problems are interior point methods which enjoy not only good theoretical convergence properties but also excellent computational performance in practice. More importantly, efficient implemen- tations of interior point methods have been performed by many modern optimization solvers such as MOSEK. For the above reasons, mixed-integer conic optimization problems have attracted a variety of research recently. In Chapter 3, we compare the strength of several SOCP and SDP relaxations that are commonly used for convex quadratic optimization problems with indicator variables. Beyond that, a new formulation is proposed based on the convex hull of the mixed- integer epigraph of 2 2 quadratic forms. Moreover, we prove that the new formulation is strictly stronger than the existing ones in literature, which is also supported by numerical results. Ideally, any mixed-binary set can be expressed as a disjunction of a finite number of pieces, with each piece defined by an assignment of the binary vector. As a result, the convex hull of the mixed-binary set can be readily constructed using disjunctive programming (DP) techniques. Yet, as the problem dimension grows, the DP formulation usually demands an exponential number of auxiliary variables. Therefore, unless there are few binary variables, such a naive extended formu- lation is prohibitive to implement in practice. In Chapter 4, we consider a class of mixed-binary programs where the objective function is the composition of a low-dimensional convex function and an affine mapping. To exploit the low-rank structure of the objective, we propose a new DP formula to express the mixed-integer set under study. Consequently, it leads to a compact convex hull description for which the number of additional variables required is significantly reduced and in fact polynomial in the problem dimension. Besides general purpose convexification techniques, one should also explore and take advan- tage of the special structures of MINLP problems whenever possible. For example, if the objective 3 function is submodular, then polymatroid inequalities can be used to strengthen existing formula- tions. Nevertheless, it is often unclear when a set function is submodular. In Chapter 5, we study multiple-ratio fractional 0-1 programs. In particular, we completely characterize the conditions under which a single linear ratio is submodular. It can be shown that a single ratio is submodular if and only if it is almost monotonic. Based on such conditions, we propose an efficient algorithm to test submodularity of a single linear ratio. The remainder of this chapter serves as the introduction to abstract mixed-integer optimization from the perspective of convex analysis. Necessary preliminaries are also given for future use. 1.1 Mixed-integer optimization A general mixed-integer program is an optimization problem of the form min f(x;z)+ha;xi+hc;zi s.t. (x;z)2F; where f(x;z) :R m+n !R is a finite function, a2R m ;c2R n ,h;i represents the inner product of two vectors, andFR m Z n is a mixed-integer feasible region. By introducing an additional variable t, the mixed-integer program can be rewritten as min t+ha;xi+hc;zi s.t. (t;x;z)2 ¯ F def =f(t;x;z) : t f(x;z);(x;z)2Fg: (1.1) For this reason, one can assume a linear objective function without loss of generality (WLOG). A setS is convex if for any two points x;y2S , their convex combination lx+(1l)y2 S8l2[0;1]. A convex relaxation ofS is any convex set that containsS . The convex hull ofS is the inclusionwise minimal convex set containingS and is denoted as convS . IfS = f(x;z)2R m Z n : Ax+ Bz bg, where A;B are two rational matrices and b is a rational vector, 4 then the fundamental theorem of integer linear programming states that convS is polyhedral and thus closed. In general,S is not necessarily closed. We denote clS as the closure of setS . Proposition 1 (Lemma 1.3 [76]). LetSR n and c2R n . Then maxfhc;xi : x2Sg= maxfhc;xi : x2 convSg= maxfhc;xi : x2 clconvSg. Furthermore, the maximum ofhc;xi is attained over S if and only if it is attained over convS . Conceptually, by Proposition 1, the mixed-integer program (1.1) can be transformed to a con- vex program, which is well understood and can be solved efficiently by means of existing algo- rithms such as interior point methods in many cases. However, in general, (1.1) isNP-hard, and thus it is unrealistic to obtain a reasonable representation of clconv ¯ F . In such cases, convexifica- tion plays a central role in solving MINLPs. In this dissertation, our primary goal is to construct strong convex relaxations of ¯ F to obtain a tight lower bound of (1.1). Branch-and-bound may be the most prominent algorithmic scheme in mixed-integer program- ming. The algorithm solves the MIP problem (1.1) by gradually growing an enumeration tree with each node in the tree representing a subproblem. During this process, the solver globally main- tains a lower bound and an upper bound of the optimal value to (1.1), where the lower bound is obtained by solving a relaxation of subproblems and the upper bound is given by the best feasi- ble solution found so far. If the upper bound and lower bound does not match, one leaf node is selected and a relaxation of the associated subproblem is solved. The node is pruned from the tree if the relaxation is infeasible or the resulting objective is greater than the current upper bound. Otherwise, the global upper bound and lower bound are updated accordingly. Furthermore, in the case where the resulting solution to the relaxation has a fractional entry ¯ z i , the subproblem would be partitioned into two child subproblems with smaller feasible regions by adding one integrality constraint z i b¯ z i c or z i d¯ z i e. Consequently, two new nodes are added to the enumeration tree, which is the branching step of the algorithm. The above process is repeated until optimality or infeasibility of (1.1) is certified. Cutting surface/plane is another important iterative approach to solving MIP problems. It is briefly described as follows. An inequality h(x) 0 defined by a convex function (see Section 1.2 5 for the definition) is called a valid inequality forS if it holds true at all points inS . In each iteration, the algorithm constructs a convex relaxationF c of ¯ F and solve the resulting convex program to optimality. If the solution ( ¯ t; ¯ x; ¯ z) to the relaxed problem belongs to ¯ F , it must be optimal to the original problem (1.1) and the algorithm terminates. Otherwise, one tries to find an inequality h(t;x;z) 0 such that it is valid for ¯ F but violated by( ¯ t; ¯ x; ¯ z). The inequality h(t;x;z) 0 found in this case is called a cutting surface separating( ¯ t; ¯ x; ¯ z) from ¯ F . Next, the valid inequality is added to strengthen the current relaxation by settingF new =F c \f(t;x;z) : h(t;x;z) 0g, which is used to trigger the next iteration. The above process is repeated until no violated inequality can be found within tolerance. In practice, cutting surface methods are usually employed together with branch-and-bound; see [76] for more discussion on the two algorithms. 1.2 Preliminaries of convex analysis In this section, we introduce more concepts and preliminary results in convex analysis for future use. Throughout this chapter, we assume f(x) is an extended value function fromR n toR[f+¥g. We denote the effective domain of f by dom f def =fx2R n : f(x)<+¥g. Function f is called proper if dom f6= / 0. Function f is called finite if dom f =R n . We denote the epigraph of f by epi f def = (t;x)2R n+1 : t f(x) . If epi f is a closed set, then f is called closed or lower semicontinuous (lsc). Throughout the dissertation, we assume all functions are proper and closed unless otherwise specified. A function f is positively homogeneous if f(lx)=l f(x)8l > 0 and x2R n . A function f :R n !R[f+¥g is called convex if epi f is convex. For any proper function f , its convex conjugate is defined as f (a) def = max x a > x f(x); 6 where a > represents the transpose of a2R n . Note that a convex conjugate is always a closed convex function. The convex envelope of a generic function f is denoted as clconv f and defined as the largest closed convex function that underestimates f , which can be characterized as the lower boundary of clconvepi f as shown in the following proposition. Proposition 2. The following holds 1. The value function infft :(t;x)2Sg is convex provided thatS is convex. 2. Function clconv f(x)= infft :(t;x)2 clconvepi fg, i.e. epiclconv f = clconvepi f . Proof. The first conclusion is due to Theorem 5.3 of [193]. The second conclusion directly follows from the definition of the convex envelope; see also Page 51 [193]. The convex envelope can also be characterized from the dual perspective. Proposition 3. Assume f :R n !R[f+¥g is a proper function. Then clconv f = f def =( f ) . In particular, if f itself is closed and convex, then f = f . Proof. The conclusion follows from Corollary 12.1.1 and Theorem 12.2 of [193]. For a closed convex function f with 02 dom f , its perspective function is denoted by f p (x;l) : R n+1 !R[f+¥g and defined as f p (x;l) def = 8 > > < > > : l f x l ifl 0 +¥ o.w., where 0 f(x=0) def = lim l!0 + l f(x=l) is called the recession function of f at x. Note that by this def- inition f p () is always a sublinear function, i.e. a closed, convex and positively homogeneous function; see Section 8 in [193]. We denote the indicator function ofS byd(x;S), which is defined asd(x;S)= 0 if x2S andd(x;S)=+¥ otherwise. Consequently, the convex conjugate ofd(x;S) is d (a;S)= max x a > xd(x;S)= max x2S a > x; 7 which is known as the support function ofS . With such notations, the constrained optimization problem min (x;z)2F f(x;z)+ha;xi+hc;zi can be expressed as min x;z f(x;z)+ha;xi+hc;zi+d(x;z;F). Furthermore, solving (1.1) can be viewed as computing the value ofd (; ¯ F). The relative interior of a nonempty convex setS is defined as riS def =fx2S :8y2S;9l > 0 s.t. x+l(x y)2Sg. This section end up with the following useful minimax theorem for saddle functions. Proposition 4 (Sion’s minimax theorem [204]). ConsiderX R n ,Y R m and f(x;y) :X Y !R, whereX is compact,Y is closed, f(x;) is closed and convex onY for all x2X , and f(;y) is closed and convex onX for all y2Y . Then min x2X sup y2Y f(x;y)= sup y2Y min x2X f(x;y). 1.3 Conic programming A coneC is a set such that x2C implies lx2C for all l > 0. A cone is called pointed if for any x, x2C andx2C imply x = 0. The dual cone of a coneC is defined asC def = fy :hx;yi 08x2Cg. ConeC is called self-dual ifC =C . A conic optimization problem is a problem that can be written in the form min x n c > x : Ax b2C o , whereC is a closed convex cone. WhenC is a nonnegative orthant, the conic optimization problem reduces to a linear program. In fact, all convex optimization problems can be equivalently represented as a conic optimization problem by conifying feasible regions [164]. A second order cone or Lorentz cone is defined asC = (t;x 1 ;;x n )2R n+1 : tkxk 2 . A rotated second order cone is a variant of the Lorentz cone and defined asC = (s;t;x 1 ;:::;x n )2 R n+2 : 2stkxk 2 2 ; s 0; t 0 . A semidefinite cone is denoted asS n + and defined as the set of symmetric positive semidefinite matrices in R nn . We also rewrite M2S n + as M 0 when the dimension n is not emphasized or can be inferred from the context. Second order cones and semidefinite cones are self-dual. Besides linear programming, second order cone programming (SOCP) and semidefinite programming (SDP) are two classes of convex optimization problems that can be efficiently solved on a relatively large scale. Conic programming enjoys the following strong duality result under mild conditions. 8 Proposition 5 (Conic Duality Theorem; Theorem 2.4.1 [41]). AssumeC is a pointed closed con- vex cone. If there exists x such that Ax b is an interior ofC , then min x n c > x : Ax b2C o = maxfb > y : A > y= c; y2C g. 1.4 Disjunctive programming For any scalar l 0 and two generic nonempty setsS andS 0 in a proper Euclidean space, we definelS def =flx : x2Sg and the Minkowski sum of setsS +S 0 def =fx+ x 0 : x2S;x 0 2S 0 g. To be consistent with the definition of 0 f(x=0), we interpret 0S as the recession cone ofS , which is defined as 0S def =fy : x+ly2S8l 0;8x2Sg. Disjunctive programming is optimization over disjunctive sets. IfS can be represented as a disjunction of several subsets, i.e.S = [ i2[m] S i , then its convex hull can be represented in an extended space by introducing additional auxiliary variables. Proposition 6. AssumeS 1 ;;S m are nonempty closed convex sets and have the same recession cone. ConsiderS = [ i2[m] S i . The following holds. 1. Set clconvS = [ l2D m å i2[m] l i clconvS i ; whereD m def = ( l2R m : å i2[m] l i = 1;l 0 ) is a m- dimensional simplex. 2. In particular, assumeS i has a finite representationS i =fx : f i j (x) b i j 8 j2[m i ]g, where each f i j is a closed convex function. Then clconvS = ( x : x= å i2[m] x i ; f p i j (x i ;l i )l i b i j ;i2[m]; j2[m i ]; å i2[m] l i = 1;l 0 ) : Proof. The proposition follows from Theorem 8.7, Theorem 9.8 and Corollary 9.8.1 of [193]. The set given by the extended formulation in the second part of Proposition 6 is always closed in the lifted space, but in general without the condition that allS i have the same recession cone, its projection onto the space of original variables is inclusively between convS and clconvS . 9 1.5 Submodular optimization A set function f : 2 N !R from the subsets of N=[n] to the real numbers is submodular overF if it exhibits diminishing returns, i.e., f(S[fig) f(S) f(T[fig) f(T) for all S T Nnfig such that T[fig2F . Equivalently, function f is submodular overF if f(S[fi; jg) f(S[f jg) f(S[fig) f(S) (1.2) for all S N and i; j62 S such that S[fi; jg2F . Set function f is called monotone nondecreasing if f(S) f(T) for all feasible S T . Since there is a one-to-one correspondence between z2 f0;1g n and S=fi2[n] : z i = 1g N, for a set function f , we also rewrite f(S) as f(z) and regard it as a binary function fromf0;1g n !R. Proposition 7 (Proposition 3.7 [28]). Assume f is a submodular function with f(0)= 0. Then clconv f(z)= f L (z)+d(z;[0;1] n ), where f L (z) is the Lov´ asz extension of f defined as f L (z) def = n å k=1 z p(k) ( f(S k ) f(S k1 )), where p : [n]! [n] is a bijection such that z p(1) z p(n) and S k =fp(1);;p(k)g. Ana-approximation algorithm for a maximization problem max x2S f(x) is a polynomial-time algorithm that for all instances of the problem produces a solution ¯ x such that f( ¯ x)a f(x ), where x is the optimal solution. If f : N!R is a submodular monotone nondecreasing set function with f(0)= 0, then the naive greedy algorithm (Algorithm 1) is a (1 1=e)-approximation algorithm for max S2F f(S) [96], whereF =fS N :jSj kg with k2N + . Algorithm 1 Greedy Algorithm for Submodular Function Maximization Step 1. Set S := / 0. Step 2. Set A :=f`2 Nn S : S[f`g2Fg. Step 3. If A6= / 0, set` 2 argmax `2A f(S[f`g) and S := S[f` g. Go to Step 2. Step 4. Return S. 10 2 Sparse and smooth signal estimation: convexification of` 0 formulations 2.1 Introduction Given nonnegative data y2R n + corresponding to a noisy realization of an underlying signal, we consider the problem of removing the noise and recovering the original, uncorrupted signal y . A successful recovery of the signal requires exploiting prior knowledge on the structure and charac- teristics of the signal effectively. A common prior knowledge on the underlying signal is smoothness. Smoothing considerations can be incorporated in denoising problems through quadratic penalties for deviations in successive estimates [186]. In particular, denoising of a smooth signal can be done by solving an optimization problem of the form min x2R n + ky xk 2 2 +lkPxk 2 2 ; (2.1) where x corresponds to the estimation for y , l > 0 is a smoothing regularization parameter, P2 R mn is a linear operator, the estimation error termky xk 2 2 measures the fitness to data, and the quadratic penalty termkPxk 2 2 models the smoothness considerations. In its simplest form kPxk 2 2 = å fi; jg2A (x i x j ) 2 ; (2.2) 11 where A encodes the notion of adjacency, e.g., consecutive observations in a time series or adjacent pixels in an image. If P is given according to (2.2), then problem (2.1) is a convex Markov Random Fields problem [133] or metric labeling problem [149], commonly used in the image segmentation context [62, 150] and in clustering [135], for which efficient combinatorial algorithms exist. Even in its general form, (2.1) is a convex quadratic optimization, for which a plethora of efficient algorithms exist. Another naturally occurring signal characteristic is sparsity, i.e., the underlying signal differs from a base value in only a small proportion of the indexes. Sparsity arises in diverse application domains including medical imaging [165], genomic studies [138], face recognition [231], and is at the core of compressed sensing methods [90]. In fact, the “bet on sparsity” principle [123] calls for systematically assuming sparsity in high-dimensional statistical inference problems. Sparsity constraints can be modeled using the` 0 -“norm” 1 , leading to estimation problems of the form min x2R n + ky xk 2 2 +l å fi; jg2A (x i x j ) 2 subject tokxk 0 k; (2.3) where k2Z + is a target sparsity andkxk 0 =å n i=1 1 x i 6=0 , where 1 () is the indicator function equal to 1 if () is true and equal to 0 otherwise. In addition, the indicators can also be used to model affine sparsity constraints [86, 87], enforcing more sophisticated priors than simple sparsity; see Section 2.5.2 for an illustration. Unlike (2.1), problem (2.3) is non-convex and hard-to-solve exactly. The regularized version of (2.3), given by min x2R n + ky xk 2 2 +l å fi; jg2A (x i x j ) 2 +mkxk 0 (2.4) withm 0, has received (slightly) more attention. Problem (2.4) corresponds to a Markov Random Fields problem with non-convex deviation functions [see 2, 134], for which a pseudo-polynomial combinatorial algorithm of complexity O jAjn e 2 log n 2 ejAj exists, wheree is a precision parameter andjAj is the cardinality of set A; to the best of our knowledge, this algorithm has not been 1 The so-called` 0 -“norm” is not a proper norm as it violates homogeneity. 12 implemented to date. More recently, in the context of signal denoising, Bach [29] proposed another pseudo-polynomial algorithm of complexity O n e 3 log n e , and demonstrated its performance for instances with n= 50. The aforementioned algorithms rely on a discretization of the x variables, and their performance depends on how precise the discretization (given by the parameter e) is. Finally, a recent result of Atamt¨ urk and G´ omez [16] on quadratic optimization with M-matrices and indicators imply that (2.4) is equivalent to a submodular minimization problem, which leads to a strongly polynomial-time algorithm of complexity O(n 7 ). The high complexity by a blackbox submodular minimization algorithm precludes its use except for small instances. No polynomial- time algorithm is known for the constrained problem (2.3). In fact, problems (2.3) and (2.4) are rarely tackled directly. One of the most popular techniques used to tackle signal estimation problems with sparsity consists of replacing the non-convex term kxk 0 with the convex` 1 -norm,kxk 1 =å n i=1 jx i j, see Section 2.2.1 for details. The resulting opti- mization problems with the` 1 -norm can be solved very efficiently, even for large instances; how- ever, the` 1 problems are often weak relaxations of the exact` 0 problem (2.3), and the estimators obtained may be poor, as a consequence. Alternatively, there is a increasing effort for solving the mixed-integer optimization (MIO) (2.3) exactly using enumerative techniques, see Section 2.2.2. While the recovered signals are indeed high quality, exact MIO approaches to-date require at least a few days to solve instances with n 1;000, and are inadequate to tackle many realistic instances as a consequence. Contributions and outline In this chapter, we discuss how to bridge the gap between the easy-to-solve` 1 approximations and the often intractable` 0 problems in a convex optimization framework. Specifically, we construct a set of iterative convex relaxations for problems (2.3) and (2.4) with increasing strength. These convex relaxations are considerably stronger than the` 1 relaxation, and also significantly improve and generalize other existing convex relations in the literature, including the perspective relaxation (see Section 2.2.3) and recent convex relaxations obtained from simple pairwise quadratic terms 13 (see Section 2.2.4). The strong convex relaxations can be used to obtain high quality, if not optimal, solutions for (2.3)–(2.4), resulting in better performance than the existing methods; in our compu- tations, solutions to instances with n= 1;000 are obtained with off-the-shelf convex solvers within seconds. For additional scalability, we give an easy-to-parallelize tailored Lagrangian decompo- sition method that solves instances with n= 100;000 under one minute. Finally, the proposed formulations are amenable to conic quadratic optimization techniques, thus can be tackled using off-the-shelf solvers, resulting in several advantages: (i) the methods described here will benefit from the continuous improvements of conic quadratic optimization solvers; (ii) the proposed ap- proach is flexible, as it can be used to tackle either (2.3) or (2.4), as well as general affine sparsity constraints, by simply changing the objective or adding constraints. Figure 2.1 illustrates the performance of the` 1 -norm estimator and the proposed strong convex estimators for an instance with n= 1;000. The new convex estimator, depicted in Figure 2.1(C), requires only one second to solve; the convex estimator enhanced with additional priors in Fig- ure 2.1(D) is solved under five seconds. The rest of the chapter is organized as follows. In Section 2.2 we review the relevant back- ground for the chapter. In Section 2.3 we introduce the strong iterative convex formulations for (2.3)–(2.4). In Section 2.4 we give conic quadratic extended reformulation of the model and de- scribe a scalable Lagrangian decomposition method to solve it. In Section 2.5 we test the per- formance of the methods from a computational and statistical perspective, and in Section 2.6 we conclude the chapter with a few final remarks. Notation Throughout the chapter, we adopt the following convention for division by 0: given a 0, a=0=¥ if a> 0 and a=0= 0 if a= 0. For a set XR n , let conv(X) denote the convex hull of X and conv(X) the closure of conv(X). Given two matrices Q, R of the same dimensions, we denote byhQ;Ri the inner product of Q and R. 14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 Noiseless signal Noisy observations (a) True signal and noisy observations. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 Noiseless signal L1 (b) ` 1 -approx results in dense and shrunk estimators with many “false positives.” 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 Noiseless signal Decomp-sparse (c) New strong convex formulation yields better sparse estimators with few “false positives.” 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 Noiseless signal Decomp-prior (d) Incoporating additional priors further improves the estimators, match- ing the sparsity pattern of the signal. Fig. 2.1: Estimators from` 1 -approximation and the new strong convex formulations (decomp) for signal denoising. 2.2 Background In this section, we review formulations relevant to our discussion. First we review the usual ` 1 - norm approximation (Section 2.2.1), next we discuss MIO formulations (Section 2.2.2), then we review the perspective reformulation, a standard technique in the MIO literature, (Section 2.2.3), and finally pairwise convex relaxations that were recently proposed (Section 2.2.4). 15 2.2.1 L1-norm approximations A standard technique for signal estimation problems with sparsity is to replace the` 0 -norm with the` 1 -norm in (2.3), leading to the convex optimization problem (` 1 -approx) min x2R n + ky xk 2 2 +l å fi; jg2A (x i x j ) 2 subject tokxk 1 k: (2.5) The` 1 -norm approximation was proposed by Tibshirani [213] in the context of sparse linear re- gression, and is often referred to as lasso. The main motivation for the ` 1 -approximation is that the` 1 -norm is the convex p-norm closest to the` 0 -norm. In fact, for L=fx2[0;1] n :kxk 0 1g, it is easy to show that conv(L)=fx2[0;1] n :kxk 1 1g; therefore, the` 1 -norm approximation is considered to be the best possible convex relaxation of the` 0 -norm. The` 1 -approximation is currently the most commonly used approach for sparsity [125]. It has been applied to a variety of signal estimation problems including signal decomposition and spike detection [e.g., 75, 103, 222, 157], and pervasive in the compressed sensing literature [67, 68, 91]. A common variant is the fused lasso [215], which involves a sparsity-inducing term of the form å n1 i=1 jx i+1 x i j; the fused lasso was further studied in the context of signal estimation [192], and is often used for digital imaging processing under the name of total variation denoising [196, 221, 182]. Several other generalizations of the` 1 -approximation exist [214], including the elastic net [238, 180], the adaptive lasso [237], the group lasso [31, 189] and the smooth lasso [128]; related ` 1 -norm techniques have also been proposed for signal estimation, see [147, 167, 217]. The generalized lasso [216] utilizes the regularization termkAxk 1 and is also studied in the context of signal approximation. Despite its widespread adoption, the ` 1 -approximation has several drawbacks. First, the ` 1 - norm term may result in excessive shrinkage of the estimated signal, which is undesirable in many contexts [233]. Additionally, the` 1 -approximation may struggle to achieve sparse estimators — in 16 fact, solutions to (2.5) are often dense, and achieving a target sparsity of k requires using a parame- ter ˆ k<< k, inducing additional bias on the estimators. As a consequence, desirable theoretical per- formance of the` 1 -approximation can only be established under stringent conditions [192, 201], which may not be satisfied in practice. Indeed, ` 1 -approximations have been shown to perform rather poorly in a variety of contexts, e.g., see [142, 174]. To overcome the aforementioned draw- backs, several non-convex approximations have been proposed [102, 126, 168, 234, 236]; more recently, there is also an increasing effort devoted to enforcing sparsity directly with` 0 regulariza- tion using enumerative MIO approaches. 2.2.2 Mixed-integer optimization Signal estimation problems with sparsity can be naturally modeled as a mixed-integer quadratic optimization (MIQO) problem. Using indicator variables z2f0;1g n such that z i = 1 x i 6=0 for all i= 1;:::;n, problem (2.3) can be formulated as min n å i=1 (y i x i ) 2 +l å fi; jg2A (x i x j ) 2 (2.6a) s.t. x i (1 z i )= 0 (2.6b) z2 Cf0;1g n (2.6c) x2R n + : (2.6d) If C is defined by a k-sparsity constraint, i.e., C=fz2f0;1g n :kzk 1 kg, then problem (2.6) is the ` 0 analog of (2.5). More generally, C may be defined by other logical (affine sparsity) constraints, which allow the inclusion of additional priors in the inference problem. In this formulation, the non-convexity of the ` 0 regularizer is captured by the complementary constraints (2.6b) and the binary constraints encoded by set C. Constraints (2.6b) can be alternatively formulated with the so-called “big-M” constraints with a sufficiently large positive number u, x i (1 z i )= 0 and z i 2f0;1g, x i uz i and z i 2f0;1g: (2.7) 17 For the signal estimation problem (2.6), u=kyk ¥ is a valid upper bound for x i , i= 1;:::;n. Prob- lem (2.6) is a convex MIQO problem, which can be tackled using off-the-shelf MIO solvers. Es- timation problems with a few hundred of variables can be comfortably solved to optimality using such solvers, e.g., see [49, 77, 112, 228]. For high Signal-to-Noise Ratios (SNR), the estimators obtained from solving the exact` 0 problems indeed result in superior statistical performance when compared with the ` 1 approximations [50]. For low SNR, however, the lack of shrinkage may hamper the estimators obtained from optimal solutions of the ` 0 problems [124]; nonetheless, if necessary, shrinkage can be easily added to (2.6) via conic quadratic regularizations terms [169], resulting again in superior statistical performance over corresponding` 1 -approximations. Unfor- tunately, current MIO solvers are unable to solve larger problems with thousands of variables. A recent research thrust aims to use MIO formulations and heuristics for sparse learning prob- lems such as (2.6) [126, 230], which scale to larger instances than MIO solvers but may not pro- vide dual bounds on the quality of the solutions found. Another research direction aims to design tailored exact methods for specialized regression problems [19, 52, 127, 148], which perform sub- stantially better than general-purpose MIO solvers for the problems their are designed to tackle, but may not generalize well to other learning problems; in particular, general constraints such as (2.6c) are challenging to handle via tailored algorithms. Finally, we point out the relationship between the` 1 -approximation (2.5) and the MIO formu- lation (2.6). It can be verified easily that, if C is defined by a k-sparsity constraint, then there exists an optimal solution z to the simple convex relaxation with big-M constraint, where z i = x i u for all i= 1;:::;n. Therefore, the constraint (2.6c) reduces tokxk 1 ku, and we find that (2.5) is in fact the natural convex relaxation of (2.6) (for a suitable sparsity parameter). This relaxation is often weak and can be improved substantially. 2.2.3 The perspective reformulation A simple strengthening technique to improve the convex relaxation of (2.6) is the perspective reformulation [97], which will be referred to as persp in the remainder of the chapter for 18 brevity. This reformulation technique can be applied to the estimation error terms in (3.10) as follows: (y i x i ) 2 t, y 2 i 2y i x i + x 2 i t ! y 2 i 2y i x i + x 2 i z i t: (2.8) The term x 2 i =z i is the closure of the perspective function of the quadratic function x 2 i , and is there- fore convex, see p. 160 of [131]. Reformulation (3.26) is in fact the best possible for separable quadratic functions with indicator variables. The perspective terms x 2 i z i can be replaced with an auxiliary variable s i along with rotated cone constraints x 2 i s i z i [3, 116]. Therefore, persp re- laxations can be easily solved with conic quadratic solvers and is by now a standard technique for mixed-integer quadratic optimization [58, 129, 166, 229]. Additionally, relationships between the persp and the sparsity-inducing non-convex penalty functions minimax concave penalty [233] and reverse Huber penalty [184] have recently been established [88]. In the context of the signal estimation problem (2.3), thepersp yields the convex relaxation n å i=1 y 2 i + min n å i=1 (2y i x i + x 2 i z i )+l å fi; jg2A (x i x j ) 2 (persp.) s.t. x i kyk ¥ z i i= 1;:::;n z2 ¯ C; x2R n + ; where ¯ C is a valid convex relaxation of C, e.g., ¯ C = conv(C). The ` 1 -approximation model, as discussed in Section 2.2.1, is the best convex relaxation that considers only the indicators for the ` 0 terms. The persp approximation is the best convex relaxation that exploits the ` 0 indicator variables as well as the separable quadratic estimation error terms; thus, it is stronger than the` 1 - approximation. However, persp cannot be applied to non-separable quadratic smoothness terms (x i x j ) 2 , as the function x 2 i =z i 2x i x j + x 2 j =z j is non-convex due to the bilinear term. 19 2.2.4 Strong formulations for pairwise quadratic terms Recently, Jeon et al. [141] gave strong relaxations for the mixed-integer epigraphs of non-separable convex quadratic functions with two variables and indicator variables. Atamt¨ urk and G´ omez [16] further strengthened the relaxations for quadratic functions of the form(x i x j ) 2 corresponding to the smoothness terms in (2.6). Specifically, let X 2 = (z;x;s)2f0;1g 2 R 3 + :(x 1 x 2 ) 2 s; x i (1 z i )= 0;i= 1;2 and define the function f :[0;1] 2 R 2 + !R + as f(z;x)= 8 > > < > > : (x 1 x 2 ) 2 z 1 if x 1 x 2 (x 1 x 2 ) 2 z 2 if x 1 x 2 : Proposition 8 (Atamt¨ urk and G´ omez [16]). The function f is convex and conv(X 2 )= (z;x;s)2[0;1] 2 R 3 + : f(z;x) s : Using persp and Proposition 8, one obtains the stronger pairwise convex relaxation of (2.6) as n å i=1 y 2 i + min n å i=1 2y i x i + x 2 i z i +l å fi; jg2A f(z i ;z j ;x i ;x j ) (2.9a) (pairwise) s.t. x i kyk ¥ z i ; i= 1;:::;n (2.9b) z2 ¯ C; x2R n + : (2.9c) Note that f is not differentiable everywhere and it is defined by pieces. Therefore, it cannot be used directly with most convex optimization solvers. Atamt¨ urk and G´ omez [16] implement (2.9) using linear outer approximations of function f : the resulting method performs adequately for instances with n 400, but was ineffective in instances with n 1;000 as strong linear outer 20 approximations require the addition of a large number of constraints. Moreover, as Example 1 below shows, formulation (2.9) can be further improved even for n= 2. Example 1. Consider the signal estimation problem (2.44) with n= 2 min(0:4 x 1 ) 2 +(1 x 2 ) 2 + 0:5(x 1 x 2 ) 2 + 0:5(z 1 + z 2 ) (2.10a) s.t. x i z i ; i= 1;2 (2.10b) z2f0;1g 2 ;x2R 2 + : (2.10c) The optimal solution of (2.10) is(z 1 ;z 2 ;x 1 ;x 2 )=(0:00;1:00;0:00;0:67). On the other hand, opti- mal solutions of the convex relaxations of (2.10) are: ` 1 -approx Obtained by replacing z2f0;1g 2 with z2[0;1] 2 . The corresponding optimal solution is(z ` ;x ` )=(0:30;0:60;0:30;0:60), and we find thatk(z ;x )(z ` ;x ` )k 2 = 0:59. persp The optimal solution is(z p ;x p )=(0:00;0:82;0:00;0:59), and (z ;x )(z p ;x p ) 2 = 0:19. pairwise The optimal solution is (z q ;x q )=(0:11;1:00;0:08;0:69), and (z ;x )(z q ;x q ) 2 = 0:14. Although persp and pairwise substantially improve upon the ` 1 -relaxation, the resulting solu- tions are still not integral in z. We will give the convex hull of (2.10) in the next section. In this chapter, we show how to further improve thepairwise formulation to obtain a stronger relaxation of (2.6). Additionally, we show how to implement the relaxations derived in the chapter in a conic quadratic optimization framework. Therefore, the proposed convex relaxations benefit from a growing literature on conic quadratic optimization, e.g., see [6, 14, 24, 161, 179], can be implemented with off-the-shelf solvers, and scale to large instances. 21 2.3 Strong convex formulations for signal estimation In thepairwise formulation each single- and two-variable quadratic term is strengthened indepen- dently and, consequently, the formulation fails to fully exploit the relationships between different pairs of variables. Observe that problem (2.6) can be stated as kyk 2 2 + min 2y 0 x+ x 0 Qx (2.11a) s.t. x i (1 z i )= 0;i= 1:::;n; (2.11b) z2 C;x2R n + (2.11c) where, for i6= j, Q i j =l iffi; jg2 A and Q i j = 0 otherwise, and Q ii = 1+ljA i j where A i = f j :fi; jg2 Ag. In particular, Q is a symmetric M-matrix, i.e., Q i j 0 for i6= j and Q 0. In this section we derive convex relaxations of (2.6) that better exploit the M-matrix structure. We briefly review properties of M-matrices and refer the reader to [45, 106, 185, 219] and the references therein for an in-depth discussion on M-matrices. Proposition 9 (Plemmons [185], characterization 37). An M-matrix is generalized diagonally dom- inant, i.e., there exists a positive diagonal matrix D such that DQ is (weakly) diagonally dominant. Generalized diagonally dominant matrices are also called scaled diagonally dominant matrices in the literature. Proposition 10 (Boman et al. [56]). A matrix Q is generalized diagonally dominant iff it has factor width at most two, i.e., there exists a real matrix V nm such that Q= VV > and each column of V contains at most two non-zeros. Proposition 10 implies that if Q is an M-matrix, then the quadratic function x 0 Qx can be written as a sum of quadratic functions of at most two variables each, i.e., x 0 Qx =å m j=1 å n i=1 V i j x i 2 where for any j at most two entries V i j are non-zero. Therefore, to derive stronger formulations for (2.11), we first study the mixed-integer epigraphs of parametric pairwise quadratic functions with indicators. 22 2.3.1 Convexification of the parametric pairwise terms Consider the mixed-integer epigraph of a parametric pairwise quadratic term (with parameters d 1 ;d 2 ) Z 2 = n (z;x;s)2f0;1g 2 R 3 + : d 1 x 2 1 2x 1 x 2 + d 2 x 2 2 s; x i (1 z i )= 0; i= 1;2 o ; where d 1 d 2 1 and d 1 ;d 2 > 0, which is the necessary and sufficient condition for convexity of the function d 1 x 2 1 2x 1 x 2 + d 2 x 2 2 . One may, without loss of generality, assume the cross-product coefficient equals2, as otherwise the continuous variables and coefficients can be scaled. Clearly, if d 1 = d 2 = 1, then Z 2 reduces to X 2 . Consider the two decompositions of the two-variable quadratic function in the definition of Z 2 given by d 1 x 2 1 2x 1 x 2 + d 2 x 2 2 = d 1 x 1 x 2 d 1 2 + x 2 2 d 2 1 d 1 = d 2 x 1 d 2 x 2 2 + x 2 1 d 1 1 d 2 : Intuitively, the decompositions above are obtained by extracting a term d i x 2 i from the quadratic function such thatd i is as large as possible and the remainder quadratic term is still convex. Then, applyingpersp and Proposition 8 to the separable and pairwise quadratic terms, respectively, one obtains two valid inequalities for Z 2 : d 1 f(z 1 ;z 2 ;x 1 ; x 2 d 1 )+ x 2 2 z 2 d 2 1 d 1 s (2.12) d 2 f(z 1 ;z 2 ; x 1 d 2 ;x 2 )+ x 2 1 z 1 d 1 1 d 2 s: (2.13) 23 Clearly, there are infinitely many such decompositions depending on the values ofd i , i= 1;2. Sur- prisingly, Theorem 1 below shows that inequalities (2.12)–(2.13) along with the bound constraints are sufficient to describe conv(Z 2 ). Theorem 1. conv(Z 2 )= (z;x;s)2[0;1] 2 R 3 + : (2.12) (2.13) : Proof. Consider the mixed-integer optimization problem min (z;x;s)2Z 2 a 1 z 1 + a 2 z 2 + b 1 x 1 + b 2 x 2 +ls (2.14) and the corresponding convex optimization min a 1 z 1 + a 2 z 2 + b 1 x 1 + b 2 x 2 +ls (2.15a) s.t. d 1 f(z 1 ;z 2 ;x 1 ; x 2 d 1 )+ x 2 2 z 2 d 2 1 d 1 s (2.15b) d 2 f(z 1 ;z 2 ; x 1 d 2 ;x 2 )+ x 2 1 z 1 d 1 1 d 2 s (2.15c) z2[0;1] 2 ; x2R 2 + ; s2R + : (2.15d) To prove the result it suffices to show that, for any value of (a;b;l), either (2.14) and (2.15) are both unbounded, or that (2.15) has an optimal solution that is also optimal for (2.14). We assume, without loss of generality, that d 1 d 2 > 1 (if d 1 d 2 = 1, the result follows from Proposition 8 by scaling), l > 0 (if l < 0, both problems are unbounded by letting s!¥, and if l = 0, problem (2.15) reduces to linear optimization over a integral polytope and optimal solutions are integral in z), and l = 1 (by scaling). Moreover, since d 1 d 2 > 1, there exists an optimal solution for both (2.14) and (2.15). Let (z ;x ;s ) be an optimal solution of (2.15); we show how to construct from (z ;x ;s ) a feasible solution for (2.14) with same objective value, thus optimal for both problems. Observe that for g 0, f(gz 1 ;gz 2 ;gx 1 ;gx 2 )=g f(z 1 ;z 2 ;x 1 ;x 2 ). Thus, if z 1 ;z 2 < 1, then (gz ;gx ;gs ) is also feasible for (2.15) with objective valueg(a 1 z 1 + a 2 z 2 + b 1 x 1 + b 2 x 2 + s ). In particular, either there exists an (integral) optimal solution with z = x = 0 by setting g = 0, or there exists an 24 optimal solution with one of the z variables equal to one by increasing g. Thus, assume without loss of generality that z 1 = 1. Now consider the optimization problem min a 2 z 2 + b 1 x 1 + b 2 x 2 + d 1 f(1;z 2 ;x 1 ; x 2 d 1 )+ x 2 2 z 2 d 2 1 d 1 (2.16a) z 2 2[0;1]; x2R 2 + ; (2.16b) obtained from (2.15) by fixing z 1 = 1, dropping constraint (2.15c), and eliminating variable s since (2.15b) holds at equality in optimal solutions. An integer optimal solution for (2.16) is also optimal for (2.14) and (2.15). Let(ˆ z; ˆ x) be an optimal solution for (2.16), and consider the two cases: Case 1: ˆ x 1 ˆ x 2 =d 1 : If 0< ˆ z 2 < 1, then the point (g ˆ z 2 ;g ˆ x 1 ;g ˆ x 2 ) with 0g ˆ z 2 1 is feasible for (2.16) with objective value g a 2 ˆ z 2 + b 1 ˆ x 1 + b 2 ˆ x 2 + d 1 f(1; ˆ z 2 ; ˆ x 1 ; ˆ x 2 d 1 )+ ˆ x 2 2 ˆ z 2 d 2 1 d 1 : Therefore, there exists an optimal solution where ˆ z 2 2f0;1g. Case 2: ˆ x 1 > ˆ x 2 =d 1 : In this case,(ˆ z 2 ; ˆ x 1 ; ˆ x 2 ) is an optimal solution of min a 2 z 2 + b 1 x 1 + b 2 x 2 + d 1 x 1 x 2 d 1 2 + x 2 2 z 2 d 2 1 d 1 (2.17a) z 2 2[0;1]; x2R 2 + : (2.17b) The condition ˆ x 1 > ˆ x 2 =d 1 implies that ˆ x 1 > 0, thus the optimal value of x 1 can be found by taking derivatives and setting to 0. We find ˆ x 1 = b 1 2d 1 + x 2 d 1 25 Replacing x 1 with his optimal value in (2.17) and removing constant terms, we find that (2.17) is equivalent to min a 2 z 2 + b 1 d 1 + b 2 x 2 + x 2 2 z 2 d 2 1 d 1 (2.18a) z 2 2[0;1]; x 2 2R + : (2.18b) If 0< ˆ z 2 < 1, then the point(g ˆ z 2 ;g ˆ x 2 ) with 0g ˆ z 2 1 is feasible for (2.18) with objective value g a 2 ˆ z 2 + b 1 d 1 + b 2 ˆ x 2 + ˆ x 2 2 ˆ z 2 d 2 1 d 1 : Therefore, there exists an optimal solution where ˆ z 2 2f0;1g. In both cases we find an optimal solution with z 2 2f0;1g. Thus, problem (2.15) has an optimal solution integral in both z 1 and z 2 , which is also optimal for (2.14). Example 1 (continued). The relaxation of (2.10) with only inequality (2.13): 1:16+ min 0:8x 1 2x 2 + 0:5(z 1 + z 2 )+ 0:5s s.t. 3 f(z 1 ;z 2 ; x 1 3 ;x 2 )+ x 2 1 z 1 3 1 3 s z2[0;1] 2 ;x2R 2 + ; is sufficient to obtain the integral optimal solution. Note that the big-M constraints x i z i are not needed. Given d 1 ;d 2 2R + , define the function g :[0;1] 2 R 2 + !R + as g(z 1 ;z 2 ;x 1 ;x 2 ;d 1 ;d 2 )= max ( d 1 f(z 1 ;z 2 ;x 1 ; x 2 d 1 )+ x 2 2 z 2 d 2 1 d 1 ; d 2 f(z 1 ;z 2 ; x 1 d 2 ;x 2 )+ x 2 1 z 1 d 1 1 d 2 ) (2.19) 26 For any d 1 ;d 2 > 0 with d 1 d 2 1, function g is the point-wise maximum of two convex functions and is therefore convex. Using the convex function g, Theorem 1 can be restated as conv(Z 2 )= (z;x;s)2[0;1] 2 R 3 + : g(z 1 ;z 2 ;x 1 ;x 2 ;d 1 ;d 2 ) s : Finally, it is easy to verify that if z 1 z 2 , then the maximum in (2.19) corresponds to the first term; if z 1 z 2 , the maximum corresponds to the second term. Thus, an explicit expression of g is g(z;x;d)= 8 > > > > > > > > > > < > > > > > > > > > > : d 1 x 2 1 2x 1 x 2 +x 2 2 =d 1 z 1 + x 2 2 z 2 d 2 1 d 1 if z 1 z 2 and d 1 x 1 x 2 d 1 x 2 1 2x 1 x 2 +d 2 x 2 2 z 2 if z 1 z 2 and d 1 x 1 x 2 d 1 x 2 1 2x 1 x 2 +d 2 x 2 2 z 1 if z 1 z 2 and x 1 d 2 x 2 x 2 1 =d 2 2x 1 x 2 +d 2 x 2 2 z 2 + x 2 1 z 1 d 1 1 d 2 if z 1 z 2 and x 1 d 2 x 2 : 2.3.2 Convex relaxations for general M-matrices Consider the set Z n = (z;x;t)2f0;1g n R n+1 + : x 0 Qx t; x i (1 z i )= 0;i= 1;:::;n ; where Q is an M-matrix. In this section, we will show how the convex hull descriptions for Z 2 can be used to construct strong convex relaxations for Z n . We start with the following motivating example. 27 Example 2. Consider the signal estimation in regularized form with n= 3,(y 1 ;y 2 ;y 3 )=(0:3;0:7;1:0), l = 1 andm = 0:5, z = 1:58+ min 0:6x 1 1:4x 2 2:0x 3 +t+ 0:5(z 1 + z 2 + z 3 ) (2.20a) s.t. x 2 1 + x 2 2 + x 2 3 +(x 1 x 2 ) 2 +(x 2 x 3 ) 2 t (2.20b) x i z i ; i= 1;2;3 (2.20c) z2f0;1g 3 ; x2R 3 + : (2.20d) The optimal solution of (2.20) is(z ;x )=(0:00;1:00;1:00;0:00;0:48;0:74) with objective value z = 1:504. The optimal solutions and the corresponding objective values of the convex relaxations of (2.20) are as follows: ` 1 -approx The opt. solution is(z ` ;x ` )=(0:24;0:43;0:59;0:24;0:43;0:59) with valuez ` 1 -approx = 0:936, andk(z ;x )(z ` ;x ` )k 2 = 0:80. persp The opt. solution is (z p ;x p ) = (0:00;0:40;0:82;0:00;0:29;0:58) with value z persp = 1:413, and (z ;x )(z p ;x p ) 2 = 0:67. pairwise The opt. solution(z q ;x q )=(0:18;0:74;1:00;0:13;0:43;0:71) with valuez pairwise = 1:488, and (z ;x )(z q ;x q ) 2 = 0:35. decomp.1 The quadratic constraint (2.20b) can be decomposed and strengthened as follows: 2x 2 1 2x 1 x 2 + x 2 2 + 2x 2 2 2x 2 x 3 + 2x 2 3 t ! g(z 1 ;z 2 ;x 1 ;x 2 ;2;1)+ g(z 2 ;z 3 ;x 2 ;x 3 ;2;2) t; leading to solution is (z d ;x d ) = (0:17;1:00;0:93;0:12;0:53;0:73) with value z decomp.1 = 1:495, andk(z ;x )(z d ;x d )k 2 = 0:23. 28 decomp.2 Alternatively, constraint (2.20b) can also be formulated as g(z 1 ;z 2 ;x 1 ;x 2 ;2;2)+g(z 2 ;z 3 ; x 2 ;x 3 ;1;2) t, and the resulting convex relaxation has solution(z ;x )=(0:00;1:00;1:00; 0:00;0:48;0:74), corresponding to the optimal solution of (2.20). As Example 2 shows, strong convex relaxations of Z n can be obtained by decomposing x 0 Qx into sums of two-variable quadratic terms (as Q is an M-matrix) and convexifying each term. However, such a decomposition is not unique and the strength of the relaxation depends on the decomposition chosen. We now discuss how to optimally decompose the matrix Q to derive the strongest lower bound possible for a fixed value of(z;x;t). Then, we show how this decomposition procedure can be embedded in a cutting surface algorithm to obtain a strong convex relaxation of (2.11). Consider the separation problem: given a point(z;x;t)2[0;1] n R n+1 + , find a decomposition of Q such that, after strengthening each two-variable term, results in a most violated inequality, which is formulated as follows: q(z;x)= max d n å i=1 n å j=i+1 jQ i j jg(z i ;z j ;x i ;x j ;d i i j ;d j i j ) (2.21a) s.t. å ji jQ i j jd i i j = Q ii 8i= 1;:::;n (2.21b) d i i j d j i j 1; d i i j 0; d j i j 0 8i< j: (2.21c) Observe that the variables of the separation problem (2.21) are the parameters d, and the variables of the estimation problem(z;x) are fixed in the separation problem. In formulation (2.21) for each (negative) entry Q i j , i< j, there is a two-variable quadratic term of the formjQ i j j d i i j x 2 i 2x i x j + d j i j x 2 j ; after convexifying each such term, one obtains the objective (2.21a). Constraints (2.21b) ensure that the decomposition indeed corresponds to the original matrix Q by ensuring that the diago- nal elements coincide, and constraints (2.21c) ensure that each quadratic term is convex. From Proposition 10, problem (2.21) is feasible for any M-matrix Q. 29 For any feasible value of d, the objective (2.21a) is convex in (z;x); thus the function q : [0;1] n R n + !R + defined in (2.21) is a supremum of convex functions and is convex itself. More- over, the constraints (2.21b) and (2.21c) are linear or rotated cone constraints, thus, are convex in d. As we now show, the objective function (2.21a) is concave in d, thus (2.21) is a convex optimization. Index the variables such that z 1 z 2 ::: z n . Then, each term in the objective (2.21a) reduces to g(z i ;z j ;x i ;x j ;d i i j ;d j i j )= 8 > > < > > : d i i j x 2 i 2x i x j +x 2 j =d i i j z i + x 2 j z j d j i j 1 d i i j if d i i j x i x j d i i j x 2 i 2x i x j +d j i j x 2 j z j if d i i j x i x j = d i i j x 2 i z i +d j i j x 2 j z j + 8 > > < > > : 2x i x j z i x 2 j d i i j 1 z j 1 z i if d i i j x i x j 2x i x j z j + d i i j x 2 i 1 z j 1 z i if d i i j x i x j : Thus, g(z;x;d) is separable in d i i j and d j i j , is linear in d j i j ; and, it is linear in d i i j for d i i j x j =x i , and concave for d i i j x j =x i . Moreover, it is easily shown that it is continuous and differentiable (i.e., the derivatives of both pieces of g with respect to d i i j coincide if d i i j x i = x j ). Therefore, the separation problem (2.21) can be solved in polynomial time by first sorting the variables z i and then by solving a convex optimization problem. The separation procedure can be embedded in an algorithm that iteratively constructs stronger relaxations of problem (2.11). Simple cutting surface algorithm: 1. Solve a valid convex relaxation. 2. Solve separation problem (2.21) using a convex optimization method. 3. Add the inequality obtained from solving the separation problem to the formulation, strength- ening the relaxation, and go to step 1. Below, we illustrate thesimple cutting surface algorithm. 30 Example 2 (Continued). Consider thepersp relaxation z 1 = 1:58+ min 0:6x 1 1:4x 2 2:0x 3 +t+ 0:5(z 1 + z 2 + z 3 ) (2.22a) s.t. x 2 1 z 1 + x 2 2 z 2 + x 2 3 z 3 +(x 1 x 2 ) 2 +(x 2 x 3 ) 2 t (2.22b) x i z i ; i= 1;2;3 (2.22c) z2[0;1] 3 ; x2R 3 + : (2.22d) with optimal solution(z;x) 1 =(0:00;0:40;0:82;0:00;0:29;0:58) with z 1 = 1:413 and (z ;x ) (z;x) 1 2 = 0:67. This relaxation can be improved by solving the separation problem (2.21) at (z;x) 1 to obtain the optimal parameters d 1 12 = 2:00, d 2 12 = 0:51, d 2 23 = 2:49 and d 3 23 = 2:00, leading to the decomposition and the constraint g(z 1 ;z 2 ;x 1 ;x 2 ;2:00;0:51)+ g(z 2 ;z 3 ;x 2 ;x 3 ;2:49;2:00) t: Adding this constraint to (2.22) and resolving gives the improved solution(z;x) 2 =(0:15;0:70;1:00; 0:12;0:43;0:71). This process can be repeated iteratively, resulting in the sequence of solutions iter.2 (z;x) 2 = (0:15;0:70;1:00;0:12;0:43;0:71) with z 2 = 1:452 andk(z ;x )(z;x) 2 k 2 = 0:36. The corresponding separation problem has solution(d 1 12 ;d 2 12 ;d 2 23 ;d 3 23 )=(2;1:06;1:94;2). iter.3 (z;x) 3 = (0:14;1:00;1:00;0:10;0:52;0:75) with z 3 = 1:499 andk(z ;x )(z;x) 3 k 2 = 0:18. The corresponding separation problem has solution(d 1 12 ;d 2 12 ;d 2 23 ;d 3 23 )=(2;2:5;0:5;2). iter.4 (z;x) 4 =(0:00;1:00;1:00;0:00;0:48;0:74) withz 3 = 1:504. The solution is integral and optimal for (2.20). The iterative separation procedure outlined above ensures that (z;x;t) satisfies the convex re- laxation Q= (z;x;t)2[0;1] n R n+1 + :q(z;x) t 31 of Z n that dominates the ` 1 -approx, persp, and pairwise and gives the strong relaxation of problem (2.11), based on the optimaldecomposition of matrix Q, given by (decomp) kyk 2 2 + min (z;x)2[0;1] n R n + 2y 0 x+q(z;x): z2 ¯ C; x i kyk ¥ z i ; i= 1:::;n: In Section 2.4 we discuss the efficient implementation ofdecomp in a conic quadratic optimization framework. 2.4 SOCP representation and Lagrangian decomposition Relaxation decomp simultaneously exploits sparsity, fitness and smoothness terms in (2.3) and, therefore, dominates all of the relaxations discussed in Section 2.2. However, the convex functions f and g can be pathological, as they are defined by pieces and are not differentiable everywhere. Handling functionq is challenging as it is non-differentiable, but also it is not given in closed form and requires solving optimization problem (2.21) to evaluate. In this section, we first show how to tackle decomp effectively by formulating it as a second- order cone (SOCP) in an extended space. We then give a tailored Lagrangian decomposition method, which is amenable to parallel computing and highly scalable. 2.4.1 Extended formulations The simple cutting surface algorithm to solve decomp, illustrated in Example 2, is com- putationally cumbersome since: (i) the separation problem (step 2) requires solving a constrained convex optimization problem; (ii) each cut added (step 3) is dense (and thus problematic for opti- mization software); (iii) a single cut is generated at each iteration; consequently, the method may require many iterations to converge. In this section, we show how to address these shortcomings with a conic quadratic extended formulation with auxiliary variables. In particular: 32 (i) The separation problem can be solved in closed form (Proposition 16)– eliminating the need to solve auxiliary optimization problems. (ii) Cuts (2.33) can be added as inequalities with at most four variables. Most conic quadratic optimization solvers are designed to exploit sparsity to improve performance and numerical stability. (iii) The method may add up toO(n 2 ) cuts per round, decreasing the total number of rounds and re-optimizations. More importantly, simple cuts (e.g., sparse or linear cuts) in this ex- tended space may translate into highly nonlinear cuts when projected into the original space of variables, often resulting in additional strength. Indeed, MIO often relies on extended formulations as such formulations lead to substaintial improvements compared to working in the original space of variables [12, 26, 212, 220]. The proposed extended formulation leads to a method at least two orders-of-magnitude faster than thesimple cutting surface algorithm. Define additional variablesG2R nn such thatG i j =G ji ; intuitively, variableG i j represents the product x i x j . Given an M-matrix Q, consider the convex optimization problem min (z;x;G) kyk 2 2 2y 0 x+hG;Qi (2.23a) s.t.G ii z i x 2 i 8i= 1;:::;n (2.23b) 0 max d i j >0 d i j f(z i ;z j ;x i ; x j d i j ) d i j G ii 2G i j + 1 d i j G j j 8i< j (2.23c) 0 x i kyk ¥ z i i= 1;:::;n (2.23d) z2 ¯ C; x2R n + ; G2R nn : (2.23e) 33 We will show in this section that problem (2.23) is equivalent todecomp under mild conditions, and can be implemented efficiently via conic quadratic optimization. In order to prove this result, we introduce the auxiliary formulation: min (z;x;G) kyk 2 2 2y 0 x+hG;Qi (2.24a) s.t. 0 max d i i j d j i j 1 d i i j ;d j i j 0 g(z i ;z j ;x i ;x j ;d i i j ;d j i j ) d i i j G ii 2G i j + d j i j G j j 8i< j (2.24b) 0 x i kyk ¥ z i i= 1;:::;n (2.24c) z2 ¯ C; x2R n + ; G2R nn : (2.24d) We first prove that (2.24) is equivalent to decomp (Proposition 12), and then show that (2.23) and (2.24) are equivalent (Proposition 13). Before doing so, let us verify that (2.23)–(2.24) are indeed relaxations of (2.11). Proposition 11. Problems (2.23)–(2.24) are valid convex relaxations of (2.11). Proof. We only prove this result for (2.24); the proof for (2.23) follows from identical arguments and is omitted for brevity. First we argue convexity of (2.24). Clearly, the objective (2.24a) is linear and constraints (2.24d) are convex. Moreover, the right hand sides of constraints (2.24b) are supremum of convex functions, thus convex. Now we argue that (2.24) is indeed a relaxation of (2.11). Suppose that constraintsG i j = x i x j and z2 C are added to (2.24): thenhG;Qi= x 0 Qx and the objective functions of (2.11) and (2.24) coincide. Moreover, for any nonnegative d i i j ;d j i j such that d i i j d j i j 1, we see that g(z i ;z j ;x i ;x j ;d i i j ;d j i j ) d i i j x 2 i 2x i x j + d j i j x 2 j (Theorem 1 – validity) = d i i j G ii 2G i j + d j i j G j j ; (G i j = x i x j ) 34 thus inequalities (2.24b) are satisfied. So, if constraintsG i j = x i x j and z2 C are added, (2.24) is equivalent to (2.11). Hence, (2.24) is a relaxation of (2.11). Proposition 12. If Q is a positive definite M-matrix, then problems decomp and (2.24) are equiv- alent. Proof. Consider the variableG i j in (2.24) for some pair i< j: observe that it only appears in the objective with coefficient Q i j 0, and a single constraint (2.24b). It follows that in an optimal solution of (2.24), variableG i j is as large as possible and the corresponding constraint (2.24b) is binding: 2G i j = max d i i j d j i j 1 d i i j ;d j i j 0 g(z i ;z j ;x i ;x j ;d i i j ;d j i j ) d i i j G ii + d j i j G j j : Therefore, we find that problem (2.24) is equivalent to min z2 ¯ C;x2R n + ;G2R n max d kyk 2 2 2y 0 x+ n å i=1 Q ii G ii + n å i=1 n å j=i+1 jQ i j j g(z i ;z j ;x i ;x j ;d i i j ;d j i j ) d i i j G ii d j i j G j j (2.25a) s.t. d i i j d j i j 1;d i i j 0; d j i j 0 8i< j: (2.25b) Rearranging terms, we see that the objective of the inner maximization problem (2.25a) is equal to n å i=1 Q ii å ji jQ i j jd i i j ! G ii + n å i=1 n å j=i+1 jQ i j jg(z i ;z j ;x i ;x j ;d i i j ;d j i j ); where we ignored the constant (in d) termkyk 2 2 2y 0 x. In particular, the inner maximization problem is precisely the Lagrangian relaxation of (2.21), whereG ii are the dual variables associated with constraints (2.21b). Therefore, if strong duality holds for problem (2.21), then problems decomp and (2.24) are equivalent. 35 Finally, we verify that Slater’s condition and, thus, strong duality for (2.21) hold for positive definite Q. Since Q is positive definite, we have that Q= ¯ Q+rI for an M-matrix ¯ Q (with same off- diagonals) and somer> 0 (e.g., letr be the minimum eigenvalue of Q). Since ¯ Q is an M-matrix, there exists a vectord satisfying å ji jQ i j jd i i j = Q ii r< Q ii 8i= 1;:::;n d i i j d j i j 1; d i i j 0; d j i j 0 8i< j: It follows that letting d i i j =d i i j +e and d j i j =d j i j +e for all i< j ande> 0 small enough, we find a vector d such that d i i j d j i j > 1 and å ji jQ i j jd i i j Q ii 8i= 1;:::;n: (2.26) After increasing additional entries of d until all inequalities (2.26) are tight, we find an interior point of (2.21). For the signal estimation problem, Q is positive-definite. Nonetheless, if strong duality does not hold, formulation (2.24) is still a convex relaxation of (2.11) that is at least as strong asdecomp. Proposition 13. Problems (2.23) and (2.24) are equivalent. Proof. For any i< j, we see from Theorem 1 that constraint (2.24b) is equivalent to the pair of constraints: 0 max d i i j d j i j 1 d i i j ;d j i j 0 d i i j f(z i ;z j ;x i ; x j d i i j )+ x 2 j z j d j i j 1 d i i j ! d i i j G ii + 2G i j d j i j G j j (2.27) 0 max d i i j d j i j 1 d i i j ;d j i j 0 d j i j f(z i ;z j ; x i d j i j ;x j )+ x 2 i z i d i i j 1 d j i j ! d i i j G ii + 2G i j d j i j G j j : (2.28) 36 Observe that d i i j f(z i ;z j ;x i ; x j d i i j ) 0 for any d i i j > 0. Therefore, if x 2 j z j >G j j , constraint (2.27) is not satisfied since the right hand side can be made arbitrarily large by letting d j i j !¥ and d i i j = 1=d j i j . Therefore, constraint (2.27) implies thatG j j z j x 2 j . Similarly, (2.28) implies thatG ii z i x 2 i . Now assume thatG j j z j x 2 j hold for all j= 1;:::;n. In this case, for any optimal solution of the maximization problem (2.27) we find that d j i j is as small as possible; that is, d j i j = 1=d i i j . Thus, ifG j j z j x 2 j holds, then constraint (2.27) reduces to 0 max d i i j >0 d i i j f(z i ;z j ;x i ; x j d i i j ) d i i j G ii + 2G i j 1 d i i j G j j ; (2.29) which is precisely constraint (2.23c). Moreover, ifG ii z i x 2 i holds, then constraint (2.28) reduces to 0 max d j i j >0 d j i j f(z i ;z j ; x i d j i j ;x j ) 1 d j i j G ii + 2G i j d j i j G j j : (2.30) After a change of variable d i i j = 1=d j i j and noting that(1=d i i j ) f(z i ;z j ;d i i j x i ;x j )= d i i j f(z i ;z j ;x i ;x j =d i i j ), we conclude that (2.30) is equivalent to (2.29). Remark 1. Note that constraints (2.23c)–(2.24b) are necessary only if Q i j 6= 0. For the signal estimation problem (2.3), Q i j = 0 forfi; jg62 A. Thus, the methods developed here are particularly efficient when Q is sparse. 2.4.2 Implementation via conic quadratic optimization The objectives (2.23a) and (2.24a) are linear, and constraints (2.23b) are rotated cone constraints, and thus can be handled directly by conic quadratic optimization solvers. In Section 2.4.2.1, we show how constraints (2.23c) and (2.24b) can be reformulated as a conic constraints for a fixed value of d i j . Then we describe, in Section 2.4.2.2, a cutting plane method for implementing (2.23). 37 2.4.2.1 Conic quadratic reformulation of functions f and g We now show how to formulate convex models involving functions f and g as conic quadratic optimization problems. Specifically, we show how to model the epigraph of functions f and g in Propositions 14 and 15, respectively. Proposition 14 (Extended formulation of conv(X 2 )). A point(z;x;s) 2 conv(X 2 ) if and only if (z;x;s)2 [0;1] 2 R 3 + and there exists v;w2R such that the set of inequalities v x 1 x 2 ; v 2 sz 1 ; w x 2 x 1 ; w 2 sz 2 (2.31) are satisfied. Proof. Suppose, without loss of generality, that x 1 x 2 and that (z;x) satisfies the bound con- straints. If (z;x;s)2 conv(X 2 ) then (x 1 x 2 ) 2 z 1 s; setting v= x 1 x 2 and w= 0, we find a fea- sible solution for (2.31). Conversely, if (2.31) is feasible, then (x 1 x 2 ) 2 z 1 v 2 z 1 s and (z;x;s)2 conv(X 2 ). Proposition 15 (Extended formulation of conv(Z 2 )). A point(z;x;s) 2 conv(Z 2 ) if and only if(z;x;s)2[0;1] 2 R 3 + and there exists s 1 ;s 2 ;q 1 ;q 2 2R + and v 1 ;v 2 ;w 1 ;w 2 2 R + such that the set of inequalities x 2 1 s 1 z 1 ; x 2 2 s 2 z 2 (persp) d 1 v 1 d 1 x 1 x 2 ; v 2 1 q 1 z 1 (z 1 z 2 and d 1 x 1 x 2 ) d 1 v 2 d 1 x 1 + x 2 ; v 2 2 q 1 z 2 (z 1 z 2 and d 1 x 1 x 2 ) d 1 q 1 + s 2 d 2 1 d 1 s (z 1 z 2 ) d 2 w 1 x 1 d 2 x 2 ; w 2 1 q 2 z 1 (z 1 z 2 and x 1 d 2 x 2 ) d 2 w 2 x 1 + d 2 x 2 ; w 2 2 q 2 z 2 (z 1 z 2 and x 1 d 2 x 2 ) d 2 q 2 + s 1 d 1 1 d 2 s (z 1 z 2 ) 38 are satisfied. Proof. Follows from using the system (2.31) with inequalities (2.12)–(2.13). 2.4.2.2 Improved cutting surface method Our implementation of the strong relaxation decomp is based on formulation (2.23), implemented in a cutting surface method. Consider the relaxation of (2.23) given by min (z;x;G) kyk 2 2 2y 0 x+hG;Qi (2.32a) s.t.G ii z i x 2 i 8i= 1;:::;n (2.32b) 0 d f(z i ;z j ;x i ; x j d ) dG ii 2G i j + 1 d G j j 8i< j;8d2D i j (2.32c) 0 x i kyk ¥ z i i= 1;:::;n (2.32d) z2 ¯ C; x2R n + ; G2R nn ; (2.32e) where eachD i j is a finite subset ofR. From Proposition 14, each constraint (2.32c) can be formu- lated by introducing new variables s;v;w 0 as the system 0 ds dG ii 2G i j + 1 d G j j ; (2.33a) v x 1 x 2 d ; v 2 sz 1 ; w x 2 d x 1 ; w 2 sz 2 : (2.33b) Therefore, relaxation (2.32) can be solved using a conic quadratic solver. In the proposed cutting surface method, formulation (2.32) is iteratively refined by adding additional elements to setsD i j , as outlined in Algorithm 2. First, all setsD i j are initialized to the singletonf1g (line 1). At each iteration of the algorithm, a relaxation of the form (2.32) is solved to optimality (line 3). Then, for each pair of indexes i< j where the relaxation induced by (2.32c) is weak, the setD i j is enlarged to improve the relaxation (line 7); Remark 2 and Proposition 16 below show to efficiently check whether the relaxation needs to be refined and how to do so, respectively. 39 Algorithm 2 Algorithm to solve formulationdecomp Output: ( ˆ x; ˆ z; ˆ G) optimal fordecomp 1: D i j f1g for all i< j 2: while Stopping criterion not met do 3: ( ˆ x; ˆ z; ˆ G) Solve (2.32) 4: for all i< j do 5: if Constraint (2.23c) is not satisfied then 6: Compute optimal d i j for maximization (2.23c) . See Proposition 16 7: D i j D i j [fd i j g 8: end if 9: end for 10: end while 11: return( ˆ x; ˆ z; ˆ G) Proposition 16. For any i< j, the optimal solution of the inner maximization problem (2.23c) is obtained as follows: 1. If x 2 i =G ii x 2 j =G j j then: (a) IfG ii x 2 i =z i = 0, then d i j !¥ is optimal. (b) Otherwise, d i j = v u u t G j j x 2 j z i G ii x 2 i z i is optimal. 2. If x 2 i =G ii x 2 j =G j j then: (a) IfG ii x 2 i =z j = 0, then d i j !¥ is optimal. (b) Otherwise, d i j = v u u u t G j j x 2 j z j G ii x 2 i z j is optimal. Proof. Suppose x 2 i =G ii x 2 j =G j j . Note thatG ii x 2 i =z i holds from constraints (2.23b). Moreover, we find that G j j x 2 j z i G ii x 2 j x 2 i x 2 j z i = x 2 j G ii x 2 i 1 z i 0: 40 We now show that there exists a stationary point of (2.23c) satisfying d i j x i x j . In this case, optimization problem (2.23c) reduces to 0 max d i j >0 d i j x 2 i 2x i x j + x 2 j =d i j z i d i j G ii 2G i j + 1 d i j G j j , 0 2 G i j x i x j z i + max d i j >0 ( d i j G ii x 2 i z i 1 d i j G j j x 2 j z i !) : (2.34) IfG ii x 2 i =z i = 0, then d i j !¥ is an optimal solution to (2.34). IfG j j x 2 j =z i = 0, then d i j = 0 is optimal. Moreover, if bothG ii x 2 i =z i > 0 andG j j x 2 j =z i > 0, then taking derivatives with respect to d i j we find that d i j = v u u u t G j j x 2 j z i G ii x 2 i z i (2.35) Finally, we verify that the condition d i j x i x j holds. Indeed, this condition reduces to 0 @ G j j x 2 j z i G ii x 2 i z i 1 A x 2 i x 2 j , G j j x 2 j z i ! x 2 i G ii x 2 i z i x 2 j , x 2 i G ii x 2 j G j j ; which is satisfied. The proof for the case x 2 i =G ii x 2 j =G j j is analogous. Remark 2. By replacing d i j with its optimal value in (2.23c), we find that this constraint can be written explicitly as the piecewise constraint 0 8 > > > > < > > > > : G i j x i x j z i s G ii x 2 i z i G j j x 2 j z i if x 2 i G ii x 2 j G j j G i j x i x j z j s G ii x 2 i z j G j j x 2 j z j if x 2 i G ii x 2 j G j j : (2.36) However, constraint (2.36) is not conic quadratic. Remark 3. In our computations, we use the following stopping criterion in line 3. Let z old and z new be the optimal objective value of the relaxation (line 3). The algorithm is terminated when the relative improvement of the relaxation(z new z old )=z new 5 10 5 . 41 Remark 4. Using Proposition 15, one can extend the ideas discussed in this section to tackle (2.24) in a conic quadratic optimization framework as well. However, we prefer formulation (2.23) since the conic quadratic representation of function f is simpler and more compact. 2.4.3 Lagrangian methods for estimation with regularized objective The cutting surface method introduced in Section 2.4.2 requires solving a sequence of progres- sively larger conic quadratic optimization problems. Based on our computations, this method can handle a variety of constraints (encoded by set ¯ C), and solve the instances with n 10;000 within seconds. For better scalability, in this section, we develop a Lagrangian relaxation-based method for the estimation problem with regularization objective: min x;z ky xk 2 2 +l n1 å i=1 (x i+1 x i ) 2 +m n å i=1 z i (2.37a) s.t. 0 x i kyk ¥ z i i= 1;:::;n (2.37b) x2R n + ; z2f0;1g n (2.37c) where m 0 is a regularization parameter controlling the sparsity of the target signal. Let L= f` 1 ;:::;` m ;` m+1 gf1;:::;ng be any subset of the indexes such that 1=` 1 <:::<` m <` m+1 = n+ 1. With the introduction of additional variables w j = x ` j x ` j 1 , problem (2.37) can be equiv- alently written as min x;z m å j=1 ` j+1 1 å i=` j (y i x i ) 2 +l ` j+1 2 å i=` j (x i+1 x i ) 2 +m ` j+1 1 å i=` j z i ! + m å j=2 w 2 j (2.38a) s.t. w j = x ` j x ` j 1 j= 2;:::;m (2.38b) 0 x i kyk ¥ z i i= 1;:::;n (2.38c) x2R n + ; z2f0;1g n ; w2R m1 : (2.38d) 42 Without the coupling constraints (2.38b), problem (2.38) decomposes into m independent prob- lems, each with variables indexed in [` j ;` j+1 1] for j = 1;:::;m. Letting g j be the Lagrange multiplier for constraint w j = x ` j x ` j 1 , we obtain the Lagrangian dual problem max g2R m1 min x;z;w m å j=1 ` j+1 1 å i=` j (y i x i ) 2 +l ` j+1 2 å i=` j (x i+1 x i ) 2 +m ` j+1 1 å i=` j z i + m å j=2 w 2 j +g j (w j x ` j + x ` j 1 ) (2.39a) s.t. 0 x i kyk ¥ z i i= 1;:::;n (2.39b) x2R n + ; z2f0;1g n ; w2R m1 : (2.39c) Observe that w j =g j =2 holds for an optimal solution of the inner minimization problem. More- over, to obtain a strong convex relaxation, we can reformulate each independent inner minimization problem using the formulations discussed in Section 2.3.2, yielding the convex relaxation max g2R m1 min x;z m å j=1 ` j+1 1 å i=` j (y 2 i 2y i x i )+q j (z;x)+m ` j+1 1 å i=` j z i ! m å j=2 g 2 j 4 +g j (x ` j 1 x ` j ) (2.40a) s.t. 0 x i kyk ¥ z i i= 1;:::;n (2.40b) x2R n + ; z2[0;1] n ; (2.40c) whereq j (z;x) is the convexification of the epigraph of the term ` j+1 1 å i=` j x 2 i +l ` j+1 2 å i=` j (x i+1 x i ) 2 : Implementation Problem (2.40) can be solved via a primal-dual method: for any fixed g, the inner minimization problem can be solved by solving m independent sub-problems (in parallel), and each sub-problem is solved using Algorithm 2. We now describe our implementation of the “main” outer maximization problem. Set L Given a target number of subproblems m2Z + , we let` j = 1+( j1)bn=mc for j= 1;:::;m. 43 Subgradient method Given g2R m1 , a subgradient of the objective (2.40a) at g is given by x(g) j = g j 2 +(x ` j 1 x ` j ), where x is an optimal solution of the inner minimization prob- lem at g. Thus, letting g h be the value of g at iteration h2 Z + , we use the update rule g h+1 =g h +(1=h)x(g h ). Initial point We start the algorithm with the initial pointg 0 = 0. Note that if x ` j 1 = x ` j = 0 (which is the case for largem), theng j = 0 is optimal. Stopping criterion We terminate the algorithm whenkx(g h )k ¥ <e (e= 10 3 in our computations) or when the number of iterations reaches h max (h max = 100 in our computations). Additional considerations In the first iteration, we need to solve m subproblems. However, the subsequent iterations often require solving fewer subproblems: if x(g h ) j =x(g h+1 ) j and x(g h ) j+1 =x(g h+1 ) j+1 , then at iteration h+ 1 solution of subproblem j does not change from the previous iteration. For problem instances with largem, the number of subproblems solved in subsequent iterations reduces considerably. On Lagrangian methods for general adjacency graphs We point out that the tailored La- grangian relaxation-based decomposition can be applied to higher-dimensional adjacency graphs. For example, when the adjacency graph is a two-dimensional mesh, one can divide the mesh into rectangular regions and apply a similar decomposition approach. Theoretically, it is possible to use a similar method for arbitrary adjacency graphs. However, key to efficiency of the Lagrangian method, described in “additional considerations” above, is the observation that if the dual variables corresponding to the border of a subregion do not change, then subproblems need not be resolved. With higher-dimensional graphs, where borders are no longer defined by two points but by rect- angles or other objects, more subproblems may need to be resolved unless regions are defined carefully, depending on the data y. Developing such methods to partition the adjacency graph is beyond the scope of this chapter. 44 2.5 Computations In this section we present experiments with utilizing the strong convex relaxations based on the pairwise convexification methods proposed in the chapter. In Section 2.5.1, we perform ex- periments to evaluate whether the convex model decomp provides a good approximation to the non-convex problem (2.11). In Section 2.5.2 we test the merits of formulation decomp (with a variety of constraints ¯ C) compared to the usual ` 1 -approximation from an inference perspec- tive. Finally, in Section 2.5.3 we test the Lagrangian relaxation-based method proposed in Sec- tion 2.4.3. We use Mosek 8.1.0 (with default settings) to solve the conic quadratic optimiza- tion problems. All computations are performed on a laptop with eight Intel(R) Core(TM) i7- 8550 CPUs and 16GB RAM. All data and code used in the computations are available at https: //sites.google.com/usc.edu/gomez/data. 2.5.1 Relaxation quality This section is devoted to testing how well the proposed convex relaxations are able to approximate the` 0 optimization problems using real data. 2.5.1.1 Data Consider the accelerometer data depicted in Figure 2.2 (A), used in [69, 70] and downloaded from the UCI Machine Learning Repository [85]. The time series corresponds to the “x acceleration” of participant 2 of the “Activity Recognition from Single Chest-Mounted Accelerometer Dataset”. This participant was “working at computer” until time stamp 44,149; “standing up, walking and go- ing upstairs” until time stamp 47,349; “standing” from time stamp 47,350 to 58,544, from 80,720 to 90,439, and from time 90,441 to 97,199; “walking” from 58,545 to 80,719; “going up or down stairs” from 90,440 to 94,349; “walking and talking with someone” from 97,200 to 104,300; and “talking while standing” from 104,569 to 138,000 (status between 104,301 and 104,568 is un- known). 45 0 500 1000 1500 2000 2500 3000 3500 4000 1 3944 7887 11830 15773 19716 23659 27602 31545 35488 39431 43374 47317 51260 55203 59146 63089 67032 70975 74918 78861 82804 86747 90690 94633 98576 102519 106462 110405 114348 118291 122234 126177 130120 134063 (a) Original data 0 50 100 150 200 250 300 350 400 450 500 1 396 791 1186 1581 1976 2371 2766 3161 3556 3951 4346 4741 5136 5531 5926 6321 6716 7111 7506 7901 8296 8691 9086 9481 9876 10271 10666 11061 11456 11851 12246 12641 13036 13431 (b) Transformed data Fig. 2.2: Underlying signals and noisy observations. Several machine learning methods have been proposed to use accelerometer data to discrimi- nate between activities, e.g., see [37] and the references therein. Variations of the acceleration can help to discriminate between activities [69]. Moreover, as pointed out in [227], behaviors can be identified (at a simplistic level) from frequencies and amplitudes of wave patterns in a single axis of the accelerometer. Therefore, we consider a rudimentary approach to identify activities from the accelerometer data: we partition the dataset into windows of 10 samples each, and for each window we compute the mean absolute value of the successive differences, obtaining the dataset plotted in Figure 2.2 (B) 2 . Finally, we scale the data so thatkyk ¥ = 1. Given an optimal solution x of the estimation problem (2.6) or a suitable relaxation of it, periods with little or no physical activity can be naturally associated with time stamps i where x i = 0, and values x i > 0 can be used as a proxy for the energy expenditure due to physical activity [202]. 2 One of the key features identified in [69] for activity recognition are the minmax sums of 52-sample windows, computed as the sums of successive differences of consecutive “peaks”. The time series we obtain follows a similar intuition, but is larger and noisier due to smaller windows. 46 2.5.1.2 Methods We compare the following two relaxations of the` 0 -problem min x2R n + n å i=1 (y i x i ) 2 +l n1 å i=1 (x i+1 x i ) 2 (2.41a) s.t. n å i=1 z i k (2.41b) x z; (2.41c) z2f0;1g n : (2.41d) L1 The natural convex relaxation of (2.41), obtained by relaxing the integrality constraints to z2[0;1] n . Decomp The convex modeldecomp –equivalently, (2.23)– implemented using Algorithm 2. The convex formulations used are relaxations of the ` 0 problem; so, their optimal objective values z LB provide lower bounds on the optimal objective value z of (2.41). We use a simple thresholding heuristic to construct a feasible solution for (2.41): for a given solution ˆ x to a convex relaxation, let ˆ x (k) denote the k-th largest value, and ¯ x be the solution given by ¯ x i = 8 > > < > > : ˆ x i if ˆ x i ˆ x (k) 0 otherwise. By construction ¯ x is feasible for (2.41), and its objective valuez UB provides an upper bound onz . Thus, the optimality gap of the heuristic is gap= 100 z UB z LB z UB : (2.42) 47 2.5.1.3 Results We test the convex formulations with the accelerometer data using l = 0:1t and k = 500t for t = 1;:::;10 for all 100 combinations. Figure 2.3(A) presents the optimality gaps obtained by each method for each value of l (averaging over all values of k), and Figure 2.3(B) presents the optimality gaps for each value of k (averaging over all values of l). We see that decomp substantially improves upon the natural` 1 relaxation. Indeed, the gaps from the` 1 relaxation are 66.7% on average, and can be very close to 100%; in contrast, the strong relaxations derived in this chapter yields optimality gaps of 0.4% on average. 0.0% 0.1% 1.0% 10.0% 100.0% 0 0.2 0.4 0.6 0.8 1 Optimality gap Regularization L1 Decomp (a) Gap as a function ofl. 0.0% 0.1% 1.0% 10.0% 100.0% 0 1000 2000 3000 4000 5000 Optimality gap Cardinality L1 Decomp (b) Gap as a function of k. Fig. 2.3: Optimality gaps of the` 1 relaxation (red) and the proposed convexificationdecomp (blue). Figure 2.4 presents the distribution of the time required for each method to solve the respective convex model. We see that the improvement of relaxation quality of the new relaxations comes at the cost of computational efficiency: while the` 1 relaxation is solved in approximately one second, the proposed convexification requires on average 54 seconds. Although the vast majority of the instances are solved under 100 seconds using Algorithm 2, a couple of instances require close to 10 minutes. Nonetheless, an average time of under minute to solve the instances to near-optimality (less than 1% optimality gap) is adequate for most practical settings. Moreover, as shown in Sec- tion 2.5.3, the computation times can be improved substantially using the decomposition method proposed in Section 2.4.3. 48 Fig. 2.4: Distribution of CPU times for each method. Finally, we present a brief comparison with mixed-integer optimization methods. Specifically, we use the perspective reformulation of (2.41), i.e., min x2R n + n å i=1 y 2 i 2y i x i + x 2 i z i +l n1 å i=1 (x i+1 x i ) 2 s.t. n å i=1 z i k x z; z2f0;1g n and solve the problems using Gurobi 8.0 with a one hour time limit. Table 2.1 presents the results: it shows, for k2f2000;4000g and l2f0:1;0:2g, the gaps and solution times for ` 1 -approx, decomp , and the branch-and-bound solver (b&b)as well as the number of branch-and-bound nodes explored. The mixed-integer optimizer is unable to solve to optimality instances of this size within the time limit of one hour. While Gurobi is able to produce good optimality gaps, improving substantially upon the simple ` 1 -relaxation (in part due to the use of the perspective reformula- tion),decomp produces better optimality gaps by an order-of-magnitude and requires only a small fraction of the computational effort. 49 Tab. 2.1: Comparison with exact branch-and-bound method. k l ` 1 -approx decomp b&b gap time(s) gap time(s) gap time(s) nodes 2000 0.1 91.2 1 0.3 60 2.7 3,600 4,047 0.2 87.0 1 0.6 42 7.4 3,600 1,445 4000 0.1 68.0 1 0.0 7 3.4 3,600 1,350 0.2 56.7 1 0.1 19 2.8 3,600 4,268 2.5.2 Statistical performance – modeling with priors In Section 2.5.1 we established that the convex model derived in this chapter indeed provides a much closer approximation for` 0 signal estimation problem than the usual` 1 relaxation. In this section we demonstrate that using the proposed convexification leads to better statistical perfor- mance than relying on the` 1 -approximation alone. We also show how additional priors other than sparsity can be seamlessly integrated into the new convex models, and the benefits of doing so. 2.5.2.1 Data We now describe how we generate test instances. First, the “true” sparse signal ˆ y is generated as follows. Let n be the number of time epochs, let s be a parameter controlling the number of “spikes” of the signal and let h be a parameter controlling the length of each spike. Initially, the true signal is fully sparse, ˆ y= 0. Then we iteratively repeat the following process to generate s spikes of non-zero values: 1. We select an index` uniformly between 1 and n+1h, corresponding to the start of a given spike. 2. We sample an h-dimensional vector v for a multivariate Gaussian distribution with mean 0 and covariance matrix B, where B i j = i(h+1 j) h+1 for i j. Thus v is a realization of a Brownian bridge process. 3. We update ˆ y `+i ˆ y `+i +jv i j. 50 Note that two different spikes may overlap, in which case the true signal ˆ y would have a single spike with larger intensity. Also note that the true signal ˆ y generated in this way has at most hs non-zeros and at most s spikes, but may have fewer if overlaps occur. Then, given a noise parameter s, we generate the noisy observations y i = ˆ y i +e i , wheree i follows a truncated normal distribution with mean 0, variances 2 i and lower bound ˆ y i . Finally, we scale the data so thatkyk ¥ = 1. 2.5.2.2 Methods We compare the following methods: L1 Corresponds to solving the` 1 -approx problem min x2R n + ky xk 2 2 +l n1 å i=1 (x i+1 x i ) 2 +mkxk 1 : Decomp-sparse Enforces the prior that the signal has a most hs non-zeros, by solving the convex optimization problem min x2R n + ;z2[0;1] n kyk 2 2 2 n å i=1 y i x i +q(z;x)+mkxk 1 s.t. n å i=1 z i hs 0 xkyk ¥ z using Algorithm 2. 51 Decomp-prior In addition to the sparsity prior as before, it incorporates the information that the underlying signal has a most s spikes and that each spike has at least h non-zeros. These two priors can be enforced by solving the optimization problem min x2R n + ;z2[0;1] n kyk 2 2 2 n å i=1 y i x i +q(z;x)+mkxk 1 (2.43a) s.t. n å i=1 z i hs (2.43b) n1 å i=1 jz i+1 z i j 2s (2.43c) minfn;`+hg å i=maxf1;`hg z i hz ` `= 1;:::;n (2.43d) 0 xkyk ¥ z: (2.43e) Constraint (2.43c) states that the process can transition from a zero value to a non-zero value at most 2s times, thus can have at most s spikes. Each constraint (2.43d) states that, if z ` = 1, then there must be at least h 1 neighboring non-zero points, thus non-zero indexes occur in patches of at least h elements. Observe that, following the results in [169], we keep an` 1 -regularization for shrinkage to improve performance in low signal-noise-ratio regimes. 2.5.2.3 Computational setting For the computations in this section, we generate instances with n= 1;000, s= 10, and h= 10; so, each signal is zero in approximately 90% of the time. Moreover, we test noise levelss = 0:1t, t= 1;:::;n, and for eachs we generate 10 different instances as follows: 1. For each parameter combination, two signals are randomly generated: one signal for training, the other for testing. 52 2. For all methods, we solve the corresponding optimization problem for the training signal with 10 values of the smoothness parameterl and 10 values of the shrinkage parameter m, a total of 100 combinations. We consider two criteria for choosing a pair(l;m): Error The pair that best fits the true signal with the respect to the estimation error, i.e., com- bination minimizingk ˆ yx k 2 2 , where x is the solution for corresponding optimization. Sparsity The pair that best matches the sparsity pattern of the true signal 3 , i.e., combination minimizingå n i=1 j ˆ y i j 0 jx i j 0 . This setting is of practical interest in cases where the training data is partially labeled: the location of the spikes is known but the actual value of the signal is not. 3. We solve the optimization problem for the testing signal with parameters (l;m) chosen in (2), and report the results (averaged over the 10 instances). For the instances considered, Table 2.2 shows the average Signal-to-Noise Ratio (SNR) as a func- tion ofs, computed as SNR= k ˆ yk 2 2 k ˆ yyk 2 2 Tab. 2.2: Signal-to-Noise Ratio for different values of the noise. s 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 SNR 2,200 138 27 8.6 3.5 1.7 0.9 0.5 0.3 0.2 2.5.2.4 Results with respect to the error criterion We now present the results when the true values of ˆ y are known in training. Figure 2.5 depicts the out-of-sample error of each method and SNR, computed as error= k ˆ y test x k 2 2 k ˆ y test k 2 2 where ˆ y test is the true testing signal, and x is the estimator. Figure 2.6 depicts how accurately the estimator obtained in testing matches the sparsity pattern of the true signal. We observe that the standard ` 1 -norm approach results in dense signals with a substantial number of false positives, and is outperformed by the approaches that enforce priors in terms of error as well. The inclusion of the sparsity prior 3 A point x i is considered non-zero ifjx i j> 10 3 . 53 results in a notable improvement in terms of the error across all SNRs, and reducing it by half or more for SNR 3. This prior also yields an order-of-magnitude improvement in terms of matching the sparsity pattern for SNR 3, although for low SNRs the improvement in matching the sparsity pattern is less pronounced (and is worse for SNR=0.5). The inclusion of additional priors for the number and length of each spike yields further improvements (especially for low SNRs), and yields a good match for the sparsity pattern in all cases. - 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.5 2.0 8.0 32.0 128.0 512.0 2,048.0 Error SNR L1 Decomp-sparse Decomp-prior Fig. 2.5: Average out-of-sample error as a function of SNR (in log-scale). Figure 2.7 provides detailed information about the distribution of the out-of-sample errors for three different SNRs. We see that in high SNR regimes, the inclusion of the sparsity prior con- sistently outperforms the` 1 -norm method, and the inclusion of additional priors consistently out- performs using only the sparsity prior. In contrast, in low SNR regimes, while the inclusion of additional priors yields better results on average, the improvement is not as consistent. Finally, Figure 2.8 depicts the average time required to solve the optimization problems as a function of the SNR. As expected, the` 1 -norm approximation is the fastest method. Optimization problems with the sparsity prior are solved under two seconds, and optimization problems with all 54 746 - 50 100 150 200 250 300 350 400 0.2 0.3 0.5 0.9 1.7 3.5 8.6 27 138 2,200 # of false positives/negatives SNR L1 Decomp-sparse Decomp-prior Fig. 2.6: Average out-of-sample number of false positives (red/dark blue/dark green) and false negatives (orange/light blue/light green) as a function of SNR. The number of false positives for L1 with SNR=2,200 is 746. (a) SNR=0.2. (b) SNR=0.5. (c) SNR=3.5. Fig. 2.7: Distribution of the out-of-sample errors for different SNRs when the true values of the signal used in training are available. priors are solved under 10 seconds. We see that time required to solve the problems based on the stronger relaxations increases as the SNR decreases. 2.5.2.5 Results with respect to the sparsity pattern criterion We now present the results when, for the training data, the true values of ˆ y are unknown, but its sparsity pattern is known. Figure 2.9 depicts the out-of-sample error of each method for each SNR 55 - 2.0 4.0 6.0 8.0 10.0 12.0 0.1 0.5 2.0 8.0 32.0 128.0 512.0 2,048.0 Time (s) SNR L1 Decomp-sparse Decomp-prior Fig. 2.8: CPU time in seconds as a function of SNR (in log-scale). The error bars correspond to 1 stdev. and Figure 2.10 depicts how accurately the estimator obtained in validation matches the sparsity pattern of the true signal. Naturally, as the true values of the training signal are unknown, all meth- ods perform worse in terms of the out-of-sample error. The` 1 -norm method in particular performs very poorly in low SNR regimes: the estimator is x 0, resulting a large error close to one and several false negatives (with no false positives, since few or no indexes are non-zero). In contrast the methods that enforce priors result in significantly reduced error across all SNRs while simulta- neously improving the detection of the sparsity in low SNRs regimes, correctly detecting several spikes. In this setting, we did not observe a substantial difference between methods Decomp-sparse and Decomp-prior. From Figure 2.11,which depicts the distributions of the errors, we see that the new convexification-based methods consistently outperform the` 1 -method. 56 - 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.5 2.0 8.0 32.0 128.0 512.0 2,048.0 Error SNR L1 Decomp-sparse Decomp-prior Fig. 2.9: Average out-of-sample error as a function of SNR (in log-scale) when only the sparsity pattern of the training signal is known. - 50 100 150 200 250 300 350 400 0.2 0.3 0.5 0.9 1.7 3.5 8.6 27 138 2,200 # of false positives/negatives SNR L1 Decomp-sparse Decomp-prior Fig. 2.10: Average out-of-sample number of false positives (red/dark blue/dark green) and false negatives (orange/light blue/light green) as a function of SNR when only the sparsity pattern of the training signal is known. 57 (a) SNR=0.2. (b) SNR=0.5. (c) SNR=3.5. Fig. 2.11: Distribution of the out-of-sample errors for different SNRs when only the sparsity pattern of the training signal is known. 2.5.3 Computational experiments - Lagrangian methods We now report on the performance of the Lagrangian method given in Section 2.4.3 for larger signals with n= 100;000,s = 0:5, s= 10 and h= 100 (so approximately 1% of the signal values are non-zero). We denoise the signal by solving the optimization problem min x2R n + ;z2[0;1] n kyk 2 2 2 n å i=1 y i x i +q(z;x)+mkxk 1 +kkzk 1 (2.44a) s.t. 0 xkyk ¥ z: (2.44b) In these experiments we use synthetic instances generated as in Section 2.5.2 withl = 0:3 andm= 0 4 and varyingk2f0:0005;0:001;0:002;0:005;0:01;0:02g. We solve (2.44) using the Lagrangian method with m2f1;10;100;1000g subproblems (m= 1 corresponds to no decomposition). The independent subproblems are solved in parallel in the same laptop computer. Table 2.3 presents the results, both in terms of statistical and computational performance. For each value of k and m, it shows the error between the true signal and the estimated signal 5 , the number of non-zero valueskx k 0 of the resulting estimator, the time required to solve the problem; the number of subgradient iterations used, and the actual number of subproblems solved. 4 In the experiments reported in Section 2.5.2.4 withs = 0:5 and method decomp-sparse, the combination(l;m)= (0:32;0) was chosen in 4/10 instances and was the combination more often selected in training. 5 Since we do perform cross-validation, we report the in-sample error. 58 We observe that for the smallest value ofk = 0:0005 (corresponding to a sparsity ofkx k 0 30;000), the method without decomposition (m= 1) is the fastest and is able to solve the prob- lems in approximately three minutes. However, as the value of the` 0 regularization parameter k increases, the Lagrangian methods solve the problems increasingly faster. In particular, for values of k 0:005 (sparsity ofkx k 0 900), the Lagrangian method with m= 1;000 solves the prob- lems in under one minute whereas a direct implementation via Algorithm 2 may require an hour or more. Indeed, we see that as the` 0 regularization parameter increases, the number of iterations and number of subproblems solved decreases considerably. In fact, if k 0:01, the Lagrangian method with m= 10 is solved to optimality without performing any subgradient iterations. Finally, we point out that in terms of the estimation error, all methods return comparable errors (except for k = 0:0005, where the maximum number of 100 iterations is reached and the Lagrangian methods do not solve the problems to optimality). Therefore, we conclude that the proposed Lagrangian method is able to efficiently tackle large- scale problems when the target sparsity is small compared to the dimension of the problem, and can solve the problems by two-orders of magnitude faster compared to default method. The drawback is that the decomposition method is unable to incorporate additional priors using constraints. 2.6 Conclusions In this chapter we derived strong iterative convex relaxations for quadratic optimization problems with M-matrices and indicators, of which signal estimation with smoothness and sparsity is a spe- cial case. The relaxations are based on convexification of quadratic functions on two variables, and optimal decompositions of an M-matrix into pairwise terms. We also gave extended conic quadratic formulations of the convex relaxations, allowing the use of off-the-shelf conic solvers. The approach is general enough to permit the addition of multiple priors in the form of additional constraints. The proposed iterative convexification approach substantially closes the gap between the` 0 -“norm” and its` 1 surrogate and results in significantly better estimators than the standard 59 Tab. 2.3: Performance of the Lagrangian method for signals with n = 100;000. Bold entries correspond to the best estimation error and the fastest solution time. k m signal quality computational performance error kx k 0 time # iter # sub 0.0005 1 0.172 29,690 172 1 1 10 0.188 30,650 1,825 89 10+362 100 0.187 30,648 656 100 100+3,252 1,000 0.187 30,670 631 100 1,000+31,932 0.001 1 0.091 10,790 506 1 1 10 0.091 10,789 3,391 62 10+257 100 0.091 10,781 613 100 100+2,455 1,000 0.090 10,760 475 100 1,000+23,174 0.002 1 0.027 2,523 1,390 1 1 10 0.027 2,511 1,703 22 10+94 100 0.027 2,510 280 100 100+848 1,000 0.027 2,502 173 100 1,000+7,861 0.005 1 0.008 878 5,579 1 1 10 0.008 877 309 12 10+20 100 0.008 877 51 17 100+54 1,000 0.008 878 31 61 1,000+480 0.01 1 0.013 758 2,141 1 1 10 0.013 758 174 1 10+0 100 0.013 759 81 18 100+38 1,000 0.013 761 49 71 1,000+347 0.02 1 0.028 648 2,184 1 1 10 0.030 637 185 1 10+0 100 0.028 646 89 14 100+27 1,000 0.028 649 44 62 1,000+275 60 approaches using ` 1 approximations. In fact, near-optimal solution of the ` 0 -problems are ob- tained in seconds for instances with over 10,000 variables, and the method scales to instances with 100,000 variables using tailored algorithms. In addition to better inference properties, the proposed models and resulting estimators are easily interpretable. On the one hand, unlike` 1 -approximations and related estimators, the sparsity of the proposed estimators is close to the target sparsity parameter k. Thus, a prior on the sparsity of the signal can be naturally fed to the inference problems. On the other hand, the proposed strong convex relaxations compare favorably to ` 1 -approximations in classification or spike inference purposes: the 0-1 variables can be easily used to assign a category to each observation via simple rounding heuristics, and resulting in high-quality solutions. 61 3 Convexification for convex quadratic optimization with indicator variables 3.1 Introduction We consider the convex quadratic optimization with indicators: (QI) min a 0 x+ b 0 y+ y 0 Qy :(x;y)2I n ; (3.1) where the indicator set is defined as I n = (x;y)2f0;1g n R n + : y i (1 x i )= 0;8i2[n] ; where a and b are n-dimensional vectors, Q2R nn is a positive semidefinite (PSD) matrix and [n] :=f1;2;:::;ng. For each i2[n], the complementarity constraint y i (1 x i )= 0, along with the indicator variable x i 2f0;1g, is used to state that y i = 0 whenever x i = 0. Numerous applications, including portfolio optimization [53], optimal control [105], image segmentation [133], signal denoising [30] are either formulated as (QI) or can be relaxed to (QI). Building strong convex relaxations of (QI) is instrumental in solving it effectively. A number of approaches for developing linear and nonlinear valid inequalities for (QI) are considered in lit- erature. Dong and Linderoth [89] describe lifted linear inequalities from its continuous quadratic 62 optimization counterpart with bounded variables. Bienstock and Michalka [54] derive valid lin- ear inequalities for optimization of a convex objective function over a non-convex set based on gradients of the objective function. Valid linear inequalities for (QI) can also be obtained using the epigraph of bilinear terms in the objective [e.g. 55, 84, 117, 162]. In addition, several spe- cialized results concerning optimization problems with indicator variables exist in the literature [19, 39, 47, 58, 81, 109, 110, 156, 166]. There is a substantial body of research on the perspective formulation of convex univariate functions with indicators [3, 88, 89, 97, 116, 129, 229]. When Q is diagonal, y 0 Qy is separable and the perspective formulation provides the convex hull of the epigraph of y 0 Qy with indicator variables by strengthening each term Q ii y 2 i with its perspective counterpart Q ii y 2 i =x i , individually. For the general case, however, convex relaxations based on the perspective reformulation may not be strong. The computational experiments in [101] demonstrate that as Q deviates from a diagonal matrix, the performance of the perspective formulation deteriorates. A powerful approach for nonconvex quadratic optimization problems is semidefinite program- ming (SDP) reformulation, first proposed by Shor [203]. Specifically, a convex relaxation is con- structed by introducing a rank-one matrix Z representing zz 0 , where z is the decision vector, and then forming the semidefinite relaxation Z zz 0 . Such SDP relaxations have been widely utilized in numerous applications, including max-cut problems [107], hidden partition problems of finding clusters in large network datasets [140], matrix completion problems [4, 66], power systems [94], robust optimization problems [40]. Sufficient conditions for exactness of SDP relaxations [e.g. 64, 132, 143, 223, 224] and stronger rank-one conic formulations [17, 21] are also given in the literature. Beyond the perspective reformulation, which is based on the convex hull of the epigraph of a univariate convex quadratic function with one indicator variable, the convexification for the 2 2 case has received attention recently. Convex hulls of univariate and 2 2 cases can be used as building blocks to strengthen (QI) by decomposing y 0 Qy into a sequence of low-dimensional terms. Castro et al. [71] study convexification of a special class of two-term quadratic function controlled 63 by a single indicator variable. Jeon et al. [141] give conic quadratic valid inequalities for the 2 2 case. Frangioni et al. [101] combine perspective reformulation and disjunctive programming and apply them to the 2 2 case. Atamt¨ urk and G´ omez [16] study the convex hull of the mixed-integer epigraph of(y 1 y 2 ) 2 with indicators. Atamt¨ urk et al. [23] give the convex hull of the more general set Z := (x;y;t)2I 2 R + : t d 1 y 2 1 2y 1 y 2 + d 2 y 2 2 ; with coefficients d2D :=fd2R 2 : d 1 0;d 2 0;d 1 d 2 1g. The conditions on the coefficients d 1 ;d 2 imply convexity of the quadratic. Atamturk and Gomez [18] study the case where the con- tinuous variables are free and the rank of the coefficient matrix is one in the context of sparse linear regression. Burer and Anstreicher [63] give an extended SDP formulation for the convex hull of the 2 2 bounded set (y;yy 0 ;xx 0 ) : 0 y x2f0;1g 2 . Their formulation does not assume con- vexity of the quadratic function and contain PSD matrix variables X and Y as proxies for xx 0 and yy 0 as additional variables. Anstreicher and Burer [9] study computable representations of convex hulls of low dimensional quadratic forms without indicator variables. More general convexifica- tions for low-rank quadratic functions [22, 119] or quadratic functions with tridiagonal matrices [159] have also been proposed. To design convex relaxations for (QI) based on convexifications for simpler substructures, a standard approach is to decompose the matrix Q as Q= R+å j2J Q j , for some index set J, where R;Q j 0, j2 J. After writing problem (3.1) as min a 0 x+ b 0 y+ y 0 Ry+ å j2J t j (3.2a) s.t.t j y 0 Q j y 8 j2 J (3.2b) (x;y)2I n ; t2R J + ; (3.2c) 64 formulation (3.2) can then be strengthened based on convexifications of the simpler structures induced by constraints (3.2b) (e.g., matrices Q j are diagonal or 2 2). There are two main ap- proaches to implement convexifications based on (3.2). On the one hand, one may choose fixed R; Q j ; j2 J, a priori and treat them as parameters, as done in [22, 98, 101, 159, 235], resulting in simpler formulations (e.g., conic quadratic representable) that may be amenable to use with off- the-shelf solvers for mixed-integer optimization. On the other hand, one may treat matrices R;Q j , j2 J, as decision variables that are chosen with the goal of obtaining the optimal relaxation bound after strengthening, as done in [23, 18, 88]. The resulting formulations with the second approach are stronger but more complex (e.g., SDP representable). In general, neither approach is preferable to the other. Contributions There are three main contributions in this chapter. 1. We show the equivalence between the optimal perspective reformulation and Shor’s SDP formulation for (QI) These two formulations have been studied extensively in literature. While it is known that Shor’s SDP formulation is at least as strong as the perspective formulation [88], the other direction has not been explored. We show in this chapter that these two formulations are in fact equivalent. 2. 2 2 case: We describe the convex hull of the epigraph of a convex bivariate quadratic with a positive cross product and indicators. Consider Z + := (x;y;t)2I 2 R + : t d 1 y 2 1 + 2y 1 y 2 + d 2 y 2 2 ; where d2D. Observe that any bivariate convex quadratic with positive off-diagonals can be written as d 1 y 2 1 + 2y 1 y 2 + d 2 y 2 2 , by scaling appropriately. Therefore,Z + is the complementary set 65 toZ and, together,Z + andZ model epigraphs of all bivariate convex quadratics with indicators and nonnegative continuous variables. In this chapter, we propose conic quadratic extended formulations to describe clconv(Z ) and clconv(Z + ). These extended formulations are more compact than alternatives previously proposed in the literature. More importantly, a distinguishing contribution of this chapter is that we also give the explicit description of clconv(Z + ) in the original space of the variables. The corresponding convex envelope of the bivariate function is a four-piece function. While convexifications in the original space of variables are more difficult to implement using current off-the-shelf mixed-integer optimization solvers, they offer deeper insights on the structure of the convex hulls. Whereas the ideal formulations ofZ can be conveniently described with two simpler valid “extremal” inequalities [23], a similar result does not hold forZ + (see Example 3 in §3.4). The derivation of ideal formulations for the more involved setZ + differs significantly from the methods in [23]. The complementary results of this chapter and [23] forZ complete the convex hull descriptions of bivariate convex functions with indicators and nonnegative continuous variables. 3. General case: We develop an optimal SDP relaxation based on 2 2 convexifications for (QI) In order to construct a strong convex formulation for (QI), we extract a sequence of 2 2 PSD matrices from Q such that the residual term is a PSD matrix as well, and convexify each bivariate quadratic term utilizing the descriptions of clconv(Z + ) and clconv(Z ). This approach works very well when Q is 2 2 PSD decomposable, i.e., when Q is scaled-diagonally dominant [56]. Otherwise, a natural question is how to optimally decompose y 0 Qy into bivariable convex quadrat- ics and a residual convex quadratic term so as to achieve the best strengthening. We address this question by deriving an optimal convex formulation using SDP duality. The new SDP formulation dominates any formulation obtained through a 22-decomposition scheme. This formulation is also stronger than other SDP formulations in the literature (see Figure 3.1), 66 including the optimal perspective formulation [88], Shor’s SDP formulation [203], and the opti- mal rank-one convexification [18]. In addition, the proposed formulation is solved many orders of magnitude faster than the 2 2-decomposition approaches based on disjunctive programming [101], and delivers higher quality bounds than standard mixed-integer optimization approaches in difficult portfolio index tracking problems. Natural QP relaxation Optimal perspective OptPersp Shor SDP Optimal rank-one OptRankOne Optimal pairs OptPairs Fig. 3.1: Relationship between the convex relaxations for (QI) discussed in this chapter. Rect- angular frames and circle frames indicate formulations in the literature and the new formulation in this chapter, respectively. The arrow direction A!B indicates that formulation B is stronger than formulationA. Solid and dashed lines indicate existing relations in the literature and relations shown in this chapter, respectively. Outline The rest of the chapter is organized as follows. In §3.2 we review the optimal perspective formu- lation and Shor’s SDP formulation for (QI) and show that these two formulations are equivalent. In §3.3 we review the convex hull results onZ and illustrate the structural difference between Z + andZ . In §3.4 we provide a conic quadratic formulation of clconv(Z + ) and clconv(Z ) in an extended space and derive the explicit form of clconv(Z + ) in the original space. In §3.5, employing the results in §3.4, we give a strong convex relaxation for (QI) using SDP techniques. In §3.6, we compare the strength of the proposed SDP relaxation with others in literature. In §3.7, we present computational results demonstrating the effectiveness of the proposed convex relaxations. Finally, in §3.8, we conclude with a few final remarks. 67 Notation To simplify the notation throughout, we adopt the following convention for division by 0: given x 0, x 2 =0=¥ if x6= 0 and x 2 =0= 0 if x= 0. Thus, x 2 =z, the closure of the perspective of x 2 , is a closed convex function [see 194, pages 67-68]. For a setXR n , clconv(X) denotes the closure of the convex hull ofX . For a vector v, diag(v) denotes the diagonal matrix V with V ii = v i for each i. Finally,S n + refers to the cone of n n real symmetric PSD matrices. 3.2 Optimal perspective formulation vs. Shor’s SDP In this section we analyze two well-known convex formulations: the optimal perspective formula- tion and Shor’s SDP. We first introduce the two formulations and then show that they are equivalent for (QI). Splitting Q into some diagonal PSD matrix D= diag(d) and a PSD residual, i.e., QD 0, one can apply the perspective reformulation to each diagonal term, by replacing D ii y 2 i with D ii y 2 i =x i , to get a valid convex relaxation of (QI)–after relaxing integrality constraints in x and dropping the complementarity constraints y i (1 x i )= 0: min (x;y)2R 2n a 0 x+ b 0 y+ y 0 (Q diag(d))y+ å i2[n] d i y 2 i x i s.t. 0 x 1; y 0 (x;y)2XR n R n : (3.3) In certain applications, such as the sensor placement [99], the single-period unit commitment [104] and the` 2 -penalized least square regression [184], vector d is immediate from the context. Thus, in such cases, (3.3) directly delivers a strong relaxation of (3.1). For cases where a decomposition of Q is not immediate, several approaches are proposed in literature [e.g. 98, 235] to obtain a desirable d. Because different decompositions usually yield different relaxations, the relaxation 68 quality of (3.3) relies on the choice of vector d. Introducing a symmetric matrix variable Y , Dong et al. [88] describe an optimal perspective relaxation for (QI): min a 0 x+ b 0 y+hQ;Yi (3.4a) OptPersp s.t. Y yy 0 0 (3.4b) y 2 i Y ii x i 8i2[n] (3.4c) 0 x 1; y 0 (3.4d) (x;y)2XR n R n : (3.4e) Note that by adding integrality constraints on x in OptPersp, one obtains a mixed-integer SDP problem, which can be solved by a branch-and-bound algorithm. What is more, the resulting mixed-integer program is equivalent to the original model (QI). Indeed, if x is integral, then (3.26c) either reduces to y i = 0 or is implied (3.26b). In any case, Y and y are linked only through (3.26b), which must hold as an equality at the optimal solution because Q 0. The authors [88] show that OptPersp is optimal in the sense that every perspective relaxation of the form (3.3) is dominated byOptPersp, and moreover, there indeed exists a decomposition of Q such that the resulting perspective formulation (3.3) is equivalent toOptPersp. Proposition 17 (Theorem 3 in [88]). OptPersp is equivalent to the following max-min optimization problem: max d2R n min (x;y)2R 2n a 0 x+ b 0 y+ y 0 (Q diag(d))y+ å i2[n] d i y 2 i x i s.t. d 0; Q diag(d) 0 0 x 1; y 0 (x;y)2XR n R n : 69 Next, we consider Shor’s SDP relaxation for problem (QI): min a 0 x+ b 0 y+ n å i=1 n å j=1 Q i j Z i j (3.6a) s.t. y i Z i;i+n = 0 8i2[n] (3.6b) Shor x i Z i+n;i+n = 0 8i2[n] (3.6c) Z 0 B @ y x 1 C A y 0 x 0 0 (3.6d) 0 x 1; y 0 (3.6e) (x;y)2XR n R n ; (3.6f) where Z2R 2n2n such that Z ii is a proxy for y 2 i , Z i+n;i+n is a proxy for x 2 i , Z i;i+n is a proxy for x i y i , i2[n], and Z i j is a proxy for y i y j for 1 i; j n. It is known that Shor is at least as strong as OptPersp [89], as constraints (3.26c) are implied by the positive definiteness of some 2 2 principal minors of (3.6d). We show below that the two formulations are, in fact, equivalent. As OptPersp is a much smaller formulation than Shor, the equivalence makes it the favorable choice. Theorem 2. OptPersp is equivalent toShor. Proof. First we verify Shor is at least as strong as OptPersp by checking that constraints (3.26b)– (3.26c) are implied by (3.6d). Let Y i j = Z i j for any i; j2[n]. By Schur Complement Lemma, Z 0 B @ y x 1 C A y 0 x 0 0() 0 B B B B @ 1 y 0 x 0 y x Z 1 C C C C A 0: 70 Since Y is a principle submatrix of Z, we have 0 B @ 1 y 0 y Y 1 C A 0, Y yy 0 0: Moreover, constraint (3.6d) also implies that for any i2[n], 0 B @ Z ii Z i;i+n Z i;i+n Z i+n;i+n 1 C A 0: (3.7) After substituting Y ii = Z ii ; x i = Z i+n;i+n ; and y i = Z i;i+n , we find that (3.7) implies Y ii x i y 2 i in OptPersp, concluding the argument. We next prove thatOptPersp is at least as strong asShor, by showing that for any given feasible solution (x;y;Y) of OptPersp, it is possible to construct a feasible solution of Shor with same objective value. First, we rewrite Z in the form Z = 0 B @ Y U U 0 V 1 C A : For a fixed feasible solution (x;y;Y) ofOptPersp consider the optimization problem l := min l;U;V l (3.8a) s.t. 0 B B B B @ 1 y 0 x 0 y Y U x U 0 V 1 C C C C A +lI 0 (3.8b) U ii = y i ; 8i2[n] (3.8c) V ii = x i ; 8i2[n]; (3.8d) where I is the identity matrix. Observe that if l 0, then an optimal solution of (3.8) satisfies (3.6d) and thus induces a feasible solution ofShor. We show next that this is, indeed, the case. 71 Let ˜ Y = 0 B @ 1 y 0 y Y 1 C A and consider the SDP dual of (3.8): l = max R;s;t;z ˜ Y;R å i (2x i z i +t i x i + 2s i y i ) (3.9a) s.t. 0 B B B B @ R z 0 diag(s) z diag(s) diag(t) 1 C C C C A 0 (3.9b) Tr(R)+ å i t i = 1; (3.9c) where R;z;diag(s);diag(t) are the dual variable associated with ˜ Y+lI;x;U; and V+lI, respec- tively. Note that we abuse the symbol I to represent the identity matrices of different dimensions. One can verify that the strong duality holds for (3.8) since l can be an arbitrary positive number to ensure that the matrix inequality holds strictly. Because the off-diagonal elements of U and V do not appear in the primal objective function and constraints other than (3.8b), the corresponding dual variables are zero. Note that to showl 0, it is sufficient to consider a relaxation of (3.9a). Therefore, dropping (3.9c), it is sufficient to show that ˜ Y;R + å i (2x i z i +t i x i + 2s i y i ) 0 (3.10) for all t 0, s, z; and R satisfying (3.9b). Observe that if t i = 0, then s i = z i = 0 in any solution satisfying (3.9b). In this case, all such terms indexed by i vanish in (3.10). Therefore, it suffices to prove (3.10) holds for all t> 0. For t> 0, by Schur Complement Lemma, (3.9b) is equivalent to R 0 B @ z 0 diag(s) 1 C A 1 diag(t) z diag(s) : (3.11) 72 Moreover, since ˜ Y 0, we find that ˜ Y;R * ˜ Y; 0 B @ z 0 diag(s) 1 C A 1 diag(t) z diag(s) + whenever R satisfies (3.11). Substituting the term ˜ Y;R in (3.10) by its lower bound, it suffices to show that * ˜ Y; 0 B @ z 0 diag(s) 1 C A 1 diag(t) z diag(s) + + å i (2x i z i +t i x i + 2s i y i ) 0; (3.12) holds for all t> 0;s;z and R satisfying (3.11). A direct computation shows that 0 B @ z 0 diag(s) 1 C A 1 diag(t) z diag(s) = 0 B B B B B B B @ å i z 2 i =t i z 1 s 1 =t 1 z n s n =t n z 1 s 1 =t 1 s 2 1 =t 1 . . . . . . z n s n =t n s 2 n =t n 1 C C C C C C C A with all off-diagonal elements equal to 0, except for the first row/column. Thus, (3.12) reduces to the separable expression å i z 2 i t i + 2z i s i y i t i + s 2 i t i Y ii + 2x i z i + x i t i + 2y i s i 0: For each term, we have z 2 i t i + 2z i s i y i t i + s 2 i t i Y ii + 2x i z i + x i t i + 2y i s i =(z 2 i + 2z i s i y i + s 2 i Y ii )=t i + x i t i + 2x i z i + 2y i s i 2 q x i (z 2 i + 2z i s i y i + s 2 i Y ii )+ 2x i z i + 2y i s i 0; 73 where the first inequality follows from the inequality between the arithmetic and geometric mean a+ b 2 p ab for a;b 0. The last inequality holds trivially if 2x i z i + 2y i s i 0; otherwise, we have q x i (z 2 i + 2z i s i y i + s 2 i Y ii )(x i z i + y i s i ) () x i (z 2 i + 2z i s i y i + s 2 i Y ii )(x i z i + y i s i ) 2 () x i z 2 i (1 x i )+ s 2 i (x i Y ii y 2 i ) 0: (as 0 x i 1 and x i Y ii y 2 i ) In conclusion,l 0 and this completes the proof. 3.3 Convex hull results forX In this section, we review the existing results on convex hulls of setsZ ,Z + , and their relaxation Z f with free continuous variables: Z f := n (x;y;t)2f0;1g 2 R 3 : t d 1 y 2 1 2y 1 y 2 +d 2 y 2 2 ; y i (1x i )= 0;i2[2] o Note that when the continuous variables are free, the sign associated with the cross term 2y 1 y 2 is irrelevant, since one can state it equivalently with the opposite sign by substituting ¯ y i =y i . In contrast, if y 0, such a substitution is not possible; hence, the need for separate analyses for sets Z + andZ . We first point out that all three sets can be naturally seen as disjunctions of four convex sets corresponding to the four possible values for x2f0;1g 2 . Thus, a direct application of disjunc- tive programming yields similar (conic quadratic) representations of the three sets [101] but such representations require several additional variables. While the disjunctive approach might suggest thatZ f ,Z + ,Z may be similar, we now argue that the sign of the cross terms materially affect the complexity of the optimization problems as well as the structure of the convex hulls. 74 3.3.1 Optimization The sign of the off-diagonals of matrix Q critically affect the complexity of the optimization prob- lem (QI). We first state a result concerning optimization with Stieltjes matrices Q, first proven in [16]. Proposition 18 (Atamt¨ urk and G´ omez [16]). Problem (3.1) can be solved in polynomial time if Q 0 and Q i j 0 for all i6= j and b 0. In contrast, an analogous result does not hold if the off-diagonal terms of matrix Q are nonneg- ative. Proposition 19. Problem (3.1) isNP-hard if Q 0 and Q i j 0 for all i6= j and b 0. Proof. We show that (QI) includes theNPhard subset sum problem as a special case under the assumptions of the proposition: given w2Z n + ;K2Z + , solve the equation w 0 x= K; x2f0;1g n : (3.13) Set Q=(I+ qq 0 )=2 0 where q2R n ++ is a parameter to be specified later. Let p i = q 2 i ; i2[n], b=q and a=g p for some g > 0 to be specified later as well. For a vector z2R n and matrix 75 M2S n + , let z S and M S denote the subvector and principle submatrix defined by S[n], respectively. Then (QI) reduces to min (x;y)2I n 1 2 y 0 (I+ qq 0 )y q 0 y+g p 0 x = min S[n];y S 0 1 2 y 0 S (I S + q S q 0 S )y S q 0 S y S +g å i2S p i (S :=fi : x i = 1g) = min S[n] 1 2 q 0 S I S + q S q 0 S 1 q S +g å i2S p i (3.14a) = min S[n] 1 2 q 0 S I S q S q 0 S 1+kq S k 2 2 ! q S +g å i2S p i (Woodbury matrix identity) = min S[n] 1 2 kq S k 2 2 1+kq S k 2 2 +gkq S k 2 2 (p i = q 2 i ) = min S[n] " 1 2(1+kq S k 2 2 ) +g(1+kq S k 2 2 ) # g 1 2 (3.14b) Note that the nonnegativity constraints are dropped in (3.14a) because they are trivially satisfied by the optimal solution as y S = I S q S q 0 S 1+kq S k 2 2 ! q S = 1 1+kq S k 2 2 q S 0: Now, let q i = p w i , i2[n] andg = 1 2(1+K) 2 . Then (3.14b) simplifies to (after dropping the constant termg 1=2 and multiplying by 2) =min S 1 1+ w(S) + 1+ w(S) (1+ K) 2 2 1+ K ; where the lower bound is attained if and only if w(S)= K. Hence, the subset sum problem (3.13) has a solution if and only if the optimal value of (QI) as constructed above equals 2=(1+ K): Propositions 18 and 19 suggest that convex hulls of sets with negative cross terms are substan- tially simpler than those with positive terms. 76 3.3.2 Rank-one results It is convenient to formulate convex hulls of sets via conic quadratic constraints as they are readily supported by modern mixed-integer optimization software. While such representations are easy to obtain via disjunctive programming, the resulting formulations generally have a prohibitive number of variables and constraints, which hamper the performance of solvers. Therefore, it is of interest to find the most compact conic quadratic formulations. In this regard, as well,Z + is significantly more complex thanZ f andZ . Consider the existing results for the simpler sets in the rank-one case, i.e., d 1 = d 2 = 1: Proposition 20. Atamturk and Gomez [18] If d 1 = d 2 = 1, then clconv(Z f )= n (x;y;t)2[0;1] 2 R 3 : t(y 1 y 2 ) 2 ; t(z 1 +z 2 )(y 1 y 2 ) 2 o In particular, for the rank-one case with free continuous variables, the clconv(Z f ) is conic quadratic representable in the original space of variables, without the need for additional variables. Proposition 21. Atamt¨ urk and G´ omez [16] If d 1 = d 2 = 1, then clconv(Z )= n (x;y;t)2[0;1] 2 R+ 3 : tf(x 1 ;x 2 ;y 1 ;y 2 ) o ; where f(x 1 ;x 2 ;y 1 ;y 2 )= 8 > > < > > : (y 1 y 2 ) 2 =x 1 if y 1 y 2 (y 1 y 2 ) 2 =x 2 if y 1 y 2 : Since constraints t (y 1 y 2 ) 2 =x i , i2 [2], are not valid forZ , clconv(Z ) is not conic quadratic representable in the original space of variables. A conic quadratic representation with two additional variables is given in [23]. In Section 3.4, Corollary 2, we describe clconv(Z + ) in the original space for the rank-one case. This description is more complex thanZ as it requires four pieces instead of two and it 77 is not conic-quadratic representable. We also provide a compact extended formulation with three additional variables. 3.3.3 Full-rank results A description of clconv(Z ) in the original space of variables is given in [23]. Interestingly, it can be expressed as two valid inequalities involving functionf introduced in Proposition 21. Proposition 22 (Atamt¨ urk et al. [23]). Set clconv(Z ) is described by bound constraints y 0, 0 x 1, and the two valid inequalities t d 1 f(x 1 ;x 2 ;y 1 ;y 2 =d 1 )+ y 2 2 x 2 d 2 1 d 1 ; t d 2 f(x 1 ;x 2 ;y 1 =d 2 ;y 2 )+ y 2 1 x 1 d 1 1 d 2 : Proposition 22 reveals that clconv(Z ) requires only homogeneous functions that are sums of rank-one and perspective convexifications. In Section 3.4, Proposition 24, we give clconv(Z + ) in the original space of variables, and show that the resulting function does not have either of these properties. The discrepancy between the results highlights that clconv(Z + ) is fundamen- tally different from clconv(Z ), and helps explain why optimization with positive matrices Q (Proposition 19) is substantially more difficult than optimization with Stieltjes matrices (Proposi- tion 18). 3.4 Convex hull description ofZ + In this section, we give ideal convex formulations for Z + = (x;y;t)2I 2 R + : t d 1 y 2 1 + 2y 1 y 2 + d 2 y 2 2 : 78 When d 1 = d 2 = 1,Z + reduces to the simpler rank-one set X + = (x;y;t)2I 2 R + : t(y 1 + y 2 ) 2 : SetX + is of special interest as it arises naturally in (QI) when Q is a diagonally dominant matrix, see computations in §3.7.1 for details. As we shall see, the convex hulls ofZ + andX + are significantly more complicated than their complementary setsZ andX studied earlier. In §3.4.1, we develop an SOCP-representable extended formulation of clconv(Z + ). Then, in §3.4.2, we derive the explicit form of clconv(Z + ) in the original space of variables. 3.4.1 Conic quadratic-representable extended formulation We start by writingZ + as the disjunction of four convex sets defined by all values of the indicator variables; that is, Z + =Z 1 + [Z 2 + [Z 3 + [Z 4 + ; whereZ i + ;i= 1;2;3;4 are convex sets defined as: Z 1 + =f(1;0;u;0;t 1 ) : t 1 d 1 u 2 ;u 0g; Z 2 + =f(0;1;0;v;t 2 ) : t 2 d 2 v 2 ;v 0g; Z 3 + =f(1;1;w 1 ;w 2 ;t 3 ) : t 3 d 1 w 2 1 + 2w 1 w 2 + d 2 w 2 2 ;w 1 0;w 2 0g; Z 4 + =f(0;0;0;0;t 4 ) : t 4 0g: 79 By the definition, a point(x 1 ;x 2 ;y 1 ;y 2 ;t)2 conv(Z + ) if and only if it can be written as a convex combination of four points belonging inZ i + ;i= 1;2;3;4. Usingl =(l 1 ;l 2 ;l 3 ;l 4 ) as the corre- sponding weights,(x 1 ;x 2 ;y 1 ;y 2 ;t)2 conv(Z + ) if and only if the following inequality system has a feasible solution l 1 +l 2 +l 3 +l 4 = 1 (3.15a) x 1 =l 1 +l 3 ; x 2 =l 2 +l 3 (3.15b) y 1 =l 1 u+l 3 w 1 ; y 2 =l 2 v+l 3 w 2 (3.15c) t=l 1 t 1 +l 2 t 2 +l 3 t 3 +l 4 t 4 (3.15d) t 1 d 1 u 2 ; t 2 d 2 v 2 ; t 3 d 1 w 2 1 + 2w 1 w 2 + d 2 w 2 2 ; t 4 0 (3.15e) u;v;w 1 ;w 2 ;l 1 ;l 2 ;l 3 ;l 4 0: (3.15f) We will now simplify (3.15). First, by Fourier–Motzkin elimination, one can substitute t 1 ;t 2 ;t 3 ;t 4 with their lower bounds in (3.15e) and reduce (3.15d) to tl 1 d 1 u 2 +l 2 d 2 v 2 +l 3 (d 1 w 2 1 +2w 1 w 2 + d 2 w 2 2 ). Similarly, sincel 4 0, one can eliminatel 4 and reduce (3.15a) toå 3 i=1 l i 1. Next, using (3.15c), one can substitute u=(y 1 l 3 w 1 )=l 1 and v=(y 2 l 3 w 2 )=l 2 . Finally, using (3.15b), one can substitutel 1 = x 1 l 3 andl 2 = x 2 l 3 to arrive at maxf0;x 1 + x 2 1gl 3 minfx 1 ;x 2 g (3.16a) l 3 w i y i ; i= 1;2 (3.16b) w i 0; i= 1;2 (3.16c) t d 1 (y 1 l 3 w 1 ) 2 x 1 l 3 + d 2 (y 2 l 3 w 2 ) 2 x 2 l 3 +l 3 (d 1 w 2 1 + 2w 1 w 2 + d 2 w 2 2 ); (3.16d) where (3.16a) results from the nonnegativity ofl 1 ;l 2 ;l 3 ;l 4 , (3.16b) from the nonnegativity of u and v. Finally, observe that (3.16b) is redundant for (3.16): indeed, if there is a solution (l;w;t) satisfying (3.16a), (3.16c) and (3.16d) but violating (3.16b), one can decrease w 1 and w 2 such that (3.16b) is satisfied without violating (3.16d). 80 Redefining variables in (3.16), we arrive at the following conic quadratic-representable ex- tended formulation for clconv(Z + ) and its rank-one special case clconv(X + ). Proposition 23. The set clconv(Z + ) can be represented as clconv(Z + )= (x;y;t)2[0;1] 2 R 3 + :9l2R + ;z2R 2 + s.t. x 1 + x 2 1l minfx 1 ;x 2 g; t d 1 (y 1 z 1 ) 2 x 1 l + d 2 (y 2 z 2 ) 2 x 2 l + d 1 z 2 1 +2z 1 z 2 +d 2 z 2 2 l Corollary 1. The set clconv(X + ) can be represented as clconv(X + )= (x;y;t)2[0;1] 2 R 3 + :9l2R + ;z2R 2 + s.t. x 1 + x 2 1l minfx 1 ;x 2 g; t (y 1 z 1 ) 2 x 1 l + (y 2 z 2 ) 2 x 2 l + (z 1 + z 2 ) 2 l Remark 5. One can apply similar arguments to the complementary setZ to derive an SOCP representable formulation of its convex hull as clconv(Z )= (x;y;t)2[0;1] 2 R 3 + :9l2R + ;z2R 2 s.t. x 1 + x 2 1l minfx 1 ;x 2 g; z 1 y 1 ; z 2 y 2 ; t d 1 (y 1 z 1 ) 2 x 1 l + d 2 (y 2 z 2 ) 2 x 2 l + d 1 z 2 1 2z 1 z 2 +d 2 z 2 2 l This extended formulation is smaller than the one given Atamt¨ urk et al. [23] for clconv(Z ). 3.4.2 Description in the original space of variables x;y;t The purpose of this section is to express clconv(Z + ) and clconv(X + ) in the original space. 81 LetL x :=fl2R : maxf0;x 1 + x 2 1gl minfx 1 ;x 2 gg, i.e., the set of feasiblel implied by constraint (3.16a). Define G(l;w) := d 1 (y 1 lw 1 ) 2 x 1 l + d 2 (y 2 lw 2 ) 2 x 2 l +l(d 1 w 2 1 + 2w 1 w 2 + d 2 w 2 2 ) and g(l) :L x !R as g(l) := min w2R 2 + G(l;w): Note that as G is SOCP-representable, it is convex. We first prove an auxiliary lemma that will be used in the derivation. Lemma 1. Function g(l) is non-decreasing overL x . Proof. Note that for any fixed w andl < minfx 1 ;x 2 g, we have ¶G(l;w) ¶l = d 1 [2(lw 1 y 1 )w 1 (x 1 l)+(y 1 lw 1 ) 2 ] (x 1 l) 2 + d 2 [2(lw 2 y 2 )w 2 (x 2 l)+(y 2 lw 2 ) 2 ] (x 2 l) 2 +(d 1 w 2 1 + 2w 1 w 2 + d 2 w 2 2 ) = d 1 [w 2 1 (x 1 l) 2 + 2(lw 1 y 1 )w 1 (x 1 l)+(y 1 lw 1 ) 2 ] (x 1 l) 2 + d 2 [w 2 2 (x 2 l 2 ) 2 + 2(lw 2 y 2 )w 2 (x 2 l)+(y 2 lw 2 ) 2 ] (x 2 l) 2 + 2w 1 w 2 = d 1 (w 1 x 1 y 1 ) 2 (x 1 l) 2 + d 2 (w 2 x 2 y 2 ) 2 (x 2 l) 2 + 2w 1 w 2 0: Therefore, for fixed w, G(;w) is nondecreasing. Now for ˜ l ˆ l, let ˜ w and ˆ w be optimal solutions defining g( ˜ l) and g( ˆ l). Then, g( ˜ l)= G( ˜ l; ˜ w) G( ˜ l; ˆ w) G( ˆ l; ˆ w)= g( ˆ l); 82 proving the claim. We now state and prove the main result in this subsection. Proposition 24. Define f(x;y;l;d) := (d 1 d 2 1)(d 1 x 2 y 2 1 + d 2 x 1 y 2 2 )+ 2ld 1 d 2 y 1 y 2 +l(d 1 y 2 1 + d 2 y 2 2 ) (d 1 d 2 1)x 1 x 2 l 2 +l(x 1 + x 2 ) ; and f + (x;y;d) := 8 > > > > > > > > > > < > > > > > > > > > > : d 1 y 2 1 x 1 + d 2 y 2 2 x 2 if x 1 + x 2 1 d 1 y 2 1 1x 2 + d 2 y 2 2 x 2 if 0 x 1 + x 2 1(x 1 y 2 d 1 x 2 y 1 )=y 2 d 1 y 2 1 x 1 + d 2 y 2 2 1x 1 if 0 x 1 + x 2 1(x 2 y 1 d 2 x 1 y 2 )=y 1 f(x;y;x 1 + x 2 1) o.w. Then, the set clconv(Z + ) can be expressed as clconv(Z + )=f(x;y;t)2[0;1] 2 R 3 + : t f + (x 1 ;x 2 ;y 1 ;y 2 ;d 1 ;d 2 )g: Proof. First, observe that we may assume x 1 ;x 2 > 0, as otherwise x 1 + x 2 1 and f + reduces to the perspective function for the univariate case. To find the representation in the original space of variables, we first project out variables z in Proposition 23. Specifically, notice that g(l) can be rewritten in the following form by letting z i =lw i ;i= 1;2: g(l)= min d 1 (y 1 z 1 ) 2 x 1 l + d 2 (y 2 z 2 ) 2 x 2 l + d 1 z 2 1 + 2z 1 z 2 + d 2 z 2 2 l (3.17) s.t. z i 0; i= 1;2: (s i ) By Proposition 23, a point(x;y;t)2[0;1] 2 R 3 + belongs to clconv(Z + ) if and only if t min l2L x g(l). We first assume x 1 + x 2 1> 0, which implies l > 0;8l2L x . For given l2L x , optimization 83 problem (3.17) is convex with affine constraints, thus Slater condition holds. Hence, the following KKT conditions are necessary and sufficient for the minimizer: 2d 1 x 1 l (z 1 y 1 )+ 2(d 1 z 1 + z 2 ) l s 1 = 0 (3.18a) 2d 2 x 2 l (z 2 y 2 )+ 2(d 2 z 2 + z 1 ) l s 2 = 0 (3.18b) z 1 s 1 = 0 (3.18c) z 2 s 2 = 0 (3.18d) s i ;z i 0; i= 1;2: (3.18e) Let us analyze the KKT system considering the positiveness of s 1 and s 2 . Case s 1 > 0. By (3.18c), z 1 = 0 and by (3.18a), z 2 > 0, which implies s 2 = 0 from (3.18d). Hence, (3.18a) and (3.18b) reduce to 2z 2 l = 2d 1 x 1 l y 1 + s 1 2d 2 x 2 l (z 2 y 2 )+ 2d 2 z 2 l = 0: Solving these two linear equations, we get z 2 = y 2 x 2 l and s 1 = 2( y 2 x 2 d 1 y 1 x 1 l ). This also indi- cates s 1 0 iffl(x 1 y 2 d 1 x 2 y 1 )=y 2 . By replacing the variables with their optimal values in the objective function (3.17), we find that g(l)= d 1 y 2 1 x 1 l + d 2 x 2 l y 2 y 2 x 2 l 2 + d 2 l y 2 x 2 l 2 (3.19a) = d 1 y 2 1 x 1 l + d 2 y 2 2 x 2 (3.19b) whenl2[0;(x 1 y 2 d 1 x 2 y 1 )=y 2 ]\L x . 84 Case s 2 > 0. Similarly, we find that g(l)= d 1 y 2 1 x 1 + d 2 y 2 2 x 2 l (3.20) whenl2[0;(x 2 y 1 d 2 x 1 y 2 )=y 1 ]\L x . Case s 1 = s 2 = 0. In this case, (3.18a) and (3.18b) reduce to 0 B @ d 1 x 1 x 1 l x 2 l d 2 x 2 1 C A 0 B @ z 1 z 2 1 C A =l 0 B @ d 1 y 1 d 2 y 2 1 C A : Ifl > 0, the determinant of the matrix is(d 1 d 2 1)x 1 x 2 +l(x 1 +x 2 l)> 0 and the system has a unique solution. It follows that 0 B @ z 1 z 2 1 C A =l 0 B @ d 1 x 1 x 1 l x 2 l d 2 x 2 1 C A 1 0 B @ d 1 y 1 d 2 y 2 1 C A ; i.e., z 1 = l(d 1 d 2 x 2 y 1 +(l x 1 )d 2 y 2 ) (d 1 d 2 1)x 1 x 2 l 2 +l(x 1 + x 2 ) ; z 2 = l(d 1 d 2 x 1 y 2 +(l x 2 )d 1 y 1 ) (d 1 d 2 1)x 1 x 2 l 2 +l(x 1 + x 2 ) : Therefore, the bounds z 1 ;z 2 0 imply lower bounds l(x 1 y 2 d 1 x 2 y 1 )=y 2 ; l(x 2 y 1 d 2 x 1 y 2 )=y 1 onl. Moreover, from (3.18a) and (3.18b), we have d 1 (y 1 z 1 ) x 1 l = d 1 z 1 + z 2 l and d 2 (y 2 z 2 ) x 2 l = d 2 z 2 + z 1 l 85 By substituting the two equalities in (3.17), we find that g(l)= d 1 y 1 z 1 + y 1 z 2 + d 2 y 2 z 2 + y 2 z 1 =l = (d 1 d 2 1)(d 1 x 2 y 2 1 + d 2 x 1 y 2 2 )+ 2ld 1 d 2 y 1 y 2 +l(d 1 y 2 1 + d 2 y 2 2 ) (d 1 d 2 1)x 1 x 2 l 2 +l(x 1 + x 2 ) : Therefore, g(l)= f(x;y;l;d) (3.21) whenl2[maxf(x 1 y 2 d 1 x 2 y 1 )=y 2 ;(x 2 y 1 d 2 x 1 y 2 )=y 1 g;+¥)\L x : To see that the three pieces of g(l) considered above are, indeed, mutually exclusive, observe that when l(x 1 y 2 d 1 x 2 y 1 )=y 2 , this is, y 2 (x 1 l) x 2 y 1 d 1 , we have d 2 y 2 y 1 x 1 l x 2 d 1 d 2 1. Since x 1 l x 2 x 2 l x 1 x 1 x 2 x 2 x 1 = 1, it holds d 2 y 2 y 1 x 1 l x 2 x 1 l x 2 x 2 l x 1 ; that is,l(x 2 y 1 d 2 x 1 y 2 )=y 1 . Finally, notice when x 1 + x 2 1 0,l may take the value 0. In this case, (3.17) reduces to g(0)= d 1 y 2 1 x 1 + d 2 y 2 2 x 2 By Lemma 1, min l2L x g(l)= g(maxf0;x 1 + x 2 1g). Combining this fact with the above discus- sion, Proposition 24 holds. Remark 6. For further intuition, we now comment on the validity of each piece of t f + (x;y;d) over [0;1] 2 R 3 + forZ + . Because the first piece can be obtained by dropping the nonnegative cross product term y 1 y 2 and then strengthening t y 2 1 + y 2 2 using perspective reformulation, it is valid everywhere. When x 1 +x 2 < 1 and y 1 ;y 2 > 0, t y 2 i =x i +y 2 j =(1x i )> f + (x;y;1;1) for i6= j. Therefore, the second and the third pieces are not valid on the domain[0;1] 2 R 3 + . If d 1 d 2 > 1, the last piece t f(x;y;x 1 + x 2 1;d) is not valid for clconv(Z + ) everywhere, as seen by exhibiting a point(x;y;t)2 clconv(Z + ) violating t f(x;y;x 1 + x 2 1;d). To do so, let (x 1 ;x 2 ;y 1 ;y 2 ;t)=(0:5; 1 d 1 d 2 + 1 +e; 1 p d 1 ;2 p d 1 ; f + (x;y)); 86 where e > 0 is small enough so that x 1 + x 2 < 1, i.e., x 2 < 0:5. With this choice, f + (x;y) = d 1 y 2 1 =x 1 + d 2 y 2 2 =x 2 : Let ˜ l = x 1 + x 2 1, then ˜ l(x 1 + x 2 ) ˜ l 2 = ˜ l. Hence, for point (x;y;t), we have f(x;y; ˜ l;d)= (d 1 d 2 1)(d 1 x 2 y 2 1 + d 2 x 1 y 2 2 )+ 2 ˜ ld 1 d 2 y 1 y 2 + ˜ l(d 1 y 2 1 + d 2 y 2 2 ) (d 1 d 2 1)x 1 x 2 + ˜ l = (d 1 d 2 1)x 1 x 2 (d 1 y 2 1 =x 1 + d 2 y 2 2 =x 2 )+ ˜ l(d 1 y 2 1 + 2d 1 d 2 y 1 y 2 + d 2 y 2 2 ) (d 1 d 2 1)x 1 x 2 + ˜ l =(1a) f + (x;y)+a(d 1 y 2 1 + 2d 1 d 2 y 1 y 2 + d 2 y 2 2 ); wherea = ˜ l=((d 1 d 2 1)x 1 x 2 + ˜ l). Since ˜ l < 0,a< 0 if and only if (d 1 d 2 1)x 1 x 2 + x 1 + x 2 1> 0 () d 1 d 2 > (1 x 1 )(1 x 2 ) x 1 x 2 = 1 x 2 1 (by x 1 = 0:5) () x 2 > 1 d 1 d 2 + 1 ; which is true by the choice of x 2 . Moreover, f + (x;y)= d 1 y 2 1 =x 1 + d 2 y 2 2 =x 2 = 2+ 8d 1 d 2 >d 1 y 2 1 + 2d 1 d 2 y 1 y 2 + d 2 y 2 2 = 1+ 8d 1 d 2 : This indicates f(x;y; ˜ l;d)>(1a) f + (x;y)+a f + (x;y)= f + (x;y)= t; that is, t f(x;y;x 1 +x 2 1;d) is violated. Observe that if d 1 d 2 = 1, then f(x;y;x 1 + x 2 1;d) reduces to the original quadratic d 1 y 2 1 + 2y 1 y 2 + d 2 y 2 . Otherwise, although t f(x;y;x 1 + x 2 1;d) appears complicated, the next propo- sition implies that it is convex over its restricted domain and can, in fact, be stated as an SDP constraint. This results strongly indicates that SOCP-representable relaxations of (QI) may be in- adequate to describe the convex hull of the relevant mixed-integer sets, unless a large number of additional variables are added. The proof of Proposition 25 can be found in the Appendix. 87 Proposition 25. If d 1 d 2 > 1 and x 1 + x 2 1> 0, then t f(x;y;x 1 + x 2 1;d) can be rewritten as the SDP constraint 0 B B B B @ t=(d 1 d 2 1) y 1 y 2 y 1 d 2 x 1 + x 2 =d 1 1=d 1 x 1 x 2 + 1 y 2 x 1 x 2 + 1 x 1 =d 2 + d 1 x 2 1=d 2 1 C C C C A 0: From Proposition 24, we get the convex hull of rank-one caseX + by setting d 1 = d 2 = 1. Corollary 2. clconv(X + )= (x;y;t)2[0;1] 2 R 3 + : t f 1+ (x;y) ; where f 1+ (x;y)= 8 > > > > > > > > > > < > > > > > > > > > > : y 2 1 x 1 + y 2 2 x 2 if x 1 + x 2 1 y 2 2 x 2 + y 2 1 1x 2 if 0 x 1 + x 2 1(x 1 y 2 x 2 y 1 )=y 2 y 2 1 x 1 + y 2 2 1x 1 if 0 x 1 + x 2 1(x 2 y 1 x 1 y 2 )=y 1 (y 1 + y 2 ) 2 o.w. 3.4.3 Rank-one approximations ofZ + We now consider valid inequalities analogous to the ones given in Proposition 22 forZ : Consider the two decompositions of the bivariate quadratic function given by d 1 y 2 1 + 2y 1 y 2 + d 2 y 2 2 = d 1 (y 1 + y 2 d 1 ) 2 +(d 2 1 d 1 )y 2 2 = d 2 ( y 1 d 2 + y 2 ) 2 +(d 1 1 d 2 )y 2 2 : 88 Applying perspective reformulation and Corollary 2 to the separable and pairwise quadratic terms, respectively, one can obtain two simple valid inequalities forZ + : t d 1 f 1+ (x 1 ;x 2 ;y 1 ; y 2 d 1 )+(d 2 1 d 1 ) y 2 2 x 2 (3.22a) t d 2 f 1+ (x 1 ;x 2 ; y 1 d 2 ;y 2 )+(d 1 1 d 2 ) y 2 1 x 1 : (3.22b) The following example shows that the inequalities above do not describe clconv(Z + ), highlighting the more complicated structure of clconv(Z + ) compared to its complementary set clconv(Z ). Example 3. ConsiderZ + with d 1 = d 2 = d= 2, and let x 1 = x 2 = x= 2=3, y 1 = y 2 = y> 0 and t= f + (x;y). Then(x;y;t)2 clconv(Z + ). On the one hand, x 1 + x 2 > 1 implies t= f + (x;y)= f(2=3;2=3;y;y;1=3)= 133 11 y 2 : On the other hand, f 1+ (x;x;y;y=d)=(y+ y=d) 2 = 9=2y 2 indicates that (3.22) reduces to t 27 4 y 2 : Since 133 11 y 2 > 27 4 y 2 , (3.22) holds strictly at this point. 3.5 An SDP relaxation for (QI) In this section, we will give an extended SDP relaxation for (QI) utilizing the convex hull results obtained in the previous section. Introducing a symmetric matrix variable Y , let us write (QI) as min a 0 x+ b 0 y+hQ;Yi : Y yy 0 ;(x;y)2I n : (3.23) 89 Suppose for a class of PSD matricesPS n + we have an underestimator f P (x;y) for y 0 Py for any P2P. Then, sincehP;Yi y 0 Py, we obtain a valid inequality f P (x;y)hP;Yi 0; P2P (3.24) for (3.23). For example, if P is the set of diagonal PSD matrices and f P (x;y)=å i P ii y 2 i =x i , for P2P, then inequality (3.24) is the perspective inequality. Furthermore, since (3.24) holds for any P2P, one can take the supremum over all P2P to get an optimal valid inequality of the type (3.24) sup P2P f P (x;y)hP;Yi 0: (3.25) In the example of perspective reformulation, inequality (3.25) becomes sup P0 diagonal ( å i P ii y 2 i =x i Y ii ) 0; which can be further reduced to the closed form y 2 i Y ii x i ;8i2[n]. This leads to the the optimal perspective formulation [88] min a 0 x+ b 0 y+hQ;Yi (3.26a) (OptPersp) s.t. Y yy 0 0 (3.26b) y 2 i Y ii x i 8i2[n] (3.26c) 0 x 1; y 0: (3.26d) Han et al. [118] show that OptPersp is equivalent to the Shor’s SDP relaxation [203] for problem (3.1). 90 LettingP be the class of 2 2 PSD matrices and f P () as the function describing the convex hull of the mixed-integer epigraph of y 0 Py, one can derive new valid inequalities for (QI). Specif- ically, using the extended formulations for f + (x;y;d) and f (x;y;d) describing clconv(Z + ) and clconv(Z ), we have f + (x;y;d)= min z;l d 1 (y 1 z 1 ) 2 x 1 l + d 2 (y 2 z 2 ) 2 x 2 l + d 1 z 2 1 + 2z 1 z 2 + d 2 z 2 2 l (3.27a) s.t. z 1 0;z 2 0 (3.27b) maxf0;x 1 + x 2 1gl minfx 1 ;x 2 g; (3.27c) and f (x;y;d)= min z;l d 1 (y 1 z 1 ) 2 x 1 l + d 2 (y 2 z 2 ) 2 x 2 l + d 1 z 2 1 2z 1 z 2 + d 2 z 2 2 l (3.28a) s.t. z 1 y 1 ;z 2 y 2 (3.28b) maxf0;x 1 + x 2 1gl minfx 1 ;x 2 g: (3.28c) Since any 2 2 symmetric PSD matrix P can be rewritten in the form of P= p d 1 1 1 d 2 or P= p d 1 1 1 d 2 ; we can take f P (x;y)= p f + (x;y;d) or f P (x;y)= p f (x;y;d), correspondingly. Since we have the explicit form of f + () and f (), for any fixed d, (3.24) gives a nonlinear valid in- equality which can be added to (3.23). Alternatively, (3.27) and (3.28) can be used to reformulate these inequalities as conic quadratic inequalities in an extended space. Moreover, maximizing the inequalities gives the optimal valid inequalities among the class of of 2 2 PSD matrices stated below. Recall thatD :=fd2R 2 : d 1 0;d 2 0;d 1 d 2 1g. Proposition 26. For any pair of indices i< j, the following inequalities are valid for (QI): max d2D f + (x i ;x j ;y i ;y j ;d 1 ;d 2 ) d 1 Y ii d 2 Y j j 2Y i j 0; (3.29a) max d2D f (x i ;x j ;y i ;y j ;d 1 ;d 2 ) d 1 Y ii d 2 Y j j + 2Y i j 0: (3.29b) 91 Optimal inequalities (3.29) may be employed effectively if they can be expressed explicitly. We will now show how to write inequalities (3.29) explicitly using an auxiliary 3 3 matrix variable W. Lemma 2. A point(x 1 ;x 2 ;y 1 ;y 2 ;Y 11 ;Y 12 ;Y 22 ) satisfies inequality (3.29a) if and only if there exists W + 2S 3 + such that the inequality system W + 12 Y 12 (3.30a) (Y 11 W + 11 )(x 1 W + 33 )(y 1 W + 31 ) 2 ;W + 11 Y 11 ;W + 33 x 1 (3.30b) (Y 22 W + 22 )(x 2 W + 33 )(y 2 W + 32 ) 2 ;W + 22 Y 22 ;W + 33 x 2 (3.30c) W + 31 0;W + 32 0 (3.30d) W + 33 x 1 + x 2 1 (3.30e) is feasible. Lemma 3. A point(x 1 ;x 2 ;y 1 ;y 2 ;Y 11 ;Y 12 ;Y 22 ) satisfies inequality (3.29b) if and only if there exists W 2S 3 + such that the inequality system Y 12 W 12 (3.31a) (Y 11 W 11 )(x 1 W 33 )(y 1 W 31 ) 2 ;W 11 Y 11 ;W 33 x 1 (3.31b) (Y 22 W 22 )(x 2 W 33 )(y 2 W 32 ) 2 ;W 22 Y 22 ;W 33 x 2 (3.31c) W 31 y 1 ;W 32 y 2 (3.31d) W 33 x 1 + x 2 1 (3.31e) is feasible. 92 Proof of Lemma 2. The Lemma is proved by means of conic duality. For brevity, dual variables associated with each constraint are introduced in the formulation below. Writing f + as a conic quadratic minimization problem as in (3.27), we first express inequality (3.29a) as 0 max d2D min t;l;z d 1 t 1 + d 2 t 2 +t 3 d 1 Y 11 d 2 Y 22 2Y 12 s.t. t 1 (x 1 l)(y 1 z 1 ) 2 ;t 1 0;x 1 l 0 (d 1 ;s 1 ;h 1 ) t 2 (x 2 l)(y 2 z 2 ) 2 ;t 2 0;x 2 l 0 (d 2 ;s 2 ;h 2 ) lt 3 kB + zk 2 2 ;l 0;t 3 0 (1;s 3 ;g) l x 1 + x 2 1 (a) z 1 ;z 2 0; (r 1 ;r 2 ) where B 2 + = d 1 1 1 d 2 : Taking the dual of the inner minimization, the inequality can be written as 0 max d2D max a;h;g;s;r å i=1;2 (x i s i + 2y i h i )+(x 1 + x 2 1)a d 1 Y 11 d 2 Y 22 2Y 12 s.t. d i s i h 2 i ; i= 1;2 s 3 kgk 2 2 r 1 ;r 2 ;a 0 a+ s 3 = s 1 + s 2 (l) 0 B @ r 1 2h 1 r 2 2h 2 1 C A = 2B + 0 B @ g 1 g 2 ; 1 C A ; (z) Note that one can obtain a strictly dual feasible solution by taking s i ;i2[3] sufficiently large. Due to Slater condition, we deduce that strong duality holds. We first assume d 1 d 2 > 1. Then, the last 93 equation impliesg = B 1 + (r=2h). Substituting outg and s 3 , and letting u i =h i r i =2;i= 1;2, the maximization problem is further reduced to 0 max d2D max a;h;s;u å i=1;2 (x i s i + 2y i h i )+(x 1 + x 2 1)a d 1 Y 11 d 2 Y 22 2Y 12 s.t. d i s i h 2 i ; i= 1;2 h i u i ; i= 1;2 a 0 s 1 + s 2 a u 0 h d 1 1 1 d 2 i 1 u: Applying Schur Complement Lemma to the last inequality, we reach 2Y 12 max h;s;u;r;d å i=1;2 (x i s i + 2y i h i )+(x 1 + x 2 1)a d 1 Y 11 d 2 Y 22 s.t. d i s i h 2 i ; i= 1;2 (p i ;q i ;w i ) h i u i ; i= 1;2 (v i ) a 0 (b) 0 B B B B @ s 1 + s 2 a u 1 u 2 u 1 d 1 1 u 2 1 d 2 1 C C C C A 0: (W + ) Note the SDP constraint implies d2 D. If d 1 d 2 = 1, then B + is singular. In this case, one can apply the same argument to the Moore-Penrose pseudo inverse of B + (see p108, Ch12 and Corol- lary 15.3.2 in [194]) and use the generalized Schur Complement Lemma (see 7.3.P8 in [137]) to 94 deduce the last SDP constraint. Finally, taking the SDP dual of the maximization problem we arrive at 2Y 12 min p;q;w;v;W + 2W + 12 s.t. p i q i w 2 i ; p i ;q i 0; i= 1;2 v i 0; i= 1;2 q i +W + 33 = x i ; i= 1;2 (s i ) p i +W + ii = Y ii ; i= 1;2 (d i ) 2w i + v i = 2y i ; i= 1;2 (h i ) 2W + 3i = v i ; i= 1;2 (u i ) bW + 33 = 1 x 1 x 2 (a) b 0; W + 0: One can obtain a strictly primal feasible solution by taking d i ;s i ;i= 1;2 sufficiently large, which implies strong SDP duality holds due to Slater condtion. Substituting out p;q;w;v;b, we arrive at (3.30). The proof of Lemma 3 is similar and is omitted for brevity. Since both (3.29a) and (3.29b) are valid, using (3.30) and (3.31) together, one can obtain an SDP relaxation of (QI). While inequalities in (3.30) and (3.31) are quite similar, in general, W + and W do not have to coincide. However, we show below that choosing W + = W , the resulting SDP formulation is still valid and it is at least as strong as the strengthening obtained by valid inequalities (3.29). 95 LetW be the set of points (x 1 ;x 2 ;y 1 ;y 2 ;Y 11 ;Y 12 ;Y 22 ) such that there exists a 3 3 matrix W satisfying W 12 = Y 12 (3.35a) (Y 11 W 11 )(x 1 W 33 )(y 1 W 31 ) 2 ;W 11 Y 11 ;W 33 x 1 (3.35b) (Y 22 W 22 )(x 2 W 33 )(y 2 W 32 ) 2 ;W 22 Y 22 ;W 33 x 2 (3.35c) 0 W 31 y 1 ; 0 W 32 y 2 (3.35d) W 33 x 1 + x 2 1 (3.35e) W 0 (3.35f) Then, usingW for every pair of indices, we can define the strengthened SDP formulation min a 0 x+ b 0 y+hQ;Yi (3.36a) (OptPairs) s.t. Y yy 0 0 (3.36b) (x i ;x j ;y i ;y j ;Y ii ;Y i j ;Y j j )2W 8i< j (3.36c) 0 x 1; y 0: (3.36d) Proposition 27. OptPairs is a valid convex relaxation of (QI) and every feasible solution to it satisfies all valid inequalities (3.29). Proof. To see that OptPairs is a valid relaxation, consider a feasible solution(x;y) of (QI) and let Y = yy 0 . For i< j, if x i = x j = 1, constraint (3.36c) is satisfied with W = Y ii Y i j y i Y i j Y j j y j y i y j 1 : Otherwise, without loss of generality, one may assume x i = 0. It follows that Y ii = y 2 i = Y i j = y i y j = 0. Then, constraint (3.36c) is satisfied with W = 0. Moreover, if W satisfies (3.36c), then W satisfies (3.30) and (3.31) simultaneously. 96 3.6 Comparison of convex relaxations In this section, we compare the strength of OptPairs with other convex relaxations of (QI). The perspective relaxation and the optimal perspective relaxationOptPersp for (QI) are well-known. Proposition 28. OptPairs is at least as strong asOptPersp. Proof. Note that (3.36c) includes constraints 0 B @ Y ii y i y i x i 1 C A 0 B @ W 11 W 31 W 31 W 33 1 C A 0; corresponding to (3.35b)-(3.35c). Thus, the perspective constraints Y ii x i y 2 i are implied. In the context of linear regression, Atamturk and Gomez [18] study the convex hull of the epigraph of rank-one quadratic with indicators X f = ( (x;y;t)2f0;1g n R n+1 : t n å i=1 y i 2 ; y i (1 x i )= 0;i2[n] ) ; where the continuous variables are unrestricted in sign. Their extended SDP formulation based on clconv(X f ), leads to the following relaxation for (QI) min a 0 x+ b 0 y+hQ;Yi (3.37a) s.t. Y yy 0 0 (3.37b) y 2 i Y ii x i 8i (3.37c) (OptRankOne) 0 B B B B @ x i + x j y i y j y i Y ii Y i j y j Y i j Y j j 1 C C C C A 0; 8i< j (3.37d) y 0; 0 x 1: (3.37e) 97 With the additional constraints (3.37d), it is immediate that OptRankOne is stronger than OptPersp. The following proposition comparesOptRankOne andOptPairs. Proposition 29. OptPairs is at least as strong asOptRankOne. Proof. It suffices to show that for each pair i< j, constraint (3.36c) ofOptPairs implies (3.37d) of OptRankOne. Rewriting (3.35b)–(3.35c), we get W 11 Y 11 (y 1 W 31 ) 2 x 1 W 33 ; W 22 Y 22 (y 2 W 32 ) 2 x 2 W 33 Combining the above and (3.35a) to substitute out W 11 ;W 22 and W 12 in W 0, we arrive at 0 B B B B @ Y 11 (y 1 W 31 ) 2 x 1 W 33 Y 12 W 31 Y 12 Y 22 (y 2 W 32 ) 2 x 2 W 33 W 32 W 31 W 32 W 33 1 C C C C A 0; W 33 x 1 ; W 33 x 2 ; which is equivalent to the following matrix inequality by Shur Complement Lemma 0 B B B B B B B B B B @ Y 11 Y 12 W 31 y 1 W 31 0 Y 12 Y 22 W 32 0 y 2 W 32 W 31 W 32 W 33 0 0 y 1 W 31 0 0 x 1 W 33 0 0 y 2 W 32 0 0 x 2 W 33 1 C C C C C C C C C C A 0: 98 By adding the third row/column to the forth row/column and then adding the forth row/column to the fifth row/column, the large matrix inequality can be rewritten as 0 B B B B B B B B B B @ Y 11 Y 12 W 31 y 1 y 1 Y 12 Y 22 W 32 W 32 y 2 W 31 W 32 W 33 W 33 W 33 y 1 W 32 W 33 x 1 x 1 y 1 y 2 W 33 x 1 x 1 + x 2 W 33 1 C C C C C C C C C C A 0: Because W 33 0, it follows that 0 B B B B @ Y 11 Y 12 y 1 Y 12 Y 22 y 2 y 1 y 2 x 1 + x 2 1 C C C C A 0 B B B B @ Y 11 Y 12 y 1 Y 12 Y 22 y 2 y 1 y 2 x 1 + x 2 W 33 1 C C C C A 0: Therefore, constraints (3.37d) are implied by (3.36c), proving the claim. The example below illustrates that OptPairs is indeed strictly stronger than OptPersp and OptRankOne. Example 4. For n= 2,OptPairs is the ideal (convex) formulation of (QI). For the instance of (QI) with a= 0 B @ 1 5 1 C A ;b= 0 B @ 8 5 1 C A ;Q= 0 B @ 5 2 2 1 1 C A each of the other convex relaxations has a fractional optimal solution as demonstrated in Table 3.1. Tab. 3.1: Comparison of convex relaxations of (QI). obj val x 1 x 2 y 1 y 2 OptPersp -2.866 0.049 0.268 0.208 1.369 OptRankOne -2.222 0.551 0.449 0.0 2.007 OptPairs -2.200 1.0 0.0 0.800 0.0 99 Notably, the fractional x values for OptPersp and OptRankOne are far from their optimal in- teger values. A common approach to quickly obtain feasible solutions to NP-hard problems is to round a solution obtained from a suitable convex relaxation. This example indicates that feasible solutions obtained in this way from formulation OptPairs may be of higher quality than those ob- tained from weaker relaxations – our computations in §3.7.2 further corroborates this intuition. An alternative way of constructing strong relaxations for (QI) is to decompose the quadratic function y 0 Qy into a sum of univariate and bivariate convex quadratic functions and utilize the convex hull results of 2 2 quadratics a i j q i j (y i ;y j )=b i j x 2 i 2y i y j +g i j y 2 j ; wherea i j > 0, in Section 3.4 for each term, see [101] for such an approach. Specifically, let y 0 Qy= y 0 Dy+ å (i; j)2P a i j q i j (y i ;y j )+ å (i; j)2N a i j q i j (y i ;y j )+ y 0 Ry where D is a diagonal PSD matrix,P=N is the set of quadratics q i j () with positive/negative off- diagonals and R is PSD remainder matrix. Applying the convex hull description for each univariate and bivariate term we obtain the following convex relaxation for (QI): min a 0 x+ b 0 y+ n å i=1 D ii y 2 i =x i + å (i; j)2P a i j f + (x i ;x j ;y i ;y j ;b i j ;g i j ) (Decomp) + å (i; j)2N a i j f (x i ;x j ;y i ;y j ;b i j ;g i j )+ y 0 Ry s.t. 0 x 1; y 0: The next proposition shows thatOptPairs dominatesDecomp. Similar duality arguments were used in [88, 101, 235]. Proposition 30. OptPairs is at least as strong asDecomp. Moreover, there exists a decomposition for whichDecomp is equivalent toOptPairs. 100 Proof. We prove the result via the minimax theory of concave-convex programs and show that Decomp can be viewed as a dual formulation of OptPairs. To make the dual relationship more transparent, we define z i j i = W i j 31 , z i j j = W i j 32 ,l i j = W i j 33 and L= (x;y;z;l)2[0;1] n R n + R n(n1) R n(n1)=2 : 0 z i j i y i ; 0 z i j j y j ; minf0;x i + x j 1gl i j maxfx i ;x j g;8i< j : Then,OptPairs can be rewritten as min x;y;z;l min Y;W a 0 x+ b 0 y+hQ;Yi (3.38a) s.t. Y yy 0 (R) Y ii y i z i j i 2 x i l i j W i j 11 z i j i 2 l i j 8i< j (` i j i ;u i j i ) Y j j y j z i j j 2 x j l i j W i j 22 z i j j 2 l i j 8i< j (` i j j ;u i j j ) 0 B @ W i j 11 Y i j Y i j W i j 22 1 C A 1 l i j 0 B @ z i j i z i j j 1 C A z i j i z i j j 8i< j (Q i j ) (x;y;z;l)2L: (3.38b) 101 Taking the SDP dual with respect to the inner minimization problem, one arrives at min x;y;z;l max R;Q i j ;`;u a 0 x+ b 0 y+ å i< j 2 6 4 ` i j i z i j i 2 l i j +` i j j z i j j 2 l i j + u i j i y i z i j i 2 x i l i j + u i j j y j z i j j 2 x j l i j + 1 l i j z i j 0 Q i j z i j 3 7 5 + R;yy 0 (3.39a) s.t. Q ii = R ii + å i< j u i j i + å i> j u ji i 8i (Y ii ) Q i j = R i j + Q i j 12 8i< j (Y i j ) 0=` i j i u i j i + Q i j 11 8i< j (W i j 11 ) 0=` i j j u i j j + Q i j 22 8i< j (W i j 22 ) R 0;Q i j 0;` i j 0;u i j 0 8i< j (3.39b) (x;y;z;l)2L: (3.39c) Since one can take the diagonal elements of Y and W i j large enough, there exists a strictly feasible solution to the inner minimization of (3.38), which implies strong duality holds and, thus, (3.38) is equivalent to (3.39). Next, substituting out u i j in (3.39a), one gets ` i j i z i j i 2 l i j +` i j j z i j j 2 l i j + u i j i y i z i j i 2 x i l i j + u i j j y j z i j j 2 x j l i j + 1 l i j z i j 0 Q i j z i j = ˜ Q i j 11 y i z i j i 2 x i l i j + ˜ Q i j 22 y j z i j j 2 x j l i j + z i j 0 ˜ Q i j z i j l i j ; 102 where ˜ Q i j = 2 6 4 ` i j i + Q i j 11 Q i j 12 Q i j 12 ` i j j + Q i j 22 3 7 5 . By changing variables Q i j ˜ Q i j , one arrives at min (x;y;z;l)2L max R;Q i j å i< j 2 6 4 Q i j 11 y i z i j i 2 x i l i j + Q i j 22 y j z i j j 2 x j l i j + z i j 0 Q i j z i j l i j 3 7 5 + R;yy 0 + a 0 x+ b 0 y (3.40a) s.t. Q ii R ii + å i< j Q i j 11 + å i> j Q ji 22 8i (3.40b) Q i j = R i j + Q i j 12 8i< j (3.40c) R 0;Q i j 0; 8i< j (3.40d) which is equivalent to (3.39). Notice that (3.40b) is, in fact, tight. Thus, (3.40b), (3.40c),and (3.40d) define a valid decomposition of Q. Moreover, Q i j 2 ;kRk 2 Trace(Q) by (3.40b), which implies the feasible region of the inner maximization problem is compact. Therefore, according to V on Neumann’s Minimax Theorem [204], one can interchange max and min without loss of equivalence and arrive at max R;Q i j :(3.40b)-(3.40d) min z i j ;l i j å i< j 2 6 4 Q i j 11 y i z i j i 2 x i l i j + Q i j 22 y j z i j j 2 x j l i j + z i j 0 Q i j z i j l i j 3 7 5 + R;yy 0 + a 0 x+ b 0 y s.t. (x;y;z;l)2L where the inner minimization problem is in the formDecomp from Proposition 23. 103 3.7 Computations In this section, we report on computational experiments performed to test the effectiveness the formulations derived in the chapter. Section 3.7.1 is devoted to synthetic portfolio optimization instances, where matrix Q is diagonally dominant and the conic quadratic-representable extended formulations developed in Section 3.4 can be readily used in a branch-and-bound algorithm without the need for an SDP constraint. The instances here are generated similarly to [16], and serve to check the incremental value of convexifications based onZ + compared to those based on only Z . In Section 3.7.2, we use real instances derived from stock market returns and test the SDP relaxationOptPairs derived in Section 3.5, as well as mixed-integer optimization approaches based on decompositions of the quadratic matrices. 3.7.1 Synthetic instances – the diagonally dominant case We consider a standard cardinality-constrained mean-variance portfolio optimization problem of the form min x;y 8 > < > : y 0 Qy : b 0 y r; 1 0 x k 0 y x; x2f0;1g n 9 > = > ; (3.41) where Q is the covariance matrix of returns, b2R n is the vector of the expected returns, r is the target return and k is the maximum number of securities in the portfolio. All experiments are conducted using Mosek 9.1 solver on a laptop with a 2.30GHz Intel ® Core TM i9-9880H CPU and 64 GB main memory. The time limit is set to one hour and all other settings are default by Mosek. 3.7.1.1 Instance generation We adopt the method used in [16] to generate the instances. The instances are designed to control the integrality gap of the instances and the effectiveness of the perspective formulation. Letr 0 be a parameter controlling the ratio of the magnitude positive off-diagonal entries of Q to the 104 magnitude of the negative off-diagonal entries of Q. Lower values of r lead to higher integrality gaps. Let d 0 be the parameter controlling the diagonal dominance of Q. The perspective formulation is more effective in closing the integrality gap for higher values of d. The following steps are followed to generate the instances: Construct an auxiliary matrix ¯ Q by drawing a factor covariance matrix G 2020 uniformly from [1;1], and generating an exposure matrix H n20 such that H i j = 0 with probability 0.75, and H i j drawn uniformly from[0;1], otherwise. Let ¯ Q= HGG 0 H 0 . Construct off-diagonal entries of Q: For i6= j, set Q i j = ¯ Q i j , if ¯ Q i j < 0 and set Q i j =r ¯ Q i j otherwise. Positive off-diagonal elements of ¯ Q are scaled by a factor ofr. Construct diagonal entries of Q: Pickm i uniformly from[0;d ¯ s], where ¯ s = 1 n å i6= j jQ i j j. Let Q ii =å i6= j jQ i j j+m i . Note that ifd =m i = 0, then matrix Q is already diagonally dominant. Construct b;r;k: b i is drawn uniformly from[0:5Q ii ;1:5Q ii ], r= 0:25å n i=1 b i , and k=bn=5c. Matrices Q generated in this way have only 20.1% of the off-diagonal entries negative on average. 3.7.1.2 Formulations With above setting, the portfolio optimization problem can be rewritten as min å i2[n] m i z i + å Q i j <0 jQ i j jt i j + å Q i j >0 jQ i j jt i j s.t. (x i ;y i ;z i )2X 0 ;8i2 N; (x i ;x j ;y i ;y j ;t i j )2Z ;8i> j : Q i j < 0; (x i ;x j ;y i ;y j ;t i j )2Z + ;8i> j : Q i j > 0; b 0 y r; 1 0 x k; (3.42) whereZ + andZ are defined as before with d 1 = d 2 = 1. Four strong formulations are tested by replacing the mixed-integer sets with their convex hulls: ConicQuadPersp by replacingX 0 with 105 clconv(X 0 ) using the perspective reformulation (2) ConicQuadN by replacingX 0 andZ with clconv(X 0 ) and clconv(Z ) using the corresponding extended formulation, (3) ConicQuadP by replacingX 0 andZ + with clconv(X 0 ) and clconv(Z + ) respectively, and (4)ConicQuadP+N by replacingX 0 ,Z , andZ + with clconv(X 0 ), clconv(Z ) and clconv(Z + ), correspondingly. 3.7.1.3 Results Table 3.2 shows the results for matrices with varying diagonal dominanced forr= 0:3. Each row in the table represents the average for five instances generated with the same parameters. Table 3.2 displays the dimension of the problem n, the initial gap (igap), the root gap improvement (rimp), the number of branch and bound nodes (nodes), the elapsed time in secons (time), and the end gap provided by the solver at termination (egap). In addition, in brackets, we report the number of instances solved to optimality within the time limit. The initial gap is computed as igap= obj best obj cont jobj best j 100, where obj best is the objective value of the best feasible solution found andobj cont is the objective value of the natural continuous relaxation of (4.22), i.e. obtained by dropping the integral constraints; rimp is computed as rimp= obj relax obj cont obj best obj cont 100, where obj relax is the objective value of the continuous relaxation of the corresponding formulation. In Table 3.2, as expected, ConicQuadPersp has the worst performance in terms of both root gap and end gap as well as the solution time. It can only solve instances with dimension n= 40 and some instances with dimension n= 60 to optimality. The rimp of ConicQuadPersp is less than 10% when the diagonal dominance is small. This reflects the fact that ConicQuadPersp pro- vides strengthening only for diagonal terms. ConicQuadN performs better than ConicQuadPersp with rimp about 10%–25%, and it can solve all low-dimensional instances and most instances of dimension n= 60. However, ConicQuadN is still unable to solve high-dimensional instances ef- fectively. ConicQuadP performs much better than ConicQuadN for the instances considered: The rimp results in significantly stronger root improvements (between 70–80% on average). Moreover, ConicQuadP can solve almost all instances to near-optimality for n= 80. For the instances that ConicQuadP is unable to solve to optimality, the average end gap is less than 5%. By strengthening 106 both the negative and positive off-diagonal terms, ConicQuadP+N provides the best performance withrimp above 90%. ConicQuadP+N can solve all instances and most of them are solved within 10 minutes. Finally, observe that as the diagonal dominance increases, the performance of all formulations improves. Specifically, larger diagonal dominance results in more instances solved to optimality, smaller egap and shorter solving time for all formulations. For these instances, on average, the gap improvement is raised from 50.69% to 92.90% by incorporating strengthening from off-diagonal coefficients. Table 3.3 displays the computational results for different values ofr with fixedd = 0:1. The relative comparison of formulations is similar as discussed before, with ConicQuadP+N resulting in the best performance. Asr increases, the performance of ConicQuadN deteriorates in terms of Rimp while the performance of ConicQuadP improves, as expected. The performance of Conic- QuadP+N also improves for high values ofr, and always results in significant improvement com- pared to other formulations for all instances. For these instances, on average, the gap improvement is raised from 9.77% to 85.38% by incorporating strengthening from off-diagonal coefficients. In summary, we conclude that utilizing convexification forZ + complement those previously obtained forZ , and together result in significantly higher root gap improvement over the simpler perspective relaxation. For the experiments in this section, we use the results of Section 3.4 to convexify pairwise quadratic terms, but do not utilize the more sophisticated SDP formulations in Section 3.5. For the instances in this section, the optimal perspective formulation [88, 235] achieves close to 100% root improvement, and all the mixed-integer optimization problems are solved in a few seconds. Moreover, the new convex formulation OptPairs produces integer (thus optimal) solutions in all instances. In the next section, we consider these stronger conic relaxations for the more realistic and challenging instances. 107 Tab. 3.2: Experiments with varying diagonal dominance,r = 0:3. n d igap ConicQuadPersp ConicQuadN ConicQuadP ConicQuadP+N Rimp Nodes Time Egap Rimp Nodes Time Egap Rimp Nodes Time Egap Rimp Nodes Time Egap 40 0.1 53.37 9.74 9,537 46 0.00[5] 23.44 3,439 44 0.00[5] 66.07 526 18 0.00[5] 86.93 65 13 0.00[5] 0.5 51.10 33.17 3,896 26 0.00[5] 47.86 1,335 18 0.00[5] 79.48 198 9 0.00[5] 95.01 24 9 0.00[5] 1.0 52.73 60.86 1,463 9 0.00[5] 74.62 375 7 0.00[5] 86.83 146 7 0.00[5] 97.46 23 8 0.00[5] Avg 52.40 34.59 4,965 27 0.00[15] 48.64 1,717 23 0.00[15] 77.46 290 11 0.00[15] 93.13 37 10 0.00[15] 60 0.1 46.90 9.05 316,000 3,363 5.53[1] 19.07 135,052 3,261 3.83[2] 76.78 4,898 498 0.00[5] 89.72 445 140 0.00[5] 0.5 50.97 38.46 134,542 1,888 2.65[3] 49.94 55,434 1,321 0.98[4] 82.72 1,652 267 0.00[5] 95.13 203 75 0.00[5] 1.0 47.22 60.04 21,440 317 0.00[5] 66.52 8,579 209 0.00[5] 94.69 86 35 0.00[5] 98.69 17 22 0.00[5] Avg 48.36 35.85 157,328 1,856 2.73[9] 45.18 66,355 1,597 1.60[11] 84.73 2,212 267 0.00[15] 94.51 222 79 0.00[15] 80 0.1 49.91 4.76 155,000 3,600 20.25[0] 21.96 69,609 3,600 14.38[0] 65.11 8,017 2,742 4.69[2] 83.33 2,142 1,416 0.00[5] 0.5 50.53 37.33 136,638 3,600 12.06[0] 49.16 63,897 3,600 7.49[0] 81.57 6,525 2,473 1.70[2] 94.21 341 261 0.00[5] 1.0 53.78 56.96 152,704 3,600 7.41[0] 69.41 45,388 3,068 2.95[2] 84.42 5,870 2,116 1.27[3] 95.67 365 275 0.00[5] Avg 51.41 33.02 148,114 3,600 13.24[0] 46.84 59,632 3,423 8.27[2] 77.03 6,804 2,443 2.55[7] 91.07 950 651 0.00[15] 108 Tab. 3.3: Experiments with varying positive off-diagonal entries,d = 0:1. n r igap ConicQuadPersp ConicQuadN ConicQuadP ConicQuadP+N Rimp Nodes Time Egap Rimp Nodes Time Egap Rimp Nodes Time Egap Rimp Nodes Time Egap 40 0.1 60.67 9.88 6,928 36 0.00[5] 23.11 1,869 22 0.00[5] 50.68 1,134 34 0.00[5] 72.23 158 19 0.00[5] 0.5 47.67 8.7 8,572 46 0.00[5] 21.67 3,181 41 0.00[5] 75.82 272 12 0.00[5] 91.78 53 11 0.00[5] 1.0 43.23 10.05 8529 44 0.00[5] 18.08 4,903 60 0.00[5] 82.33 149 8 0.00[5] 92.56 51 11 0.00[5] Avg 50.52 9.54 8,010 42 0.00[15] 20.95 3,317 41 0.00[15] 69.61 519 18 0.00[15] 85.53 87 14 0.00[15] 60 0.1 60.26 10.7 256,480 2,585 4.03[2] 27.85 34,563 847 0.00[5] 53.37 16,190 1,983 3.10[3] 78.5 1,016 264 0.00[5] 0.5 45.98 9.38 319,534 3,230 5.57[1] 19.21 103,869 3,043 4.24[2] 78.22 2,715 315 0.00[5] 91.09 259 107 0.00[5] 1.0 40.87 10.09 197,140 3,258 4.52[2] 15.93 98,289 2,982 4.34[2] 85.23 564 100 0.00[5] 91.66 135 72 0.00[5] Avg 49.03 10.06 257,718 3,024 4.71[5] 21 78,907 2,291 2.86[9] 72.27 6,490 799 1.03[13] 87.08 470 148 0.00[15] 80 0.1 64.85 9.88 142,299 3,600 24.78[0] 26.42 60,081 3,600 14.81[0] 46.63 11,367 3,172 17.40[1] 69.6 4,948 2,920 6.22[1] 0.5 47.97 9.27 148,252 3,600 18.46[0] 20.75 48,887 3,600 15.70[0] 73.3 7,245 3,019 2.72[2] 89.11 1,131 827 0.00[5] 1.0 41.69 10 149,563 3,600 14.79[0] 16.61 52,485 3,600 14.34[0] 84.7 3,769 1,444 0.88[4] 91.93 1,068 716 0.00[5] Avg 51.51 9.72 146,705 3,600 19.34[0] 21.26 53,818 3,600 14.95[0] 68.21 7,460 2,545 7.00[7] 83.54 2,382 1,487 2.07[11] 109 3.7.2 Real instances – the general case Now using real stock market data, we consider portfolio index tracking problem of the form min(y y B ) 0 Q(y y B ) (IT) s.t. 1 0 y= 1; 1 0 x k 0 y x; x2f0;1g n ; where y B 2R n is a benchmark index portfolio, Q is the covariance matrix of security returns and k is the maximum number of securities in the portfolio. 3.7.2.1 Instance generation We use the daily stock return data provided by Boris Marjanovic in Kaggle 1 to compute the co- variance matrix Q. Specifically, given a desired start date (either 1/1/2010 or 1/1/2015 in our computations), we compute the sample covariance matrix based on the stocks with available data in at least 99% of the days since the start (returns for missing data are set to 0). The resulting covariance matrices are available at https://sites.google.com/usc.edu/gomez/data. We then generate instances as follows: we randomly sample an n n covariance matrix Q corresponding to n stocks, and we draw each element of y B from uniform [0,1], and then scale y B so that 1 0 y B = 1. 1 https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs 110 3.7.2.2 Convex relaxations The natural convex relaxation of IT always yields a trivial lower bound of 0, as it is possible to set z= y= y B . Thus, we do not report results concerning the natural relaxation. Instead, we consider the optimal perspective relaxationOptPersp of [88]: min x;y;Y y 0 B Qy B 2y 0 B Qy+hQ;Yi (3.44a) s.t. Y yy 0 0 (3.44b) (OptPersp) y 2 i Y ii x i 8i2[n] (3.44c) 0 x 1; y 0 (3.44d) 1 0 y= 1; 1 0 x k (3.44e) and the proposedOptPairs exploiting off-diagonal elements of Q: min x;y;Y;W y 0 B Qy B 2y 0 B Qy+hQ;Yi s.t. Y yy 0 0 W i j 0 8i< j (Y ii W i j 11 )(x i W i j 33 )(y i W i j 31 ) 2 ; W i j 11 Y ii 8i< j (OptPairs) (Y j j W i j 22 )(x j W i j 33 )(y j W i j 32 ) 2 ; W i j 22 Y j j 8i< j W i j 33 x i + x j 1; W i j 33 x i ; W i j 33 x j 8i< j 0 W i j 31 y i ; 0 W i j 32 y j ; W i j 12 = Y i j 8i< j 0 x 1; y 0 1 0 y= 1; 1 0 x k; For each relaxation, we consider a simple rounding heuristic to obtain feasible solutions to (IT): given an optimal solution ( ¯ x; ¯ y) to the continuous relaxation, we fix x i = 1 for the k-largest values of ¯ x and the remaining x i = 0, and resolve the continuous relaxation to compute y. 111 3.7.2.3 Exact mixed-integer optimization approaches We also consider three mixed-integer optimization approaches, each associated with a different convex relaxation. The first one is the Natural relaxation corresponding to the mixed-integer quadratic formulation (IT). The second one is the correspondingOptPersp formulation min y 0 B Qy B 2y 0 B Qy+ y 0 Ry+ n å i=1 D ii t i (3.45a) s.t. t i x i y 2 i ; i2[n] (3.45b) 1 0 y= 1; 1 0 x k (3.45c) 0 y x; x2f0;1g n ; (3.45d) where D+ R= Q and R are the dual variables associated with constraint (3.44b). The third one is theOptPairs formulation based on the decomposition min y 0 B Qy B 2y 0 B Qy+ y 0 Ry+ å i< j t i j (3.46a) s.t. t i j Q i j ii y 2 i + Q i j j j y 2 j + 2Q i j i j y i y j 8i< j (3.46b) 1 0 y= 1; 1 0 x k (3.46c) 0 y x; x2f0;1g n ; (3.46d) where matrix R is the dual variable associated with constraint Y yy 0 0, and matrices Q i j are the dual variables associated with constraints W i j 0. The formulation is then obtained from the SOCP-representable convexification of constraints (3.46b) using Proposition 23 (if Q i j i j 0) or Remark 5 (if Q i j i j < 0). 112 3.7.2.4 Results In these experiments, the solution time limit is set to 20 minutes, which includes the time required to solve the SDP relaxations to find suitable decompositions. Tables 3.4 and 3.5 present the results using historical data since 2010 and 2015, respectively. They show, for different values of n and k, and for each conic relaxation: the time required to solve the convex relaxations in seconds, the lower bound (LB) corresponding to the optimal objective value of the continuous relaxation, the upper bound (UB) corresponding to the objective value of the heuristic, the gap between these two values, computed as Gap= UBLB UB ; they also show the best objective found at termination, and the associated gap, number of nodes explored, time spent in branch-and-bound in seconds, and number of instances that could be solved to optimality within the time limit (#). The lower bounds, upper bounds from the convex relaxations, and objective from branch-and-bound, are scaled so that the best upper bound found for a given instance is 100. Each row represents an average of five instances generated with the same parameters. We first summarize our conclusions, then discuss in depth the relative performance of the mixed-integer optimization formulations, and finally discuss the performance of the conic formu- lations (which, we argue, perform best for this class of problems). Summary The perspective reformulation (3.45) remains the best approach to solve the prob- lems to optimality with the current off-the-shelf MISOCP solvers, as MIP solvers struggle with more the sophisticated formulations. However, the stronger formulations are very effective in pro- ducing comparable or better solutions (especially in challenging instances with poor natural convex relaxations) via rounding the convex relaxation solutions in a fraction of the computational time. Comparison of mixed-integer optimization approaches For instances with n= 50, we see that, among the mixed-integer optimization approaches, the one based on OptPersp is arguably the best, solving to optimality 22/30 instances (compared with Natural: 15/22, and OptPairs: 8/22). The Natural mixed-integer optimization formulation is able to explore more nodes, but 113 Tab. 3.4: Results with stock return data since 2010. n k Method Convex Branch & Bound LB UB Gap Time(s) Obj Gap Nodes Time(s) # 50 5 Natural - - - - 100.0 0.0% 1,248,360 140.3 5 OptPersp 94.2 106.0 10.9% 0.9 100.0 0.0% 147,146 142.8 5 OptPairs 99.3 100.7 1.4% 2.9 100.0 1.4% 1,844 843.6 3 7 Natural - - - - 100.0 23.1% 10,188,792 1,155.2 1 OptPersp 93.3 106.1 12.0% 0.8 100.0 7.1% 984,569 900.7 2 OptPairs 98.9 101.5 2.6% 2.6 100.0 5.3% 2,186 1197.5 0 10 Natural - - - - 100.0 47.8% 11,397,070 1,200.0 0 OptPersp 92.7 106.8 13.2% 0.8 101.2 20.5% 1,252,352 1,112.7 1 OptPairs 98.0 105.5 6.8% 2.5 101.7 9.6% 1,678 1,197.5 0 100 10 Natural - - - - 100.0 87.2% 4,121,436 1,200.0 0 OptPersp 91.4 107.9 14.6% 25.3 107.6 40.5% 415,954 1,174.7 0 OptPairs 99.3 100.9 1.6% 55.9 176.7 228.3% 239 1,144.1 0 15 Natural - - - - 100.0 91.0% 4,681,628 1,200.0 0 OptPersp 91.3 108.7 15.4% 21.7 107.2 46.2% 514,894 1,178.4 0 OptPairs 99.8 100.5 0.8% 43.9 193.7 311.0% 259 1,156.1 0 20 Natural - - - - 100.0 92.6% 4,852,043 1,200.0 0 OptPersp 90.6 107.4 15.5% 21.5 109.7 52.7% 566,917 1,178.5 0 OptPairs 99.3 100.4 1.2% 43.4 202.3 269.7% 189 1,156.7 0 the relaxations are weaker, ultimately leading to inferior performance. In contrast, the stronger mixed-integer formulation based on OptPairs needs more time to process each node (by orders- of-magnitude) due to the increased complexity of the relaxations, resulting in poor performance overall. Nonetheless, for instances where it can prove optimality (e.g., n= 50, k= 5), it does so with substantially fewer nodes, illustrating the power of the stronger relaxations. Interestingly, in the more challenging instances with data from 2010, n= 50 and k= 10, OptPairs is able to prove the best optimality gap of 9.6% (compared withOptPersp: 20.5%, andNatural: 4.8%). For larger instances with n= 100, all mixed-integer optimization formulations struggle. For- mulations based onOptPairs result in gaps well-above 100%, that is, the best lower bound achieved by branch-and-bound is negative; for instances with data since 2015 and k= 20, the root node re- laxations cannot be fully processed in 20 minutes, and the branch-and-bound solver terminates 114 Tab. 3.5: Results with stock return data since 2015. n k Method Convex Branch & Bound LB UB Gap Time(s) Obj Gap Nodes Time(s) # 50 5 Natural - - - - 100.0 0.0% 404,98 45.9 5 OptPersp 90.6 116.3 20.9% 0.8 100.0 0.0% 49,455 55.6 5 OptPairs 99.3 100.0 0.7% 2.9 100.0 0.8% 2,436 652.9 4 7 Natural - - - - 100.0 4.5% 3,396,633 461.8 4 OptPersp 91.4 116.9 20.9% 0.9 100.1 0.0% 230,606 365.9 5 OptPairs 99.5 100.0 0.5% 3.2 100.8 8.1% 2,387 1,129.7 1 10 Natural - - - - 100.0 25.8% 7,582,274 1,200.0 0 OptPersp 90.7 114.5 20.2% 1.1 100.3 4.6% 594,498 815.2 4 OptPairs 99.3 100.4 1.1% 3.7 105.6 17.5% 1,477 1,196.3 0 100 10 Natural - - - - 100.0 77.1% 4,345,144 1,200.0 0 OptPersp 80.3 246.4 56.5% 20.9 102.1 41.8% 329,855 1,179.1 0 OptPairs 96.6 102.7 6.0% 47.2 348.4 126.5% 195 1,152.9 0 15 Natural - - - - 100.0 84.8% 4,388,528 1,200.0 0 OptPersp 80.1 278.9 66.2% 20.8 109.8 49.2% 378,832 1,179.3 0 OptPairs 96.3 105.4 8.5% 45.7 223.6 143.1% 189 1,154.4 0 20 Natural - - - - 100.0 87.6% 4,326,808 1,200 0 OptPersp 80.2 204.3 56.5% 22.0 100.9 53.1% 409,343 1,178.3 0 OptPairs 96.4 102.6 6.0% 51.3 † † † † 0 †: Unable to fully process the root node in the time limit without an incumbent solution. Indeed, MISOCP solvers based on outer approximations strug- gle to solve highly nonlinear instances with a large number of variables and exhibit pathological behavior, e.g., see [16, 22, 119] for similar documented results. Formulations based on Natural produce the best incumbent solutions, due to the large number of nodes explored, but terminate with optimality gaps close to 100% in all cases. Formulations based onOptPersp achieve a middle ground of producing reasonably good solutions with moderate gaps, although the optimality gaps of 50% are still quite high. Discussion of conic formulations First, note that the continuous conic formulation OptPairs produces better lower bounds and upper bounds (via the rounding heuristic) than the continuous OptPersp: in particular, gaps are on average reduced by 66%, see Figure 3.2 for a summary of the gaps across all instances. The better performance comes at the expense of increased computational 115 times by a factor of three, which does not depend on the dimension of the problem. For the instances considered, the additional computation time is at most 30 seconds, which is negligible compared with the cost of solving the mixed-integer optimization problem. (a) Data since 2010. (b) Data since 2015. Fig. 3.2: Distribution of gaps forOptPersp andOptPairs. We now compare rounding OptPairs solution with the mixed-integer optimization based on OptPersp, henceforth referred to as MIO, which produced the best results among branch-and- bound approaches. For instances MIO solves to optimality (typically requiring between one and ten minutes),OptPairs produces optimality gaps under 2% in less than four seconds, indicating the effectiveness of rounding the strong OptPairs solutions. More importantly, in all other instances, OptPairs invariably produces much better gaps than MIO in a fraction of the time. For example, in Table 3.4 with n= 100, OptPairs provides optimality gaps under 2% in one minute, whereas MIO terminates with gaps above 40% after 20 minutes of branch-and-bound. While the improved gaps are mostly caused by considerably better lower bounds, in many cases the rounding heuristic based onOptPairs delivers better primal bounds than MIO: for example, in Table 3.4, n= 100 and k= 20, OptPairs produces feasible solutions with an average objective value of 100.4, whereas MIO results in incumbents with average value of 109.7. 116 3.8 Conclusions In this chapter, we describe the convex hull of the mixed-integer epigraph of the bivariate convex quadratic functions with nonnegative variables and off-diagonals with an SOCP-representable ex- tended formulation as well as in the original space of variables. Furthermore, we develop a new technique for constructing an optimal convex relaxation from elementary valid inequalities. Us- ing this technique, we develop a new strong SDP relaxation for (QI), based on the convex hull descriptions of the bivariate cases as building blocks. Moreover, the computational results with synthetic and real portfolio optimization instances indicate that the proposed formulations provide substantial improvement over existing alternatives in the literature. 117 4 Compact extended formulations for low-rank functions with indicator variables 4.1 Introduction In this chapter, we consider a general mixed-integer convex optimization problem with indicator variables min x;z fF(x) :(x;z)2FR n f0;1g n ;x i (1 z i )= 08i2[n]g; (4.1) where F :R n !R is a convex function,F is the feasible region, and [n] def =f1;2;:::;ng. Each binary variable z i indicates whether a given continuous variable x i is zero or not. In other words, z i = 0 =) x i = 0, and z i = 1 allows x i to take any value. Optimization problem (4.1) arises in a variety of settings, including best subset selection problems in statistics [50], and portfolio optimization problems in finance [53]. In practice, the objective function often has the form F(x)=å i2[m] f i (x), where each f i (x) is a composition of a relatively simple convex function and a low-rank linear map. This observation motivates the need for a comprehensive study of the mixed-integer set Q def = 8 > < > : (t;x;z)2R n+1 f0;1g n : t f(x);x i 08i2I + ; z2f0;1g n ;x i (1 z i )= 08i2[n] 9 > = > ; ; (4.2) 118 where f(x)= g(Ax)+ c > x, g :R k !R is a proper closed convex function, A is a k n matrix, c2R n , andI + [n] is the subset of variables restricted to be non-negative. Disjunctive programming is a powerful modeling tool to represent a nonconvex optimization problem in the form of disjunctions, especially when binary variables are introduced to encode the logical conditions such as sudden changes, either/or decisions, implications, etc. [35]. The theory of linear disjunctive programming is first pioneered by Egon Balas in 1970s [34, 32, 33, 36], and later extended to the nonlinear case [72, 205, 154, 114, 46, 57, 144, 145, 176, 73, 232]. Once a mixed-integer set is modeled as a collection of disjunctive sets, its convex hull can be described easily as a Minkowski sum of the scaled convex sets with each obtained by creating a copy of original variables. Such extended formulations are the strongest possible for the mixed-integer set under study. However, a potential downside of such formulations is that, often, the number of additional variables required in the description of the convex hull is exponential in the number of binary variables. Thus, a direct application of disjunctive programming as mentioned can be unfavorable in practice. A possible approach to implement the extended formulation induced by disjunctive program- ming in an efficient way is to reduce the number of additional variables introduced in the model without diminishing the relaxation strength. In principle, this goal can be achieved by means of Fourier-Motzkin elimination [79]. This method is practical if the set under study can be naturally expressed using few disjunctions, e.g., to describe piecewise linear functions [8, 120], involv- ing few binary variables [16, 100], or separable [116]. However, projecting out variables in a moderate-size model can be very challenging [10], not to mention in a high-dimensional setting. Regarding the nonlinear set with indicator variablesQ, in the simplest case where the MINLP is separable, it is known that the convex hull can be described by the perspective function of the objective function supplemented by the constraints defining the feasible region [3, 88, 89, 97, 116, 129, 229, 48]. Two papers [226] and [22] are closely related to our work. The mixed-integer sets studied in both can be viewed as a special realization ofQ in (4.2). Wei et al. [226] characterize the closure 119 of the convex hull ifQ when the defining function g is a univariate function and the continuous variables are free, i.e.I + = / 0. Atamt¨ urk and G´ omez [22] study the setting with sign-constrained continuous variables, and univariate quadratic functions g: in such cases, the sign-restrictions resulted in more involved structures for the closure of the convex hull ofQ. Contributions and outline In this chapter we show how to construct extended formulations of setQ, requiring at mostO(n k ) copies of the variables. In particular, if the rank k is small, then the resulting formulations are indeed much smaller than those resulting from a natural application of disjunctive programming. Moreover, for the special case of k= 1, we show how we are able to recover and improve existing results in the literature, either by providing smaller extended formulations (linear in n), or by providing convexifications in the original space of variables for a more general class of functions. The rest of the chapter is organized as follows. In Section 4.2, we provide relevant background and the main result of the chapter: a compact extended formulation ofQ. In Section 4.3, using the results in Section 4.2, we derive the explicit form of the convex hull of the rank-one setQ in the original state of variables. In Section 4.4, we present the complexity results showing that tractable convexifications ofQ are unlikely if additional constraints are imposed on the continuous variables. In Section 4.5, we extend the results in Section 4.2 to the case where the binary variables are constrained. Finally, in Section 4.6, we conclude the chapter. 4.2 A convex analysis perspective on convexification In this section, we first introduce necessary preliminaries in convex analysis and notations adopted in this chapter. After that we present our main results and their connections with previous works in literature. 120 4.2.1 Notations and preliminaries Throughout this chapter, we assume f(x) a proper closed convex function fromR n toR[f+¥g. We adopt the same notations as those in Chapter 1. Moreover, we borrow the term rank which is normally defined for an affine mapping and extend its definition to a general nonlinear convex function. Definition 1 (Rank of convex functions). Given a proper closed convex function f(x), the rank 1 rank( f) is defined as the smallest integer k such that f(x) can be expressed in the form f(x)= g(Ax)+ c > x for some closed convex function g(), a k n matrix A and a vector c2R n . For example, the rank of an affine function f(x)= c > x is simply 0. The rank of a convex quadratic function f(x)= x > Ax coincides with rank(A), where A 0 is a positive semidefinite (PSD) matrix. For convenience, we repeat the set of interest: Q def = 8 > < > : (t;x;z)2R n+1 f0;1g n : t f(x);x i 08i2I + ; x i (1 z i )= 08i2[n] 9 > = > ; ; where f(x)= g(Ax)+ c > x is a proper closed convex function. Note that if the complementary constraints x i (1z i )= 0 are removed fromQ, then x and z are decoupled and clconv(Q) reduces simply toX[0;1] n , where X def = (t;x)2R n+1 : t f(x);x i 08i2I + : For the purpose of decomposing clconv(Q), for anyI[n], define the following sets: X(I) def =X\f(t;x) : x i = 08i = 2Ig; Z(I) def =fz2f0;1g n : z i = 18i2Ig; V(I) def =X(I)Z(I): 1 The definition of rank( f) is different from the one adopted in classical convex analysis (see Section 8 in [193]). If dom( f) is full-dimensional, the two definitions coincide. 121 Notice that for any(t;x;z)2V(I), either x i = 0 or z i = 18i2[n]. Therefore, x i (1z i )= 08i2[n] and thusV(I)Q. Furthermore,Q can be expressed as a disjunction Q= [ I[n] V(I): (4.3) Note that usual disjunctive programming techniques are based on (4.3), creating copies of variables for every I[n], resulting in an exponential number of variables. Finally, define R def =f(t;x;z) : t 0; Ax= 0;x i 0;8i2I + ; z i = 1;8i2[n]g; which is a closed convex set. For any y2R n , we denote the support of y by supp(y) def =fi : y i 6= 0g. We let e be the vector of all ones (whose dimension can be inferred from the context). 4.2.2 Convex hull characterization In this section, we characterize clconv(Q). We first show that if f(x) is homogeneous, under mild conditions, clconv(Q) is simply the natural relaxation ofQ. Proposition 31. If f(x) is a homogeneous function, then clconv(Q)=X[0;1] n . Proof. Since f is homogeneous, each setX(I) is a closed convex cone. Moreover,I 1 I 2 implies thatX(I 1 )X(I 2 ), which further implies that X(I 2 )X(I 1 )+X(I 2 )X(I 2 )+X(I 2 )=X(I 2 ); (4.4) 122 where the last equality results from thatX(I 2 ) is conic. Hence, equality holds throughout (4.4). By (4.3), clconv(Q)= clconv 0 @ [ I[n] V(I) 1 A =clconv 0 @ [ I[n] X(I) conv(Z(I)) 1 A (X(I) is convex;X;Z are decoupled) = [ l2R 2 n + :e > l=1 å I[n] l I (X(I) conv(Z(I))) = [ l2R 2 n + :e > l=1 X([n]) å I[n] l I conv(Z(I)) (X(I) is conic and (4.4)) =X [ l2R 2 n + :e > l=1 å I[n] l I conv(Z(I)) (X =X([n])) =X[0;1] n : Proposition 31 generalizes Proposition 1 of [111] (where f is the` 2 norm). Next, we present the main result of the chapter, characterizing clconv(Q) without the assumption of homogeneity. In particular, we show that clconv(Q) can be constructed from substantially less disjunctions than those given in (4.3). The proof relies on Proposition 47 which reveals one-to-one correspondence between closed convex sets and support functions. Theorem 3. Assume rank( f) k and f(0)= 0. Then clconv(Q)= clconv 0 @ 0 @ [ I :jIjk V(I) 1 A [R 1 A : Moreover, iffx2R n : Ax= 0; x i 08i2I + g=f0g, thenR can be removed from the disjunc- tion. 123 Informally, S I :jIjk V(I) andR correspond to the “extreme points” and “extreme rays” of clconv(Q), respectively. From (4.3),Q is a disjunction of exponentially many pieces ofV(I). However, given a low-rank function f , Theorem 3 states that clconv(Q) can be generated (using disjunctive programming) from a much smaller number of setsV(I). We also remark that condi- tion f(0)= 0 plays a minor role in the derivation of Theorem 3. If 02 dom( f) but f(0)6= 0, one can study f(x) f(0). Proof of Theorem 3. DenoteS = [ I :jIjk V(I)[R. SinceV(I)Q for allI , andRQ, we find thatSQ. Thus,d (;clconv(S))=d (;S)d (;Q): Due to Proposition 47, it remains to prove the opposite direction, namely thatd (;S)d (;Q). Since rank( f) k, there exists g(), A2R kn and c such that f(x)= g(Ax)+ c > x. Taking any a=(a t ;a x ;a z )2RR n R n , ifa t > 0,d (;S)=d (;Q)=+¥ because t is unbounded from above. We now assumea t 0, and define `(x;z) def =a t g(Ax)+(a x +a t c) > x+a > z z: Thend (a;S)= maxf`(x;z) :(t;x;z)2Sg andd (a;Q)= maxf`(x;z) :(t;x;z)2Qg. Given any fixed( ¯ t; ¯ x; ¯ z)2Q, consider the linear program v def = max x2R n (a x +a t c) > x s.t. Ax= A ¯ x ¯ x i x i 0 8i2 supp( ¯ x) supp(¯ z) x i = 0 8i2[n]nsupp( ¯ x): (4.5) Note that linear program (4.5) is always feasible setting x= ¯ x. Moreover, every feasible solution (or direction) x of (4.5) satisfies x i (1 ¯ z i )= 0. 124 If v =+¥, then there exists a feasible direction d such that Ad= 0 and(a x +a t c) > d> 0. It implies that for any r 0, f(rd)= g(rAd)+ rc > d= g(0)+ rc > d= rc > d. Hence,(rc > d;rd;e)2 R Q. Furthermore, `(rd;e)=(a x +a t c) > dr+a > z e!+¥; (as r!+¥) which impliesd (a;S)=d (a;Q)=+¥. If v is finite, there exists an optimal solution x to (4.5). Moreover, x can be taken as an extreme point of the feasible region of (4.5) which is a pointed polytope. It implies that n linearly independent constraints must be active at x . Since rank(A) k, at least n k constraints of the form ¯ x i x i 0 hold at equality. Namely, x satisfiesjsupp(x )j k. Since A ¯ x= Ax , we can define t def = ¯ t+ c > x ¯ x f( ¯ x)+ c > x ¯ x= g(A ¯ x)+ c > ¯ x+ c > x ¯ x= g(Ax )+ c > x = f(x ): SettingI = supp(x ), one can deduce that(t ;x )2X(I), and ¯ z2Z(I) because supp(x ) supp(¯ z). Thus, (t ;x ; ¯ z)2V(I)SQ, and (a x +a t c) > x (a x +a t c) > ¯ x indicates that `(x ; ¯ z)`( ¯ x; ¯ z). Namely, for an arbitrary point( ¯ t; ¯ x; ¯ z)2Q, there always exists a point inS with a superior objective value of`(). Therefore, d (a;S)d (a;Q), completing the proof of the main conclusion. The last statement of the theorem follows since iffx2R n : Ax= 0; x i 08i2I + g=f0g, then the feasible region of (4.5) is bounded and thus v is always finite. Using the disjunctive representation of Theorem 3 and usual disjunctive programming tech- niques [72], one can immediately obtain extended formulations requiring at mostO(n k ) copies of the variables. Moreover, it is often easy to project out some of the additional variables, resulting in formulations with significantly less variables or, in some cases, formulations in the original space of variables. We illustrate these concepts in the next section with k= 1. 125 4.3 Rank-one convexification In this section, we show how to use Theorem 3 and disjunctive programming to derive convex- ifications for rank-one functions. In particular, throughout this section, we make the following assumptions: Assumption 1. Function f is given by f = g(å n i=1 a i x i )+c > x, where g is a finite one-dimensional function, f(0)= g(0)= 0, and a i 6= 08i2 n. For simplicity, we also assume c= 0. First in Section 4.3.1 we derive an extended formulation with a polynomial number of addi- tional variables. Then in Section 4.3.2 we project out the additional variables for the case with free continuous variables, and recover the results of [226]. Similarly, in Section 4.3.3, we provide the description of the convex hull of cases with non-negative continuous variables in the original space of variables, generalizing the results of [22] and [16] to general (not necessarily quadratic) functions g. We also show that the extended formulation proposed in this chapter is more amenable to implementation that the inequalities proposed in [22]. 4.3.1 Extended formulation of clconv(Q) We first discuss the compact extended formulation of clconv(Q) that can be obtained directly for the disjunctive representation given in Theorem 3, using 2n additional variables. Proposition 32. Under Assumption 1, (t;x;z)2 clconv(Q) if and only if there exists l;t2R n such that the inequality system t n å i=1 g p (a i (x i t i );l i ); a > t = 0; 0t i x i 8i2I + ; l i z i 18i2[n]; l 0; n å i=1 l i 1 (4.6) 126 is satisfied. Proof. By Theorem 3, clconv(Q)= clconv 0 @ V(/ 0) [ i2[n] V(fig)[R 1 A = [ l2R n+2 + :e > l=1 l 0 clconv(V(/ 0))+ n å i=1 l i clconv(V(fig))+l [n] R ! ; where l 0 clconv(V(/ 0))=f(t 0 ;x 0 ;z 0 ) : t 0 0; x 0 = 0; 0 z 0 l 0 g; l i clconv(V(fig))= 8 > < > : (t i ;x i ;z i ) : t i g p (a i x i i ;l i ); z i i =l i ; 0 z i j l i 8 j6= i; x i i 0 if i2I + ; x i j = 08 j6= i 9 > = > ; ; l [n] R= n t [n] ;t;z [n] : t [n] 0; a > t = 0; t i 08i2I + ; z [n] =l [n] e o : It follows that(t;x;z)2 clconv(Q) if and only if the following inequality system has a solution: t n å i=1 g p (a i x i i ;l i ); a > t = 0; x i = x i i +t i 8i2[n]; (4.7a) x i i 0; t i 0 8i2I + ; z j = n å i=1 z i j +l [n] + z 0 j 8 j2[n]; z i i =l i ; 0 z 0 i l 0 8i2[n]; (4.7b) 0 z i j l i 8 j6= i2[n]; l 0; l 0 + n å i=1 l i +l [n] = 1: 127 We now show how to simply the above inequality system step by step. First, we can substitute out x i i and z i i using (4.7a) and (4.7b), obtaining the system t n å i=1 g p (a i (x i t i );l i ); a > t = 0; 0t i x i 8i2I + ; z j l j l [n] = å i2[n]:i6= j z i j + z 0 j 8 j2[n]; (4.8a) 0 z i j l i 8[n]3 i6= j2[n][f0g; (4.8b) l 0; l 0 + n å i=1 l i +l [n] = 1: (4.8c) Next, we can substitute out å i2[n]:i6= j z i j + z 0 j in (4.8a) using the bounds (4.8b). Doing so, (4.8a) reduces to 0 z j l j l [n] å i2[n]:i6= j l i +l 0 = 1l j l [n] ; ,l [n] z j l j ; z j 1; where the equality results from (4.8c). We deduce that the system of inequalities reduces to t n å i=1 g p (a i (x i t i );l i ); a > t = 0; 0t i x i 8i2I + ; z j 1; l [n] z j l j 8 j2[n]; l 0; l 0 + n å i=1 l i +l [n] = 1: Formulation (4.6) follows from using Fourier-Motzkin elimination to project out l 0 and l [n] , re- placing them with 0 and changing the last equality to an inequality. 128 In addition, ifI + =[n] and a i > 08i2[n], then a > t = 0 and t 0 imply t = 0 in (4.6). Therefore, we deduce the following corollary for this special case. Corollary 3. Under Assumption 1,I + =[n] and a i > 08i2[n],(t;x;z)2 clconv(Q) if and only if there existsl2R n such that the inequality system t n å i=1 g p (a i x i ;l i ); l i z i 18i2[n]; l 0; n å i=1 l i 1 (4.10) is satisfied. Naturally, the extended formulations obtained from Proposition 32 and Corollary 3, requiring O(n) additional variables, are substantially more compact than extended formulations obtained from the disjunction (4.3), which requireO(n 2 n ) variables. 4.3.2 Explicit form of clconv(Q) with unconstrained continuous variables When x is unconstrained, i.e.,I + = / 0, the explicit form of clconv(Q) in the original space of variables was first established in [18] for quadratic functions, and later generalized in [226] to general rank-one functions. In Proposition 33 below, we present a short proof on how to recover the aforementioned results, starting from Proposition 32. First, we need the following property on the monotonicity of the perspective function. Lemma 4. Assume g(v) is a convex function overR with g(0)= 0. Then g p (v;l) is a nonincreas- ing function onl > 0 for fixed v. Proof. Since g is convex, for8v 1 ;v 2 , g(v 1 )g(v 2 ) v 1 v 2 is nondecreasing with respect to v 1 . Taking v 1 = v=l and v 2 = 0, since g(0)= 0, one can deduce that g p (v;l)=lg v l = v g v 1 l 0 v 1 l 0 129 is nondecreasing with respect to 1=l, i.e. nonincreasing with respect tol. Proposition 33 (Wei et al. [226]). Under Assumption 1, if additionallyI + = / 0, then clconv(Q)= n (t;x;z) : t g p a > x; min n 1; e > z o ; 0 z e o : Proof. Without loss of generality, we assume a= e; otherwise, we can scale x i by a i . We first eliminatet from (4.6). From the convexity of g(), we find that å i2[n] g p (x i t i ;l i )= å i2[n] l i g x i t i l i = å j2[n] l j ! å i2[n] l i å j2[n] l j g x i t i l i ! å j2[n] l j ! g å i2[n] x i å i2[n] t i å j2[n] l j ! = å j2[n] l j ! g å i2[n] x i å j2[n] l j ! = g p å i2[n] x i ; å i2[n] l i ! ; (e > t = 0) where the inequality holds at equality if there exists some common ratio r such that x i t i =l i r. Moreover, this ratio does exist by setting r= e > x=e > l and t = x rl for all i2[n] –it can be verified directly that e > t = 0. Thus, the above lower bound can be attained for all x. Hence, (4.6) reduces to t g p (e > x;e > l) 0l i z i 18i2[n] (4.11a) å i2[n] l i 1: (4.11b) Since for fixed v, g p (v;s) is non-increasing with respect to s2R + , projecting out l amounts to computing the maximum of e > l, that is, solving the linear program max l fe > l : (4.11a) and (4.11b)g: Summing up (4.11a) over all i2[n] and combining it with (4.11b), we deduce that e > l minfe > z; 1g. 130 It remains to show this upper bound is tight. Ifå i2[n] z i 1, one can setl i = z i 8i2[n]. Now as- sumeå i2[n] z i > 1. Let m be the index such thatå m1 i=1 z i 1 andå m i=1 z i > 1. Set l i = 8 > > > > > > < > > > > > > : z i if i< m 1å m i=1 z i if i= m 0 if i> m: It can be verified directly that this solution is feasible and e > l = 1. The conclusion follows. 4.3.3 Explicit form of clconv(Q) with nonnegative continuous variables In this section, we aim to derive the explicit form of clconv(Q) whenI + =[n]. A description of this set is known for the quadratic case [22] only. We now derive it for the general case. When specialized to bivariate rank-one functions, it also generalizes the main result of [16] where the research objective is a non-separable bivariate quadratic. Observe that function f can be written as f(x)= g å i2N + a i x i å i2N a i x i ! ; where a i > 08i2[n], andN + [N is a partition of [n]. We first state the main theorem of this section, where the min=max over an empty set is taken to be+¥=¥ respectively. Theorem 4. Under Assumption 1 andI + =[n], for all (t;x;z) such that x 0;0 z e, the following statements hold: 131 Ifå i2N + a i x i >å i2N a i x i and there exists a partitionL[M[U ofN + \supp(x)\supp(z) such that 1 å i2M[U z i > 0; max i2L a i x i z i < å i2L a i x i 1å i2M[U z i min i2M a i x i z i max i2M a i x i z i å i2N + a i x i å i2L[M[N a i x i å i2U z i < min i2U a i x i z i ; then(t;x;z)2 clconv(Q) if and only if t g p å i2L a i x i ;1 å i2M[U z i ! + å i2M g p (a i x i ;z i )+ g p å i2N + a i x i å i2L[M[N a i x i ; å i2U z i ! : (4.12) Otherwise,(t;x;z)2 clconv(Q) if and only if t f(x): Ifå i2N + a i x i å i2N a i x i and there exists an partitionL[M[U ofN \supp(x)supp(z) such that 1 å i2M[U z i > 0; max i2L a i x i z i < å i2L a i x i 1å i2M[U z i min i2M a i x i z i max i2M a i x i z i å i2N a i x i å i2L[M[N + a i x i å i2U z i < min i2U a i x i z i ; then(t;x;z)2 clconv(Q) if and only if t g p å i2L a i x i ;1 å i2M[U z i ! + å i2M g p (a i x i ;z i )+ g p å i2N a i x i + å i2L[M[N + a i x i ; å i2U z i ! : Otherwise,(t;x;z)2 clconv(Q) if and only if t f(x): Note that f(x) can be rewritten as f(x)= ˜ g å i2N a i x i å i2N + a i x i , where ˜ g(s)= g(s). Due to this symmetry, it suffices to prove the first assertion of Theorem 4. Without loss of gener- ality, we also assume a i = 18i2[n]; otherwise we can consider ˜ x i = a i x i . 132 For simplicity, we first establish the convex hull results under the following assumption which is stronger than Assumption 1. Later, we will extend the conclusion of Theorem 4 to a general finite convex function. Assumption 2. In addition to Assumption 1, function g is assumed to be a strongly convex and differentiable function with g(0) = 0 = min s2R g(s). Also, we assume x > 0, 0 < z e and å i2N + a i x i >å i2N a i x i . Observe that the positivity assumption x> 0 and z> 0 implies thatN + \ supp(x)\ supp(z) is simplyN + . For any vector y and index setS[n], we denote y(S)=å i2S y i for convenience. The workhorse of the derivation of Theorem 4 is the following minimization problem induced by the extended formulation (4.6) of clconv(Q): for a given (x;z) such that x> 0;0< z e and å i2N + x i >å i2N x i , min l;t å i2N + l i g x i t i l i + å i2N l i g x i t i l i s.t. t(N + )t(N )= 0; 0t x e > l 1; 0l z: (4.13) Lemma 5. Under Assumption 2, one can sett i = x i ; l i = 08i2N andl i > 0; t i < x i 8i2N + in an optimal solution to (4.13). Proof. For contradiction, we assume there exists some i 2N such thatt i < x i in the optimal solution to (4.13). Since å i2N + t i = å i2N t i < å i2N x i < å i2N + x i ; there must be some i + 2N + such thatt i + < x i + . Since g is convex and attains its minimum at 0, g is increasing over[0;+¥) and decreasing over(¥;0]. It follows that increasingt i andt i + by the same sufficiently small amount ofe > 0 would improve the objective function, which contradicts with the optimality. Hence,t i = x i 8i2N . It follows that one can safely takel i = 08i2N . 133 We now prove the second part of the statement, namely, that there exists an optimal solution withl > 0. Ifl i = 0;i2N + , since g is strongly convex, one must havet i = x i , otherwise g p (x i t i ;0)=+¥. Moreover, if t i = x i , one can safely take l i = 0. Finally, assume l i = 0;t i = x i for some index i. In this case, constraints t(N + )t(N )= 0, the previously proven property that t(N )= x(N ) and the assumption x(N + )> x(N ) imply that there exists an index j where t j < x j andl j > 0. Setting( ˜ l i ;x i ˜ t i )=e(l j ;x j t j ) and( ˜ l j ;x j ˜ t j )=(1e)(l j ;x j t j ) for some small enough e > 0, we find that the new feasible solution is still optimal since g p (;) is a homogeneous function. The conclusion follows. By changing variablet xt, it follows from Lemma 5 that (4.13) can be simplified to min l;t å i2N + l i g t i l i (4.14a) s.t. t(N + )= C (4.14b) 0<t x (4.14c) l(N + ) 1 (4.14d) 0<l z; (4.14e) where C def = x(N + )x(N ). Denote the derivative of g(t) by g 0 (t). Define G(t) def = tg 0 (t)g(t)> 0 and ˆ g(t) def = tg(1=t). Next lemma will be used to simplify (4.14) and express the optimal solution to (4.14) in terms of G(t) and g 0 (t). Lemma 6. Under Assumption 2, g 0 (t) and G(t) are increasing over (0;+¥) and thus invertible. Moreover, ˆ g(t) is strictly convex over(0;+¥). 134 Proof. Strict monotonicity of g 0 (t) over [0;+¥) follows directly from the strict convexity of g. Since g is strongly convex, it can be written as g(t)= g 1 (t)+g 2 (t), where g 1 (t)=at 2 is quadratic with a certaina> 0 and g 2 (t) is convex. For any t> 0, G(t+e) G(t)=eg 0 (t+e)(g(t+e) g(t))+t g 0 (t+e) g 0 (t) =e g 0 (t+e) g 0 (t) +t g 0 (t+e) g 0 (t) +O(e) (Taylor expansion) =e g 0 (t+e) g 0 (t) +t(g 0 2 (t+e) g 0 2 (t))+ 2ate+O(e) 2ate+O(e)> 0; (Monotonicity of the derivative of a convex function) as e is small enough. Namely, G(t) is an increasing function over [0;+¥). Moreover, G(0)= 0 implies that G(t)> 08t2(0;+¥). To prove the last conclusion, ˆ g 0 (t)= g(1=t) g 0 (1=t)=t = G(1=t) is increasing over(0;+¥), which implies the strict convexity of ˆ g. Assume(l;t) is an optimal solution to (4.14). Let r i def = t i l i 8i2N + . The next lemma reveals the structure of the optimal solution to (4.14). Namely, unless r i ’s are identical, either l i or t i attains the upper bound. Lemma 7. Under Assumption 2, if r i > r j , thenl i = z i andt j = x j in the optimal solution to (4.14). Proof. For contradiction we assumel i < z i . Takee > 0 small enough and lete i =e=t i ande j = e=t j . Since ˆ g is strictly convex and 1=r i < 1=r j , one can deduce that ˆ g(1=r i +e i ) ˆ g(1=r i ) e i < ˆ g(1=r j ) ˆ g(1=r j e j ) e j , (l i +e)=t i g(t i =(l i +e))l i =t i g(t i =l i ) e=t i < l j =t j g(t j =l j )(l j e)=t j g(t j =(l j e)) e=t j , (l i +e)g t i l i +e +(l j e)g t j l j e l i g t i l i l j g t j l j < 0: It implies that we can improve the objective value of (4.14) by increasingl i and decreasingl j by e, which contradicts with the optimality. Hence, one can deducel i = z i . 135 Similarly, the second conclusion follows by l i g t i e l i +l j g t j +e l j l i g t i l i l j g t j l j =(g 0 (r j ) g 0 (r i ))e+O(e)< 0; since g 0 () is strictly increasing. Finally, we are now ready to prove Theorem 4 under the additional Assumption 2. Proposition 34. Under Assumption 2, the conclusion in Theorem 4 holds true. Proof. We discuss two cases defined by z(N + ) separately. Case 1: z(N + )> 1. Since the objective function of (4.14) is decreasing with respect to l by Lemma 4, one must havel(N + )= 1. Thus, program (4.14) can be reduced to min l;t å i2N + l i g t i l i (4.15a) s.t. t(N + )= C (a) 0<t x (b) l(N + )= 1 (d) 0<l z: (g) Assume(t;l) is the optimal solution to (4.15) and define r i =t i =l i . There are two possibilities – either all r i ’s are identical or there are at least two distinct values of r i ’s. In the former case, denote r= r i 8i2N + , that is, t i = rl i 8i2N + . Then t(N + )=l(N + )= rl(N + )=l(N + )= r, and in particular r= C. In this case, (4.15a) reduces tol(N + )g(r)= g(C). Now we assume there are at least two distinct values of r i ’s. By Lemma 7, for all i2N + , either x i =t i orl i = z i . It follows thatN + =L[M[U , where L =fi :t i = x i ;l i < z i g;M =fi :t i = x i ;l i = z i g;U =ft i < x i ;l i = z i g: (4.16) 136 Since all constraints of (4.15) are linear, KKT conditions are necessary and sufficient for optimality of(l;t). Let(a;b;g;d) be the dual variables associated with each constraint of (4.15). It follows thatg i = 08i2L andb i = 08i2U . The KKT conditions of (4.15) can be stated as follows (left column: statement of the KKT condition; right column: equivalent simplification) g 0 x i l i a+b i = 0 b i =a g 0 x i l i 8i2L g 0 x i z i a+b i = 0 b i =a g 0 x i z i 8i2M g 0 t i z i a = 0 t i = z i (g 0 ) 1 (a) 8i2U g x i l i x i l i g 0 x i l i +d = 0 l i = x i G 1 (d) 8i2L g x i z i x i z i g 0 x i z i +d+g i = 0 g i = G x i z i d 8i2M g 0 t i z i t i z i g 0 t i z i +d+g i = 0 g i = G t i z i d 8i2U t(U)+ x(M)+ x(L)= C t(U)= C x(M) x(L) l(L)+ z(M)+ z(U)= 1 l(L)= 1 z(M) z(U) 0<t i < x i 8i2U; 0<l i < z i 8i2L; b i 08i2L[M; g i 08i2M[U . Denote by ¯ C= C x(M) x(L) and ¯ z= 1 z(M) z(U). Thent(U)= z(U)(g 0 ) 1 (a)= ¯ C implies a = g 0 ¯ C=z(U) , and l(L)= x(L)=G 1 (d)= ¯ z implies d = G(x(L)=¯ z): Thus, one can first substitute out a and d to get b i 8i2M;t i 8i2U;l i 8i2L;g i 8i2M . Then one can plug inl i 8i2L andt i 8i2U to work outb i 8i2L andg i 8i2U . Hence, we deduce that the KKT system is equivalent to b i = g 0 ( ¯ C=z(U)) g 0 (x(L)=¯ z) 0 8i2L (4.17a) b i = g 0 ( ¯ C=z(U)) g 0 (x i =z i ) 0 8i2M (4.17b) t i = ¯ Cz i =z(U)2(0;x i ) 8i2U (4.17c) l i = ¯ zx i =x(L)2(0;z i ) 8i2L (4.17d) g i = G(x i =z i ) G(x(L)=¯ z) 0 8i2M (4.17e) g i = G( ¯ C=z(U)) G(x(L)=¯ z) 0 8i2U: (4.17f) 137 Because g 0 and G are increasing from Lemma 6, the KKT system has a solution if and only if min i2U x i z i (4.17c) > ¯ C z(U) (4.17b) max i2M x i z i min i2M x i z i (4.17e) x(L) ¯ z (4.17d) > max i2L x i z i ; which implies ¯ C z(U) x(L) ¯ z ,(4.17a) and (4.17f). Moreover, by using the solution to the KKT system, the optimal value of (4.15) is g x(L) ¯ z + å i2M z i g x i z i + z(U)g ¯ C z(U) : Case 2: z(N + ) 1. Since the objective function of (4.14) is decreasing with respect to l, one must havel i = z i 8i2N + . Thus, problem (4.14) reduces to min t å i2N + z i g t i z i (4.18a) s.t. t(N + )= C (a) 0<t x: (b) Assumet is the optimal solution to (4.18). Similarly, define L = / 0; M =fi :t i = x i g; U =fi :t i < x i g: Then the KKT conditions can be written as g 0 x i z i a+b i = 0 b i =g 0 x i z i +a 8i2M g 0 t i z i a t i =(g 0 ) 1 (a)z i 8i2U x(M)+t(U)= C t(U)= C x(M)= ¯ C=(g 0 ) 1 (a)z(U) 0<t i < x i 8i2U; b i 08i2M: 138 It follows thata = g 0 ( ¯ C=z(U)). Plugginga in, one arrives at b i = g 0 ( ¯ C=z(U)) g 0 (x i =z i ) 0 8i2M t i = z i ¯ C=z(U)2(0;x i ) 8i2U: Therefore, the KKT system has a solution if and only if min i2U x i z i > ¯ C z(U) max i2M x i z i : The proof is finished. From Proposition 34, we see that Theorem 4 holds if additional assumptions are imposed – namely g is finite, strongly convex and differentiable, 0= g(0)= min t2R g(t) and x> 0;z> 0. To complete the proof of the theorem, we now show how to remove the assumptions, one by one. Proof of Theorem 4. Due to the symmetry mentioned above, we only prove the first conclusion in the theorem under Assumption 1 and a= e. A key observation is that the optimal primal solution to (4.14), given by (4.17c) and (4.17d), does not involve function g and only relies on x and z (while the values of the dual variables does depend on g). Denote this optimal solution by(t ;l ) and the objective function of (4.14) by h(t;l;g). Then h(t ;l ;g) h(t;l;g) for all feasible solutions (t;l) to (4.14) and all functions g satisfying Assumption 2. If g is not a strongly convex function, one can consider g e (s) = g(s)+es 2 , where e > 0. Since g e is strongly convex, the conclusion is applicable to g e . One can deduce that for any feasible solution(t;l) of (4.14) h(t ;l ;g e ) h(t;l;g e ); which implies h(t ;l ;g) h(t;l;g) by letting e! 0. Thus, the conclusion holds if g is a differentiable function with 0= g(0)= min t2R g(t). If g is not a differentiable function, one can consider its Moreau-Yosida regularization e e g(s) def = min w g(w)+ 1 2e (s w) 2 g(s); where e > 0. It follows that e e g(s) is a differentiable convex function; see Corollary 4.5.5, [130]. Moreover, e e g(s)! g(s) ase! 0; see Theorem 1.25, [195]. 139 Because h(t ;l ;e e g) h(t;l;e e g); lettinge! 0, one can deduce that the conclusion holds for any finite convex function g with 0= g(0)= min t2R g(t). For a general finite convex function g with g(0)= 0, ˜ g(s)= g(s)cs is a convex function with 0= ˜ g(0)= min s2R ˜ g(s), where c2¶g(0). The conclusion follows by applying the theorem to ˜ g. Finally, if there exists i2N + such that x i = 0, then t i = 0 in (4.13) which implies that one can safely setl i = 0 in (4.13). Hence, we can exclude the variables associated with index i from consideration and reduce the problem to a lower-dimensional case. For this reason, without loss of generality, we assume x i > 08i2[n]. DefineN 0 def =fi2N + : z i = 0g, ˜ f(x;z) as the RHS of (4.12), and f (x;z) as the optimal value of (4.13). Let r> 0 be a sufficiently large number and consider z r defined as z r i = x i =r if i2N 0 and z r i = z i otherwise. Note that lim r!+¥ z r = z. If there exists a partitionL[M[U ofN + \ supp(z) stated in the theorem, thenL[M[ ˜ U is the partition of N + associated with(x;z r ) where ˜ U =U[N 0 . Since z r > 0, the conclusion holds for(x;z r ), i.e. f (x;z r )= ˜ f(x;z r ). Because f and ˜ f are closed convex functions, f (x;z)= lim r!+¥ f (x;z r )= lim r!+¥ ˜ f(x;z r )= ˜ f(x;z): On the other hand, if for(x;z r ) such a partitionL[M[U ofN + does not exist, then neither does for (x;z) because otherwise, (LnN 0 ;MnN 0 ;UnN 0 ) would be a proper partition ofN + \supp(z). In this case, the conclusion follows from the closedness of f and f . This completes the proof. Remark 7. SetsL ,M andU in Theorem 4 can be found inO n 2 time. Indeed, without loss of generality, we assume thatå i2N + a i x i >å i2N a i x i . First, sort and index x i =z i in a nondecreasing order. It follows from the conditions in Theorem 4 that if suchL ,M andU exist, then there must be some k 1 and k 2 such thatL =fi2N + : i< k 1 g,M =fi2N + : k 1 i k 2 g and U =fi2N + : i> k 2 g. Consequently, one can verify the conditions in Theorem 4 by enumerating all possible combinationsfk 1 ;k 2 gN + . Now we turn to the special case whereN = / 0, that is, every entry of a is positive. 140 Corollary 4. Under Assumption 1 and a> 0, point (t;x;z)2 clconv(Q) if and only if (x;z)2 [0;1] n R n + and there exists a partitionL[M = supp(x)\ supp(z) such that 1 å i2M z i 0; max i2L a i x i z i < å i2L a i x i 1å i2M z i min i2M a i x i z i ; (4.19) and the following inequality holds t g p å i2L a i x i ;1 å i2M z i ! + å i2M g p (a i x i ;z i )+ g p å i2N + a i x i å i2L[M a i x i ;0 ! : (4.20) Proof. In this particular setting, it is easy to see thatN = / 0 andt i = 08i2[n] in (4.13). Thus, the partition defined in (4.16) always exists withU = / 0. The conclusion follows from Theorem 4. Finally, we close this section by generalizing the main result of [16] to non-quadratic functions. Specifically, in [16], the authors studied the set Q 2 def = 8 > < > : (t;x;z)2R 3 f0;2g n : t g(a 1 x 1 a 2 x 2 );x i 0; i= 1;2; x i (1 z i )= 0; i= 1;2 9 > = > ; ; where g is quadratic, and provided the description of clconv(Q 2 ) in the original space of variable. A similar result holds for general convex functions. Corollary 5. Given a convex function g() with dom(g)=R and f(x 1 ;x 2 )= g(a 1 x 1 a 2 x 2 ), where a> 0, point(t;x;z)2 clconv(Q) if and only if(x;z)2[0;1] 2 R 2 + and t 8 > > < > > : g p (a 1 x 1 a 2 x 2 ;z 1 ) if a 1 x 1 a 2 x 2 g p (a 2 x 2 a 1 x 1 ;z 2 ) if a 2 x 2 a 1 x 1 : Proof. In this case,N + is a singleton in (4.14). 141 4.3.4 Implementation In this section, we discuss the implementation of the results given in Theorem 4 (for the quadratic case) with conic quadratic solvers. A key difficulty towards using the convexification is that inequalities (4.12) are not valid: while they describe clconv(Q) in their corresponding region, determined by partitionL[M[U , they may cut off points of clconv(Q) elsewhere. To circumvent this issue, Atamt¨ urk and G´ omez [22] propose valid inequalities, each requiringO(n) additional variables and corresponding exactly with (4.12) in the corresponding region, and valid elsewhere. The inequalities are then implemented as cutting surfaces, added on the fly as needed. It is worth nothing that since the optimization problems considered are nonlinear, and convex relaxations are solved via interior point solvers, adding a cut requires resolving again the convex relaxation (without the warm-starting capabilities of the simplex method for linear optimization). In contrast, we can use Proposition 32 directly to implement the inequalities. When specialized to quadratic functions g, and with the introduction of auxiliary variables u to model conic quadratic cones, we can restate Proposition 32 as: (x;y;t)2 clconv(Q) if and only if there exists(l;t;u)2 R 3n such that t n å i=1 a 2 i u i ; (4.21a) l i u i (x i t i ) 2 ; u i 0;8i2[n]; (4.21b) a > t = 0; 0t i x i 8i2I + ; (4.21c) l i z i 1;8i2[n]; (4.21d) l 0; n å i=1 l i 1: (4.21e) is satisfied. Inequalities (4.21b) are (convex) rotated cone constraints, which can be handled by most off-the-shelf conic quadratic solvers, and every other constraint is linear. Note that using (4.21) requires addingO(n) variables once –instead of adding a similar number of variables per 142 inequality added, with exponentially many inequalities required to describe clconv(Q)–, and thus is a substantially more compact formulation than the one presented in [22]. To illustrate the benefits resulting from a more compact formulation, we compare the two formulations in instances used by [22], available online at https://sites.google.com/usc. edu/gomez/data. The instances correspond to portfolio optimization problems of the form min K å k=1 t k + n å i=1 (d i x i ) 2 (4.22a) s.t. t k (a > k x) 2 (4.22b) e > x= 1 8k2[K] (4.22c) c > x h > z b (4.22d) 0 x z (4.22e) x2R n ; z2f0;1g n ; (4.22f) where a k 2R n for all k2 [K], c;d;h2R n + and b2R + . Strong relaxations can be obtained by relaxing the integrality constraints z2f0;1g n ! z2 [0;1] n , using the perspective reformu- lation (d i x i ) 2 !(d i x i ) 2 =z i in the objective, and adding inequalities (4.21) (or using cutting sur- faces) corresponding to each rank-one constraint (4.22b). Figure 4.1 summarizes the computa- tional times required to solve the convex relaxations across 120 instances with n2f200;500g and K2f5;10;20g, using the CPLEX solver in a laptop with Intel Core i7-8550U CPU and 16GB memory. In short, the extended formulation (4.21) is on average twice as fast as the cutting surface method proposed in [22], and up to five times faster in the more difficult instances (in addition to arguably being easier to implement). We also tested the effect of the extended formulation using CPLEX branch-and-bound solver. While we did not encounter numerical issues resulting in incorrect behavior by the solver (the cutting surface method does result in numerical issues, see [22]), the performance of the branch- and-bound method is substantially impaired when using the extended formulation. We discuss in 143 0 20 40 60 80 100 120 0 2 4 6 8 10 # of instances solved Time (s) extended formulation cutting surface Fig. 4.1: Number of instances solved as a function of time. The cutting surface requires on average 1.79 seconds to solve an instance, and can solve all 120 instances in 13.3 seconds or less. In contrast, the extended formulation (4.21) requires on average 0.78 seconds, and solves all instances in less than 2.6 seconds. more detail the issues of using the extended formulation with CPLEX branch-and-bound method in the appendix. 4.4 NP-hardness with upper bound constraints on continuous variables The setQ studied so far assumes that the continuous variables are either unbounded, or non- negative/non-positive. Either way, setQ admits a similar disjunctive form given in Theorem 3 and, if it is rank-one, results in similar compact extended formulations given in Proposition 32. A natural question is whether the addition of bounds on the continuous variables results in similar convexifications, or if the resulting set is structurally different. In this section, we show that it is impossible to describe the convex hull with bounded variables in a compact way, unlessP = NP. 144 Consider the set Q B def = (t;x;z)2R n+1 f0;1g n : t f(x); 0 x z : We show that describing clconv(Q B ) isNP-hard even when rank( f)= 1. Two examples are given to illustrate this point – single node flow sets and rank-one quadratic forms. Single-node fixed-charge flow set The single-node fixed-charge flow set is the mixed integer linear set defined as T def = ( (x;z)2R n f0;1g n : n å i=1 a i x i b; 0 x z ) ; where 0< a i b8i2[n]. Note that one face of the single-node flow set conv(T\f(x;z) : x i = z i 8i2[n]g) is isomorphic to the knapsack set conv(fz2f0;1g n :å n i=1 a i z i bg). If we define f(x)= g a > x , where g(t)=d(t;ft2R : t bg), it is clear thatQ B =R + T , which meansT is isomorphic to one facet of conv(Q B ). Thus, it is impossible to describe conv(Q B ) in a compact way unlessP =NP. Rank-one quadratic program with box-constrained continuous variables Consider the following mixed-integer quadratic program min x;z a > x 2 + b > x+ c > z s.t. 0 x z; z2f0;1g n : (4.23) 145 We aim to show (4.23) isNP-hard in general. To achieve this goal, we show that (4.23) includes the following well known 0-1 knapsack problem (4.24) as a special case. min z2f0;1g n v > z s.t. w > z W; (4.24) where(v;w;W)2Z 2n+1 + are nonnegative integers such that w i Wå j w j 8i2[n]. Proposition 35. The knapsack problem (4.24) is equivalent to the optimization problem min x;z M 1 Wx 0 + n å i=1 w i x i W ! 2 M 2 n å i=1 x i ! + n å i=1 (M 2 v i )z i s.t. 0 x i z i ; i= 0;1;:::;n z i 2f0;1g; i= 0;1;:::;n; (4.25) where M 1 =å n i=1 v i + 1 and M 2 = 2nW 2 M 1 + 1 are polynomial in the input size. Proof. Denote the objective function byq(x;z). First, since the objective function does not involve z 0 , z 0 can be safely taken as 1. Second, we now show that M 2 is large enough to force x i = z i for all i2[n]. Specifically, for any i2[n], since x 0 ;x i 1 and w i W, it holds that ¶q(x;z) ¶x i = 2M 1 w i Wx 0 + n å i=1 w i x i W ! M 2 2nW 2 M 1 M 2 < 0: That is,q(x;z) is decreasing with respect to x i ;i2[n], which implies x i = z i in any optimal solution. It follows that (4.25) can be simplified to min x 0 ;z M 1 Wx 0 + n å i=1 w i z i W ! 2 n å i=1 v i z i s.t. 0 x 0 1; z i 2f0;1g; i= 1;:::;n; (4.26) 146 Next, we claim that M 1 is large enough to ensure that the optimal solution to (4.26) satisfies Wx 0 +å n i=1 w i z i W = 0; that is, x 0 = 1 w > z =W. To prove it rigorously, observe that the min- imum value of (4.26) must be non-positive since x 0 = 1;z= 0 is a feasible solution with objective value equal to 0. Moreover, since 1 w > z =W 1, if x 0 6= 1 w > z =W at the optimal solution to (4.26), then 1 w > z =W < 0 and the optimal x 0 must attain its lower bound 0. Furthermore, 1 w > z =W < 0 implies that w > z W+ 1 since w; W and z are nonnegative integers. In this case, setting x 0 = 0, the minimum objective value of (4.26) can be written as M 1 n å i=1 w i z i W ! 2 n å i=1 v i z i > M 1 å i v i = 1> 0; which contradicts the non-positivity of the optimal objective value. Therefore, we can substitute out x 0 = 1 w > z =W and (4.25) further reduces to min z n å i=1 v i z i s.t. 0 1 w > z W 1; z i 2f0;1g; i= 1;:::;n; which is equivalent to (4.24), because w > z =W 0 holds trivially. Due to the equivalence between optimization problems and separation problems [115], Propo- sition 35 indicates that it is impossible to extend the analysis in Section 4.3.1 and Section 4.3 to the case with bounded continuous variables. 147 4.5 Extensions In this section, we allow some constraints to be imposed on binary variables z i 8i2[n] and consider Q def = 8 > < > : (t;x;z)2R n+1 f0;1g n : t f(x);x i 08i2I + ; z2Z;x i (1 z i )= 08i2[n] 9 > = > ; ; where f(x)= g(Ax)+ c > x is a proper closed convex function, [n] def =f1;2;:::;ng,I + [n] and Zf0;1g n . It turns out that the analysis developed in Section 4.2 can be refined and applied to the constrained case. Similarly, one can redefine the following sets for eachI[n]: X(I) def =X\f(t;x) : x i = 08i = 2Ig; Z(I) def =Z\fz : z i = 18i2Ig; V(I) def =X(I)Z(I): For the same reason,Q can be expressed as a disjunction Q= [ I[n] V(I): (4.27) Vector ¯ z2Z is called a maximal element ofZ if supp(¯ z) is not a proper subset of supp(z) for all z2Z . Denote the set of maximal elements ofZ by ¯ Z . For anyI[n], define R(I) def =Q\f(t;x;z) : Ax= 0; supp(z)=Ig: Notice thatR(I) is a closed convex set inasmuch as z is uniquely determined by supp(z). The counterparts of Proposition 31 and Theorem 3 can also be established as follows. Proposition 36. If f(x) is a proper, closed, convex and homogeneous function and e2Z , then clconv(Q)=X clconv(Z). 148 The proof of Proposition 36 is exactly the same as that of Proposition 31 and thus be omitted. The proof of the following Theorem 5 is put in the appendix. Theorem 5. Assume f(x) is a proper closed convex function with rank( f) k and f(0)= 0. Then clconv(Q)= clconv 0 @ [ I :jIjk V(I) [ R 1 A ; whereR is any set satisfying S z2 ¯ Z R(supp(z))R Q. What is more, iffx2R n : Ax= 0; x i 08i2I + g=f0g, then one can takeR= / 0 instead. WhenjIj is a small number, clconv(V(I)) =X(I) conv(Z(I)) is usually easy to obtain. In this case, constructing clconv(Q) amounts to constructing clconv(R). Next, we discuss how to construct clconv(R) in some common settings: Group sparsity:Z = z2f0;1g n : z i = z j 8i; j2P p ; p2[m] , wherefP p g p2[m] is a par- tition of[n]. Strong hierarchy structure: Z = z2f0;1g n : z i z j 8(i; j)2E , whereE is the set of links of variables. Upper bounds on cardinality:Z =fz2f0;1g n :å n i=1 z i mg, where m2Z + . Lower bounds on cardinality:Z =fz2f0;1g n :å n i=1 z i mg, where m2Z + . In the first three cases, e2Z which implies the set of maximal elements is a singleton ¯ Z =f[n]g. If we chooseR =R([n]), one can deduce that clconv(R)=f(t;x;z) : t c > x;Ax= 0gfeg. The last case is discussed in Proposition 37. Proposition 37. Under the same assumptions of Theorem 3, ifZ =fz2f0;1g n :å i z i mg or Z =fz2f0;1g n :å i z i = mg and m k+1, one can takeR=Q\f(t;x;z) : Ax= 0;å n i=1 z i = mg in Theorem 3. What is more, clconv(R)= X\ (t;x) : Ax= 0 z2[0;1] n : n å i=1 z i = m : Proof. Since the set of maximal elements is ¯ Z =fz2f0;1g n :å i2[n] z i = mg in both cases, it can be verified directly thatR satisfies the requirement in Theorem 3. It remains to prove the 149 convex hull result aboutR. Denote the convex relaxation ofR in the proposition asS . Note that clconv(R) andS are polyhedral. Hence, it is sufficient to proveS is integral in z. For any a =(a t ;a x ;a z ) witha t 0, consider max (t;x;z)2S a t t+a > x x+a > z z which is equivalent to v = max t;x;z (a x +a t c) > x+a > z z s.t. Ax= 0; x i 08i2I + ; å i2[n] z i = m; z2[0;1] n : If v finite, an optimal solution(x ;z ) exists. Because(rx ;z ) is feasible for any r 0, we must have(a x +a t c) > x = 0. Hence x can be taken as 0. What is more, it can be seen easily that z is integral. This implies(0;0;z )2R. If v =+¥, there exists an extreme ray d in the feasible region of the linear program such that at most k+1 entries of d are nonzero. Due to k+1 m, there exists a vector z such thatå i2[n] z i = m and supp(d) supp(z ). It implies that (rc > d;rd;z )2R for all r 0. The conclusion follows. 4.6 Conclusions In this chapter, we propose a new disjunctive programming representation of the convex enve- lope of a low-rank convex function with indicator variables and complementary constraints. The ensuing formulations are substantially more compact than alternative disjunctive programming formulations. As a result, it is substantially easy to project out the additional variables to recover formulations in the original space of variables, and to implement the formulations using off-the- shelf solvers. 150 5 Fractional 0-1 programming and submodularity 5.1 Introduction We consider a multiple-ratio fractional 0–1 program given by: max x2F å k2M å i2N a ki x i b k0 +å i2N b ki x i ; (5.1) where M=f1;:::;mg, N =f1;:::;ng andF :=fx2f0;1g n : Dx dg for given D2R qn and d2R q . Problem (5.1) is often referred to as a multiple-ratio hyperbolic 0–1 program. Problems of the form (5.1) can also be viewed as a class of set-function optimization problems that seek a subset S of N with its indicator variableI fSg 2R n , where the i-th element ofI fSg is 1 if and only if i2 S. Throughout the note, we make the following assumptions: A1: b k0 +å i2N b ki x i > 0 for all k2 M and all x2Fnf0g. A2: a ki 0, b k0 0 and b ki > 0 for all k2 M and i2 N. A3:F is downward closed, i.e., if S2F then T2F for all T S. Assumption A1 is standard in fractional optimization [60, 61, 172]. In particular, if b k0 > 0, then it ensures that the denominator is strictly positive for each ratio and the objective function in (5.1) is well defined. Our main result (see Theorem 6 in Section 5.3) requires b k0 > 0 to hold. We provide additional discussions on (5.1) with b k0 = 0 in Sections 5.3.4 and 5.4.2. 151 Assumption A2 is not too restrictive as it naturally holds in many application settings, see ex- amples in [61], including those considered in this note, see Section 5.4. For our results developed in this note we also require an additional relatively mild assumption A3 on the structure of the fea- sible region,F , in (5.1). We note that many types of feasible regions considered in the literature, such asF = 2 N (unconstrained problem),F =fS N :jSj pg for some positive integer p (car- dinality constraint) andF =fS N : å i2S w i cg for some weights w 0 and c 0 (capacity constraint) all satisfy assumption A3. Applications of single- and multiple-ratio fractional 0–1 programs as in (5.1) appear in many diverse areas. For example, M´ endez-D´ ıaz et al. [173] discuss an assortment optimization prob- lem under mixed multinomial logit choice models (MMNL). Tawarmalani et al. [209] consider a facility location problem, where a fixed number of facilities need to be located to service cus- tomers locations with the objective of maximizing a market share. Arora et al. [11] study a class of set covering problems in the context of airline crew scheduling that aim at covering all flights operated by an airline company. Furthermore, many combinatorial optimization problems can be formulated in the form (5.1) including the minimum fractional spanning tree problem [74, 218], the maximum mean-cut problem [139, 190] and the maximum clique ratio problem [200]. More application examples can be found in the studies by [51, 93, 158], the recent survey by Borrero et al. [61] and the references therein. While in general problem (5.1) isNP-hard even in the case of a single ratio [122, 187], single-ratio problems can be solved in polynomial time under A1 and A2 and additional assump- tions, e.g., unconstrained problems [122], or problems where the convex hull ofF is known [171, 198]. Furthermore, Rusmevichientong et al. [199] show that for the unconstrained multi- ratio problem, there is no approximation algorithm with polynomial running time that has an ap- proximation factor better thanO 1=m 1d for anyd > 0. Other related theoretical computational results are discussed in [187, 188]. 152 Exact solution methods for (5.1) encompass mixed-integer programming reformulations [60, 93, 172], branch and bound algorithms [209], and other enumerative methods [61, 113, 121]. How- ever, due toNP-hardness of (5.1), these methods do not scale well when the size of the problem increases. Motivated by these computational complexity considerations, a number of studies rely on approximation schemes and heuristics for solving (5.1). Rusmevichientong et al. [197], Mittal and Schulz [175] and D´ esir et al. [83] all propose approximation algorithms for assortment op- timization under the MMNL model when the number of customer segments, m, is fixed. Amiri et al. [7] develop a heuristic algorithm based on Lagrangian relaxation in the context of stochastic service systems. Prokopyev et al. [188] present a GRASP-based (Greedy Randomized Adaptive Search) heuristic for solving the cardinality constrained problems. Finally, simple greedy algo- rithms are also used in the literature [95, 152]. However, it is often not well understood when such algorithms perform well. Contributions and outline. The remainder of the note is organized as follows. In Section 5.2, we overview some necessary preliminaries and formulate our model (5.1) in terms of set functions. In Section 5.3, we provide the main result of the note that characterizes the submodular- ity of a single ratio. Submodularity is often a key property for devising approximation algo- rithms [96, 178]. If the objective function can be identified as a submodular function, then simple greedy algorithms are capable of delivering high-quality solutions. In fact, it is possible to obtain (1 e 1 )-approximations under a variety of feasible regions – independently of the number of the ratios, m, involved–, thus improving over existing approximation methods for (5.1). We also discuss the connections between submodularity and monotonicity in the context of fractional 0–1 optimization. In Section 5.4, we consider our theoretical results in the context of two applications – the as- sortment optimization and the p-choice facility location problems. For the assortment optimization problem, our results suggest that submodularity is linked to a phenomenon known as cannibaliza- tion [177], and naturally arises in several important scenarios. The results can also be applied in the case when there is a fixed cost associated with offering a product in the assortment [15, 153], 153 which arises, for example, in online advertisement with costs-per-impression. For the p-choice facility location problem [209], we show how to reformulate the original problem in a desirable form that can be then exploited to benefit from the submodularity property. Finally, we conclude the note in Section 5.5. 5.2 Preliminaries Next, we provide relevant background on submodular optimization, and discuss how our theoreti- cal results can be used in practice; see also our brief discussion on applications in Section 5.4. 5.2.1 Notation Let a k =(a ki ) i2N and b k =(b ki ) i2N[f0g for all k2 M, and for given a k 2R n and b k 2R n+1 , define h(x;a k ;b k ) := å i2N a ki x i b k0 +å i2N b ki x i : Then equation (5.1) can be rewritten as max x2F å k2M h(x;a k ;b k ): (5.2) This form appears in many applications such as the retail assortment and the p-choice facility location problems. Note that for each x2f0;1g n , there is a unique set S=fi2 N : x i = 1g N, and conversely, each S N corresponds to an indicator vectorI fSg 2f0;1g n . Thus, we can rewrite h(x;a k ;b k ) as a set function h(S;a k ;b k ) := h(I fSg ;a k ;b k ); and regardF as the domain of sets, i.e.,F 2 N . Thereafter, we may use the vector form and the set form of (5.2) interchangeably for convenience. The main result of this note is a necessary and sufficient condition for the submodularity of each function h(;a k ;b k ). 154 5.2.2 Submodularity and approximation algorithms. A set function f : 2 N !R from the subsets of N to the real numbers is submodular overF if it exhibits diminishing returns, i.e., f(S[fig) f(S) f(T[fig) f(T) for all S T Nnfig such that T[fig2F . Equivalently, function f is submodular overF if f(S[fi; jg) f(S[f jg) f(S[fig) f(S) (5.3) for all S N and i; j62 S such that S[fi; jg2F . The greedy algorithm, see its pseudo-code in Algorithm 1, is a popular choice for tackling monotone submodular maximization problems because it is easy to implement and gives a constant- factor approximation in many cases. When the feasible region is a matroid, the greedy algorithm produces a solution with 1/2 approximation factor; see [96]. When the feasible region is given by a cardinality constraint, the approximation ratio can be improved to (1 e 1 ); see [178]. Other (1 e 1 )-approximation algorithms or near-optimal algorithms have also been provided for other classes of feasible regions over the years [65, 151, 206], for example, whenF is defined with a single or multiple capacity constraints. We now discuss three settings in which the results of this note can be directly used to obtain approximation algorithms for fractional problems. Constrained single-ratio problems. Consider a single-ratio instance of (5.1), i.e.,jMj= 1, but the convex hull ofF is not known, for example, whenF is defined by multiple knapsack constraints,F =fS N :å i2S w `i c ` ; `= 1;:::;mg. Clearly, this class of problems is NP-hard (recall our discussion in Section 5.1). Nonetheless, if function h(;a k ;b k ) is monotone and submod- ular (see Proposition 38 in Section 5.3.2), then existing approximation algorithms can be used; see, e.g., [151]. 155 Single ratio plus a linear function. Consider the special case of a two-ratio problem (5.1) where one of the ratios is a linear function, i.e., max x2F å i2N c i x i + å i2N a i x i b 0 +å i2N b i x i : (5.4) Problem (5.4) is NP-hard and arises, for example, in assortment optimization problems with fixed costs of including an item into the assortment; see, e.g., [15, 153]. It can be easily checked [15] that if all a i =b i = a j =b j for all i6= j, in which case å i2N a i x i b 0 +å i2N b i x i =l å i2N b i x i b 0 +å i2N b i x i for somel> 0, then prob- lem (5.4) is a submodular optimization problem. In this note we provide more general conditions for verifying whether (5.4) is submodular; see, for example, Proposition 41 in Section 5.4.1. Multiple ratios. The objective of the multiple-ratio problem (5.1) is submodular if all functions h(;a k ;b k ) are submodular. While such conditions may appear to be quite restrictive, we show that several existing results concerning the tractability of assortment optimization problems correspond to submodularity of each ratio; for example, see Propositions 42 and 43 in Section 5.4.1. 5.2.3 Submodularity and cutting-plane methods. The submodularity results presented in this note can be systematically exploited to efficiently solve problems via mixed-integer programming solvers, even if the ratios in (5.1) are not submodular. Indeed, the convex envelope of a submodular function is described by its Lov´ asz extension [163], which can be incorporated into branch-and-bound solvers via facet-defining valid inequalities; see [20, 25]. The concave envelope of a submodular function cannot be, in general, described efficiently, but valid inequalities to approximate it have been proposed nonetheless [1, 178]. Specifically, in the context of fractional optimization, given an arbitrary (and possibly non- submodular) ratio, Atamt¨ urk and Narayanan [27] use the characterization of submodularity pre- sented in this note to express it as a difference of submodular functions of the form å i2N a i x i b 0 +å i2N b i x i = å i2N (a i + c i )x i b 0 +å i2N b i x i å i2N c i x i b 0 +å i2N b i x i ; (5.5) 156 where both ratios are submodular. Then they use the aforementioned results to generate cutting planes corresponding to the convex and concave envelopes of each ratio, thus strengthening the mixed-integer programming formulations. We refer the reader to [27] for further details on the implementation of their method and corresponding computational results. 5.3 Submodularity of a single ratio and its implications 5.3.1 A necessary and sufficient condition In this section, we give a necessary and sufficient condition for the submodularity of the function h(), see Theorem 6. As a direct consequence, if h(;a k ;b k ) satisfies the condition of Theorem 6 for every k2 M, then it follows that the fractional 0–1 program (5.2) admits a constant-factor approximation algorithm. For convenience, we drop the superscript k in a k and b k and use the notation h(;a;b) throughout this section. We first consider the case where b 0 > 0. The key result of this note is as follows: Theorem 6. If b 0 > 0, then function h(;a;b) is submodular overF if and only if h(S[fig;a;b)+ h(S[f jg;a;b) a i b i + a j b j (5.6) for all S N, and i; j62 S with i6= j such that S[fig[f jg2F . Proof. Recall that assumption A2 holds. Thus, the right-hand side of (5.6) is well-defined. Let S N, let i; j62 S with i6= j satisfying S[fig[f jg2F , and define A S =å j2S a j and B S = b 0 +å j2S b j . Observe that h(S;a;b)= A S =B S . From (5.3) we find that h(;a;b) is submodular if and only if A S + a i + a j B S + b i + b j A S + a j B S + b j A S + a i B S + b i A S B S : 157 Multiplying both sides by B S (B S + b i + b j ), we get the equivalent condition B S A S + a i + a j B S 1+ b i B S + b j (A S + a j ) B S 1+ b j B S + b i (A S + a i )(B S + b i + b j )A S , a i B S b i B S (A S + a j ) B S + b j a i B S b i A S b j A S + b j B S + b i B S (A S + a i ) , a i B S b i B S (A S + a j ) B S + b j a i B S b i A S + b j B S + b i B S A S + B S a i (B S + b i )A S , a i B S b i B S (A S + a j ) B S + b j a i B S b i A S + b j B S + b i (a i B S b i A S ): Adding b i A S + b i B S A S +a j B S +b j a i B S to both sides, we find b i A S b i B S (A S + a j ) B S + b j + b j B S + b i (a i B S b i A S ) , b i A S (B S + b i ) B S + b j b i B S (A S + a j )(B S + b i )+ b j (B S + b j )(a i B S b i A S ) , b i A S B 2 S + b i A S B S (b i + b j )+ b 2 i b j A S b i A S B 2 S + b i a j B 2 S + B S A S b 2 i + a j b 2 i B S + a i b j B 2 S + a i b 2 j B S b i b j A S B S b i b 2 j A S : After rearranging and canceling out some terms in the above expression, we obtain: 2b i b j A S B S + b 2 i b j A S + b i b 2 j A S B 2 S (a i b j + a j b i )+ a j b 2 i B S + a i b 2 j B S , b i b j A S (b i + b j + 2B S ) b i b j a j =b j B 2 S + a j =b j b i B S + a i =b i B 2 S + a i =b i b j B S , A S (b i + b j + 2B S ) a j =b j B 2 S + a j =b j b i B S + a i =b i B 2 S + a i =b i b j B S ,(B S + b i )(A S a j =b j B S )+(B S + b j )(A S a i =b i B S ) 0: 158 Finally, dividing by(B S + b i )(B S + b j ) and then adding a i =b i + a j =b j on both sides, we get , A S a j =b j B S B S + b j + a j =b j + A S a i =b i B S B S + b i + a i =b i a i =b i + a j =b j , A S + a j B S + b j + A S + a i B S + b i a i =b i + a j =b j ; which is precisely inequality (5.6). As we discuss next, submodularity is closely linked to monotonicity. 5.3.2 Monotonicity implies submodularity The function h(;a;b) is monotone nondecreasing if h(S;a;b) h(S[f jg;a;b) (5.7) for every set S and j62 S such that S[f jg2F . Monotonicity is often a prerequisite for greedy algorithms, see, e.g., [178], to guarantee a constant approximation factor. Also, it arises naturally in many applications; see Section 5.4.1.1 for details. As we show next, monotonicity is a sufficient condition for submodularity. Proposition 38. If function h(;a;b) is monotone nondecreasing, then h(;a;b) is submodular. Proof. Condition (5.7) is equivalent to å i2S a i b 0 +å i2S b i å i2S a i + a j b 0 +å i2S b i + b j , 1+ b j b 0 +å i2S b i å i2S a i å i2S a i + a j , å i2S a i b 0 +å i2S b i a j b j , h(S;a;b) a j b j (5.8) 159 for all S and j62 S. Therefore, if i; j62 S, then h(S[fig;a;b) a i =b i and h(S[f jg;a;b) a j =b j , and inequality (5.6) follows. Inequality (5.8) needs to hold for every combination of set S and element i for the function to be monotone. Note that h(S[f jg;a;b) is the weighted average of h(S;a;b) and a j =b j given by h(S[f jg;a;b)= b 0 +å i2S b i b 0 +å i2S b i + b j å i2S a i b 0 +å i2S b i + b j b 0 +å i2S b i + b j a j b j ; and hence, (5.8) is equivalent to h(S[f jg;a;b) a j =b j . Therefore, checking monotonicity can be done by verifying that a j b j max S[f jg2F h(S[f jg;a;b) 8 j2 N; (5.9) and the optimization problem on the right-hand side of (5.9) can be solved using existing algo- rithms for single-ratio fractional optimization; see [171, 190]. In fact, in some cases monotonicity can be verified without solving an optimization problem. Corollary 6. Function h(;a;b) is monotone nondecreasing (and submodular) over 2 N if and only if min i2N a i b i h(N;a;b): Proof. The forward direction follows directly from (5.9). For the backward direction, let a =b = min i2N fa i =b i g, and then we find that h(N;a;b) a b , å i2N a i b 0 +å i2N b i a b , å i2N a i b i a b b i a b b 0 : Since a i =b i a =b , we find thatå i2S (a i =b i a =b )b i (a =b )b 0 for any S N, i.e., h(S;a;b) a =b for any S N. 160 5.3.3 On non-monotone submodular functions From Proposition 38, we know that monotonicity implies submodularity. In general, as Example 5 below shows, the converse does not hold. Example 5. Assume we have three variables, i.e., N =f1;2;3g, with the setting (a 1 ;a 2 ;a 3 )= (3;2;1) and (b 0 ;b 1 ;b 2 ;b 3 ) = (2;1;1;1). Then from Theorem 6 we can verify that h(;a;b) is submodular over 2 N : since a i =b i + a j =b j 3 for any i6= j and, for any S N, h(S;a;b) h(f1;2g;a;b)= 5=4 3=2, we find that inequality (5.6) holds. However, h(f3g;a;b)= 1=3< h(f1;2;3g;a;b)= 6=5< h(f1;2g;a;b)= 5=4, and monotonicity does not hold. Nonetheless, if h(;a;b) is submodular, then it is in fact very close to a nondecreasing function as shown in Proposition 39 below. In particular, if the decision variable with the smallest value a i =b i is fixed, then the resulting function is monotone. Assume for the remainder of this section, without loss of generality, that a 1 =b 1 a 2 =b 2 a n =b n . DefineF 1 :=fS2F : n2 Sg andF 2 :=fS2F : n = 2 Sg. Proposition 39. If h(;a;b) is submodular overF , then the following holds: (i) function h(;a;b) is monotone nondecreasing overF 1 ; (ii) for any S2F 2 and any j6= n such that S[f jg2F and S[fng2F , we have h(S[f jg;a;b) h(S;a;b). Proof. We first prove h(;a;b) is monotone nondecreasing overF 1 by contradiction. Assume there exists S and j6= n such that n= 2 S and h(S[f j;ng;a;b)< h(S[fng;a;b). Because h(S[f j;ng;a;b) is a convex combination of h(S[fng;a;b) and a j =b j , we have a j =b j < h(S[fng;a;b). Since a n =b n a j =b j , we find that a n =b n < h(S[fng;a;b). Note that h(S[fng;a;b)= b 0 +å i2S b i b 0 + b n +å i2S b i h(S;a;b)+ b n b 0 + b n +å i2S b i a n b n (5.10) is a convex combination of h(S;a;b) and a n =b n , and since a n =b n < h(S[fng;a;b), it follows that h(S[fng;a;b)< h(S;a;b). By submodularity, h(S[f j;ng;a;b) h(S[f jg;a;b) h(S[ 161 fng;a;b) h(S;a;b)< 0, which indicates that h(S[f j;ng;a;b)< h(S[f jg;a;b). Thus, a n =b n < h(S[f jg;a;b). However, this implies h(S[f jg;a;b)+ h(S[fng;a;b)> a j =b j + a n =b n , which is a contradiction based on Theorem 6. Thus, (i) holds. Next, we prove (ii) by contradiction. Assume there exists S and j6= n such that n= 2 S and h(S[ f jg;a;b)< h(S;a;b). Because h(S[f jg;a;b) is the weighted average of a j =b j and h(S;a;b), we have that a j =b j < h(S[f jg;a;b)< h(S;a;b). Recall that a n =b n a j =b j . Hence, a n =b n < h(S;a;b), which implies a n =b n < h(S[fng;a;b) – using similar arguments as in the proof of (i). Hence, h(S[ f jg;a;b)+ h(S[fng;a;b)> a j =b j + a n =b n , which contradicts the submodularity of h(;a;b). Corollary 7. If eitherF = 2 N orF =fS N :jSj pg for any p2f1;:::;n1g, then submod- ularity of h(;a;b) overF implies that h(;a;b) is monotone nondecreasing overF 1 andF 2 . Example 1 (Continued). Observe that h(;a;b) is indeed monotone overF 1 , since h(f3g;a;b)= 1=3, h(f1;3g;a;b)= 1, h(f2;3g;a;b)= 3=4 and h(f1;2;3g;a;b)= 6=5. Similarly, we can verify that h(;a;b) is monotone overF 2 since h(/ 0;a;b)= 0, h(f1g;a;b)= 1, h(f2g;a;b)= 2=3 and h(f1;2g;a;b)= 5=4. 5.3.4 On homogeneous fractional functions In this section, we show that the assumption b 0 > 0 is indeed necessary in Theorem 6, as otherwise submodularity does not hold in most practical situations. Proposition 40 below formalizes this statement. Proposition 40. Assume b 0 = 0. If there exists a feasible set S such that there are at least three distinct values for a i =b i ; i2 S, then h(;a;b) is not submodular. Proof. Assume without loss of generality that a 1 =b 1 < a 2 =b 2 < a 3 =b 3 . Then the following in- equality b 1 b 1 + b 3 a 3 b 3 a 1 b 1 + b 2 b 2 + b 3 a 3 b 3 a 2 b 2 b 1 b 1 + b 2 + b 3 a 3 b 3 a 1 b 1 + b 2 b 1 + b 2 + b 3 a 3 b 3 a 2 b 2 : 162 holds since denominators are greater on the right-hand side. Subtracting 2(a 3 =b 3 ) on both sides, we find that a 1 + a 3 b 1 + b 3 a 2 + a 3 b 2 + b 3 a 1 + a 2 + a 3 b 1 + b 2 + b 3 a 3 b 3 ; which is equivalent to h(f1;3g;a;b)+h(f2;3g;a;b) h(f1;2;3g;a;b)+h(f3g;a;b), violating the definition of submodularity. 5.3.5 Submodularity testing In this section, we discuss how to verify whether h(;a;b) is submodular overF . By Theorem 6, to test for the submodularity, it suffices to compute t i j := max S[fig[f jg2F i; j= 2S h(S[fig;a;b)+ h(S[f jg;a;b) (5.11) for each pairfi; jg, i6= j, and check whether t i j a i =b i + a j =b j . The maximization problem (5.11) involves an exponential number of candidates to be con- sidered; hence, this problem is not trivial for a general feasible regionF . Nevertheless, for the unconstrained problems, i.e.,F = 2 N , using Algorithm 3 discussed below, submodularity testing can be achieved in polynomial time due to the connection between submodularity and monotonic- ity as outlined in Sections 5.3.2 and 5.3.3. Algorithm 3 Algorithm for submodularity testing withF = 2 N Step 1. Sortfa i =b i g n i=1 in the non-increasing order, i.e. a 1 =b 1 a 2 =b 2 a n =b n . Step 2. Compute and compare h([n 1];a;b) with a n1 =b n1 . If h([n 1];a;b)> a n1 =b n1 , sub- modularity fails to hold and stop; otherwise, set i := 1 and go to Step 3. Step 3. Set t in := h([n 1];a;b)+ h([n]nfig;a;b). If t in > a i =b i + a n =b n , submodularity fails to hold and stop; otherwise, set i := i+ 1. Next, if i= n, then go to Step 4; otherwise, go to Step 3. Step 4. Return that submodularity holds. 163 Define[k] :=f1;:::;kg for any k2 N and assumeF = 2 N . Recall thatF 1 =fS2F : n2 Sg andF 2 =fS2F : n = 2 Sg. By Corollary 7, submodularity of h(;a;b) must imply its mono- tonicity overF 1 andF 2 . By Corollary 6 monotonicity overF 2 is equivalent to h([n 1];a;b) a n1 =b n1 . Therefore, submodularity does not hold if h([n 1];a;b)> a n1 =b n1 ; see Step 2 of Algorithm 3. Now assume h([n 1];a;b) a n1 =b n1 , and we need to verify monotonicity of h(;a;b) over F 1 . Note that h([n];a;b) is the weighted average of h([n 1];a;b) and a n =b n ; see equation (5.10) with S=[n1]. Then it follows that h([n];a;b) maxfh([n1];a;b); a n =b n g a n1 =b n1 , where the second inequality results from Step 1. Hence, we conclude that h(;a;b) is monotone overF 1 by Corollary 6. From monotonicity of h(;a;b) overF 1 andF 2 , we find that h(S[fig;a;b) maxfh([n 1];a;b); h([n];a;b)g for all S N and i2 N. In view of (5.11), if i6= n and j6= n, t i j 2max n h([n 1];a;b); h([n];a;b) o 2 a n1 b n1 a i b i + a j b j : If j= n, the optimal value of (5.11) is attained at S= Nnfi;ng by monotonicity, which implies t in = h([n 1];a;b)+ h([n]nfig;a;b). This observation justifies Steps 3 and 4 of Algorithm 3 and concludes our discussion of the proposed approach. 5.4 Applications In this section, we discuss the implications of our theoretical results in the context of the assortment optimization and the p-choice facility location problems. 164 5.4.1 Assortment optimization problem In the assortment optimization problem, a firm offers a set of products to utility-maximizing cus- tomers. The goal of the firm is to choose an assortment of products that maximizes its expected revenue. It is a core revenue management problem pervasive in practice [207]. In this subsection, we mainly consider this problem under the mixed multinomial logit model (MMNL); see, e.g., [59, 170]. Formally, let N be the set of products that can be offered to customers. Denote by r i the revenue perceived by the firm if a customer chooses product i2 N. Under the MMNL model, each product i2 N is associated with a random weight v ki > 0, and the no-purchase option is associated with weight v k0 > 0; these weights encode the relative preferences for the products by a customer of type k2 M, i.e., set M describes market segments. Given the preference weights v k , if assortment S N is offered, then the probability that a customer in k2 M chooses product i2 S is given by q(i;S;v k )= v ki v k0 +å i2S v ki : The conditional expected revenue from offering assortment S N is r(S;v k )= å i2S r i q(i;S;v k ): Taking the expectation over the random vector v k , we formulate the assortment optimization prob- lem under the MMNL model as max S2F E v [r(S;v)]= å k2M p k r(S;v k ); (5.12) where p k is the probability of a customer to be in segment k and each realization of v can be interpreted as the preferences associated with a given customer of customer segment. We assume 165 that the support of v is finite. Hence, (5.12) can be posed in the form of (5.1), where a ki = p k r i v ki , b ki = v ki and b k0 = v k0 for all k2 M and i2 N. Thus, a ki =b ki = p k r i . Finally, we note that p k 0 for each k2 M. Hence, for submodularity of the objective function in (5.12) it is sufficient to consider the single-ratio functions r(;v k ), k2 M. Therefore, in our discussion below when applying the results of Theorem 6 and Corollary 6 (with ratio a i =b i ), the multiplier p k can be dropped from consideration. 5.4.1.1 Cannibalization and submodularity Intuitively, in retail assortment problems, monotonicity of the revenue function implies that there is limited cannibalization, i.e., the introduction of a new product i (when feasible) always increases the expected revenue perceived by the firm – despite that the revenue obtained from previously offered products in S might decrease slightly. To be more specific, this limited cannibalization phenomenon arises in online advertising: the probability that a given customer clicks on an ad is often quite low, and the advertiser usually profits from offering more ads within the limited number of spots on the webpage. Let r min = min i2N r i and r max = max i2N r i . By Proposition 38 and Corollary 6, we obtain the following results in terms of revenue functions immediately. Corollary 8. If function r(;v) is monotone nondecreasing, then r(;v) is submodular. Corollary 9. Function r(;v) is monotone nondecreasing (and submodular) over 2 N if and only if r min r(N;v). 5.4.1.2 Revenue spread, no-purchase probability and submodularity When the revenues r of all products are identical, assortment optimization problems are known to be submodular maximization problems [15, 82]. Intuitively, one would expect that if the revenues are sufficiently close (but not identical), then submodularity should be preserved. Proposition 41 formalizes this intuition: if the gap between the largest and the smallest revenues is bounded above by the odds of no-purchase, then the function is nondecreasing and submodular. 166 Proposition 41. If r max r min r min min S2F 1 q(S;v) q(S;v) ; (5.13) then r(;v) is nondecreasing and submodular, where r max and r min are the largest and smallest revenues, respectively, and q(S;v)=å i2S q(i;S;v) is the probability that an item is purchased. Proof. Equation (5.13) can be rewritten as r max q(S;v) r min for all S2F . Since for any S and i62 S it follows that r(S;v) r max q(S;v) r min r i , we find that (5.8) is satisfied and the function r(;v) is monotone submodular. Proposition 41 provides us with additional intuition on the industries in which the expected rev- enues are submodular functions of the assortment offered. In the online advertisement, where the revenues obtained from clicks are usually similar and the odds of no-purchase are high, we would expect to obtain submodular revenue functions. In a monopoly, the firm offering the assortment would have a large flexibility in setting prices (resulting in a large revenue spread) and the odds of no-purchase would be low (due to the lack of competing alternatives), resulting in a revenue func- tion that is not submodular. In contrast, in a competitive market, the odds of no-purchase would be larger and firms have little or no control over prices (and if the values r i are interpreted as profits instead of revenues, the spread would typically be low), resulting in submodular revenue functions. From Proposition 41 we also gain insights on the differences between revenue management in the airline and hospitality industries, two industries that are often treated as equivalent in the liter- ature [208]. In the hospitality industries, no-purchase odds can be high as shown by the relatively low occupancy rates – 66.1% in the US [136] in 2018; in addition, revenue differences between products are often due to ancillary charges (e.g., breakfast, non-refundable, long stay), which ac- count for a small portion of the baseline price for a room. In such circumstances we would expect revenue functions to be submodular and simple greedy heuristics to perform well. In contrast, in the airline industries no-purchase odds are often smaller – the load factor was 86.1% in the US in 2018 [108] –, and air fares can change dramatically depending on the conditions. Thus, in 167 the airline industry we would expect to encounter non-submodular revenue functions, and simple heuristics may be inadequate. 5.4.1.3 On the greedy algorithm and revenue-ordered assortments Revenue-ordered assortments are optimal for unconstrained assortment optimization under the MNL model, and tend to perform well in practice [207]. Berbeglia and Joret [44] study the revenue- ordered assortments under the general discrete choice model and prove performance guarantees. Proposition 42 (Berbeglia and Joret [44]). Revenue-ordered assortments are a 1 1+log r max r min -approxi- mation for the unconstrained assortment optimization problem under the MMNL choice model, where r max and r min are the largest and smallest revenues, respectively. Thus, the quality of revenue-ordered assortments depend on the ratio r max =r min ; in particular, if r max =r min = 1, then the revenue-ordered assortments strategy delivers an optimal solution, and the guarantee degrades as the value of the ratio increases. From Proposition 41, we can also obtain guarantees depending on the ratio r max =r min . Define: a(S)= max k2M q(S;v k )= max k2M å i2S q(i;S;v k ) (5.14) as the maximum probability that a customer from any segment purchases an item when assortment S is offered. Proposition 43. IfF =fS : jSj pg for some positive integer p and r max =r min 1+ 1a(S) a(S) for all S2F , then Algorithm 1 delivers a(1 1=e)-optimal solution for the assortment optimization problem under the MMNL choice model. Unlike Proposition 42, we impose a condition on the ratio r max =r min in Proposition 43; how- ever, if such condition is satisfied, then we obtain an approximation guarantee of(1 1=e) 0:63 for the more general assortment optimization problem under a cardinality constraint. Finally, we also point out that Rusmevichientong et al. [199] prove that if customers are value conscious, i.e., v 1 v 2 ::: v n and r 1 v 1 r 2 v 2 ::: r n v n for all realizations of v, then the 168 revenue ordered assortments are optimal for the unconstrained and cardinality constrained cases. It is easy to check that in this case the solutions obtained from the greedy algorithm correspond pre- cisely with the revenue ordered assortments. Thus, Algorithm 1 delivers optimal solutions as well. Remark 8. We comment that our results so far are only applicable to those logit models with linear fractional objectives. When the revenue function involves a ratio of nonlinear functions, e.g., the nested logit models in assortment optimization [5], how to test and exploit the submodularity is still an open problem and needs more work to be done in the future. 5.4.2 p-choice facility location problem Facility location problems deal with deciding where to locate facilities across a finite set of feasible points, taking into account the needs of customers to be served in such a way that a given economic index is optimized [38]. Submodularity often arises in facility location problems. For example, Benati [42] considers the maximum facility location problem with random utilities (MCFLRU); see also [43, 160]. Since in the MCFLRU all ratios a i =b i are identical, submodularity follows directly from Proposition 41. Dam et al. [78] study the maximum capture problem in facility location under random utility models, where the objective function is a sum of the multiplicative inverses of a nonlinear choice probability generating function (CPGF). The authors [78] show that if the CPGF is increasing and submodular, the resulting problem is a submodular maximization problem. We also refer the reader to [13, 92, 155, 181] for additional studies on the applications of submodularity to facility location problems. In this subsection, we consider a particular class of facility location problems with a fractional 0–1 objective function, referred to as the p-choice facility location problem, which is considered in [209]. In the p-choice facility location problem, a decision-maker has to decide where to locate p facilities in n possible locations to service m demand points, in order to maximize the market share. Formally, let d k > 0 be the demand at customer location k2 M =f1;:::;mg, and v ki > 0 be the utility of location i to customers at k. Let S N :=f1;:::;ng,jSj= p, be the set of facilities 169 chosen by the decision-maker. It is assumed that the market share provided by facility j2 S with respect to demand point k is given by: d k v k j å i2S v ki : Let w i > 0 be some weight parameter that represents the importance of locating facility in location i2 N. Then the problem of determining the set of facility locations S that maximizes the weighted market share can be formulated as: max jSj=p å i2S w iå k2M d k v ki å j2S v k j ; which can be reorganized as max jSj=p å k2M d k å i2S v ki w i å i2S v ki : (5.15) Clearly, the model in (5.15) can be formulated as a fractional 0–1 program given by (5.1). Note that from Proposition 40, the objective function in (5.15) is, in general, not submodular since it is homogeneous. Nonetheless, exploiting the equality constraint, we can convert the objec- tive function to a non-homogeneous one. Define v k min =d min i2N fv ki g for some fixedd2(0;1). For any feasible solution S, wherejSj= p, we also have that: å i2S v ki = å i2S v k min + å i2S (v ki v k min )= pv k min + å i2S (v ki v k min ): As a result, (5.15) can be equivalently stated as: max jSj=p å k2M d k å i2S v ki w i pv k min +å i2S (v ki v k min ) ; (5.16) where v ki v k min > 0 and v k min > 0 for all i2 N and k2 M by our construction procedure. 170 Recall our discussion on the links between monotonicity and submodularity in Section 5.3.2. Applying inequality (5.9), we find that a given ratio k in the objective function of (5.16) is mono- tone nondecreasing over setF :=fS N :jSj pg if min i2N v ki w i v ki v k min max jSjp å i2S v ki w i pv k min +å i2S (v ki v k min ) : (5.17) Hence, if (5.17) holds for all ratios k2 M, then the feasibility set in (5.16) can be relaxed to jSj p. Consequently, assumption A3 is satisfied and (5.16) reduces to the maximization problem of a submodular function by Proposition 38. The right-hand side of (5.17) can be interpreted as the best average revenue weighted by market share, or simply the best total revenue that can be obtained from customer segment k. The intuition for (5.17) to hold in the p-choice facility location problem is rather similar to our observations in the assortment optimization problem. Indeed, it is easy to verify, for example, that if all locations have the same utilities and weights, i.e., v ki = v k and w i = w for all i2 N and some v k and w, then (5.17) holds. Moreover, from (5.17) we obtain the following sufficient condition. Proposition 44. Let w max and w min be the maximum and minimum weights, and let v k max be the maximum utility associated with customer segment k. If w min w max + 1 v k max v k min ; (5.18) then the revenue of customer segment k is submodular. Proof. Observe that since w i w min and v ki v ki v k min v k max v k max v k min , we find that v ki w i v ki v k min v k max v k max v k min w min : Moreover, we also find that max jSjp å i2S v ki w i pv k min +å i2S (v ki v k min ) max jSjp w maxå i2S v ki pv k min w max v k max v k min : 171 After rearranging terms corresponding to the sufficient condition v k max v k max v k min w min w max v k max v k min ; we obtain precisely (5.18). Simply speaking, if the considered facility locations are sufficiently similar with respect to their utilities, i.e., v k max v k min 1, then ratios in (5.16) are submodular. Submodularity may be preserved for larger spread of utilities, provided that the weights are sufficiently close. If all the considered facility locations are sufficiently similar with respect to their utilities and weights, then (5.15) can be reduced to maximizing a submodular function; consequently, high-quality solutions can be obtained by a greedy approach, e.g., Algorithm 1. 5.4.3 On minimization problems In this note we focus on identifying submodularity in maximization problems, in which case greedy algorithms can be used to obtain near optimal solutions. However, submodularity can be exploited in minimization problems as well. Indeed, the epigraph of a submodular set function is described by its Lov´ asz extension [163], which can be used to improve mixed-integer programming formu- lations via cutting planes. Moreover, even if a given ratio is not submodular, the results presented in this note can be used to decompose any ratio into two components such that one of which is submodular (and strengthening can be done using the submodular component); see, e.g., [27]. 5.5 Conclusion In this note we explore submodularity of the objective function for a broad class of fractional 0–1 programs with multiple-ratios. Under some mild assumptions, we derive the necessary and sufficient condition for a single ratio of two linear functions to be submodular. Therefore, if the derived condition holds for every considered single-ratio function, then simple greedy algorithms 172 can be used to deliver good quality solutions for multiple-ratio fractional 0–1 programs. Finally, we also illustrate applicability of our results in the context of the assortment optimization and facility location problems. 173 6 Towards a general theory for convexification of MINLPs with indicators It is usually instrumental to construct a strong convex relaxation for a mixed-integer nonlinear program. In this chapter, we present several techniques which are useful in establishing convex hull results and illustrate theses results with examples. Some of these examples appear in the literature, and others are new. For most of those existing examples, we give a different proof which is simpler than the original one. 6.1 General techniques for convexifying mixed-integer sets In this section, we summarize the general techniques for convexifying mixed-integer sets. 6.1.1 Disjunctive programming and extended formulation Proposition 6 provides a straightforward way to compute the convex hull of a disjunctive setS . Similarly, one can study the convex envelope of a “disjunctive” function. Consider a a finite family of closed convex functionsf f i (x)g i2[m] onR n , one can define[ i2[m] f i def = inf t :(t;x)2[ i2[m] epi f i ; i.e. epi f =[ i2[m] epi f i . For example, assume f(x;z)= g(x)+d x;z; (x;z) : x(1 z)= 0; z2 f0;1g n , where g :R n !R is a closed convex function. Then f(x;z) is a disjunction of g(x)+ d x;z; (x;z) : x a = 0; z a = 0; z [n]na = 1 ,a2 2 [n] . Applying Proposition 6, one can deduce that 174 clconv f = ( å i2[m] l i f i (x i ) : å i2[m] x i = x; l2D m ) provided that f i ’s have the same recession func- tion. For convenience, we introduce the concept of infimal convolution O i2[m] f i (x), which is defined as O i2[m] f i (x) def = inf ( å i2[m] f i (x i ) : å i2[m] x i = x ) : If m= 2, we rewrite Oi2[2] f i as f 1O f 2 . Since epi Oi2[m] f i =å i2[m] epi f i , Oi2[m] f i is a convex function. We have the following characterization of the infimal convolutions. Proposition 45 (Theorem 16.4 [193]). Consider a a finite family of closed convex functionsf f i (x)g i2[m] onR n . Then O i2[m] f i = å i2[m] f i . Conversely, if in addition \ i2[m] ridom f i 6= / 0, then å i2[m] f i = Oi2[m] f i . Proposition 45 is useful to compute the closed form of infimal convolutions. For instance, as- sume f i (x)=l i x > Q i x, where Q i 0, i2[2]. Then( f 1O f 2 )(x)=( f 1O f 2 ) (x)=( f 1 + f 2 ) (x)= 1 l 1 x > (Q 1 ) 1 x+ 1 l 2 x > (Q 2 ) 1 x = x > 1 l 1 (Q 1 ) 1 + 1 l 2 (Q 2 ) 1 1 x. Proposition 6 shows that the operator clconv obeys the distribution law with respect to the set union operator. By contrast, in general clconv(S 1 \S 2 )6= clconvS 1 \ clconvS 2 : However, we have the following particular results. Proposition 46. Consider f :R n !R[f+¥g andSR n . The following convex hull results hold. 1. Graph of f : clconvf(t;x) : t= f(x)g= clconvf(t;x) : t f(x)g\ clconvf(t;x) : t f(x)g: 2. Face ofS : conv(S\fx :ha;xi= bg)=(convS)\fx :ha;xi= bg provided thatSfx : ha;xi bg. 3. Level set of f : convfx : f(x)= 0; x2Xg= convfx : f(x) 0; x2Xg\ convfx : f(x) 0; x2Xg provided that f is a continuous finite function andX is a convex set. Proof. We denote the set on the RHS of each part as ¯ S . It can be seen that ¯ S always contains the convex hull on the LHS and it remains to prove the reverse direction. 175 1. Consider any c2R;a2R n . If c 0, then d (c;a; ¯ S)d (c;a;clconvf(t;x) : t f(x)g)=d (c;a;f(t;x) : t f(x)g) =max x c f(x)+ a > x=d (c;a;f(t;x) : t= f(x)g): Similarly, one can argue thatd (c;a; ¯ S)d (c;a;f(t;x) : t= f(x)g) as c< 0. The conclu- sion follows from Proposition 47. 2. The faciality as a sufficient condition is due to Lemma 3.2 [35]. Consider any ¯ x2 ¯ S convS. there existsfx i g i2[m] S and l2D m such that ¯ x=å i2[m] l i x i . Moreover, since b=ha; ¯ xi=å i2[m] l i a;x i å i2[m] l i b= b, one can deduce that a;x i = b which implies x i 2S\fx :ha;xi= bg. The conclusion follows. 3. This conclusion is due to [210]. Consider any ¯ x2 ¯ S . If f( ¯ x)= 0, then ¯ x2fx : f(x)= 0; x2 Xg and we are done. Otherwise, WLOG we assume f( ¯ x)> 0. Since ¯ x2 convfx : f(x) 0; x2Xg, there existsfx k g k2[K] X such that f(x k ) 0 and ¯ x2 conv x 1 ;;x K . By the mean-value theorem, for each k2[K] there exists z k 2 ¯ x;x k such that f(z k )= 0. Thus, ¯ x2 conv z 1 ;;z K . Since z k 2fx : f(x)= 0; x2Xg8k2[K], the conclusion follows. 6.1.2 Outer construction and valid inequalities Sometimes it is more convenient to establish the convex hull results from the dual perspective. Recall that eachha;xid (a;S) defines a valid inequality forS . It turns out that the strongest valid inequality in this form is adequate for describing clconvS as shown in the next proposition, which follows directly from applying Proposition 3 to indicator functions. Proposition 47. Consider a setS . Then clconvS =fx : sup a ha;xid (a;S)g= domd (;S). Consequently, for a closed convex setT ,T = clconv(S) if and only ifd (;T)=d (;S). 176 Next, we give two examples to illustrate Proposition 47 - the first one is a covering-type bilinear knapsack set, and the second one is a product of a polytope and a simplex. Proposition 48 (Proposition 9 [211]). ConsiderS = ( (x;y)2R 2n : x 0;y 0; å i2[n] x i y i 1 ) . Then its convex hull is given by clconvS = ( (x;y)2R 2n : x 0;y 0; å i2[n] p x i y i 1 ) . Proof. Because d (a;b;S)= max ( a > x+ b > y : å i2[n] x i y i 1; x 0; y 0 ) =max ( a > x+ b > y : å i2[n] z 2 i 1; x 0; y 0; z 2 i = x i y i 8i2[n] ) +d(a;b;R 2n ) =max ( å i2[n] a i x i + b i z i x i : å i2[n] z 2 i 1;z 0 ) +d(a;b;R 2n ) =maxf2 å i2[n] p a i b i z i : å i2[n] z 2 i 1; z 0g+d(a;b;R 2n )=2min i2[n] p a i b i +d(a;b;R 2n ); one can deduce that d (x;y;S)= max x > a+ y > b+ 2min i2[n] p a i b i : a 0; b 0 = min ( å i2[n] a i x i + t 2 i a i y i 2min i2[n] t i : a 0; t 0 ) +d(x;y;R 2n + ) = min ( 2 å i2[n] t i p x i y i 2min i2[n] t i : t 0 ) +d(x;y;R 2n + ) (homogeneous) =d x;y; ( (x;y) : x 0; y 0; å i2[n] t i p x i y i min i2[n] t i 8t 0 )! =d x;y; ( (x;y) : x 0; y 0; å i2[n] t i p x i y i min i2[n] t i 8t 0 s.t. min i2[n] t i = 1 )! =d x;y; ( (x;y)2R 2n : x 0;y 0; å i2[n] p x i y i 1 )! : The conclusion follows from Proposition 47. 177 Proposition 49 ([80]). ConsiderS = n (x;y;z)2PD m R K : y > A k x= z k ;8k2[K] o , where P is a polytope,D m is the m-dimensional simplex and A k 2R mn . Then its convex hull is given by clconvS = clconv [ y2f0;1g n n (x;z)2PR K : y > A k x= z k ;8k2[K] o fyg which is a polytope. Proof. Denote the disjunctive set on the RHS of the last identity as ¯ S . It can be seen that ¯ SS , which implies clconv ¯ S clconvS . Thus, d (;S)d (; ¯ S). It suffices to prove d (;S) d (; ¯ S) due to Proposition 47. Consider d (a;b;c;S)= max (x;y;z)2S ha;xi+hb;yi+ hc;zi= max (x;y)2PD m ha;xi+hb;yi+ å k2[K] c k y > A k x= max (x;y)2P ha;xi+ max y2D m * b+ å k2[K] c k A k x;y + . It can be seen that the optimal y to the inner linear maximization problem can be taken as a binary vector. Denote the optimal solution as (x ;y ;z ). Since y 2f0;1g n , (x ;y ;z )2 ¯ S . Thus, d (a;b;c; ¯ S)ha;x i+hb;y i+hc;z i d (a;b;c;S) = max (x;y;z)2S ha;xi+hb;yi+hc;zi. The conclusion follows. 6.1.3 Convexification by part In this part, we present a useful recipe for deriving the convex envelope of a certain function g(x;z) :R n R m !R[f+¥g. We call such a recipe convexification by part. Proposition 50 ([191]). Assume g(x;z) has an affine underestimator and domg6= / 0. Then clconvg(x;z)= sup a2R n ha;xi+ clconv z g(a;z)= sup a2A ha;xi+ clconv z g(z;a); where clconv z g(a;z) def = clconv((g(;z)) (a)) andA def =fa2R n : clconv z g(a;z)<+¥8zg. Proposition 50 suggests three steps to derive clconvg: (i) For fixed a and z, compute the neg- ative partial conjugate inf x g(x;z)ha;xi; (ii) For fixed a, compute the convex envelope of the resulting function with respect to z, i.e. clconv z g(a;z); (iii) For fixed z, take the convex conjugate of clconv z g(;z) to return to the original space. Next we illustrate the above three-step scheme with an example. Before that, we present a generic result of transforming a convex quadratic program into a SDP problem. 178 Proposition 51. Assume W 0 andfx : Hx hg is nonempty. Then the convex quadratic program (QP) maxf 1 2 x > Wx+ c > x : Hx hg is equivalent to the semidefinite program (SDP) min 8 > < > : t : 2 6 4 2(t h > y) c > y > H c H > y W 3 7 5 0;y 0 9 > = > ; : Proof. Consider the eigendecomposition of W = U > DU with D= Diag(d 1 ;:::;d m ;0;:::;0) and define ˜ H = HU > , ˜ c= Hc and z= Ux, then the QP is equivalent to max 1 2 z > Dz+ ˜ c > z : ˜ Hz h = min y0 max z 1 2 z > Dz+ ˜ c > z+ y > (h ˜ Hz) =min y0 max z 1 2 m å i=1 d i z 2 i +( ˜ c ˜ H > y) > z+ y > h =min ( ˜ H > y+ ˜ c) 2 i 2d i + y > h : y 0; ( ˜ c ˜ H > y) i = 08i = 2[m] =min 8 > < > : t : 2 6 4 2(t h > y) ( ˜ c ˜ H > y) > ˜ c ˜ H > y D 3 7 5 0;y 0 9 > = > ; (PSD property)( ˜ c ˜ H > y) i = 08i = 2[m]) =min 8 > < > : t : 2 6 4 1 0 0 U > 3 7 5 2 6 4 2(t h > y) ( ˜ c ˜ H > y) > ˜ c ˜ H > y D 3 7 5 2 6 4 1 0 0 U 3 7 5 0; y 0 9 > = > ; ; which is equivalent to the SDP in the proposition. Proposition 52 ([225]). Consider g(x;z)= 1 2 x > Qx+d(x;z;I n ) where Q= 2 6 4 1 a a 1 3 7 5 0;I n def = f(x;z)2R n f0;1g n : x i (1 z i )= 08i2[n]g: Then clconvg(x;z)= min 8 > > > > < > > > > : t : 2 6 6 6 6 4 2t x 1 x 2 x 1 z 1 + a 2 1a 2 l a 1a 2 l x 2 a 1a 2 l z 2 + a 2 1a 2 l 3 7 7 7 7 5 0;(l;z)2P 9 > > > > = > > > > ; ; whereP = (l;z) : maxf0;z 1 + z 2 1gl minfz 1 ;z 2 g; z2[0;1] 2 179 Proof. Since min (x;z)2I n 1 2 x > Qxha;xi= 1 2 h z 1 z 2 a > Q 1 a+ z 1 (1 z 2 )a 2 1 +(1 z 1 )z 2 a 2 2 i 8z2f0;1g 2 ; one can deduce that inf x g(x;z)ha;xi= a 2(1a 2 ) (aa 2 1 +aa 2 2 2a 1 a 2 )z 1 z 2 1 2 a 2 1 z 1 1 2 a 2 2 z 2 , which is a two-dimensional binary bilinear function. Thus, clconv z g(a;z) is described by the McCormick polytope, i.e. clconv z g(a;z)= min (l;z)2P z a 2(1a 2 ) (aa 2 1 +aa 2 2 2a 1 a 2 )l 1 2 a 2 1 z 1 1 2 a 2 2 z 2 , whereP z =fl : maxf0;z 1 + z 2 1gl minfz 1 ;z 2 g;0 z 1g. It follows that clconv z g(a;z)= sup a ha;xi+ clconv z g(a;z) =sup a min l2P z 1 2 a > 2 6 4 z 1 + a 2 1a 2 l a 1a 2 l a 1a 2 l z 2 + a 2 1a 2 l 3 7 5 a+hx;ai = min l2P z sup a 1 2 a > 2 6 4 z 1 + a 2 1a 2 l a 1a 2 l a 1a 2 l z 2 + a 2 1a 2 l 3 7 5 a+hx;ai; where the last identity is due to Proposition 4. The conclusion follows by applying Proposition 51 to the inner unconstrained quadratic program. 6.2 Convexification of structured sets In this part, we present specialized convex hull results for three classes of structured mixed-integer sets. 6.2.1 Permutation-invariant sets A setSf(x;z) :R n R m g is called permutation-invariant with respect to x if for any(x 1 ;;x n ;y)2 S ,(x p(1) ;;x p(n) ;y)2S wherep :[n]![n] is a bijection. 180 Proposition 53 ([146]). Assume A setSf(x;y) :R n R m g is permutation-invariant with respect to x. Then conv(S)= 8 > > > > < > > > > : (x;y) : (u;y)2 convS 0 ; u 1 u n ; å i2[n] u i = å i2[n] x i k å i=1 u i kr k + n å i=1 t k i ; x j t k j + r k ; t k j 0 8k2[n 1]; j2[n] 9 > > > > = > > > > ; ; whereS 0 is any set satisfyingS\f(u;z) : u 1 u n gS 0 convS\f(u;z) : u 1 ::: u n g. The following proposition shows that it is impossible to establish the similar results for sets which are permutation-invariant with respect to coupled variables. For any vector a2R n , any matrix A2R nn , and any indexa;b[n], we denote the subvector of a as a a =(a i ) i2a and the submatrix of A ab as(A i j ) i2a; j2b . What is more, we denote A 1 aa def =(A aa ) 1 . Proposition 54. Assume Q= I+ 1 r 11 > 2S n + , where I is the identity matrix, 1 represents the vector of all 1’s, and r > 0. Problem min 1 2 x > Qxha;xi+hc;zi isNP-hard with rational data a and c. Proof. The conclusion is proved by reducing the following well-known subset sum problem to the stated mixed-integer QP: Given a multiset of integersfa i g n i=1 , decide if there exists a subseta[n] s.t. å i2a a i = 0: Let c i = a 2 i 8i2[n], then the MIQP is equivalent to min a[n];y a 1 2 x > a Q aa x a a > a x a + å i2a c i = min a[n] 1 2 a > a Q 1 aa a a + å i2a c i = min a[n] 1 2 a > a I aa 1 a 1 > a r+jaj ! a a + å i2a c i (Sherman–Morrison formula) = min a[n] (å i2a a i ) 2 r+jaj 0; (c i = a 2 i ) where the lower bound 0 is attained if and only ifå i2a a i = 0. The conclusion follows. 181 6.2.2 Chain-function with indicator variables A function f :R n !R[f+¥g is called a chain-function if it can be written as f(x)= å i2[n1] f i (x i ;x i+1 ) where each f i :R 2 !R[f+¥g is a closed convex function with 02 dom f i . For example, f(x)= x > Qx+d(x;[`;u]), where Q2S n + is a tridiagonal matrix and`;u2R n . Notice that the quadratic form can be written as x > Qx= å i2[n1] ¯ a i x 2 i + 2Q i;i+1 x i x i+1 + b i x 2 i+1 for some a;b2R n + s.t. a i b i Q 2 i;i+1 8i2[n 1]. It can be seen in this case, f is a chain-function by defining f i (x i ;x i+1 )= a i x 2 i + 2Q i;i+1 x i x i+1 + b i x 2 i+1 +d(x i ;x i+1 ;fx :` i x i u i ;` i+1 x i+1 u i+1 g). In this part, we considerS =f(s;x;z) : s f(x); x i (1 z i )= 0; z i 2f0;1g8i2[n]g. To study clconvS , we define a graph G(V;E) with V =f0g[[n] and E =f(i; j) : 0 i< j n+ 1g. Note that G is a simple acyclic directed graph. Denote A G as the incidence matrix of G. Consider two vertices v 0 = 0 and v n+1 = n+ 1. Define the v 0 v n+1 path polytopeP def = y2R E : A G y= b; y 0 , where b v 0 = 1;b v n+1 =1 and b v = 0 for all v2 Vnfv 0 ;v n+1 g. Con- sider an mapping T :P![0;1] V , for which y7! z= T(y) is defined by z v 0 = z v n+1 = 0; z v = 1å e2d + (v) y e 8v2 Vnfv 0 ;v n+1 g, where d + (v) represents the set of edges that enters vertex v. Since T is an affine mapping, z= T(y) can be expressed as z= A T y+ b T for a certain matrix A T and vector b T . For each (i; j)2 E, define f i j :R n !R[f+¥g by f i j (x)= å ik< j f k (x k ;x k+1 )+ d(x;fx : x k = 08k s.t. 0 k i or j k n+ 1g), where f 0 def = 0; f n def = 0. Proposition 55. Assume f is a chain-function and f p i (;0)=d(;f0g). Then convS = (s;x;z) : s O 0i< jn+1 f p i j (;y i j ) (x); z= A T y+ b T ;A G y= b;y 0 : Proof. Denote ¯ S as the set of(s;x;y;z) on the right hand side. First observe that the restriction of mapping T overP\N E is a bijection tof0;1g V , by which it is easy to check thatS = proj s;x;z ¯ S\ 182 f(s;x;y;z) : y2N E g. For e=(i; j)2 E, defineS e := epi f p i j (x;y e ). Because epi Oe2E f p e (;y e )= å e2E epi f p e (;y e ), ¯ S can be rewritten as ¯ S =f(s;x;z) : s= å e2E s e ; x= å e2E x e ;(s e ;x e ;y e )2S e 8e2 E;z= A T y+ b T ; y2Pg: Consequently, for an arbitrary linear program overS - min (s;x;y;z)2S a s s+ha z ;zi+ha x ;xi, min (s;x;z)2 ¯ S a s s+ha z ;zi+ha x ;xi (6.1) is a valid convex relaxation. It suffices to show if the optimal value of (6.1) is finite, then the optimal solution to (6.1) is integral. The case where a s < 0 is trivial and hence, we assume a s 0. Substituting out the variable x;z and s, the objective function reduces to min a s ( å e2E s e )+ha z ;T(y)i+ n å t=1 a x t ( å it< j x i j t ); which can be rewritten in the form by ignoring the constant term min å e2E (c e s e + c y e y e +hc x e ;x e i) for certain c e 0;c x e ;c y e . Therefore, (6.1) can be formulated in the form min y;s e ;y e ;x e å e2E (c e s e + c y e y e +hc x e ;x e i) (6.2a) s.t. y2 P; (s e ;y e ;x e )2S e : (6.2b) When y2 P is fixed, (6.2) is separable. If we put d e def = min (s e ;1;x e )2S e c e s e + c e +hc x e ;x e i; then the optimal value of each subproblem is min (s e ;y e ;x e )2S e c e s e + c e y e +hc x e ;x e i= y e d e : It implies that the relaxation (6.2) reduces to min y2P å e2E d e y e ; which is equivalent to solve a shortest path problem with weights d. It implies that the optimal y is integral. As a result, z= T(y) is also integral by the observation made at the beginning of the proof. In the special case where f is a triangular quadratic form, the closed form of the infimal con- volution in the proposition is available; see [159]. 183 6.2.3 Triangular function with hierarchical indicator variables A function f :R n !R[f+¥g is called a triangular function if it can be written as f(x) = å n k=1 f k å k i=1 a ki x i , where each f k :R k !R is a closed convex function and a kk = 18k2[n]. For example, each quadratic form f(x)= x > Qx defined by a positive definite matrix Q 0 is a triangular function because it can be written as f(x)=å n k=1 å k i=1 a ki x i 2 for certain coefficients a ki using Cholesky decomposition. In this part, we studyS =f(t;x;z) : t f(x);z 1 z n ;x i (1 z i )= 0;z i 2f0;1g8i2[n]g, where f is a triangular function. Proposition 56. Assume f is a triangular function and f p k (;0)=d(;f0g)8k2[n]. Then clconvS = ( (t;x;z) : t n å k=1 f p k k å i=1 a ki x i ;z k ! ;0 z 1 z n 1 ) : Proof. Denote the convex set on the RHS as ¯ S . Because for each k2[n], f p k (;0)=d(;f0g), it can be seen thatS = ¯ S\f(t;x;z) : z 1 z n ; z2f0;1g n g: Thus, one can deduce that ¯ S is a valid convex relaxation of clconvS . We prove the conclusion by induction over dimension n. In the base case where n= 1, f(x) is a univariate finite convex function, and it is known that the convex hull is described by its perspective function. Assume the conclusion holds for dimension n1. It suffices to show that for an arbitrary linear program over ¯ S , the optimal z is integral. Note that min (t;x;z)2 ¯ S t+ha;xi+hc;zi= min x;z:0z 1 z n 1 n å k=1 f p k k å i=1 a ki x i ;z k ! +ha;xi+hc;zi: Because the objective function is positively homogeneous, one can deduce that at the optimal solution either z 1 = 0 or z n = 1. In the former case, we have x 1 = 0 and the dimension of the problem is reduced by one. Thus, the conclusion follows. In the latter case, the linear program is equivalent to min x;z:0z 1 z n1 1 n å k=1 f p k k å i=1 a ki x i ;z k ! +ha;xi+hc;zi+ min x n f n n å i=1 a ni x n ! = min x;z:0z 1 z n1 1 n å k=1 f p k k å i=1 a ki x i ;z k ! +ha;xi+hc;zi+C; 184 where C= min x n f n n å i=1 a ni x n ! = min s f n (s) is a constant. Thus, the dimension of the problem is also reduced by one. The conclusion follows by the induction hypothesis. Remark 9. It can be shown that the coercivity condition f p k (;0)= d(;f0g) is not required in Proposition 56. 6.2.4 Convexification and decomposition for conic quadratic programs with indicator variables In this subsection, we consider two mixed-integer quadratic conic sets S(Q) def = n (t;x;z) : t x > Qx; x2P; x i (1 z i )= 0; z i 2f0;1g8i2[n] o S 1 2 (Q) def = n (t;x;z) : t p x > Qx; x2P; x i (1 z i )= 0; z i 2f0;1g8i2[n] o ; where Q2S n + andP is a closed convex set. In general, it seems impossible to fully characterize clconvS and clconvS 1 2 due to theNP-hardness of solving the linear optimization problem over them. However, for some special classes of matrix Q, it is possible to describe the convex hulls or construct strong convex relaxations of the mixed integer sets, which include 1. diagonal matricesQ D =fQ2S n + : Q i j = 08i6= jg 2. low-dimensional matricesQ a =fQ2S n + : Q i j = 08i = 2a or j = 2ag, wherea[n] 3. tridiagonal matricesQ T =fQ2S n + : Q i j = 08(i; j) s.t.ji jj> 1g 4. Stieltjes matricesQ S =fQ2S n + : Q i j 08i6= jg 5. H 0 matricesQ H =fQ2S n + : ¯ Q2Q S g, where ¯ Q is the comparison matrix of Q and defined by ¯ Q ii = Q ii 8i2[n] and ¯ Q i j =jQ i j j8fi; jg[n]. 185 Note that each ofQ D ,Q a ,Q T andQ S is defined by a matrix inequality Q 0 and several linear inequalities/equalities and thus is a closed convex cone. Pang and Han [183] shows that the set of H 0 -matrices is also a closed convex cone. Assume the convex hull/strong relaxation ofS(Q) orS 1 2 (Q) is available for matrix Q2 [ i2[m] Q i , where eachQ i could be one of the above five classes of matrices or other closed convex sets of matrices for which the mixed-integer sets are well-understood. A natural question is how to utilize these specialized results to construct a strong convex relaxation ofS(Q) orS 1 2 (Q) for a general Q 0. The question can be answered by adopting the following decomposition strat- egy. Specifically, assume ¯ S i (Q) ( ¯ S 1 2 i (Q) respectively) is a convex relaxation ofS(Q) (S 1 2 (Q) respectively) for Q2Q i . Additionally, assume Q can be decomposed as Q= å i2[m] Q i + R, where Q i 2Q i and R 0. Then one can rewriteS i (Q) andS 1 2 i (Q) as S(Q)= ( (t;x;z) : t å i2[m] t i + x > Rx; t i x > Q i x8i2[m]; x2P; x(1 z)= 0; z2f0;1g n ) S 1 2 (Q)= 8 > > > < > > > : (t;x;z) : t s å i2[m] t 2 i + x > Rx; t i p x > Q i x8i2[m]; x2P; x(1 z)= 0; z2f0;1g n 9 > > > = > > > ; ; where a b=(a i b i ) i is the Hadamard product of two vectors a and b. Consequently, ¯ S(Q) and ¯ S 1 2 (Q) defined below are convex relaxations ofS(Q) andS 1 2 (Q) respectively ¯ S(Q) def = \ (Q 1 ;;Q n ;R)2D ( (t;x;z) : t å i2[m] t i + x > Rx; (t i ;x;z)2 ¯ S i (Q i )8i2[m] ) ¯ S 1 2 (Q) def = \ (Q 1 ;;Q n ;R)2D 8 < : (t;x;z) : t s å i2[m] t 2 i + x > Rx; (t i ;x;z)2 ¯ S 1 2 i (Q i )8i2[m] 9 = ; ; whereD def = ( (Q 1 ;;Q n ;R) : Q= å i2[m] Q i + R; Q i 2Q i ; i2[m]; R 0 ) is a compact convex set. In some cases, the closed form of ¯ S(Q) or ¯ S 1 2 (Q) is available, see Chapter 2 for details. 186 6.3 Convexification and optimization Recall that by Proposition 1, one can transform an mixed-integer program into a convex program provided that the convex hull of the mixed-integer setS is available. Under mild conditions - e.g. that clconvS is a convex body (a closed convex set with a nonempty interior) - the linear program over clconvS can be solved by means of the ellipsoid method in time that is polynomial in the encoding size of clconvS (Theorem 5.2.1 [41]). This indicates that if the original mixed integer program (1.1) isNP-hard, one can not expect to describe the convex hull of the underlying mixed-integer set in a desirable way (e.g. as a compact extended formulation or using separating inequalities) unlessP=NP. However, a number of difficult problems can be solved efficiently in some restricted scenarios, i.e. one can computed (a;S) for a2A . In such cases, one can turn to construct a valid relaxation of the optimization problems. Proposition 57. Consider a setS , then the following statements hold. 1. SetS A def = x : sup a2A ha;xid (a;S) 0 is a closed convex set containing clconvS . 2. Assume (i) the explicit form of the restriction of d (a;S) over a2A is given by f(a), where f is a closed convex function; (ii)A =fa : Aa b2Cg, whereC is a pointed closed convex cone; (iii) there exists a2 ridom f such that Aa b is an interior ofC . Then S A = n x : f (x+ A > y) b > y; y2C o . 3. For any a2A , the relaxation is exact in the sense that sup x2S A ha;xi= sup x2S ha;xi. Proof. Part (1) directly follows from Proposition 47. Part (2) follows from the duality in conic programming. Specifically, sup a2A ha;xi f(a) = ( f+d(;A)) (x) = ( f O d (A))(x); where the last identity is due to assumption (iii) and Proposition 45. Note that d (x;A)= sup n x > a : Aa b2C o = inf n b > y :A > y= x; y2C o ; 187 which is due to the strong duality theorem for conic programs (Proposition 5). Thus, sup a2A ha;xi f(a)=( f O d (A))(x)= inf f x 1 +d x 2 A : x 1 + x 2 = x =inff f x 1 b T y : x 1 + x 2 = x; x 2 =A > y; y2C g= inf n f x+ A > y b > y : y2C o : To prove Part (3), we take any a2A . By the definition ofS A , x2S A implies thatha;xi d (a;S). Thus, sup x2S A ha;xid (a;S)= sup x2S ha;xi. On the other hand, sinceS A is a valid convex relaxation ofS , one can deduce that sup x2S A ha;xi sup x2S ha;xi. The conclusion follows. Next, we present an example to illustrate Proposition 57. Assume Q is a nonsingular Stieltjes matrix, i.e. Q 0 and Q i j 08i6= j and consider S = (t;x;z) : t 1 2 x > Qx; x(1 z)= 0; x 0; z2f0;1g n : Lemma 8 (Inverse of partitioned matrices; Page 18 [137]). Assume P 0 is partitioned into four blocks P= 2 6 4 A B > B C 3 7 5 , then P 1 = 2 6 4 A 1 0 0 0 3 7 5 + 2 6 4 A 1 B > I 3 7 5 (P=A) 1 BA 1 I ; where P=A := C BA 1 B > 0 is the Schur complement of P with respect to A. Proof. Note that 2 6 4 I 0 BA 1 I 3 7 5 2 6 4 A B > B C 3 7 5 2 6 4 I A 1 B > 0 I 3 7 5 = 2 6 4 A 0 0 P=A 3 7 5 ; 188 which implies P 1 = 2 6 4 I 0 BA 1 I 3 7 5 2 6 4 A 1 0 0 (P=A) 1 3 7 5 2 6 4 I A 1 B > 0 I 3 7 5 : The conclusion follows. Proposition 58. Assume Q is a nonsingular Stieltjes matrix and consider any vector a. If there exists a partition of[n]=a[b such that a a 0 and a b Q ba Q 1 aa a a 0, then the MIQP min x;z 1 2 x > Qx a > x+ c > z s.t. x(1 z)= 0; x 0; z2f0;1g n is equivalent to min x a ;z 1 2 x > a Q aa x a a > a x a + c > z s.t. x a (1 z a )= 0; z2f0;1g n ; which can be solved in polynomial time as a submodular minimization problem. Remark 10. Note that if all a i 6= 0, then a =fi : a i > 0g and b =fi : a i < 0g. In this case, the condition in Proposition 58 can be verified easily. Moreover, this condition generalize the sufficient condition in [16] which requires that all entries of a have the same sign. Proof. We first prove the equivalence of the two MIQPs. To achieve this, it suffices to prove that for any fixed z2f0;1g n , the optimal solution x to the resulting continuous quadratic program satisfies x a 0 and x b = 0. Consider an arbitrary z= 1 a 0 1 b 0 0 , where a 0 a and b 0 b. The optimality 189 condition of the resulting quadratic program is given by the following linear complementarity problem (LCP) 2 6 4 Q a 0 a 0 Q a 0 b 0 Q b 0 a 0 Q b 0 b 0 3 7 5 0 B @ x a 0 x b 0 1 C A 0 B @ a a 0 a b 0 1 C A 0 B @ l a 0 l b 0 1 C A = 0; 0 0 B @ x a 0 x b 0 1 C A ? 0 B @ l a 0 l b 0 1 C A 0; where u? v represents u v= 0. It is sufficient to prove the LCP has a solution withl a 0 = 0 and x b 0 = 0, which is equivalent to the consistency of the following inequality system by substituting outl a 0 and x b 0 x a 0 = Q 1 a 0 a 0 a a 0 0; l i = Q ia 0Q 1 a 0 a 0 a a 0 a i 0 8i2b 0 ; The first nonnegativity inequality is trivial due to the property of Stieltjes matrices Q 1 a 0 a 0 0 and a a 0 0. Thus, it remains to prove the second one. Indeed, we denoteg=ana 0 . For any i2b 0 b, it follows from the assumption that 0 Q ia Q 1 aa a a a i = Q ia 0 Q ig 8 > < > : 2 6 4 Q 1 a 0 a 0 0 0 0 3 7 5 + 2 6 4 Q 1 a 0 a 0 Q a 0 g I 3 7 5 (Q aa =Q a 0 a 0) 1 Q ga 0Q 1 a 0 a 0 I 9 > = > ; 0 B @ a a 0 a g 1 C A a i (By Lemma 8) =Q ia 0Q 1 a 0 a 0 a a 0 a i | {z } l i + Q ia 0 Q ig 2 6 4 Q 1 a 0 a 0 Q a 0 g I 3 7 5 (Q aa =Q a 0 a 0) 1 Q ga 0Q 1 a 0 a 0 I 0 B @ a a 0 a g 1 C A | {z } 0 ; where the second underbraced product is nonpositive because all multiplicands are nonnegative except the first one Q ia 0 Q ig 0. 190 The polynomial solvability of the reduced MIQP is due to Proposition 3 [16]. This completes the proof. For eacha[n], defineA a def =fa : a a 0; a b Q ba Q 1 aa a a ; b =[n]nag. Then by Proposi- tion 50,S a = (t;x;z) : t max a2A a a > x+ conv z g(z;a) is a convex relaxation of clconvS , where g(x;z)= 1 2 x > Qx+d(x;z;f(x;z) : x 0; x(1z)= 0g). Define Q 1 a (b)2R bb by Q 1 a (b) i j = Q 1 aa i j if i2a; j2a and 0 otherwise. Proposition 59. SetS a is given by S a = 8 > < > : (t;x;z) : 2 6 4 2t x > a + x > b Q ba Q 1 aa y > a x a + Q 1 aa Q ab x b y a F L a (z a ) 3 7 5 0; x b 0; 0 z 1 9 > = > ; ; where for z 0, F L a (z) def = jaj å k=1 z p(k) (Q 1 a (a k ) Q 1 a (a k1 )) 0, wherep :[jaj]!a is a bijection such that z p(1) z p(jaj) anda k =fp(1);;p(k)g. Proof. We first note that sincea k1 a k and z 0, one can deduce F L a (z)2S jaj + from Lemma 8. For any z2f0;1g n andb =fi : z i = 1g, by Proposition 58 min x g(x;z)ha;xi= min x b 1 2 x > b Q bb x b a > b x b = 1 2 a > b Q 1 bb a b = 1 2 a > a Q 1 a (b)a a def = v(z;a); where v(z) is submodular due to Proposition 3 [16]. Thus, by Proposition 7, clconv z g(a;z)= v L (z;a)+d(z;[0;1] n )= 1 2 a > a F a (z)a a +d(z;[0;1] n ), where the last identity follows from the def- inition of v L (z;a) and F L a (z). Therefore, max a2A a x > a+ conv z g(z;a)= max a:a a 0;a b Q ba Q 1 aa a a x > a a a 1 2 z > a F a (z)a a + x > b a b +d(z;[0;1] n ) = max a a 0 x > a + x > b Q ba Q 1 aa a a 1 2 a > a F a (z a )a a +d(z;[0;1] n )+d(x b ;R b + ): (a b 0) The conclusion follows by applying Proposition 51 to the last quadratic program. 191 Fig. 6.1: 2D Example ofA 0 . Consider the two-dimensional case where Q= 2 6 4 1 d d 1 3 7 5 0 with 0< d< 1. The region ofA 0 def = S a[2] A a is depicted in Figure 6.1. Note that for any a2A a and c2R n , we always have g (a;c)= max (t;x;z)2S a t+ha;xi+hc;zi by Proposition 57. In this sense, informally speaking, the convex relaxation \ a[2] A a is ideal for at least 75% a’s inR n (see Figure 6.1). 192 7 Conclusion In this dissertation, we systematically study the theory of mixed-integer nonlinear programming with indicator variables. In Chapter 2, we derive a new formulation for a signal estimation problem based on the convexification of the quadratic function defined by a M-matrix. The computational results demonstrate the excellent empirical performance of the statistical estimator delivered by the new formulation. In Chapter 3, we derive and compare the strength of several SOCP/SDP convex relaxations for convex quadratic programs with indicator variables. We show that the formulation obtained by optimally decomposing a matrix into a sequence of bivariate quadratic summands sig- nificantly outperforms the existing ones in literature. In Chapter 4, we propose a new disjunctive programming formula for the mixed-integer epigraph of a low-rank convex function. Starting from it, we establish the convexification results for the mixed-integer set under study, which include many existing results in literature as special cases. In Chapter 5, we characterize the necessary and sufficient conditions under which a linear fractional 0-1 function is submodular and discussed their implications in practical settings. It turns out that in certain scenarios of assortment optimiza- tion or facility location problems, one can exploit submodularity results to devise approximation algorithms or efficiently solve the problem via MIP solvers. Finally, we remark that there is a scarcity of discussion on enumeration algorithms in this dissertation. In fact, as we observed in many computational experiments (e.g. the one in Chap- ter 4), off-the-shelf MIP solvers often fail to fully take advantage of the convexification results 193 we established for MINLP problems because the solution methods they adopted are mainly di- rected at MILP problems. Hence, it is crucial to develop efficient searching strategies that are more amenable to nonlinear structures. To mitigate this issue, one seemingly promising research direction is integrating decision diagrams - a powerful tool to explore the combinatorial structure of discrete optimization problems - and convexification techniques, which is the present work of the author. We hope that the work presented here could motivate more research on MINLP in the future. 194 Bibliography [1] Ahmed, S. and Atamt¨ urk, A. (2011). Maximizing a class of submodular utility functions. Mathematical programming, 128(1):149–169. [2] Ahuja, R. K., Hochbaum, D. S., and Orlin, J. B. (2004). A cut-based algorithm for the nonlin- ear dual of the minimum cost network flow problem. Algorithmica, 39:189–208. [3] Akt¨ urk, M. S., Atamt¨ urk, A., and G¨ urel, S. (2009). A strong conic quadratic reformulation for machine-job assignment with controllable processing times. Operations Research Letters, 37:187–191. [4] Alfakih, A. Y ., Khandani, A., and Wolkowicz, H. (1999). Solving euclidean distance matrix completion problems via semidefinite programming. Computational optimization and applica- tions, 12(1-3):13–30. [5] Alfandari, L., Hassanzadeh, A., and Ljubi´ c, I. (2021). An exact method for assortment opti- mization under the nested logit model. European Journal of Operational Research, 291(3):830– 845. [6] Alizadeh, F. and Goldfarb, D. (2003). Second-order cone programming. Mathematical Pro- gramming, 95:3–51. [7] Amiri, A., Rolland, E., and Barkhi, R. (1999). Bandwidth packing with queuing delay costs: Bounding and heuristic solution procedures. European Journal of Operational Research, 112(3):635–645. [8] Anderson, R., Huchette, J., Tjandraatmadja, C., and Vielma, J. P. (2019). Strong mixed-integer programming formulations for trained neural networks. In International Conference on Integer Programming and Combinatorial Optimization, pages 27–42. Springer. [9] Anstreicher, K. and Burer, S. (2010). Computable representations for convex hulls of low- dimensional quadratic forms. Mathematical Programming, 124(1):33–43. [10] Anstreicher, K. M. and Burer, S. (2021). Quadratic optimization with switching variables: the convex hull for n= 2. Mathematical Programming, 188(2):421–441. [11] Arora, S., Puri, M., and Swarup, K. (1977). The set covering problem with linear fractional functional. Indian Journal of Pure and Applied Mathematics, 8(5):578–588. [12] Atamt¨ urk, A. (2007). Strong formulations of robust mixed 0-1 programming. Mathematical Programming, 108:235–250. 195 [13] Atamt¨ urk, A., Berenguer, G., and Shen, Z.-J. (2012). A conic integer programming approach to stochastic joint location-inventory problems. Operations Research, 60(2):366–381. [14] Atamt¨ urk, A. and G´ omez, A. (2016). Submodularity in conic quadratic mixed 0-1 optimiza- tion. arXiv preprint arXiv:1705.05918. BCOL Research Report 16.02, UC Berkeley. Forth- coming in Operations Research. [15] Atamt¨ urk, A. and G´ omez, A. (2017). Maximizing a class of utility functions over the vertices of a polytope. Operations Research, 65:433–445. [16] Atamt¨ urk, A. and G´ omez, A. (2018). Strong formulations for quadratic optimization with M-matrices and indicator variables. Mathematical Programming, 170:141–176. [17] Atamt¨ urk, A. and G´ omez, A. (2019). Rank-one convexification for sparse regression. arXiv preprint arXiv:1901.10334. [18] Atamturk, A. and Gomez, A. (2019). Rank-one convexification for sparse regression. arXiv preprint arXiv:1901.10334. [19] Atamt¨ urk, A. and G´ omez, A. (2020a). Safe screening rules for` 0 -regression. Forthcoming in 2020 International Conference of Machine Learning. [20] Atamt¨ urk, A. and G´ omez, A. (2020b). Submodularity in conic quadratic mixed 0–1 opti- mization. Operations Research, 68(2):609–630. [21] Atamt¨ urk, A. and G´ omez, A. (2020a). Supermodularity and valid inequalities for quadratic optimization with indicators. arXiv preprint arXiv:2012.14633. [22] Atamt¨ urk, A. and G´ omez, A. (2020b). Supermodularity and valid inequalities for quadratic optimization with indicators. arXiv preprint arXiv:2012.14633. [23] Atamt¨ urk, A., G´ omez, A., and Han, S. (2018). Sparse and smooth signal estimation: Con- vexification of l0 formulations. http://pitt.edu/ agomez/publications/SignalEstimation.pdf. [24] Atamt¨ urk, A. and Narayanan, V . (2007). Cuts for conic mixed-integer programming. In International Conference on Integer Programming and Combinatorial Optimization, pages 16– 29. Springer. [25] Atamt¨ urk, A. and Narayanan, V . (2008). Polymatroids and mean-risk minimization in dis- crete optimization. Operations Research Letters, 36(5):618–622. [26] Atamt¨ urk, A. and Narayanan, V . (2010). Conic mixed-integer rounding cuts. Mathematical Programming, 122:1–20. [27] Atamt¨ urk, A. and Narayanan, V . (2021). Submodular function minimization and polarity. Mathematical Programming, pages 1–11. [28] Bach, F. (2013). Learning with submodular functions: A convex optimization perspective. Now Publishers, Inc. 196 [29] Bach, F. (2016). Submodular functions: from discrete to continuous domains. Mathematical Programming, pages 1–41. [30] Bach, F. (2019). Submodular functions: from discrete to continuous domains. Mathematical Programming, 175(1-2):419–459. [31] Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225. [32] Balas, E. (1979). Disjunctive programming. Annals of discrete mathematics, 5:3–51. [33] Balas, E. (1985). Disjunctive programming and a hierarchy of relaxations for discrete opti- mization problems. SIAM Journal on Algebraic Discrete Methods, 6(3):466–486. [34] Balas, E. (1998). Disjunctive programming: Properties of the convex hull of feasible points. Discrete Applied Mathematics, 89(1-3):3–44. [35] Balas, E. (2018). Disjunctive programming. Springer. [36] Balas, E., Tama, J. M., and Tind, J. (1989). Sequential convexification in reverse convex and disjunctive programming. Mathematical Programming, 44(1):337–350. [37] Bao, L. and Intille, S. S. (2004). Activity recognition from user-annotated acceleration data. In International Conference on Pervasive Computing, pages 1–17. Springer. [38] Barros, A. I. (2013). Discrete and fractional programming techniques for location models, volume 3. Springer Science & Business Media. [39] Belotti, P., Bonami, P., Fischetti, M., Lodi, A., Monaci, M., Nogales-G´ omez, A., and Sal- vagnin, D. (2016). On handling indicator constraints in mixed integer programming. Computa- tional Optimization and Applications, 65(3):545–566. [40] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. (2009). Robust optimization, volume 28. Princeton University Press. [41] Ben-Tal, A. and Nemirovski, A. (2001). Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM. [42] Benati, S. (1996). Submodularity in competitive location problems. Ricerca Operativa. [43] Benati, S. and Hansen, P. (2002). The maximum capture problem with random utilities: Problem formulation and algorithms. European Journal of Operational Research, 143(3):518– 530. [44] Berbeglia, G. and Joret, G. (2017). Assortment optimisation under a general discrete choice model: A tight analysis of revenue-ordered assortments. In Proceedings of the 2017 ACM Conference on Economics and Computation, EC ’17, pages 345–346, New York, NY , USA. ACM. 197 [45] Berman, A. and Plemmons, R. J. (1994). Nonnegative matrices in the mathematical sciences, volume 9. Siam. [46] Bernal, D. E. and Grossmann, I. E. (2021). Convex mixed-integer nonlinear programs derived from generalized disjunctive programming using cones. arXiv preprint arXiv:2109.09657. [47] Bertsimas, D., Cory-Wright, R., and Pauphilet, J. (2019a). A unified approach to mixed-integer optimization: Nonlinear formulations and scalable algorithms. arXiv preprint arXiv:1907.02109. [48] Bertsimas, D., Cory-Wright, R., and Pauphilet, J. (2020a). Mixed-projection conic optimiza- tion: A new paradigm for modeling rank constraints. arXiv preprint arXiv:2009.10395. [49] Bertsimas, D. and King, A. (2015). OR forum – an algorithmic approach to linear regression. Operations Research, 64:2–16. [50] Bertsimas, D., King, A., Mazumder, R., et al. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44:813–852. [51] Bertsimas, D., Korolko, N., and Weinstein, A. M. (2019b). Identifying exceptional responders in randomized trials: An optimization approach. INFORMS Journal on Optimization, 1(3):187– 199. [52] Bertsimas, D., Van Parys, B., et al. (2020b). Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. The Annals of Statistics, 48(1):300–323. [53] Bienstock, D. (1996). Computational study of a family of mixed-integer quadratic program- ming problems. Mathematical programming, 74(2):121–140. [54] Bienstock, D. and Michalka, A. (2014). Cutting-planes for optimization of convex functions over nonconvex sets. SIAM Journal on Optimization, 24(2):643–677. [55] Boland, N., Dey, S. S., Kalinowski, T., Molinaro, M., and Rigterink, F. (2017). Bounding the gap between the mccormick relaxation and the convex hull for bilinear functions. Mathematical Programming, 162(1):523–535. [56] Boman, E. G., Chen, D., Parekh, O., and Toledo, S. (2005). On factor width and symmetric H-matrices. Linear Algebra and Its Applications, 405:239–248. [57] Bonami, P. (2011). Lift-and-project cuts for mixed integer convex programs. In International Conference on Integer Programming and Combinatorial Optimization, pages 52–64. Springer. [58] Bonami, P., Lodi, A., Tramontani, A., and Wiese, S. (2015). On mathematical programming with indicator constraints. Mathematical Programming, 151:191–223. [59] Bonnet, C. and Simioni, M. (2001). Assessing consumer response to protected designation of origin labelling: a mixed multinomial logit approach. European Review of Agricultural Economics, 28(4):433–449. 198 [60] Borrero, J. S., Gillen, C., and Prokopyev, O. A. (2016). A simple technique to improve linearized reformulations of fractional (hyperbolic) 0–1 programming problems. Operations Research Letters, 44(4):479–486. [61] Borrero, J. S., Gillen, C., and Prokopyev, O. A. (2017). Fractional 0–1 programming: appli- cations and algorithms. Journal of Global Optimization, 69(1):255–282. [62] Boykov, Y ., Veksler, O., and Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23:1222–1239. [63] Burer, S. and Anstreicher, K. (2020). Quadratic optimization with switching variables: The convex hull for n= 2. arXiv preprint arXiv:2002.04681. [64] Burer, S. and Ye, Y . (2020). Exact semidefinite formulations for a class of (random and non-random) nonconvex quadratic programs. Mathematical Programming, 181:1–17. [65] Calinescu, G., Chekuri, C., P´ al, M., and V ondr´ ak, J. (2011). Maximizing a monotone sub- modular function subject to a matroid constraint. SIAM Journal on Computing, 40(6):1740– 1766. [66] Candes, E. J. and Plan, Y . (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936. [67] Cand` es, E. J. and Wakin, M. B. (2008). An introduction to compressive sampling. IEEE Signal Processing Magazine, 25:21–30. [68] Candes, E. J., Wakin, M. B., and Boyd, S. P. (2008). Enhancing sparsity by reweighted` 1 minimization. Journal of Fourier analysis and Applications, 14:877–905. [69] Casale, P., Pujol, O., and Radeva, P. (2011). Human activity recognition from accelerometer data using a wearable device. In Iberian Conference on Pattern Recognition and Image Analysis, pages 289–296. Springer. [70] Casale, P., Pujol, O., and Radeva, P. (2012). Personalization and user verification in wearable systems using biometric walking patterns. Personal and Ubiquitous Computing, 16:563–580. [71] Castro, J., Frangioni, A., and Gentile, C. (2014). Perspective reformulations of the CTA problem with l 2 distances. Operations Research, 62(4):891–909. [72] Ceria, S. and Soares, J. (1999). Convex programming for disjunctive convex optimization. Mathematical Programming, 86(3):595–614. [73] C ¸ ezik, M. T. and Iyengar, G. (2005). Cuts for mixed 0-1 conic programming. Mathematical Programming, 104:179–202. [74] Chandrasekaran, R. (1977). Minimal ratio spanning trees. Networks, 7(4):335–342. [75] Chen, S. S., Donoho, D. L., and Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM review, 43:129–159. 199 [76] Conforti, M., Cornu´ ejols, G., Zambelli, G., et al. (2014). Integer programming, volume 271. Springer. [77] Cozad, A., Sahinidis, N. V ., and Miller, D. C. (2014). Learning surrogate models for simulation-based optimization. AIChE Journal, 60:2211–2227. [78] Dam, T. T., Ta, T. A., and Mai, T. (2021). Submodularity and local search approaches for maximum capture problems under generalized extreme value models. European Journal of Operational Research. [79] Dantzig, G. B. (1972). Fourier-motzkin elimination and its dual. Technical report, STAN- FORD UNIV CA DEPT OF OPERATIONS RESEARCH. [80] Davarnia, D., Richard, J.-P. P., and Tawarmalani, M. (2017). Simultaneous convexification of bilinear functions over polytopes with application to network interdiction. SIAM Journal on Optimization, 27(3):1801–1833. [81] Dedieu, A., Hazimeh, H., and Mazumder, R. (2020). Learning sparse classifiers: Continuous and mixed integer optimization perspectives. arXiv preprint arXiv:2001.06471. [82] D´ esir, A., Goyal, V ., Segev, D., and Ye, C. (2015). Capacity constrained assortment opti- mization under the Markov chain based choice model. Working paper, Columbia University, New York, NY .http://dx.doi.org/10.2139/ssrn.2626484. [83] D´ esir, A., Goyal, V ., and Zhang, J. (2014). Near-optimal algorithms for capacity constrained assortment optimization. Working paper, Columbia University, NY .http://dx.doi.org/10. 2139/ssrn.2543309. [84] Dey, S. S., Santana, A., and Wang, Y . (2019). New SOCP relaxation and branching rule for bipartite bilinear programs. Optimization and Engineering, 20(2):307–336. [85] Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository. [86] Dong, H. (2019). On integer and MPCC representability of affine sparsity. Operations Re- search Letters, 47(3):208–212. [87] Dong, H., Ahn, M., and Pang, J.-S. (2019). Structural properties of affine sparsity constraints. Mathematical Programming, 176(1-2):95–135. [88] Dong, H., Chen, K., and Linderoth, J. (2015). Regularization vs. relaxation: A conic opti- mization perspective of statistical variable selection. arXiv preprint arXiv:1510.06083. [89] Dong, H. and Linderoth, J. (2013). On valid inequalities for quadratic programming with continuous variables and binary indicators. In Goemans, M. and Correa, J., editors, Integer Programming and Combinatorial Optimization, pages 169–180, Berlin, Heidelberg. Springer Berlin Heidelberg. [90] Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52:1289–1306. 200 [91] Donoho, D. L., Elad, M., and Temlyakov, V . N. (2006). Stable recovery of sparse over- complete representations in the presence of noise. IEEE Transactions on Information Theory, 52:6–18. [92] Du, D., Lu, R., and Xu, D. (2012). A primal-dual approximation algorithm for the facility location problem with submodular penalties. Algorithmica, 63(1-2):191–200. [93] Elhedhli, S. (2005). Exact solution of a class of nonlinear knapsack problems. Operations Research Letters, 33(6):615–624. [94] Fattahi, S., Ashraphijuo, M., Lavaei, J., and Atamt¨ urk, A. (2017). Conic relaxations of the unit commitment problem. Energy, 134:1079–1095. [95] Feldman, J. and Topaloglu, H. (2015). Bounding optimal expected revenues for assortment optimization under mixtures of multinomial logits. Production and Operations Management, 24(10):1598–1620. [96] Fisher, M. L., Nemhauser, G. L., and Wolsey, L. A. (1978). An analysis of approximations for maximizing submodular set functions—II. In Balinski, M. L.and Hoffman, A. J., editor, Polyhedral Combinatorics, pages 73–87. Springer. [97] Frangioni, A. and Gentile, C. (2006). Perspective cuts for a class of convex 0–1 mixed integer programs. Mathematical Programming, 106(2):225–236. [98] Frangioni, A. and Gentile, C. (2007). Sdp diagonalizations and perspective cuts for a class of nonseparable miqp. Operations Research Letters, 35:181–185. [99] Frangioni, A., Gentile, C., Grande, E., and Pacifici, A. (2011). Projected perspective refor- mulations with applications in design problems. Operations Research, 59(5):1225–1232. [100] Frangioni, A., Gentile, C., and Hungerford, J. (2019). Decompositions of semidefinite matrices and the perspective reformulation of nonseparable quadratic programs. Mathematics of Operations Research. [101] Frangioni, A., Gentile, C., and Hungerford, J. (2020). Decompositions of semidefinite matrices and the perspective reformulation of nonseparable quadratic programs. Mathematics of Operations Research, 45(1):15–33. [102] Frank, L. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35:109–135. [103] Friedrich, J., Zhou, P., and Paninski, L. (2017). Fast online deconvolution of calcium imag- ing data. PLoS computational biology, 13. [104] Galiana, F. D., Motto, A. L., and Bouffard, F. (2003). Reconciling social welfare, agent profits, and consumer payments in electricity pools. IEEE Transactions on Power Systems, 18(2):452–459. [105] Gao, J. and Li, D. (2011). Cardinality constrained linear-quadratic optimal control. IEEE Transactions on Automatic Control, 56(8):1936–1941. 201 [106] Gao, Y .-m. and Wang, X.-h. (1992). Criteria for generalized diagonally dominant matrices and M-matrices. Linear algebra and its Applications, 169:257–268. [107] Goemans, M. X. and Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145. [108] Goldstein, M. (2018). Meet the most crowded airlines: Load factor hits all-time high. https://www.forbes.com/sites/michaelgoldstein/2018/07/ 09/meet-the-most-crowded-airlines-load-factor-hits-all-time-high/ #90f0b4354fbd. Accessed: 2019-04-12. Forbes. [109] G´ omez, A. (2019). Outlier detection in time series via mixed-integer conic quadratic opti- mization. http://www.optimization-online.org/DB HTML/2019/11/7488.html. [110] G´ omez, A. (2020). Strong formulations for conic quadratic optimization with indicator variables. Forthcoming in Mathematical Programming. [111] G´ omez, A. (2021). Strong formulations for conic quadratic optimization with indicator variables. Mathematical Programming, 188(1):193–226. [112] G´ omez, A. and Prokopyev, O. (2018). A mixed-integer fractional optimization approach to best subset selection. http://www.optimization-online.org/DB HTML/2018/08/6791.html. [113] Granot, D. and Granot, F. (1976). On solving fractional (0, 1) programs by implicit enumer- ation. INFOR: Information Systems and Operational Research, 14(3):241–249. [114] Grossmann, I. E. (2002). Review of nonlinear mixed-integer and disjunctive programming techniques. Optimization and engineering, 3(3):227–252. [115] Gr¨ otschel, M., Lov´ asz, L., and Schrijver, A. (1981). The ellipsoid method and its conse- quences in combinatorial optimization. Combinatorica, 1:169–197. [116] G¨ unl¨ uk, O. and Linderoth, J. (2010). Perspective reformulations of mixed integer nonlinear programs with indicator variables. Mathematical Programming, 124:183–205. [117] Gupte, A., Kalinowski, T., Rigterink, F., and Waterer, H. (2020). Extended formulations for convex hulls of some bilinear functions. Discrete Optimization, 36:100569. [118] Han, S., G´ omez, A., and Atamt¨ urk, A. (2021). The equivalence of optimal per- spective formulation and Shor’s SDP for quadratic programs with indicator variables. https://arxiv.org/abs/2112.04618. [119] Han, S. and G´ omez, A. (2021). Compact extended formulations for low-rank functions with indicator variables. arXiv preprint arXiv:2110.14884. [120] Han, S. and G´ omez, A. (2021). Single-neuron convexification for binarized neural networks. http://www.optimization-online.org/DB HTML/2021/05/8419.html. 202 [121] Hansen, P., De Arag˜ ao, M. V . P., and Ribeiro, C. C. (1990). Boolean query optimization and the 0-1 hyperbolic sum problem. Annals of Mathematics and Artificial Intelligence, 1(1-4):97– 109. [122] Hansen, P., De Arag˜ ao, M. V . P., and Ribeiro, C. C. (1991). Hyperbolic 0–1 programming and query optimization in information retrieval. Mathematical Programming, 52(1-3):255–263. [123] Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 1. Springer series in statistics New York, NY , USA:. [124] Hastie, T., Tibshirani, R., and Tibshirani, R. J. (2017). Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692. [125] Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. CRC press. [126] Hazimeh, H. and Mazumder, R. (2018). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. arXiv preprint arXiv:1803.01454. [127] Hazimeh, H., Mazumder, R., and Saab, A. (2020). Sparse regression at scale: Branch-and- bound rooted in first-order optimization. arXiv preprint arXiv:2004.06152. [128] Hebiri, M., Van De Geer, S., et al. (2011). The smooth-lasso and other ` 1 + ` 2 -penalized methods. Electronic Journal of Statistics, 5:1184–1226. [129] Hijazi, H., Bonami, P., Cornu´ ejols, G., and Ouorou, A. (2012). Mixed-integer nonlinear pro- grams featuring “on/off” constraints. Computational Optimization and Applications, 52:537– 558. [130] Hiriart-Urruty, J.-B. and Lemar´ echal, C. (2004). Fundamentals of convex analysis. Springer Science & Business Media. [131] Hiriart-Urruty, J.-B. and Lemar´ echal, C. (2013). Convex analysis and minimization algo- rithms I: Fundamentals, volume 305. Springer science & business media. [132] Ho-Nguyen, N. and Kılınc ¸-Karzan, F. (2017). A second-order cone based approach for solv- ing the trust-region subproblem and its variants. SIAM Journal on Optimization, 27(3):1485– 1512. [133] Hochbaum, D. S. (2001). An efficient algorithm for image segmentation, Markov random fields and related problems. Journal of the ACM (JACM), 48:686–701. [134] Hochbaum, D. S. (2013). Multi-label markov random fields as an efficient and effective tool for image segmentation, total variations and regularization. Numerical Mathematics: Theory, Methods and Applications, 6(1):169–198. [135] Hochbaum, D. S. and Liu, S. (2018). Adjacency-clustering and its application for yield prediction in integrated circuit manufacturing. Operations Research, 66(6):1571–1585. 203 [136] Hoisington, A. (2018). Hotel management. https://www.hotelmanagement.net/own/ occupancy-hits-30-year-high-u-s. Accessed: 2019-04-12. [137] Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge university press. [138] Huang, J., Ma, S., and Zhang, C.-H. (2008). Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica, pages 1603–1618. [139] Iwano, K., Misono, S., Tezuka, S., and Fujishige, S. (1994). A new scaling algorithm for the maximum mean cut problem. Algorithmica, 11(3):243–255. [140] Javanmard, A., Montanari, A., and Ricci-Tersenghi, F. (2016). Phase transitions in semidef- inite relaxations. Proceedings of the National Academy of Sciences, 113(16):E2218–E2223. [141] Jeon, H., Linderoth, J., and Miller, A. (2017). Quadratic cone cutting surfaces for quadratic programs with on–off constraints. Discrete Optimization, 24:32–50. [142] Jewell, S. and Witten, D. (2017). Exact spike train inference via ` 0 optimization. arXiv preprint arXiv:1703.08644. [143] Jeyakumar, V . and Li, G. (2014). Trust-region problems with linear inequality constraints: exact sdp relaxation, global optimality and robust optimization. Mathematical Programming, 147(1-2):171–206. [144] Kilinc, M., Linderoth, J., and Luedtke, J. (2010). Effective separation of disjunctive cuts for convex mixed integer nonlinear programs. Technical report, University of Wisconsin-Madison Department of Computer Sciences. [145] Kılınc ¸-Karzan, F. and Yıldız, S. (2015). Two-term disjunctions on the second-order cone. Mathematical Programming, 154:463–491. [146] Kim, J., Tawarmalani, M., and Richard, J.-P. P. (2021). Convexification of permutation- invariant sets and an application to sparse principal component analysis. Mathematics of Oper- ations Research. [147] Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. (2009). ` 1 trend filtering. SIAM review, 51:339–360. [148] Kimura, K. and Waki, H. (2018). Minimization of Akaike’s information criterion in linear regression analysis via mixed integer nonlinear program. Optimization Methods and Software, 33(3):633–649. [149] Kleinberg, J. and Tardos, E. (2002). Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. Journal of the ACM (JACM), 49:616–639. [150] Kolmogorov, V . and Zabin, R. (2004). What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:147–159. 204 [151] Kulik, A., Shachnai, H., and Tamir, T. (2009). Maximizing submodular set functions subject to multiple linear constraints. In Mathieu, C., editor, Proceedings of the twentieth annual ACM- SIAM symposium on Discrete algorithms, pages 545–554. Society for Industrial and Applied Mathematics. [152] Kunnumkal, S. (2015). On upper bounds for assortment optimization under the mixture of multinomial logit models. Operations Research Letters, 43(2):189–194. [153] Kunnumkal, S. and Mart´ ınez-de Alb´ eniz, V . (2019). Tractable approximations for assort- ment planning with product costs. Operations Research, 67(2):436–452. [154] Lee, S. and Grossmann, I. E. (2000). New algorithms for nonlinear generalized disjunctive programming. Computers & Chemical Engineering, 24(9-10):2125–2141. [155] Li, Y ., Du, D., Xiu, N., and Xu, D. (2015). Improved approximation algorithms for the facility location problems with linear/submodular penalties. Algorithmica, 73(2):460–482. [156] Lim, C. H., Linderoth, J., and Luedtke, J. (2018). Valid inequalities for separable concave constraints with indicator variables. Mathematical Programming, 172(1-2):415–442. [157] Lin, X., Pham, M., and Ruszczy´ nski, A. (2014). Alternating linearization for structured regularization problems. Journal of Machine Learning Research, 15:3447–3481. [158] Lin, Y . H. and Tian, Q. (2021). Exact approaches for competitive facility location with discrete attractiveness. Optimization Letters, 15:377–389. [159] Liu, P., Fattahi, S., G´ omez, A., and K¨ uc ¸ ¨ ukyavuz, S. (2021). A graph-based decomposition method for convex quadratic optimization with indicators. arXiv preprint arXiv:2110.12547. [160] Ljubi´ c, I. and Moreno, E. (2018). Outer approximation and submodular cuts for maximum capture facility location problems with random utilities. European Journal of Operational Re- search, 266(1):46–56. [161] Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H. (1998). Applications of second- order cone programming. Linear algebra and its Applications, 284:193–228. [162] Locatelli, M. and Schoen, F. (2014). On convex envelopes for bivariate functions over polytopes. Mathematical Programming, 144(1):56–91. [163] Lov´ asz, L. (1983). Submodular functions and convexity. In Mathematical programming the state of the art, pages 235–257. Springer. [164] Lubin, M., Yamangil, E., Bent, R., and Vielma, J. P. (2016). Extended formulations in mixed-integer convex programming. In International Conference on Integer Programming and Combinatorial Optimization, pages 102–113. Springer. [165] Lustig, M., Donoho, D., and Pauly, J. M. (2007). Sparse MRI: The application of com- pressed sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 58:1182–1195. 205 [166] Mahajan, A., Leyffer, S., Linderoth, J., Luedtke, J., and Munson, T. (2017). Minotaur: A mixed-integer nonlinear optimization toolkit. Technical report, ANL/MCS-P8010-0817, Ar- gonne National Lab. [167] Mammen, E., van de Geer, S., et al. (1997). Locally adaptive regression splines. The Annals of Statistics, 25:387–413. [168] Mazumder, R., Friedman, J. H., and Hastie, T. (2011). Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106:1125–1138. [169] Mazumder, R., Radchenko, P., and Dedieu, A. (2017). Subset selection with shrinkage: Sparse linear modeling when the SNR is low. arXiv preprint arXiv:1708.03288. [170] McFadden, D. and Train, K. (2000). Mixed MNL models for discrete response. Journal of Applied Econometrics, 15(5):447–470. [171] Megiddo, N. et al. (1979). Combinatorial optimization with rational objective functions. Mathematics of Operations Research, 4(4):414–424. [172] Mehmanchi, E., G´ omez, A., and Prokopyev, O. A. (2019). Fractional 0–1 programs: links between mixed-integer linear and conic quadratic formulations. Journal of Global Optimization, 75(2):273–339. [173] M´ endez-D´ ıaz, I., Miranda-Bront, J. J., Vulcano, G., and Zabala, P. (2014). A branch-and- cut algorithm for the latent-class logit assortment problem. Discrete Applied Mathematics, 164:246–263. [174] Miller, A. (2002). Subset selection in regression. CRC Press. [175] Mittal, S. and Schulz, A. S. (2013). A general framework for designing approximation schemes for combinatorial optimization problems with many objectives combined into one. Operations Research, 61(2):386–397. [176] Modaresi, S., Kılınc ¸, M. R., and Vielma, J. P. (2016). Intersection cuts for nonlinear integer programming: Convexification techniques for structured sets. Mathematical Programming, 155:575–611. [177] Moorthy, K. S. and Png, I. P. (1992). Market segmentation, cannibalization, and the timing of product introductions. Management Science, 38(3):345–359. [178] Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming, 14(1):265–294. [179] Nemirovski, A. S. and Todd, M. J. (2008). Interior-point methods for optimization. Acta Numerica, 17:191–234. [180] Nevo, D. and Ritov, Y . (2017). Identifying a minimal class of models for high-dimensional data. Journal of Machine Learning Research, 18:797–825. 206 [181] Ortiz-Astorquiza, C., Contreras, I., and Laporte, G. (2017). Formulations and approximation algorithms for multilevel uncapacitated facility location. INFORMS Journal on Computing, 29(4):767–779. [182] Padilla, O. H. M., Sharpnack, J., Scott, J. G., and Tibshirani, R. J. (2018). The DFS fused lasso: Linear-time denoising over general graphs. Journal of Machine Learning Research, 18(176):1–36. [183] Pang, J.-S. and Han, S. (2021). Some strongly polynomially solvable convex quadratic programs with bounded variables. arXiv preprint arXiv:2112.03886. [184] Pilanci, P., Wainwright, M. J., and El Ghaoui, L. (2015). Sparse learning via boolean relax- ations. Mathematical Programming, 151:63–87. [185] Plemmons, R. J. (1977). M-matrix characterizations. I – nonsingular M-matrices. Linear Algebra and its Applications, 18:175–188. [186] Poggio, T., Torre, V ., and Koch, C. (1985). Computational vision and regularization theory. Nature, 317:314. [187] Prokopyev, O. A., Huang, H.-X., and Pardalos, P. M. (2005a). On complexity of uncon- strained hyperbolic 0–1 programming problems. Operations Research Letters, 33(3):312–318. [188] Prokopyev, O. A., Meneses, C., Oliveira, C. A., and Pardalos, P. M. (2005b). On multiple- ratio hyperbolic 0–1 programming problems. Pacific Journal of Optimization, 1(2):327–345. [189] Qin, Z. and Goldfarb, D. (2012). Structured sparsity via alternating direction methods. Journal of Machine Learning Research, 13:1435–1468. [190] Radzik, T. (1998). Fractional combinatorial optimization. In Du, D.-Z. and Pardalos, P. M., editors, Handbook of Combinatorial Optimization, pages 429–478. Springer-Verlag New York. [191] Richard, J.-P. P. and Tawarmalani, M. (2010). Lifting inequalities: a framework for gener- ating strong cuts for nonlinear programs. Mathematical Programming, 121(1):61–104. [192] Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso. The Annals of Statistics, 37:2922–2952. [193] Rockafellar, R. T. (1970a). Convex analysis. [194] Rockafellar, R. T. (1970b). Convex Analysis. [195] Rockafellar, R. T. and Wets, R. J.-B. (2009). Variational analysis, volume 317. Springer Science & Business Media. [196] Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60:259–268. [197] Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. (2009). A PTAS for capacitated sum-of-ratios optimization. Operations Research Letters, 37(4):230–238. 207 [198] Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. (2010). Dynamic assortment op- timization with a multinomial logit choice model and capacity constraint. Operations Research, 58(6):1666–1680. [199] Rusmevichientong, P., Shmoys, D., Tong, C., and Topaloglu, H. (2014). Assortment opti- mization under the multinomial logit model with random choice parameters. Production and Operations Management, 23(11):2023–2039. [200] Sethuraman, S. and Butenko, S. (2015). The maximum ratio clique problem. Computational Management Science, 12(1):197–218. [201] Shen, X., Pan, W., Zhu, Y ., and Zhou, H. (2013). On constrained and regularized high- dimensional regression. Annals of the Institute of Statistical Mathematics, 65:807–832. [202] Shepard, E. L., Wilson, R. P., Quintana, F., Laich, A. G., Liebsch, N., Albareda, D. A., Halsey, L. G., Gleiss, A., Morgan, D. T., Myers, A. E., et al. (2008). Identification of animal movement patterns using tri-axial accelerometry. Endangered Species Research, 10:47–60. [203] Shor, N. Z. (1987). Quadratic optimization problems. Soviet Journal of Computer and Systems Sciences, 25:1–11. [204] Sion, M. (1958). On general minimax theorems. Pacific Journal of Mathematics, 8(1):171– 176. [205] Stubbs, R. A. and Mehrotra, S. (1999). A branch-and-cut method for 0-1 mixed convex programming. Mathematical programming, 86(3):515–532. [206] Sviridenko, M. (2004). A note on maximizing a submodular set function subject to a knap- sack constraint. Operations Research Letters, 32(1):41–43. [207] Talluri, K. and van Ryzin, G. (2004). Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15–33. [208] Talluri, K. T. and van Ryzin, G. J. (2006). The theory and practice of revenue management, volume 68. Springer Science & Business Media. [209] Tawarmalani, M., Ahmed, S., and Sahinidis, N. V . (2002). Global optimization of 0-1 hyperbolic programs. Journal of Global Optimization, 24(4):385–416. [210] Tawarmalani, M. and Richard, J.-P. P. (2013). Decomposition techniques in convexification of inequalities. Technical report, Technical report. [211] Tawarmalani, M., Richard, J.-P. P., and Chung, K. (2010). Strong valid inequalities for orthogonal disjunctions and bilinear covering sets. Mathematical Programming, 124(1):481– 512. [212] Tawarmalani, M. and Sahinidis, N. V . (2005). A polyhedral branch-and-cut approach to global optimization. Mathematical Programming, 103(2):225–249. 208 [213] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288. [214] Tibshirani, R. (2011a). Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73:273–282. [215] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:91–108. [216] Tibshirani, R. J. (2011b). The solution path of the generalized lasso. Stanford University. [217] Tibshirani, R. J. et al. (2014). Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42:285–323. [218] Ursulenko, O., Butenko, S., and Prokopyev, O. A. (2013). A global optimization algorithm for solving the minimum multiple ratio spanning tree problem. Journal of Global Optimization, 56(3):1029–1043. [219] Varga, R. S. (1976). On recurring theorems on diagonal dominance. Linear Algebra and its Applications, 13(1-2):1–9. [220] Vielma, J. P., Dunning, I., Huchette, J., and Lubin, M. (2017). Extended formulations in mixed integer conic quadratic programming. Mathematical Programming Computation, 9(3):369–418. [221] V ogel, C. R. and Oman, M. E. (1996). Iterative methods for total variation denoising. SIAM Journal on Scientific Computing, 17:227–238. [222] V ogelstein, J. T., Packer, A. M., Machado, T. A., Sippy, T., Babadi, B., Yuste, R., and Paninski, L. (2010). Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of Neurophysiology, 104:3691–3704. [223] Wang, A. L. and Kılınc ¸-Karzan, F. (2019). On the tightness of SDP relaxations of QCQPs. http://www.optimization-online.org/DB HTML/2019/11/7487.html. [224] Wang, A. L. and Kılınc ¸-Karzan, F. (2020). The generalized trust region subproblem: solu- tion complexity and convex hull results. Mathematical Programming, pages 1–42. [225] Wei, L., Atamt¨ urk, A., G´ omez, A., and K¨ uc ¸ ¨ ukyavuz, S. (2022). On the convex hull of convex quadratic optimization problems with indicators. arXiv preprint arXiv:2201.00387. [226] Wei, L., G´ omez, A., and Kucukyavuz, S. (2020). Ideal formulations for constrained convex optimization problems with indicator variables. arXiv preprint arXiv:2007.00107. [227] Wilson, R. P., Shepard, E., and Liebsch, N. (2008). Prying into the intimate details of animal lives: Use of a daily diary on animals. Endangered Species Research, 4:123–137. [228] Wilson, Z. T. and Sahinidis, N. V . (2017). The ALAMO approach to machine learning. Computers & Chemical Engineering, 106:785–795. 209 [229] Wu, B., Sun, X., Li, D., and Zheng, X. (2017). Quadratic convex reformulations for semi- continuous quadratic programming. SIAM Journal on Optimization, 27:1531–1553. [230] Xie, W. and Deng, X. (2018). Scalable algorithms for the sparse ridge regression. arXiv preprint arXiv:1806.03756. [231] Yang, A. Y ., Sastry, S. S., Ganesh, A., and Ma, Y . (2010). Fast` 1 -minimization algorithms and an application in robust face recognition: A review. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 1849–1852. IEEE. [232] Yıldız, S. and Kılınc ¸-Karzan, F. (2016). Low-complexity relaxations and convex hulls of disjunctions on the positive semidefinite cone and general regular cones. Optimization Online. [233] Zhang, C.-H. et al. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942. [234] Zhang, Y ., Wainwright, M. J., and Jordan, M. I. (2014). Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pages 921–948. [235] Zheng, X., Sun, X., and Li, D. (2014a). Improving the performance of miqp solvers for quadratic programs with cardinality and minimum threshold constraints: A semidefinite pro- gram approach. INFORMS Journal on Computing, 26(4):690–703. [236] Zheng, Z., Fan, Y ., and Lv, J. (2014b). High dimensional thresholded regression and shrink- age effect. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76:627– 649. [237] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429. [238] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:301–320. 210 Appendices A Supplementary results for Chapter 2 We now give an extended formulation of W = (d 1 ;d 2 ;r)2R 3 + : r g(z;x;d 1 ;d 2 ); d 1 d 2 1 for fixed(z;x) with z 1 z 2 , arising in the separation problem (2.21). Proposition 60 (Extended formulation of W). A point (d 1 ;d 2 ;r)2 W if and only if d 1 d 2 1, d 1 ;d 2 ;r 0 and there exists v;w2R + such that the set of inequalities r 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 1 z 2 1 z 1 v w 2 d 1 v; d 1 x 1 x 2 w is feasible. 211 Proof. From (2.22a), we see that function g can be written as g(z;x;d 1 ;d 2 )=d 1 x 2 1 z 1 + d 2 x 2 2 z 2 + 8 > > < > > : 2x 1 x 2 z 1 x 2 2 d 1 1 z 2 1 z 1 if d 1 x 1 x 2 2x 1 x 2 z 2 + d 1 x 2 1 1 z 2 1 z 1 if d 1 x 1 x 2 = 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 8 > > < > > : d 1 x 1 x 2 d 1 2 1 z 2 1 z 1 if d 1 x 1 x 2 0 if d 1 x 1 x 2 = 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 8 > > < > > : (d 1 x 1 x 2 ) 2 d 1 1 z 2 1 z 1 if d 1 x 1 x 2 0 if d 1 x 1 x 2 : Therefore, we see that if (d 1 ;d 2 ;r)2 W and d 1 x 1 x 2 , then setting v= w= 0 we find a feasible solution to the system of equation. If d 1 x 1 x 2 , then setting w= d 1 x 1 x 2 and v= (d 1 x 1 x 2 ) 2 d 1 we find a feasible solution to the system. Conversely, if the system is feasible with d 1 x 1 x 2 0, then r 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 1 z 2 1 z 1 v 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 = g(z;x;d 1 ;d 2 ): If w d 1 x 1 x 2 0, then r 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 1 z 2 1 z 1 v 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 1 z 2 1 z 1 w 2 d 1 2x 1 x 2 z 2 + d 1 x 2 1 z 2 + d 2 x 2 2 z 2 1 z 2 1 z 1 (d 1 x 1 x 2 ) 2 d 1 = g(z;x;d 1 ;d 2 ): In both cases,(d 1 ;d 2 ;r)2 W. 212 From Proposition 60, if z 1 ::: z n , then the separation problem (2.21) can be formulated as max d;v;w n å i=1 n å j=i+1 jQ i j j 2x i x j z j + d i i j x 2 i z j + d j i j x 2 j z j 1 z j 1 z i v i j ! (A.1a) s.t. å ji jQ i j jd i i j = Q ii ; i= 1;:::;n (A.1b) w 2 i j d i i j v i j ; d i i j x i x j w i j ; d i i j d j i j 1; 1 i< j n (A.1c) d i i j ;d j i j ;v i j ;w i j 0; 1 i< j n: (A.1d) In order to use the strongest relaxation Q, we introduce an additional variable t and initially formulate (2.3) as n å i=1 y 2 i + min n å i=1 2y i x i +t (A.2a) s.t. n å i=1 s i +l å fi; jg2A (x i x j ) 2 t (A.2b) (M-sep) n å i=1 z i k (A.2c) x 2 i s i z i i= 1;:::;n (A.2d) x i kyk ¥ z i i= 1;:::;n (A.2e) z2[0;1] n ;x2R n + ;s2R n + ; (A.2f) which is equivalent to the persp.. Then the convex relaxation is iteratively refined by adding inequalities of the form å fi; jg2A jQ i j js i j t; g(z i ;z j ;x i ;x j ;d i i j ;d j i j ) s i j 8fi; jg2 A (A.3) using the extended formulation in Proposition 15, where the parameters d are obtained by solving the separation problem (A.1). 213 Remark 11. The number of variables required to represent (A.3) as conic quadratic inequalities can be significantly reduced by exploiting the following observations: 1. In our computations, d i i j d j i j 1 for most pairsfi; jg2 A in optimal solutions of the separation problem (A.1). Thus, ifjd i i j d j i j 1je for a precisione (e.g.,e= 10 6 in our computations), constraint g(z i ;z j ;x i ;x j ;d i i j ;d j i j ) s i j can be compactly represented by adding v i j ;w i j 2R 2 as d i i j v i j d i i j x i x j ; d i i j v 2 i j s i j z i ; (A.4) d i i j w i j d i i j x i + x j ; d i i j w 2 i j s i j z j : (A.5) 2. Given parameters d and a point (z;x), most of the constraints in Proposition 15 are not binding. Thus, it is sufficient to add only the binding constraints, corresponding to one of the pieces of function g. For example, if z i z j and d i i j x i x j , then we add only variables v i j ;q i j 2R + and constraints d i i j v i j d i i j x i x j ; v 2 i j q i j z i ; d i j q i j + s j d j i j 1 d i i j ! s i j : Thus each cut requires two additional variables, two linear constraints and one rotated cone constraint for eachfi; jg2 A. Additionally, ifjd i i j d j i j 1je, then it suffices to add either (A.4) or (A.5), further reducing the number of variables and constraints needed. B Supplementary results for Chapter 3 B.1 Proof of Proposition 25 Proof of Proposition 25. Notice that for l = x 1 + x 2 1> 0, f(x;y;l;d) can be rewritten in the form f(x;y;l;d)= 1 D ˆ y 0 A ˆ y; 214 where D=(d 1 d 2 1)x 1 x 2 + x 1 + x 2 1> 0; ˆ y 0 =( p d 1 y 1 ; p d 2 y 2 ) and A = 0 B @ (d 1 d 2 1)x 2 +l p d 1 d 2 l p d 1 d 2 l (d 1 d 2 1)x 1 +l 1 C A : Observe det(A )=(d 1 d 2 1)D. Hence, f(x;y;l;d)= (d 1 d 2 1) det(A ) ˆ y T A ˆ y=(d 1 d 2 1) ˆ y T A 1 ˆ y; where A is the adjugate of A , i.e., A= 0 B @ (d 1 d 2 1)x 1 +l p d 1 d 2 l p d 1 d 2 l (d 1 d 2 1)x 2 +l 1 C A Note that A 0. By Schur Complement Lemma, t=(d 1 d 2 1) ˆ y 0 A 1 ˆ y if and only if 0 B @ t=(d 1 d 2 1) ˆ y T ˆ y A 1 C A 0; i.e., 0 B B B B @ t=(d 1 d 2 1) p d 1 y 1 p d 2 y 2 p d 1 y 1 (d 1 d 2 1)x 1 +l p d 1 d 2 l p d 2 y 2 p d 1 d 2 l (d 1 d 2 1)x 2 +l 1 C C C C A 0; which is further equivalent to 0 B B B B @ t=(d 1 d 2 1) y 1 y 2 y 1 (d 1 d 2 1)x 1 =d 1 +l=d 1 l y 2 l (d 1 d 2 1)x 2 =d 2 +l=d 2 1 C C C C A 0: The conclusion follows by takingl = x 1 + x 2 1. 215 C Supplementary results for Chapter 4 C.1 On computational experiments with branch-and-bound As mentioned in §4.3.4, the extended formulation (4.21) did not produce good results when used in conjunction with CPLEX branch-and-bound solver. To illustrate this phenomenon, Table 7.1 shows details on the performance of the solver in a single representative instance, but similar behavior was observed in all instances tested. The table shows from left to right: the time required to solve the convex relaxation via interior point methods and the lower bound produced by this relaxation (note that this is not part of the branch-and-bound algorithm); the time required to process the root node of the branch-and-bound tree, and the corresponding lower bound obtained; the time used to process the branch-and-bound tree, the number of branch-and-bound nodes explored, and the lower bound found after processing the tree (we set a time limit of 10 minutes). Tab. 7.1: Performance of CPLEX solver in an instance with n= 500 and k= 10. Default settings are used, and a time limit of 10 minutes is set. The optimal objective value in the particular instance is 1.47. Method Convex relaxation Root node Branch-and-bound Time(s) LB Time(s) LB Time(s) Nodes LB Without (4.21) 0.2 1.09 3.5 1.20 7.7 460 1.47 With (4.21) 2.4 1.30 45.9 0.90 600 4,073 0.99 Our expectations a priori were as follows: using inequalities (4.21) should result in harder convex relaxations solved, and thus less nodes explored within a given time limit; on the other hand, due to improved relaxations and higher-quality lower bounds, the algorithm should be able to prove optimality after exploring substantially less nodes. Thus, there should be a tradeoff between the number of nodes to be explored and the time required to process each node. From Table 7.1, we see that there is no tradeoff in practice. The performance of the solver without inequalities (4.21) is as expected. While just solving the convex relaxation via interior point methods requires 0.2 seconds, there is an overhead of 3 seconds to process the root node due to preprocessing/cuts/heuristics and additional methods used by the 216 solver (and the quality of the lower bound at the root node is slightly improved as a result).Then, after an additional 4 seconds used to explore 460 nodes, optimality is proven. The performance of the solver using inequalities (4.21) defied our expectations. In theory, the more difficult convex relaxation can be solved with an overhead of 2 seconds, resulting in a root improvement of (1:30 1:09)=(1:47 1:09)= 55:3% (better than the one achieved by default CPLEX). In practice, the overhead is 40 seconds, and results in a degradation of the relaxation, that is, the lower bound proved at the root node is worse than the natural convex relaxation of problem without inequalities (4.21). From that point out, the branch-and-bound progresses slowly due to the more difficult relaxations, and the lower bounds are worse throughout the tree. Even after the time limit of 10 minutes and over 4,000 nodes explored, the lower bound proved by the algorithm is still worse than the natural convex relaxation. While we cannot be sure about the exact reason of this behavior, we now make an educated guess. Most conic quadratic branch-and-bound solvers such as CPLEX do not use interior point methods to solve relaxations at each node of the branch-and-bound tree, but rather rely on poly- hedral outer approximations in an extended space to benefit from the warm-start capabilities of the simplex method. We conjecture that while formulation (4.21) might not be particularly chal- lenging to solve via interior point methods, it might be difficult to construct a good-quality outer approximation of reasonable size. If so, then the actual relaxation used by the solver is possi- bly a poor-quality linear outer approximation of the feasible region induced by (4.21), and is still difficult to solve, resulting in the worst of both worlds. We point out that the second author has en- countered similar counterintuitive behavior with solvers (other than CPLEX) based on linear outer approximations [16], suggesting that indeed this phenomenon stems from the implementation of outer-approximations in general (rather than a particular quirk specific to CPLEX). 217 C.2 Proof of Theorem 5 Proof of Theorem 5. DenoteS = [ I :jIjk V(I) [ R. SinceV(I)Q for allI ,SQ. Thus, d (;conv(S))=d (;S)d (;Q): Due to Proposition 47, it remains to prove the opposite directiond (;S)d (;Q). Since rank( f) k, there exists g(), A2R kn and c such that f(x)= g(Ax)+ c > x. Taking any a=(a t ;a x ;a z )2R 2n+1 , ifa t > 0,d (;S)=d (;Q)=+¥ because t is unbounded from above. We now assume a t 0. and define`(x;z) def =a t g(Ax)+(a x +a t c) > x+a > z z. Then d (a;S)= maxf`(x;z) :(t;x;z)2Sg and d (a;Q)= maxf`(x;z) :(t;x;z)2Qg. Taking any ( ¯ t; ¯ x; ¯ z)2Q, consider the following linear program v def = max x2R n (a x +a t c) > x s.t. Ax= A ¯ x ¯ x i x i 0 8i2 supp(¯ z) x i = 0 8i2[n]nsupp(¯ z): (C.1) Note that linear program (C.1) is always feasible because ¯ x is a feasible solution to (C.1). What is more, every feasible solution (or direction) x of (C.1) satisfies x i (1 ¯ z i )= 0. If v =+¥, then there exists a feasible direction d such that Ad= 0 and(a x +a t c) > d> 0. It implies that for any r 0, f(rd)= g(rAd)+rc > d= g(0)+rc > d= rc > d. Taking z d 2 ¯ Z such that supp(¯ z) supp(z d ), it is easy to see that d i (1 z d i )= 08i2[n]. Hence, (rc > d;rd;z d )2R Q. Furthermore, `(rd;z d )=(a x +a t c) > dr+a > z z d !+¥; (as r!+¥) which impliesd (a;S)=d (a;Q)=+¥. 218 If v is finite, thanks to rank(A) k, there exists an optimal solution x to (C.1) such that jsupp(x )j k. Since A ¯ x= Ax , t def = ¯ t+ c > x ¯ x f( ¯ x)+ c > x ¯ x =g(A ¯ x)+ c > ¯ x+ c > x ¯ x =g(Ax )+ c > x = f(x ) SettingI = supp(x ), one can deduce that(t ;x )2X(I), and ¯ z2Z(I) inasmuch as supp(x ) supp(¯ z). It implies that(t ;x ; ¯ z)2V(I)SQ. What is more,(a x +a t c) > x (a x +a t c) > ¯ x indicates that`(x ; ¯ z)`( ¯ x; ¯ z). Namely, for an arbitrary point ( ¯ t; ¯ x; ¯ z)2Q, there always exists a point inS with a superior objective value of `(). Therefore, d (a;S)d (a;Q). This fin- ishes the proof of the main conclusion. The last statement of the theorem follows from that if fx2R n : Ax= 0; x i 08i2I + g=f0g, the feasible region of (C.1) is bounded and thus v is always finite. 219
Abstract (if available)
Abstract
Operational problems involving discrete structures naturally arise in a vast number of applications which provide the impetus for the development of mixed-integer programming (MIP). In this dissertation, we are mainly concerned with the theory of mixed-integer nonlinear programming (MINLP) with indicator variables as well as their applications in other fields including machine learning, statistics, finance, and revenue management.
In parametric learning problems, a regularizer is often added to the loss function to control the overfitting of the model. If the regularizer is convex, then the resulting optimization problem is often easy to solve but may result in estimators with inferior statistical properties. On the other hand, if the regularizer is chosen to be nonconvex, then the problem is hard to solve to global optimality, and the performance of the resulting local minimizer or stationary point highly relies on the initialization of the algorithm. A convex relaxation of MINLP formulations provides another way to regularize the model. It can be perceived as an adaptive nonconvex regularizer but meanwhile, makes the overall problem remain convex and thus computationally tractable. In Chapter 2, we study a regression problem with sparsity priors which can be naturally modeled as quadratic optimization with ``l_0-norm”constraints. Since such problems are hard-to-solve, the standard approach is to tackle their convex surrogates with the l_1-norm regularizer. Alternatively, we propose new iterative (convex) conic quadratic relaxations that exploit not only the ``l_0-norm" terms, but also the fitness and smoothness functions. These stronger relaxations lead to significantly better estimators than l_1-norm approaches and also allow one to utilize affine sparsity priors. In addition, the parameters of the model and the resulting estimators are easily interpretable. Experiments with a tailored Lagrangian decomposition method indicate that the proposed iterative convex relaxations yield solutions within 1% of the exact l_0 approach, and can tackle instances with up to 100,000 variables under one minute.
In Chapter 3, we study the convex quadratic optimization problem with indicator variables (CQI). We first prove the equivalence between the optimal perspective reformulation and Shor's SDP formulation for the CQI. These two formulations have been studied extensively in literature. What is more, for the 2 × 2 case, we describe the convex hull of the epigraph in the original space of variables, and also give a conic quadratic extended formulation. Then, using the convex hull description for the 2 × 2 case as a building block, we derive an extended SDP relaxation for the general case. This new formulation is stronger than other SDP relaxations proposed in the literature for the problem, including Sho's SDP relaxation, the optimal perspective relaxation as well as the optimal rank-one relaxation. Computational experiments indicate that the proposed formulations are quite effective in reducing the integrality gap of the optimization problems.
In Chapter 4, we study the mixed-integer epigraph of a low-rank convex function with non-convex indicator constraints, which are often used to impose logical constraints on the support of the solutions. Extended formulations describing the convex hull of such sets can easily be constructed via disjunctive programming, although a direct application of this method often yields prohibitively large formulations, whose size is exponential in the number of variables. In that chapter, we propose a new disjunctive representation of the sets under study, which leads to compact formulations with size exponential in the rank of the function, but polynomial in the number of variables. Moreover, we show how to project out the additional variables for the case of rank-one functions, recovering or generalizing known results for the convex hulls of such sets (in the original space of variables).
In Chapter 5, we study multiple-ratio fractional 0-1 programs, a broad class of NP-hard combinatorial optimization problems. In particular, under some relatively mild assumptions we provide a complete characterization of the conditions, which ensure that a single-ratio function is submodular. Then we illustrate our theoretical results with the assortment optimization and facility location problems, and discuss practical situations that guarantee submodularity in the considered application settings. In such cases, near-optimal solutions for multiple-ratio fractional 0-1 programs can be found via simple greedy algorithms. In Chapter 6, we summarize the techniques for convexifying mixed-integer sets. In Chapter 7, we conclude the dissertation and discuss further promising research directions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Stochastic games with expected-value constraints
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Integer optimization for analytics in high stakes domain
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Train scheduling and routing under dynamic headway control
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Traffic assignment models for a ridesharing transportation market
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Models and algorithms for the freight movement problem in drayage operations
PDF
Essays on revenue management with choice modeling
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
Asset Metadata
Creator
Han, Shaoning
(author)
Core Title
Mixed-integer nonlinear programming with binary variables
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2022-08
Publication Date
07/15/2022
Defense Date
05/19/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
binary variables,convexification,integer programming,mixed-integer optimization,nonlinear programming,OAI-PMH Harvest,submodular function
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gomez, Andres (
committee chair
), Atamturk, Alper (
committee member
), Dilkina, Bistra (
committee member
), Pang, Jong-Shi (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
shaoning@usc.edu,shh122@pitt.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111371444
Unique identifier
UC111371444
Legacy Identifier
etd-HanShaonin-10833
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Han, Shaoning
Type
texts
Source
20220715-usctheses-batch-953
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
binary variables
convexification
integer programming
mixed-integer optimization
nonlinear programming
submodular function