Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Applications of Wasserstein distance in distributionally robust optimization
(USC Thesis Other)
Applications of Wasserstein distance in distributionally robust optimization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Applications of Wasserstein Distance in Distributionally Robust Optimization by Ye Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Industrial and Systems Engineering) August 2018 Copyright 2018 Ye Wang Acknowledgements It has been an unforgettable journey pursuing a doctoral degree. Without the guidance and assistance from many others, I would not have come to this point. First of all my sincere appreciation goes to my adviser, Professor John Gunnar Carlsson. I feel lucky that I met such an brilliant, hardworking and considerate advisor. John always brought up exciting research ideas, and whenever I got stuck in my research, meeting with him would give me new directions to work on. He also respects my thoughts and choices, and provides wonderful opportunities to help me achieve my goal. To me, John is the one who always be there and support me unconditionally. My Ph.D. years become very joyful working with John. I would like to thank my committee members: Professor Ali Abbas, Professor Phebe Vayanos and Professor Mahdi Soltanolkotabi. Their comments and advice broadened my visions and inspired me with potential research directions. I also appreciate the tremendous amount of efforts that the faculties of University of Southern California have put for me to teach me and elaborate me to a higher level of thoughtfulness and understanding. I would also like to thank my labmates: Mehdi Behroozi, Xiangfei Meng and Siyuan Song for the stimulating discussions, and for all the fun we have had in the last few years. I appreciate the aid and support of all of my friends who have been with me during both happiness and hardship and made my time during my Ph.D. years more interesting and joyful and gave me a wonderful sense of exhilaration. Last but not the least, I would like to thank my parents whose unwavering love, unconditional support, and constant encouragement help me throughout the ups and downs in my graduate ii study. I also would like to express my deepest gratitude to my husband, Xinyuan Qiao, who is the one that brings happiness for my life. I could not have done any of these without him. iii Table of Contents Acknowledgements ii List Of Tables vi List Of Figures vii Abstract viii Chapter 1: Introduction 1 1.1 Travelling Salesman Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Highest Density Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Drone Surveillance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Drone Districting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3 Defibrillator Placement Problem . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Literature Review 12 2.1 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Distributionally Robust Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Distributionally Robust Optimization with Wasserstein Distance . . . . . . . . . . 15 Chapter 3: Preliminaries 17 Chapter 4: Distributionally Robust TSP 21 4.1 Worst-case Distributions with Wasserstein Distance Constraints . . . . . . . . . . . 21 4.1.1 Structure of the Worst-case Distribution . . . . . . . . . . . . . . . . . . . . 22 4.2 Finding the Worst-case Distribution Efficiently . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Variations and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 5: Entropy Maximization 37 5.1 Structure of the Solution to (5.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Solving (5.1) Efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 6: Highest Density Region Optimization 48 6.1 Solve (6.6) Efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 Find the Worse-case Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 iv Chapter 7: Computational Experiments 61 7.1 Selecting the Distance Parameter t . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.2 Varying Values of n and t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.3 A Districting Experiment with Geospatial Data . . . . . . . . . . . . . . . . . . . . 66 7.3.1 Distance-constrained Neighborhood Travelling Salesman Problem . . . . . . 69 7.3.2 Districting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3.3 Districting Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 8: Conclusion 80 Reference List 81 Appendix A Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Appendix B Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Appendix C Probabilistic analysis of the capacitated VRP . . . . . . . . . . . . . . . . . . . . . . . . 88 Appendix D Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 v List Of Tables 7.1 Capture probabilities for 7 regions and 2 partitioning criteria . . . . . . . . . . . . 76 vi List Of Figures 1.1 Illustration of the Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Drone surveillance problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Illustration of HDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 One-dimensional HDR and Two-dimensional HDR . . . . . . . . . . . . . . . . . . 8 2.1 Clustered point sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1 Optimal solution for distributionally robust TSP . . . . . . . . . . . . . . . . . . . 27 7.1 Threshold values of t for significance levels between 0 and 1 . . . . . . . . . . . . . 64 7.2 Varying values of n and t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.3 A 3-dimensional LIDAR scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.4 Demonstration of the tour of a drone . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.5 Proof of Lemma 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.6 Demonstration of the partition method . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.7 Six metropolitan regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.8 Partitions of seven metropolitan regions . . . . . . . . . . . . . . . . . . . . . . . . 79 B.1 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 D.1 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 vii Abstract Recent research on formulating and solving distributionally robust optimization problems has seen many different approaches for describing one’s ambiguity set, such as moment based approach and statistical distance based approach. In this dissertation, we use Wasserstein distance to charac- terize the ambiguity set of distributions, which allows us to circumvent common overestimation that arises when other procedures are used, such as fixing the center of mass and the covariance matrix of the distribution. We consider distributionally robust versions of the Euclidean travelling salesman problem, the entropy maximization problem and the highest density region optimization problem respectively: as input, we are given a compact, contiguous planar regionR and a realization of sampled demand points in that region, and our objective is to construct a probability distribution onR that is sufficiently “close” to the empirical distribution consisting of the sampled points and is as “spread out” as possible, in the sense that the asymptotic length of a TSP tour of points drawn from that distribution should be as large as possible, the entropy of the distribution should be as large as possible, and the coverage probability over the highest density region should be as small as possible. Numerical experiments confirm that our approach is useful when used in a decision support tool for dividing a territory into service districts for a fleet of aerial vehicles when limited data is available. viii Chapter 1 Introduction One of the most complex factors that arises in formulating and solving robust optimization prob- lems is the difficulty of describing one’s ambiguity set in a way that is both useful and mathe- matically tractable. Recent works have seen many different approaches for describing these sets, such as moment based approach[20, 25] and statistical distance based approach[12, 35, 4, 5]. The choice of one’s ambiguity set often yields qualitative insights into what demand patterns affect the outcome most significantly. In this dissertation, we consider distributionally robust versions of the Euclidean travelling salesman problem, the entropy maximization problem and the highest density region optimiza- tion problem respectively: as input, we are given a compact, contiguous planar regionR and a realization of sampled demand points in that region, and our objective is to construct a proba- bility distribution onR that is sufficiently “close” to the empirical distribution consisting of the sampled points and is as “spread out” as possible, in the sense that the asymptotic length of a TSP tour of points drawn from that distribution should be as large as possible, the entropy of the distribution should be as large as possible, and the coverage probability over the highest density region should be as small as possible. In order to characterize our ambiguity set of distribu- tions, we use a statistical metric called the Wasserstein distance, which is also known as the earth mover’s distance or Kantorovich metric. Conceptually speaking, the Wasserstein distance is very 1 simple and intuitive: if we visualize two probability distributions μ 1 and μ 2 as being two piles of equal amounts of sand, then the Wasserstein distance between them is simply the minimum amount of work needed to move one pile to take the shape of the other, as suggested in Figure 1.1a. A particularly attractive feature of the Wasserstein distance that is not present in many other statistical metrics is the ability to directly compare a discrete distribution and a continuous distribution, as illustrated in Figures 1.1e-1.1g. In addition, because the Wasserstein distance is a true metric, the set of all distributions within a certain distance of a reference distribution is a convex set that turns out to admit a simple representation. 1.1 Travelling Salesman Problem The travelling salesman problem (TSP) asks the following question: ”Given a list of customers and the distances between each pair of customers, what is the shortest possible route that visits each customer and returns to the origin customer?” It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science. The vehicle routing problem (VRP) is a generalization of the travelling salesman problem, which asks ”What is the optimal set of routes for a fleet of vehicles to traverse in order to deliver to a given set of customers?”. Recent research on the robust and stochastic travelling salesman problem and the vehicle routing problem has seen many different approaches for describing the region of ambiguity, such as taking convex combinations of observed demand vectors or imposing constraints on the mo- ments of the spatial demand distribution. One approach that has been used outside the trans- portation sector is the use of statistical metrics that describe a distance function between two probability distributions. In this dissertation, we consider a distributionally robust version of the Euclidean travelling salesman problem in which we compute the worst-case spatial distribution 2 (a) (b) (c) (d) (e) (f) (g) Figure1.1: Figure1.1ashowsaWassersteindistanceproblembetweentwounivariatedistributions μ 1 and μ 2 . Figures 1.1b-1.1d show that a Wasserstein mapping can be thought of as an infinite- dimensional generalization of a bipartite matching; here μ 1 and μ 2 are shown in 1.1b and 1.1c, and 1.1d shows a bipartite matching between a large number of samples collected fromμ 1 andμ 2 . Figures 1.1e-1.1g show an interpretation of a Wasserstein distance problem when μ 1 is a smooth density andμ 2 is atomic. The two distributions are shown in 1.1e, and 1.1f shows the solution to an assignment problem between a large number of samples from μ 1 and the atomic distribution μ 2 ; a side consequence is that the Lagrange multipliers of this assignment induce a partition of the regionR, each of whose cells are associated with one of the elements of μ 2 . Figure 1.1g shows this partition together with the atomic distribution μ 2 ; each cell contains 1/n of the mass of the density, which is to be transported to the point contained within it. By using previous results [16], these dashed curves are computationally easy to compute. If we let R i denote the cell associated with each point x i in the atomic distribution and f denote the density, then the Wasserstein distance is P i s Ri f(x)kx−x i kdA. 3 of demand against all distributions whose Wasserstein distance to an observed demand distribu- tion is bounded from above. This constraint allows us to circumvent common overestimation that arises when other procedures are used, such as fixing the center of mass and the covariance matrix of the distribution. Our notion of distributional robustness relies on the following famous theorem, originally stated in [3] and further developed in [48], which relates the length of a TSP tour of some points with the distribution from which they were sampled: Theorem 1 (Beardwood-Halton-Hammersley (BHH) Theorem). Suppose thatX ={X 1 ,X 2 ,...} is a sequence of random points i.i.d. according to a probability density function f(·) defined on a compact planar regionR. Then with probability one, the length TSP(X) of the optimal travelling salesman tour through X satisfies lim N→∞ TSP(X) √ N =β x R p f c (x)dA where β is a constant and f c (·) represents the absolutely continuous part of f(·). Note in particular thatβ does not depend onf(·). It is additionally known that 0.6250≤β≤ 0.9204 and estimated that β≈ 0.7124; see [2, 3]. 1.2 Entropy Entropy isameasureofuncertaintyoftheprobabilitydistributionofarandomvariable. According to the principle of maximum entropy[32], the probability distribution which best represents the currentstateofknowledgeistheonewiththelargestentropysubjecttopreciselystatedpriordata. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time. Recent works have seen many different approaches for describing the 4 stated prior data in entropy maximization problem, such as giving a set of conserved quantities (average values of some moment functions) and prescribing some symmetries of the probability distribution. In our research, we search through all distributions whose Wasserstein distance to the empirical distribution is sufficiently small to maximize entropy. This approach is theoretically and practically attractive because of the minimal amount of assumptions imposed on the problem structure. 1.3 Highest Density Region 1.3.1 Drone Surveillance Problem An unmanned aerial vehicle (UAV), commonly known as a drone, is an aircraft without a human pilot aboard. Compared to manned aircraft, drones are often preferred for missions too “dull, dirty or dangerous” [49] for humans. In 2012, the Federal Aviation Administration (FAA) granted approval to various law enforcement agencies to train operators in the use of drones. There are many potential future applications, from serving warrants to equipping drones with stun guns, but currently unmanned drones primarily offer aerial surveillance in situations where a manned helicopter would be too expensive or dangerous. In this dissertation, we consider a monitoring problem with a fleet of drones. We have a region for surveillance and historical data describing likely targets over it. If we have limited number of drones with limited camera range, we are not able to monitor the entire region. A natural question is, how can we route the drones to best monitor the region, and how can the historical data guide us? If we know the true underlying distribution of targets, the best approach is obviously to route drones to monitor the densest sub-regions. This is illustrated in Figure 1.2. The best case scenario is that all crimes happen at the same location and we can capture all of them by using one drone monitoring that location. The worst-case scenario is that targets are spread out over the region and we are not able to capture all of them because of limited 5 Figure 1.2: This figure shows a crime distribution over a compact planar region. The lines in red are the routes for drones to monitor. surveillance resources. In practice, we do not know the true underlying distribution of crimes. In our problem, we are concerned with robustness in the distributional sense [13]: we seek the worse case spatial distribution of crimes, i.e. the distribution that is as spread out as possible, and thereby the percentage of crimes that are captured by drones is as low as possible, while remaining consistent with the historical crime data. We will first introduce Highest Density Region (HDR) and then illustrates how to use the concept to formulate our problem. Many statistical methods involve summarizing a probability distribution by a region of the sample space covering a specific probability. For example, reporting a prediction interval for a future observation involves an implicit summary of the underlying distribution. Highest Density Region (HDR) is often the most appropriate subset to use to summarize a probability distribu- tion since it is capable of exposing the most striking features of the data than most alternative methods[29]. Asamotivatingexample, Figure1.3showsfive 75%probabilityregionsforamixture density comprising anN(0, 1) density and anN(4, 1) density with weights 0.7 and 0.3 respectively. Only HDR shows the bimodality. The definition of HDR is as following. The same definition also applies for discrete variables with the density function replaced by he probability mass function. 6 Figure 1.3: Five different 75% probability regions for a mixture density comprising an N(0, 1) density and an N(4, 1) density with weights 0.7 and 0.3 respectively. Only HDR shows the bimodality. Here, c q denotes the qth quantile, μ denotes the mean, and σ denotes the standard deviation of the density. Definition 2 (HDR). Let f(x) be the density function of a random variable X. Then the 100(1−α)% HDR is the subset R(f α ) of the sample space of X such that R(f α ) ={x :f(x)≥f α } where f α is the largest constant such that Pr(X∈R(f α ))≥ 1−α. HDR occupies the smallest possible volume in the sample space with a given coverage prob- ability. Figure 1.4a shows a 95% HDR for the one-dimensional case and Figure 1.4b shows a 95% HDR for the two-dimensional case. In our problem, instead of choosing the smallest pos- sible sub-region to cover a specific probability as a two-dimensional HDR, we want to find the sub-region with a specific area that covers the largest possible probability. We search through all distributions whose Wasserstein distance to the empirical distribution is sufficiently small to get the smallest of the largest coverage probabilities over the sub-regions. 7 (a) (b) Figure 1.4: Figure 1.4a shows a one-dimensional 95% HDR and Figure 1.4b shows a two- dimensional 95% HDR. 1.3.2 Drone Districting Problem Another application of the distributionally robust HDR optimization problem is in the design of districts for allocating a fleet of drones to visit a collection of customers when demand is uncertain. After districting, one drone would be assigned to a fixed district. Such strategy is desirable for many practical reasons: for example, they provide a simple rule for assigning destination points to drones, they decompose the original problem into independent sub-problems, and they enable one to make high-level strategic decisions in the absence of complete information. In fact, for many problems, it can be shown that there exist simple districting strategies that are asymptotically optimal as the number of demand points becomes large [39, 54]. The problem of designing such districts is a foundational one in the continuous approximation literature, as can be seen in Chapter 4 of the seminal book [19], for example. Our problem is motivated by an industrial affiliate, a major North American mapping company, that uses a districting strategy to assign vehicles to scan metropolitan regions. This kind of geographic localization–asopposedtoexplicitlycomputingturn-by-turndirectionsforvehicles–isacommon strategy that is also used by Google Street View, for example [27]: [Google spokesman Larry] Yu initially stated drivers were given specific routes to follow. But a Street View driver, who asked to remain anonymous for employment 8 reasons, said he was simply told to drive around Sonoma County and collect images. Yu retracted his assertion after learning of the driver’s statement. Rather than collecting street-level photographs, our affiliate assembles full 3-dimensional point clouds of buildings and other city structures (among many other data sources, e.g. temperature, radio signals, and sunlight), which necessitates more control over driver behavior than that de- scribed by the quote above. The most common way that uncertainty is represented is by assuming that demand follows a known probability density function (which is often further assumed to be uniform); this density then informs the districting decision in some way. We can designs districts in a data-driven fashion; rather than having full distribution information, we assume that we are given only a collection of data samples that inform our districting decision. 1.3.3 Defibrillator Placement Problem Sudden cardiac arrest is a leading cause of death in North America, and is responsible for over 400,000 deaths each year[28]. An automated external debrillator (AED) is a lightweight, portable device that delivers an electric shock through the chest to the heart. The shock can potentially stop an irregular heart beat (arrhythmia) and allow a normal rhythm to resume following sudden cardiac arrest. Successful treatment of cardiac arrest is extremely time sensitive, and use of an AED where possible significantly increases the probability of survival. Placement of AEDs in public locations can improve survival by enabling bystanders to treat victims of cardiac arrest prior to the arrival of emergency medical responders, thus shortening the time between collapse and treatment. In this dissertation, we consider a defibrillator placement problem with a limited number of AEDs to ensure their accessibility in the event of an out-of-hospital cardiac arrest emergency. We have public locations such as malls, coffee shops, restaurants, and subway stations, and historical data describing likely cardiac arrest incidents over them. Since we have limited number of AEDs, 9 we are not able to place them all over the region. A natural question is, how can we place the AEDs to best serve the region, and how can the historical data guide us? If we know the true underlying distribution of cardiac arrests, the best approach is obviously to place AEDs over the densest sub-regions. The best case scenario is that all cardiac arrests happen at the same location and we can place one AED at that location to resolve all of them. The worst-case scenario is that cardiac arrests are spread out over the region and we are not able to supply AEDs to all of them because of limited resources. In practice, we do not know the true underlying distribution of cardiac arrests. In our problem, we are concerned with robustness in the distributional sense [13]: we seek the worse case spatial distribution of cardiac arrests, i.e. the distribution that is as spread out as possible, and thereby the percentage of cardiac arrests that are resolved by AEDs is as low as possible, while remaining consistent with the historical cardiac arrest data. In other word, we want to find the sub-region with a specific area that covers the largest possible probability. We search through all distributions whose Wasserstein distance to the empirical distribution is sufficiently small to get the smallest of the largest coverage probabilities over the sub-regions. 1.4 Notational Conventions Our notational conventions throughout this dissertation are as follows: integrals over regions of arbitrary dimension are denoted with the triple integral t dV, and integrals over regions inR 2 are denoted with the double integral s dA. The diameter of a regionR, denoted diam(R), is the largest possible distance between two points inR, sup x,y∈R kx−yk. The vector consisting of all 1’s is written e, whose dimension will always be clear from context, and the indicator function of a particular condition and the Dirac delta function are written as 1(·) and δ(·) respectively. The Wasserstein distance between two distributions is writtenD(·,·) and is defined in the next 10 section. Finally, for any univariate function f(x), we say that f(x) ∈ o(g(x)) as x → ∞ if lim x→∞ f(x)/g(x) = 0. 11 Chapter 2 Literature Review 2.1 Stochastic Optimization Optimization models are often used in practice to guide decision making. In many of these models there exist parameters that need to be specified or estimated. For instance, given the following convex optimization formulation minimize x∈X h(x,ξ), whereX is a convex set of feasible solutions andh(x,ξ) is a convex cost function inx that depends on some parameter ξ∈ R m , it is often the case that at the time of optimizing, the parameters have not yet been fully resolved. If we represent the uncertainty in ξ through a distribution f ξ , we can instead resort to minimizing the expected cost, which leads to the following stochastic optimization formulation minimize x∈X E ξ [h(x,ξ)], where the expectation is taken with respect to the random parameter ξ. See [47] for a complete study of stochastic optimization. Unfortunately, even when the stochastic optimization problem is convex, in order to solve it one must often resort to Monte Carlo approximations, which can be computationally challenging (see [46]). Another major drawback is that we assume the underlying 12 true distribution f ξ is known. Even if multiple realizations of ξ are observed, f ξ still may not be known exactly, while use of a distribution different fromf ξ may sometimes result in bad decisions. To address these issues, distributionally robust optimization (DRO) was introduced[45]. 2.2 Distributionally Robust Optimization In Distributionally Robust Optimization (DRO), after defining a setD of possible probability distributions that is assumed to include the true f ξ , the objective function is reformulated with respect to the worst case expected cost over the choice of a distribution in this set, which leads to the following distributionally robust optimization formulation minimize x∈X max f ξ ∈D E ξ [h(x,ξ)], whereD is often called an ambiguity set. The ambiguity setD is a key ingredient of any distributionally robust optimization model. A good ambiguity set should be rich enough to contain the true data-generating distribution with high confidence. On the other hand, the ambiguity set should be small enough to exclude pathological distributions, which would incentivize overly conservative decisions. The ambiguity set should also be easy to parameterize from data, and ideally it should facilitate a tractable reformulation of the distributionally robust optimization problem as a structured mathematical program that can be solved with off-the-shelf optimization software. By far, the most common way to define the ambiguity set in distributionally robust optimiza- tion is to use the support and the first and second moments of the sample distribution [20, 41, 55], and the papers [18, 25] additionally make use of bounds on “directional deviation measures” that isolate stochastically independent components. One major drawback of this method, which mo- tivates our present work, is its inability to respond to clustering or even mere multi-modality in data points. For example, Figure 2.1 shows a data set that is very clustered, and therefore whose 13 Figure 2.1: The above point sets in the unit square are extremely clustered and one would expect that their TSP tour should be short, entropy should be small, and cumulative density over the highest density region should be large. However, because their sample mean and covariance matrix are the same as that of the uniform distribution, any robust methodology that uses only mean and covariance information will fail to recognize the clustering, thereby incurring significant overestimation. TSP tour should be short, entropy should be small and largest coverage probability over the fixed area sub-regions should be large relative to (for example) a uniform distribution. However, this clustered data set actually has precisely the same mean and covariance matrix as the uniform dis- tribution; such an approach therefore frequently leads to an over-conservative solution (or more generally, a solution that is not faithful to the true unknown demand distribution), even when a large number of samples is available. This over-conservatism is actually noted in Figure 10 of [15]. In order to remedy this, we propose the use of the Wasserstein distance as a means of defining the ambiguity set of demand distributions. The Wasserstein distance is very commonly used in machine learning and statistics [11, 30, 40, 43], and is also mentioned in the context of robust optimization in [21] for its relationship with the Prokhorov metric. To our knowledge, the first di- rect applications of the Wasserstein distance to optimization problems have occurred very recently in [52, 53]; the former uses Carathéodory-type results to reduce the support set of an infinite- dimensional optimization problem to a finite set and the latter uses the Wasserstein distance as one of several statistical metrics to define risk measures for portfolios. 14 A variety of other statistical metrics (or pseudo-metrics) have been used previously for solving distributionally robust optimization problems; such metrics include the Kullback-Leibler diver- gence, Hellinger distance, χ 2 -distance, total variation distance, or Kolmogorov-Smirnov statistic. A few examples follow: • The paper [4] solves a variety of robust linear programs using φ-divergences, a wide class of pseudo-metrics to which several of the aforementioned quantities belong. • The paper [6] gives a highly flexible framework that builds ambiguity sets using classical statistical hypothesis tests, including the χ 2 test and the Kolmogorov-Smirnov test. • The paper [12] computes robust financial portfolios using the Kullback-Leibler divergence. • The paper [31] shows how to solve robust dynamic programs whose distributional ambiguity sets are defined using the Kullback-Leibler divergence. • The paper [35] proposes a robust lot-sizing model whose distributional ambiguity set is defined via the χ 2 goodness-of-fit test. ThemajorreasonwhytheWassersteindistanceisaparticularlyappropriatechoiceforourproblem of interest is the Wasserstein distance allows one to directly make comparisons between a discrete distribution (such as the empirical distribution consisting of a collection of data points) and a continuous distribution, as we have previously noted in Figure 1.1; this it not possible in (for example) the Kullback-Leibler divergence, the Hellinger distance, or the total variation distance. 2.3 DistributionallyRobustOptimizationwithWasserstein Distance Many papers on distributionally robust optimization with Wasserstein distance have appeared very recently. The following are a few examples: 15 • The paper [23] proposes a maximin approach for portfolio selection, which accommodates scenario uncertainty and probability ambiguity where ambiguity is modelled by restricting the Wasserstein distance to the basic probability model. • The paper [33] considers a distributionally robust version of the Euclidean travelling sales- man problem in which they compute the worst-case spatial distribution of demand against all distributions whose Wasserstein distance to an observed demand distribution is bounded from above. • The paper[44] derives the dual reformulation of distributionally robust stochastic optimiza- tion with Wasserstein distance under a general setting, allowing us to obtain a precise structural description of the worst-case distribution and connects the distributionally ro- bust stochastic optimization to classical robust programming. • The paper [22] shows how to apply complementary slackness principles to solve a very largefamilyofdistributionallyrobust optimizationproblems subject to Wasserstein distance constraints; their use of convex duality theory is closely related to our own derivation in Section 4, Section 5 and section 6. • The paper [34] introduces RWPI (Robust Wasserstein Profile Inference), which can be used to optimally choose the regularization parameter in machine learning algorithms, such as generalized LASSO (Least Absolute Shrinkage and Selection) and regularized logistic re- gression. 16 Chapter 3 Preliminaries In order to retain mathematical rigor, we find the following results useful: Definition 3 (Wasserstein distance). Letμ 1 andμ 2 denote two probability measures defined on a compact planar regionR. The Wasserstein distance between μ 1 and μ 2 , writtenD(μ 1 ,μ 2 ), is defined as D(μ 1 ,μ 2 ) := inf π∈Π(μ1,μ2) ZZZZ R×R kx−ykdπ(x,y), (3.1) where Π(μ 1 ,μ 2 ) 3 π(x,y) is defined as the set of all probability measures on R×R whose marginals areμ 1 andμ 2 , that is, the set of all probability measures that satisfyπ(A×R) =μ 1 (A) and π(R×B) =μ 2 (B) for all measurable subsets A,B⊂R. The Wasserstein distance can be thought of as a generalization of an assignment problem: for example, when μ 1 and μ 2 are discrete distributions consisting of n points each with equal mass, the Wasserstein distance between the two is simply computed as the cost of a bipartite matching (multiplied by a normalization term of 1/n). This interpretation is suggested in Figure 1.1. 17 Finally,theWassersteindistancebetweenadiscretedistributionconsistingofpoints{x 1 ,...,x n } with uniform probabilities 1/n and a continuous probability density function f defined on a com- pact planar regionR can be obtained by solving the following infinite-dimensional optimization problem: minimize I1(·),...,In(·) n X i=1 x R kx−x i kf(x)I i (x)dA s.t. x R f(x)I i (x)dA = 1/n ∀i n X i=1 I i (x) = 1 ∀x∈R I i (x) ≥ 0 ∀i,∀x∈R ; here the valueI i (x) simply describes the amount of the distribution at pointx∈R that should be moved to point x i . The lemma below summarizes some basic results on the Wasserstein distance between a probability density and an empirical distribution: Lemma 4. Let f denote a probability density function on a compact planar regionR and let ˆ f denote an atomic distribution consisting of distinct points x 1 ,...,x n ∈R each having probability mass 1/n. Then the following statements are true: 1. The Wasserstein distanceD(f, ˆ f) is the optimal objective value to the concave maximization problem maximize λ∈R n x R f(x) min i {kx−x i k−λ i }dA s.t. (3.2) e T λ = 0, where e∈R n denotes a vector whose entries are all 1’s. 18 2. For any λ, a valid supergradient for the objective function of (3.2) is the vector g∈ R n defined by setting g i =− x Ri f(x)dA, where each R i is a connected piecewise hyperbolic region characterized by R i ={x∈R :kx−x i k−λ i ≤kx−x j k−λ j ∀j6=i} ; that is, for any other λ 0 , we have x R f(x) min i {kx−x i k−λ 0 i }dA≤ x R f(x) min i {kx−x i k−λ i }dA + g T (λ 0 −λ). 3. If λ ∗ is a maximizer of (3.2), then an optimal Wasserstein mapping between f and ˆ f is obtained by defining R ∗ i = x∈R :kx−x i k−λ ∗ i ≤kx−x j k−λ ∗ j ∀j6=i for each i and transporting all of the mass of each R ∗ i to its associated point x i . 4. If f(x)> 0 for all x∈R, then there exists a unique maximizer λ ∗ . The following classical result from [38] will be useful in confirming the existence of an optimal solution of the problem that we will construct: Theorem 5 (Lagrange Duality). Let φ be a real-valued convex functional defined on a convex subset Ω of a vector space X, and let G be a convex mapping of X into a normed space Z. Suppose 19 there exists an x 1 such that G(x 1 ) < θ, where θ is the zero element, and that inf{φ(x ) : x ∈ Ω, G(x )≤θ} is finite. Then inf x∈Ω,G(x )≤θ φ(x ) = max z ∗ ≥θ inf x∈Ω φ(x ) +hG(x ),z ∗ i and the maximum on the right is achieved by some z ∗ 0 ∈ Z ∗ such that z ∗ 0 ≥ θ, where Z ∗ denotes the dual space of Z andh·,·i denotes the evaluation of a linear functional, i.e. z ∗ (G(x )). If the infimum on the left is achieved by some x 0 ∈ Ω, thenhG(x 0 ),z ∗ 0 i = 0, and x 0 minimizes φ(x ) +hG(x ),z ∗ 0 i over all x∈ Ω. Proof. Proof. Statement 1 is a well-known special case of the Kantorovich duality theorem; see for example Theorem 1.3 of [50] for specific details. Proofs of statements 2-4 are routine and can be found in Section A of the Online Supplement. The key insight that the above theorem yields is the reduction of an infinite-dimensional problem to a finite-dimensional problem by way of Lagrange duality. Statement 3, for example, is a direct consequence of complementary slackness; it says that ifkx−x i k−λ ∗ i <kx−x j k−λ ∗ j , then the optimal Wasserstein mapping fromf and ˆ f must not transport any mass nearx to point x j . The economic interpretation of the regions R ∗ i relative to the dual variables λ ∗ i can be found in [17]; in a nutshell, the sub-regions R ∗ i that characterize the mapping are equivalent to market regions induced by a mill pricing scheme at each of the points x i . 20 Chapter 4 Distributionally Robust TSP 4.1 Worst-caseDistributionswithWassersteinDistanceConstraints The input to our problem is a set of distinct demand pointsx 1 ,...,x n in a compact planar region R, which we assume are sampled from some (unknown) distribution function f. By rearranging the terms of Theorem 1, we can write TSP(x 1 ,...,x n ) =β √ n x R p f(x)dA +o( √ n) with probability one as n→∞. Since β is a constant and √ n is independent of the distribution f by definition, we therefore conclude that the “worst” distribution whose induced TSP workload is as large as possible (subject to whatever other constraints might be present) is precisely that distribution that maximizes s R p f(x)dA. We now let ˆ f denote the empirical distribution on thesen pointsx i . We will search through all distributions f whose Wasserstein distance to ˆ f is sufficiently small, i.e. whereD(f, ˆ f)≤t; here D(·,·) is the Wasserstein distance from Definition 3 and t is a parameter that will be discussed in 21 Section 7.1. The problem of finding the worst-case TSP distribution, subject to the Wasserstein distance constraint, is then written as the infinite-dimensional convex optimization problem maximize f x R p f(x)dA s.t. (4.1) D(f, ˆ f) ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. This is our problem of interest throughout this paper. We will embed f in the Banach space L 1 (R), which consists of all functions that are absolutely Lebesgue integrable onR. 4.1.1 Structure of the Worst-case Distribution Here we describe the solution to (4.1). To begin, we apply Lemma 4 to express the distance constraintD(f, ˆ f)≤t in (4.1) differently, obtaining the equivalent formulation maximize f∈L 1 (R) x R p f(x)dA s.t. (4.2) x R f(x) min i {kx−x i k−λ i }dA ≤ t ∀λ : e T λ = 0 x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. This is an infinite-dimensional problem with an infinite-dimensional constraint space and is there- fore best addressed using Theorem 5; before doing so, we find the following result useful: Lemma 6. There exists a unique optimal solution f ∗ to problem (4.2), and f ∗ (x) > 0 for all x∈R. 22 Proof. Proof. The fact that the optimal solution is unique (provided one exists) is an immediate consequence of the fact that the square root function in the integrand of (4.2) is strictly concave. To prove existence, let{f j } denote a sequence of feasible inputs to (4.2) whose objective values converge to a supremum. For each f j , let λ j denote a value of λ that induces the optimal Wasserstein mapping between f j and ˆ f as described in Lemma 4, i.e. that solves problem (3.2). It is easy to verify that the iterates λ j lie in the compact set Λ, defined by Λ := λ∈R n : e T λ = 0, λ i ≤ diam(R)∀i , because anyλ lying outside Λ would force some sub-regions to be empty. Therefore, the sequence {λ j } must have a convergent subsequence with a limit λ ∗ , inducing a partition R ∗ 1 ,...,R ∗ n as in statement 2 of Lemma 4 that satisfies s R ∗ i f(x)dA = 1/n for all i. Standard arguments then show that the true worst-case distribution f ∗ is precisely the solution to the problem maximize f∈L 1 (R) x R p f(x)dA s.t. (4.3) n X i=1 x R ∗ i kx−x i kf(x)dA ≤ t x R ∗ i f(x)dA = 1 n ∀i f(x) ≥ 0 ∀x∈R (this is an immediate consequence of the fact that the optimal objective cost to (4.3) varies continuously as the vectorλ ∗ , which defines the partitionR ∗ 1 ,...,R ∗ n , is perturbed). This problem 23 has a finite-dimensional constraint space, and it is routine to apply Theorem 5 to (4.3) to derive the dual problem minimize ν≥0 1 4 n X i=1 x R ∗ i 1 ν 0 kx−x i k +ν i dA +ν 0 t + 1 n (ν 1 +··· +ν n ) s.t. (4.4) ν 0 kx−x i k +ν i ≥ 0 ∀x∈R ∗ i ∀i whereby we conclude that the optimal solution f ∗ to (4.3) must take the form f ∗ (x) = 1 4(ν ∗ 0 kx−x i k +ν ∗ i ) 2 (4.5) on each sub-region R ∗ i . This satisfies f(x)> 0 for all x∈R and completes the proof. The functional form for the optimal f ∗ can in fact be simplified further: Theorem 7. The worst-case distribution that solves problem (4.2), and therefore (4.1), takes the form f ∗ (x) = 1 4(ν ∗ 0 min i {kx−x i k−λ ∗ i } +ν ∗ 1 ) 2 (4.6) with ν ∗ 0 ,ν ∗ 1 ≥ 0 and e T λ ∗ = 0. Proof. Proof. Themajordifferencebetweentheformoff ∗ aswrittenaboveandtheformdescribed in (4.5) is the fact that the expression in (4.5) is not guaranteed to vary continuously as we move from one region R ∗ i to another; the expression (4.6) is continuous by inspection. We first note that the constraint x R f(x) min i {kx−x i k−λ i }dA≤t ∀λ : e T λ = 0 24 can be restricted to merely the compact set Λ := λ∈R n : e T λ = 0, λ i ≤ diam(R)∀i because, ifλ i > diam(R) for somei, thenkx−x i k−λ i < 0 for allx and the constraint is obviously satisfied. We will apply Theorem 5 where X =L 1 (R), Ω is the subset of the non-negative orthant in L 1 (R) that integrates to 1, and Z consists of all continuous functions on Λ, i.e. Z = C (Λ) (note that Z satisfies the interior point requirement of Theorem 5 because inequalities are simply taken elementwise in Λ). We define φ(x ) : X→R and G(x ) : X→ Z as the maps sending f7→ x R p f(x)dA and f7→ x R f(x) min i {kx−x i k−λ i }dA−t respectively, where the right-hand side of the second expression is regarded as a continuous func- tion of λ. The dual space Z ∗ consists of all regular signed Borel measures on Λ (this is the Riesz representation theorem; see e.g. [42]). However, Lemma 6 shows that f ∗ (x) > 0 onR, and therefore the optimal λ ∗ that solves problem (3.2) is unique by statement 4 of Lemma 4. This implies that G(x ) = s R f(x) min i {kx−x i k−λ i }dA−t< 0 wheneverλ6=λ ∗ , and therefore, since hG(x ),z ∗ i = 0 at optimality, it must be the case that z ∗ is zero everywhere except for (possibly at) λ ∗ . Thus, we conclude that z ∗ is an evaluation functional at λ ∗ (multiplied by a scalar), so that hG(x ),z ∗ i =q ∗ x R f(x) min i {kx−x i k−λ ∗ i }dA−t ! for all feasible f, where q ∗ ≥ 0 is some scalar. Theorem 5 then says that f ∗ must also be the solution to the problem 25 maximize f∈L 1 (R) x R p f(x)dA +q ∗ x R f(x) min i {kx−x i k−λ ∗ i }dA−t ! s.t. x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R or equivalently, the problem maximize f∈L 1 (R) x R p f(x)dA s.t. x R f(x) min i {kx−x i k−λ ∗ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. It is routine to verify that the constraint s R f(x)dA = 1 can be replaced with an inequality (in a nutshell, this is because we are allowed to make f(x) as large as we like whenkx−x i k−λ ∗ i ≤ 0 for some index i). Thus, we can apply Theorem 5 again to the problem maximize f∈L 1 (R) x R p f(x)dA s.t. x R f(x) min i {kx−x i k−λ ∗ i }dA ≤ t x R f(x)dA ≤ 1 f(x) ≥ 0 ∀x∈R 26 (a) 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 (b) Figure 4.1: Two views of an example of f ∗ (x) as described in Theorem 7, where there are n = 10 points. In4.1a, theshadingrepresentsf ∗ (x)andthedashedlinesindicatetheoptimalWasserstein map between f ∗ and ˆ f. By construction, the vector λ ∗ induces those dashed lines, as described in Lemma 4. to derive the 2-dimensional dual problem minimize ν0,ν1 x R 1 4(ν 0 min i {kx−x i k−λ ∗ i } +ν 1 ) dA +ν 0 t +ν 1 s.t. (4.7) ν 0 min i {kx−x i k−λ ∗ i } +ν 1 ≥ 0 ∀x∈R ν 0 ,ν 1 ≥ 0 ; the optimality conditions of (4.7) describe precisely the desired form of f ∗ , which completes the proof. Figure 4.1 shows a surface plot of a worst-case distribution f ∗ . Remark 1. Many problems in distributionally robust optimization have objective functions and constraints that are linear in terms of the unknown distribution f (for example, the expectation operator). For such problems, Carathéodory-type theorems imply that the worst-case distribution will consist of a finite number of points, even when one uses ambiguity sets defined by the Wasser- stein distance; see for example [52]. As a consequence of this fact, it is sometimes the case that 27 one can determine the worst-case distribution using only one iteration of a finite-dimensional op- timization problem; this turns out to hold (for the Wasserstein metric) in [22, 53], for example. Because of the non-linearity in the objective, our worst-case distribution f ∗ is smooth; we will describe an iterative scheme for finding it in Section 4.2. Remark 2. One of the salient attributes of the worst-case distribution f ∗ is that the presence of the square root in the objective of (4.1) establishes an inverse proportionality between the optimal solution f ∗ (x) and the square of the distance to one of the data points x i (with some additional additive and multiplicative weights from the dual variables ν ∗ and λ ∗ ). This same inverse pro- portionality is shared by the classical geographic gravity model [1], which is “the most common formulation of the spatial interaction method” and has historically been used to model a wide variety of demographic phenomena such as population migration, spatial utility for retail stores, and trip distributions between cities. This would appear to lend credibility to our solution f ∗ , inasmuch as it takes a form that closely matches that of distributions for related problems. Remark 3. The earlier paper [15] considers a problem closely related to (4.1) in which one has constraints on the mean and covariance of f instead of the constraint onD(f, ˆ f); that problem is written as maximize f∈L 1 (R) x R p f(x)dA s.t. (4.8) x R xf(x)dA = μ x R xx T f(x)dA Σ +μμ T x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. As we have noted in Figure 2.1, mean and covariance information may not be sufficient to give any useful information on the ambiguity set of distributions. It is more desirable to describe this 28 ambiguity set in a way that is guaranteed to make better use of sample points as they become available. It is well-known (e.g. Section 2 of [14]) that the Wasserstein distance between the empirical distribution ˆ f and the true distribution f converges to zero with probability one as n → ∞. This, coupled with some routine analysis, guarantees that the objective cost of our proposed formulation (4.1) will eventually be the same as that of the ground truth distribution: Theorem 8. Let X = {X 1 ,X 2 ,...} be a sequence of random points i.i.d. according to an absolutely continuous probability density function ¯ f(·) defined on a compact planar regionR. For any positive integern, let ˆ f n denote the empirical distribution on points{X 1 ,...,X n }. Then with probability one there exists a sequence{t 1 ,t 2 ,...}, converging to 0, such that the optimal objective value of the problem maximize f∈L 1 (R) x R p f(x)dA s.t. (4.9) D(f, ˆ f n ) ≤ t n x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R approaches the ground truth (i.e. s R q ¯ f(x)dA) as n→∞. Proof. Proof. See Section B of the Online Supplement. 4.2 Finding the Worst-case Distribution Efficiently The preceding section established that the worst-case distribution that solves (4.1) can be ex- pressed in terms of optimal vectorsλ ∗ ∈R n andν ∗ ∈R 2 . This section describes a simple method 29 for calculatingλ ∗ andν ∗ efficiently by way of an analytic center cutting plane method [9]. Recall that our problem of interest, as written in (4.2), is maximize f∈L 1 (R) x R p f(x)dA s.t. x R f(x) min i {kx−x i k−λ i }dA ≤ t ∀λ : e T λ = 0 x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R ; thus, it is certainly true that if we fix any specific value ¯ λ such that e T ¯ λ = 0, then the following problem is a relaxation of (4.2) and hence has an objective value that is at least as large as that of (4.2): maximize f∈L 1 (R) x R p f(x)dA s.t. x R f(x) min i {kx−x i k− ¯ λ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. It is natural to consider the problem of selecting the particular value of ¯ λ that makes the above relaxation as tight as possible. In fact, our proof of Theorem 7 says that there exists a particular 30 value of ¯ λ, namelyλ ∗ , such that the above relaxation is actually tight; in other words, the optimal distribution f ∗ is the solution to the problem maximize f∈L 1 (R) x R p f(x)dA s.t. x R f(x) min i {kx−x i k−λ ∗ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R for an appropriately chosen vector λ ∗ . Thus, the problem of finding λ ∗ actually reduces to the optimization problem minimize λ∈R n max f∈Ω(λ) x R p f(x)dA s.t. (4.10) e T λ = 0 λ i ≤ diam(R) ∀i where Ω(λ) is the subset of L 1 (R) consisting of all functions f such that x R f(x) min i {kx−x i k−λ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. 31 Of course, the inner problem of maximizingf givenλ is easily solved because the gradient vector for the dual problem minimize ν0,ν1 x R 1 4(ν 0 min i {kx−x i k−λ i } +ν 1 ) dA +ν 0 t +ν 1 s.t. (4.11) ν 0 min i {kx−x i k−λ i } +ν 1 ≥ 0 ∀x∈R ν 0 ,ν 1 ≥ 0, as derived in the proof of Theorem 7, can be computed explicitly. Thus, we simply require a better understanding of problem (4.10): Lemma 9. The (outer) objective function of problem (4.10) is quasiconvex, i.e. its sub-level sets are convex. Proof. Proof. For notational compactness, letG(λ) denote the objective function of (4.10). Recall [10] that G(λ) is quasiconvex if and only if, for any λ 1 ,λ 2 and any θ∈ [0, 1], we have G(θλ 1 + (1−θ)λ 2 )≤ max{G(λ 1 ),G(λ 2 )}. Let ¯ λ =θλ 1 + (1−θ)λ 2 and let ¯ f denote the distribution that maximizes s R p f(x)dA over all f∈ Ω( ¯ λ); it will suffice to prove that either ¯ f∈ Ω(λ 1 ) or ¯ f∈ Ω(λ 2 ). By definition, we have x R ¯ f(x) min i {kx−x i k− ¯ λ i }dA≤t, and the left-hand side of the above inequality is a concave function in ¯ λ (if we fix the function ¯ f). Thus, if we letS denote the line segment joining λ 1 and λ 2 (which, of course, contains ¯ λ), we see that the problem minimize λ∈S x R ¯ f(x) min i {kx−x i k−λ i }dA 32 must realize its minimizer on the boundary ofS (since we are minimizing a concave function), i.e. the point λ 1 or λ 2 . Therefore, it must be the case that s R ¯ f(x) min i {kx−x i k−λ 1 i }dA≤t or s R ¯ f(x) min i {kx−x i k−λ 2 i }dA≤t, which completes the proof. The following theorem describes a cutting plane oracle for the outer problem (4.10): Theorem 10. Let ¯ λ satisfy e T ¯ λ = 0 and let ¯ f be the solution to the inner problem of (4.10) (i.e. ¯ f maximizes s R p f(x)dA over all f∈ Ω( ¯ λ)). Then the vector ¯ g∈R n defined by setting ¯ g i =− x ¯ Ri ¯ f(x)dA for all i, where ¯ R i is defined as ¯ R i = x∈R :kx−x i k− ¯ λ i ≤kx−x j k− ¯ λ j ∀j6=i , defines a valid cutting plane for problem (4.10); that is, if ¯ g T (λ 0 − ¯ λ)≤ 0 for some λ 0 satis- fying e T λ 0 = 0, and f 0 is the solution to the inner problem of (4.10) associated with λ 0 , then max f∈Ω(λ 0 ) s R p f(x)dA≥ max f∈Ω( ¯ λ) s R p f(x)dA. Proof. Proof. Statement 2 of Lemma 4 says that, for any other λ 0 , the assumption that ¯ g T (λ 0 − ¯ λ)≤ 0 yields x R ¯ f(x) min i {kx−x i k−λ 0 i }dA≤ x R ¯ f(x) min i {kx−x i k− ¯ λ i }dA+¯ g T (λ 0 − ¯ λ)≤ x R ¯ f(x) min i {kx−x i k− ¯ λ i }dA≤t whichimpliesthat ¯ f∈ Ω(λ 0 )andthereforethat max f∈Ω(λ 0 ) s R p f(x)dA≥ max f∈Ω( ¯ λ) s R p f(x)dA as desired. 33 We now have a fast method for generating cutting planes associated with problem (4.10) and thereby recovering the distribution f ∗ that solves problem (4.1); see Algorithm 1 for a formal description. Input: A compact, planar regionR containing a set of distinct points x 1 ,...,x n which are interpreted as an empirical distribution ˆ f, a distance parameter t, and a tolerance . Output: An -approximation of the distribution f ∗ that maximizes s R p f(x)dA subject to the constraint thatD(f, ˆ f)≤t. /* This is a standard analytic center cutting plane method applied to problem (4.10), which has an n-dimensional variable space. */ Set UB =∞ and LB =−∞; Set Λ ={λ∈R n : e T λ = 0, λ i ≤ diam(R)∀i}; while UB−LB> do Let ¯ λ be the analytic center of Λ; /* Build an upper bounding ¯ f for the original problem (4.1). */ Let ¯ ν 0 , ¯ ν 1 be the solution to problem (4.11) with ¯ λ as an input; Let ¯ f(x) = 1 4 (¯ ν 0 min i {kx−x i k− ¯ λ i } + ¯ ν 1 ) −2 ; Let UB = s R q ¯ f(x)dA; /* Build a lower bounding ˜ f that is feasible for (4.1) by construction. */ Let ¯ R i ={x∈R :kx−x i k− ¯ λ i ≤kx−x j k− ¯ λ j ∀j6=i} for i ={1,...,n}; Let ˜ ν∈R n+1 be the solution to problem (4.4) with inputs ¯ R 1 ,..., ¯ R n ; Let ˜ f be defined by setting ˜ f(x) = 1 4 (˜ ν 0 kx−x i k + ˜ ν i ) −2 on each ¯ R i ; Let LB = s R q ˜ f(x)dA; Let g i =− s ¯ Ri ¯ f(x)dA for i ={1,...,n}; LetH ={λ∈R n : g T λ≥ g T ¯ λ} and set Λ = Λ∩H; end return ˜ f; Algorithm 1: Algorithm WorstTSPDensity takes as input a compact planar region contain- ing a set of n distinct points, a distance threshold t, and a tolerance . 4.2.1 Variations and Extensions The analysis of Section 4.2 motivates three possible extensions of the existing framework: Uneven data weights: By definition, the empirical distribution ˆ f of the pointsx 1 ,...,x n con- sists of a collection of n atomic masses at each point, each having mass of 1/n. It is easy to envision scenarios in which one desires uneven weights: for example, one might use an 34 exponential weighting scheme to emphasize more recent measurements, or one might use different weights to distinguish between activities on weekends versus weekdays (or other seasonal effects). If we require point x i to have a mass q i associated with it, then we can find the worst-case distribution f ∗ by solving problem (4.2), with the one change that we replace the restriction that e T λ = 0 with a restriction that q T λ = 0 instead; the form of f ∗ is otherwise unchanged. Capacitated vehicles: Our approach can also be adapted to solve problems when vehicles have capacities and originate from a central depot located at the origin. To do so, suppose that each vehicle can visit c destinations before returning to the depot. The following theorem from[26]providesusefulupper and lowerboundsforthe cost ofacapacitatedvehicle routing tour: Theorem 11. Let X ={x 1 ,...,x n } be a set of demand points in the plane serviced by a fleet of vehicles with capacity κ that originate from a single depot located at the origin. The length of the optimal set of capacitated VRP tours of X, written VRP(X), satisfies max ( 2 κ n X i=1 kx i k,TSP(X) ) ≤VRP(X)≤ 2 |X| κ · P n i=1 kx i k |X| +(1− 1/κ)TSP(X). (4.12) The probabilistic version of this, as derived in Section C of the online supplement, uses the BHH Theorem (Theorem 1 of this paper) to characterize the length of the TSP term: √ n·max ( 2 s x R kxkf(x)dA, β x R p fc(x)dA ) >VRP(X)> √ n· 2 s x R kxkf(x)dA+β x R p fc(x)dA ! , (4.13) where we sets =κ/ √ n and we have adopted the notation “>” to denote an “approximate” inequality, both of which are also explained in Section C. It is immediately obvious that the upper and lower bounds are within a factor of 2 of one another. Applying the same analysis 35 as in Section 4.1, the worst-case distribution that maximizes the right-hand side of (4.13) subject to a Wasserstein distance constraint takes the form f ∗ (x) = 1 4(ν ∗ 0 min i {kx−x i k−λ ∗ i } +ν ∗ 1 − 2 s kxk) 2 ; its level sets, i.e. those curves for which ν ∗ 0 (kx−x i k−λ ∗ i ) +ν ∗ 1 − 2 s kxk is constant, consist of piecewise components of so-called Cartesian ovals [37]. Higher dimensions: The BHH Theorem (Theorem 1) is also applicable in higher dimensions; the general form says that, when the service regionR belongs toR d , we have lim N→∞ TSP(X) N (d−1)/d =β d y R f c (x) (d−1)/d dV , for dimension-dependent constants β d . Applying the same analysis as in Section 4.1, the worst-case distributionthat maximizes the right-hand side of the above, subject to a Wasser- stein distance constraint, takes the form f ∗ (x) = (d− 1) d−1 d d · 1 (ν ∗ 0 min i {kx−x i k−λ ∗ i } +ν ∗ 1 ) d . 36 Chapter 5 Entropy Maximization The differential entropy of a random variableX that follows a probability density functionf with a supportR is defined as h(f) = s R −f(x) logf(x)dA, which is an extension of the well-known Shannon entropy to continuous probability distributions. We now let ˆ f denote the empirical distribution on thesen pointsx i . We will search through all distributions f whose Wasserstein distance to ˆ f is sufficiently small, i.e. whereD(f, ˆ f)≤t; here D(·,·) is the Wasserstein distance from Definition 3 and t is a parameter that will be discussed in Section 7.1. The entropy maximization problem subject to the Wasserstein distance constraint, is then written as the infinite-dimensional convex optimization problem maximize f x R −f(x) logf(x)dA s.t. (5.1) D(f, ˆ f) ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. 37 This is our problem of interest throughout this chapter. We will embed f in the Banach space L 1 (R), hereafter abbreviated simply to L 1 , in order to apply Lemma 4 so as to characterize the constraint thatD(f, ˆ f)≤t. Since it is well-known (e.g. Section 2 of [14]) that the empirical distribution ˆ f converges with probability one with respect to the Wasserstein distance as n→∞, it is not surprising that our proposed formulation (5.1) is guaranteed to make better use of sample points as they become available comparing to traditional methods like mean- and covariance-based approaches: Theorem 12. Let X ={X 1 ,X 2 ,...} be a sequence of random points i.i.d. according to an absolutely continuous probability density function ¯ f(·) defined on a compact planar regionR. For any positive integern, let ˆ f n denote the empirical distribution on points{X 1 ,...,X n }. Then with probability one there exists a sequence{t 1 ,t 2 ,...}, converging to 0, such that the optimal objective value of the problem maximize f∈L 1 x R −f(x) logf(x)dA s.t. (5.2) D(f, ˆ f n ) ≤ t n x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R approaches the ground truth (i.e. s R − ¯ f(x) log ¯ f(x)dA) as n→∞. Proof. See Section D of the Appendix. 38 5.1 Structure of the Solution to (5.1) To begin, we apply Lemma 4 to express the distance constraintD(f, ˆ f)≤ t in (5.1) differently, obtaining the equivalent formulation maximize f∈L1 x R −f(x) logf(x)dA s.t. (5.3) x R f(x) min i {kx−x i k−λ i }dA ≤ t ∀λ : e T λ = 0 x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. This is an infinite-dimensional problem with an infinite-dimensional constraint space and is there- fore best addressed using Theorem 5; before doing so, we find the following result useful: Lemma 13. There exists a unique optimal solution f ∗ to problem (5.3), and f ∗ (x) > 0 for all x∈R. Proof. The fact that the optimal solution is unique (provided one exists) is an immediate con- sequence of the fact that the differential entropy function in (5.3) is strictly concave. To prove existence, let{f j } denote a sequence of feasible inputs to (5.3) whose objective values converge to a supremum. For each f j , let λ j denote a value of λ that induces the optimal Wasserstein mapping between f j and ˆ f as described in Lemma 4, i.e. that solves problem (3.2). It is easy to verify that the iterates λ j lie in the compact set Λ, defined by Λ := λ∈R n : e T λ = 0, λ i ≤ diam(R)∀i , because anyλ lying outside Λ would force some sub-regions to be empty. Therefore, the sequence {λ j } must have a convergent subsequence with a limit λ ∗ , inducing a partition R ∗ 1 ,...,R ∗ n as in 39 statement 2 of Lemma 4 that satisfies s R ∗ i f(x)dA = 1/n for all i. Standard arguments then show that the true worst-case distribution f ∗ is precisely the solution to the problem maximize f∈L1 x R −f(x) logf(x)dA s.t. (5.4) n X i=1 x R ∗ i kx−x i kf(x)dA ≤ t x R ∗ i f(x)dA = 1 n ∀i f(x) ≥ 0 ∀x∈R (this is an immediate consequence of the fact that the optimal objective cost to (5.4) varies continuously as the vectorλ ∗ , which defines the partitionR ∗ 1 ,...,R ∗ n , is perturbed). This problem has a finite-dimensional constraint space, and it is routine to apply Theorem 5 to (5.4) to derive the dual problem minimize ν n X i=1 x R ∗ i (2(ν 0 kx−x i k +ν i )− 1)e (ν0kx−xik+νi−1) dA +ν 0 t + 1 n (ν 1 +··· +ν n ) s.t. (5.5) ν 0 kx−x i k +ν i ≥ 0 ∀x∈R ∗ i ∀i ν 0 ≥ 0 whereby we conclude that the optimal solution f ∗ to (5.4) must take the form f ∗ (x) =e (ν ∗ 0 kx−xik+ν ∗ i −1) on each sub-region R ∗ i . This satisfies f(x)> 0 for all x∈R and completes the proof. The functional form for the optimal f ∗ can in fact be simplified further: 40 Theorem 14. The worst-case distribution that solves problem (5.3), and therefore (5.1), takes the form f ∗ (x) =e (ν ∗ 0 mini{kx−xik−λ ∗ i }+ν ∗ 1 −1) with ν ∗ 0 ,ν ∗ 1 ≥ 0 and e T λ ∗ = 0. Proof. We first note that the constraint x R f(x) min i {kx−x i k−λ i }dA≤t ∀λ : e T λ = 0 can be restricted to merely to the compact set Λ := λ∈R n : e T λ = 0, λ i ≤ diam(R)∀i because, ifλ i > diam(R) for somei, thenkx−x i k−λ i < 0 for allx and the constraint is obviously satisfied. We will apply Theorem 5 where X = L 1 , Ω is the subset of the non-negative orthant in L 1 that integrates to 1, and Z consists of all continuous functions on Λ, i.e. Z =C (Λ) (note that Z satisfies the interior point requirement of Theorem 5 because inequalities are simply taken elementwise in Λ). We define φ(x ) : X→R and G(x ) : X→ Z as the maps sending f7→ x R −f(x) logf(x)dA and f7→ x R f(x) min i {kx−x i k−λ i }dA−t respectively, where the right-hand side of the second expression is regarded as a continuous func- tion of λ. The dual space Z ∗ consists of all regular signed Borel measures on Λ (this is the Riesz representation theorem; see e.g. [42]). However, Lemma 13 shows that f ∗ (x) > 0 onR, and 41 therefore the optimal λ ∗ that solves problem (3.2) is unique by statement 4 of Lemma 4. This implies that G(x ) = s R f(x) min i {kx−x i k−λ i }dA−t< 0 wheneverλ6=λ ∗ , and therefore, since hG(x ),z ∗ i = 0 at optimality, it must be the case that z ∗ is zero everywhere except for (possibly at) λ ∗ . Thus, we conclude that z ∗ is an evaluation functional at λ ∗ (multiplied by a scalar), so that hG(x ),z ∗ i =q ∗ x R f(x) min i {kx−x i k−λ ∗ i }dA−t ! for all feasible f, where q ∗ ≥ 0 is some scalar. Theorem 5 then says that f ∗ must also be the solution to the problem maximize f∈L 1 x R −f(x) logf(x)dA +q ∗ x R f(x) min i {kx−x i k−λ ∗ i }dA−t ! s.t. x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R or equivalently, the problem maximize f∈L 1 x R −f(x) logf(x)dA s.t. x R f(x) min i {kx−x i k−λ ∗ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. 42 It is routine to verify that the constraint s R f(x)dA = 1 can be replaced with an inequality (in a nutshell, this is because we are allowed to make f(x) as large as we like whenkx−x i k−λ ∗ i ≤ 0 for some index i). Thus, we can apply Theorem 5 again to the problem maximize f∈L 1 x R −f(x) logf(x)dA s.t. x R f(x) min i {kx−x i k−λ ∗ i }dA ≤ t x R f(x)dA ≤ 1 f(x) ≥ 0 ∀x∈R to derive the 2-dimensional dual problem minimize ν0,ν1 x R 2(ν 0 min i {kx−x i k−λ ∗ i } +ν 1 )− 1 e (ν0 mini{kx−xik−λ ∗ i }+ν1−1) dA +ν 0 t +ν 1 s.t. (5.6) ν 0 min i {kx−x i k−λ ∗ i } +ν 1 ≥ 0 ∀x∈R ν 0 ,ν 1 ≥ 0 ; the optimality conditions of (5.6) describe precisely the desired form of f ∗ , which completes the proof. 5.2 Solving (5.1) Efficiently The preceding section established that the worst-case distribution that solves (5.1) can be ex- pressed in terms of optimal vectorsλ ∗ ∈R n andν ∗ ∈R 2 . This section describes a simple method 43 for calculating λ ∗ and ν ∗ efficiently by way of an analytic center cutting plane method [9]. As derived in the proof of Theorem 14, the optimal distribution f ∗ is the solution to the problem maximize f∈L 1 x R −f(x) logf(x)dA s.t. x R f(x) min i {kx−x i k−λ ∗ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R for an appropriately chosen vector λ ∗ (it is obvious that, if we substitute any other value λ in the above, we obtain an upper bound for our original problem (5.3)). The problem of finding λ ∗ actually reduces to the optimization problem minimize λ∈R n max f∈Ω(λ) x R −f(x) logf(x)dA s.t. (5.7) e T λ = 0 λ i ≤ diam(R) ∀i where Ω(λ) is the subset of L 1 consisting of all functions f such that x R f(x) min i {kx−x i k−λ i }dA ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. 44 Of course, the inner problem of maximizingf givenλ is easily solved because the gradient vector for the dual problem minimize ν0,ν1 x R 2(ν 0 min i {kx−x i k−λ i } +ν 1 )− 1 e (ν0 mini{kx−xik−λi}+ν1−1) dA +ν 0 t +ν 1 s.t. (5.8) ν 0 min i {kx−x i k−λ i } +ν 1 ≥ 0 ∀x∈R ν 0 ,ν 1 ≥ 0 ; as derived in the proof of Theorem 14, can be computed explicitly. Thus, we simply require a better understanding of problem (5.7): Lemma 15. The (outer) objective function of problem (5.7) is quasiconvex, i.e. its sub-level sets are convex. Proof. For notational compactness, let G(λ) denote the objective function of (5.7). Recall [10] that G(λ) is quasiconvex if and only if, for any λ 1 ,λ 2 and any θ∈ [0, 1], we have G(θλ 1 + (1−θ)λ 2 )≤ max{G(λ 1 ),G(λ 2 )}. Let ¯ λ =θλ 1 + (1−θ)λ 2 and let ¯ f denote the distribution that maximizes s R −f(x) logf(x)dA over all f∈ Ω( ¯ λ); it will suffice to prove that either ¯ f∈ Ω(λ 1 ) or ¯ f∈ Ω(λ 2 ). By definition, we have x R ¯ f(x) min i {kx−x i k− ¯ λ i }dA≤t, and the left-hand side of the above inequality is a concave function in ¯ λ (if we fix the function ¯ f). Thus, if we letS denote the line segment joining λ 1 and λ 2 (which, of course, contains ¯ λ), we see that the problem minimize λ∈S x R ¯ f(x) min i {kx−x i k−λ i }dA 45 must realize its minimizer on the boundary ofS (since we are minimizing a concave function), i.e. the point λ 1 or λ 2 . Therefore, it must be the case that s R ¯ f(x) min i {kx−x i k−λ 1 i }dA≤t or s R ¯ f(x) min i {kx−x i k−λ 2 i }dA≤t, which completes the proof. The following theorem describes a cutting plane oracle for the outer problem (5.7): Theorem 16. Let ¯ λ satisfy e T ¯ λ = 0 and let ¯ f be the solution to the inner problem of (4.10) (i.e. ¯ f maximizes s R −f(x) logf(x)dA over all f∈ Ω( ¯ λ)). Then the vector ¯ g∈R n defined by setting ¯ g i =− x ¯ Ri ¯ f(x)dA for all i, where ¯ R i is defined as ¯ R i = x∈R :kx−x i k− ¯ λ i ≤kx−x j k− ¯ λ j ∀j , defines a valid cutting plane for problem (5.7); that is, if ¯ g T (λ 0 − ¯ λ)≤ 0 for some λ 0 satis- fying e T λ 0 = 0, and f 0 is the solution to the inner problem of (5.7) associated with λ 0 , then max f∈Ω(λ 0 ) s R −f(x) logf(x)dA≥ max f∈Ω( ¯ λ) s R −f(x) logf(x)dA. Proof. Statement 2 of Lemma 4 says that, for any other λ 0 , the assumption that ¯ g T (λ 0 − ¯ λ)≤ 0 yields x R ¯ f(x) min i {kx−x i k−λ 0 i }dA≤ x R ¯ f(x) min i {kx−x i k− ¯ λ i }dA+¯ g T (λ 0 − ¯ λ)≤ x R ¯ f(x) min i {kx−x i k− ¯ λ i }dA≤t which implies that ¯ f∈ Ω(λ 0 ) and therefore that max f∈Ω(λ 0 ) s R −f(x) logf(x)dA≥ max f∈Ω( ¯ λ) s R −f(x) logf(x)dA as desired. 46 We now have a fast method for generating cutting planes associated with problem (5.7) and thereby recovering the distribution f ∗ that solves problem (5.1); see Algorithm 2 for a formal description. Input: A compact, planar regionR containing a set of distinct points x 1 ,...,x n which are interpreted as an empirical distribution ˆ f, a distance parameter t, and a tolerance . Output: An -approximation of the distribution f ∗ that maximizes s R −f(x) logf(x)dA subject to the constraint thatD(f, ˆ f)≤t. /* This is a standard analytic center cutting plane method applied to problem (5.7), which has an n-dimensional variable space. */ Set UB =∞ and LB =−∞; Set Λ ={λ∈R n : e T λ = 0, λ i ≤ diam(R)∀i}; while UB−LB> do Let ¯ λ be the analytic center of Λ; /* Build an upper bounding ¯ f for the original problem (5.1). */ Let ¯ ν 0 , ¯ ν 1 be the solution to problem (5.8) with ¯ λ as an input; Let ¯ f(x) =e (¯ ν0 mini{kx−xik− ¯ λi}+¯ ν1−1) ; Let UB = s R − ¯ f(x) log ¯ f(x)dA; /* Build a lower bounding ˜ f that is feasible for (5.1) by construction. */ Let ¯ R i ={x∈R :kx−x i k− ¯ λ i ≤kx−x j k− ¯ λ j ∀j} for i ={1,...,n}; Let ˜ ν∈R n+1 be the solution to problem (5.5) with inputs ¯ R 1 ,..., ¯ R n ; Let ˜ f be defined by setting ˜ f(x) =e (˜ ν0kx−xik+˜ νi−1) on each ¯ R i ; Let LB = s R − ˜ f(x) log ˜ f(x)dA; Let g i =− s ¯ Ri ¯ f(x)dA for i ={1,...,n}; LetH ={λ∈R n : g T λ≥ g T ¯ λ} and set Λ = Λ∩H; end return ˜ f; Algorithm 2: Algorithm MaximumEntropyDensity takes as input a compact planar region containing a set of n distinct points, a distance threshold t, and a tolerance . 47 Chapter 6 Highest Density Region Optimization The input to our problem is a set of distinct landmark pointsx 1 ,...,x n in a compact planar region R, which we assume are sampled from some (unknown) probability density function f. LetS denote the sub-region ofR with area equal to a given non-negative parameter α (α≤Area(R)). The highest density region is theS with the largest coverage probability over it. The largest coverage probability function of a random variable X that follows a probability density function f with a supportR is defined as h(f) = max S⊆R:Area(S)=α s S f(x)dA, which is convex in f. We now let ˆ f denote the empirical distribution on thesen pointsx i . We will search through all distributions f whose Wasserstein distance to ˆ f is sufficiently small, i.e. whereD(f, ˆ f)≤t; here D(·,·) is the Wasserstein distance from Definition 3 and t is a parameter that will be discussed in Section 7.1. The problem of finding the worst-case highest density region distribution, subject to theWassersteindistanceconstraint,isthenwrittenastheinfinite-dimensionalconvexoptimization problem minimize f max S⊆R:Area(S)=α x S f(x)dA s.t. (6.1) D(f, ˆ f) ≤ t x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. 48 This is our problem of interest throughout this section. To begin, we express the distance con- straintD(f, ˆ f)≤t in (6.1) explicitly, obtaining the following problem minimize fi max S⊆R:Area(S)=α x S n X i=1 f i (x)dA s.t. (6.2) n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA = 1 n ∀i f i (x) ≥ 0 ∀i,x∈R, here the functionf i (x) simply describes the amount of the distribution at pointx∈S that should be moved to point x i . Lemma 17. Problem (6.2) is equivalent to the following problem minimize fi,u,y αy + x R u(x)dA s.t. (6.3) n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA ≥ 1 n ∀i u(x) +y− n X i=1 f i (x) ≥ 0 ∀x∈R f i (x) ≥ 0 ∀i,x∈R u(x) ≥ 0 ∀x∈R y ≥ 0. Proof. We introduce an auxiliary non-negative variable y∈R, and rewrite the objective function in (6.2) as minimize fi,y αy + x R max(0, n X i=1 f i (x)−y)dA. 49 If we choosey to be the largest value that is smaller or equal than the density at any point in the highest density region, i.e. P n i=1 f i (x)≥y∀x∈S ∗ α . We have αy + x R max(0, n X i=1 f i (x)−y)dA = αy + max S⊆R:Area(S)=α x S ( n X i=1 f i (x)−y)dA = max S⊆R:Area(S)=α x S n X i=1 f i (x)dA =s y (f i ). Since s y (f i ) is attained for a particular choice of y, we obtain that s y (f i ) is bounded below by the minimum over y max S⊆R:Area(S)=α x S n X i=1 f i (x)dA≥ minimize y αy + x R max(0, n X i=1 f i (x)−y)dA. On the other hand, for every y, we have s y (f i ) = max S⊆R:Area(S)=α x S ( n X i=1 f i (x)−y +y)dA = αy + max S⊆R:Area(S)=α x S ( n X i=1 f i (x)−y)dA ≤ αy + max S⊆R:Area(S)=α x S max(0, n X i=1 f i (x)−y)dA ≤ αy + x R max(0, n X i=1 f i (x)−y)dA. Since the upper bound above is valid for every y, it remains valid when minimizing over y, and we have max S⊆R:Area(S)=α x S n X i=1 f i (x)dA≤ minimize y αy + x R max(0, n X i=1 f i (x)−y)dA. 50 Therefore we have max S⊆R:Area(S)=α x S n X i=1 f i (x)dA = minimize y αy + x R max(0, n X i=1 f i (x)−y)dA. Then we can get the following equivalent problem minimize fi,y αy + x R max(0, n X i=1 f i (x)−y)dA s.t. (6.4) n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA = 1 n ∀i f i (x) ≥ 0 ∀i,x∈R. y ≥ 0. Let u(x) = max(0, P n i=1 f i (x)−y). After variable substitution in (6.4), we can get the following problem minimize fi,u,y αy + x R u(x)dA s.t. (6.5) n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA = 1 n ∀i u(x) +y− n X i=1 f i (x) ≥ 0 ∀x∈R f i (x) ≥ 0 ∀i,x∈R u(x) ≥ 0 ∀x∈R y ≥ 0. 51 It’s routine to verify that the constraint s R f i (x)dA = 1 n can be replaced with an inequality (in a nutshell, this is because we are allowed to make f i (x) as small as we like). Thus, we get the following problem minimize fi,u,y αy + x R u(x)dA s.t. n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA ≥ 1 n ∀i u(x) +y− n X i=1 f i (x) ≥ 0 ∀x∈R f i (x) ≥ 0 ∀i,x∈R u(x) ≥ 0 ∀x∈R y ≥ 0, which completes the proof. This is our final primal problem. To circumvent proving the existence of optimal solution, we look at the following pre-dual problem instead maximize ν0,νi,λ −ν 0 t + 1 n n X i=1 ν i s.t. (6.6) x R λ(x)dA ≤ α ν 0 kx−x i k +λ(x)−ν i ≥ 0 ∀i,x∈R λ(x) ≤ 1 ∀x∈R ν 0 ≥ 0 ν i ≥ 0 ∀i λ(x) ≥ 0 ∀x∈R. 52 We call the problem pre-dual problem because the primal problem is the dual of it, which is proven in the lemma below: Lemma 18. Problem (6.3) is the dual of Problem (6.6). Proof. WewillembedλintheBanachspaceL 1 (R),hereafterabbreviatedsimplytoL 1 ,whichcon- sists of all functions that are absolutely Lebesgue integrable onR. This is an infinite-dimensional problem with an infinite-dimensional constraint space, and it is routine to apply Theorem 5 to derive the dual problem minimize fi,u,y αy + x R u(x)dA s.t. n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA ≥ 1 n ∀i u(x) +y− n X i=1 f i (x) ≥ 0 ∀x∈R f i (x) ≥ 0 ∀i,x∈R u(x) ≥ 0 ∀x∈R y ≥ 0. The dual of problem (6.6) is exactly the same as problem (6.3), which is our final primal problem. That’s why we call problem (6.6) pre-dual problem. According to Theorem 5, if there exists a finite optimal solution to the pre-dual problem, i.e. problem (6.6), there must also exists a finite optimal solution to the primal problem with the same objective value, i.e. problem (6.3). Therefore, we will first solve the pre-dual problem in section 6.1 and then use complementary slackness to find optimal primal solution in section 6.2. 53 6.1 Solve (6.6) Efficiently Problem (6.6) is an infinite-dimensional problem with an infinite-dimensional constraint space, and thus hard to solve directly. We can obtain an equivalent formulation with a finite-dimensional variable space in the lemma below: Lemma 19. Problem (6.6) is equivalent to the following problem maximize ν0,νi −ν 0 t + 1 n n X i=1 ν i s.t. (6.7) x R max(0, max i {ν i −ν 0 kx−x i k})dA ≤ α ν i ≤ 1 ∀i ν 0 ≥ 0 ν i ≥ 0 ∀i. Proof. Since λ(x)≥ 0 and λ(x)≥ν i −ν 0 kx−x i k∀i, we have λ(x)≥ max(0, max i {ν i −ν 0 kx−x i k}) ∀x∈R. Since λ(x)≤ 1∀x∈R, we have max(0, max i {ν i −ν 0 kx−x i k})≤ 1 ∀x∈R, 54 therefore we have ν i ≤ 1 ∀i. Then we can rewrite the problem as following maximize ν0,νi,λ −ν 0 t + 1 n n X i=1 ν i s.t. (6.8) x R λ(x)dA ≤ α λ(x) ≥ max(0, max i {ν i −ν 0 kx−x i k}) ∀x∈R ν i ≤ 1 ∀i ν 0 ≥ 0 ν i ≥ 0 ∀i. Given values of ν 0 and ν i , λ(x) is irrelevant to the objective function. We want to make λ(x) as smallaspossibletosatisfy s R λ(x)dA≤αwithoutviolatingλ(x)≥ max(0, max i {ν i −ν 0 kx−x i k}). Therefore, we can simply letλ(x) = max(0, max i {ν i −ν 0 kx−x i k}), obtaining the equivalent for- mulation maximize ν0,νi −ν 0 t + 1 n n X i=1 ν i s.t. x R max(0, max i {ν i −ν 0 kx−x i k})dA ≤ α ν i ≤ 1 ∀i ν 0 ≥ 0 ν i ≥ 0 ∀i, which completes the proof. This is a convex optimization problem with a finite-dimensional variable space and thereby we can use first-order methods to solve it. For notational compactness, letν = [ν 0 ,ν 1 ...ν n ], which is the optimization variable. Let h(ν) =−ν 0 t + 1 n P n i=1 ν i , which is the objective function. Let 55 g(ν) = s R max(0, max i {ν i −ν 0 kx−x i k})dA, which is the constraint function. In this section, we use an analytical center cutting plane method[9] to solve problem (6.7). When cutting plane oracle is queried at ¯ ν∈R (n+1) , • if ¯ ν is feasible, we have deep objective cut−∇h(¯ ν) T ν≤−∇h(¯ ν) T ¯ ν−h (k) best +h(¯ ν), where ∇h(¯ ν) T = [−t, 1 n ... 1 n ] and h (k) best is the best (largest) objective value found so far in the algorithm; • if ¯ ν is not feasible, we have deep feasibility cut∇g(¯ ν) T ν≤∇g(¯ ν) T ¯ ν +α−g(¯ ν), where ∇g(¯ ν) is the gradient vector of the constraint function g. We will show how to calculate ∇g(¯ ν) in the following. Lemma 20. LetR i be the set of pointsx for whichν i −ν 0 kx−x i k is maximal, i.e. ν i −ν 0 kx−x i k≥ ν j −ν 0 kx−x j k∀j. Let B i be the set of points x for which ν i −ν 0 kx−x i k≥ 0. The gradient vector of the constraint function g takes the form ∇g(ν) = [− n X i=1 x Ri∩Bi kx−x i kdA, Area(R 1 ∩B 1 )... Area(R n ∩B n )]. Proof. The partial derivative of g(ν) with respect to ν 0 is ∂g(ν) ∂ν 0 = ∂ ∂ν 0 x R max(0, max i {ν i −ν 0 kx−x i k})dA = x R ∂ ∂ν 0 max(0, max i {ν i −ν 0 kx−x i k})dA = − n X i=1 x Ri∩Bi kx−x i kdA. 56 The partial derivative of g(ν) with respect to ν i is ∂g(ν) ∂ν i = ∂ ∂ν i x R max(0, max i {ν i −ν 0 kx−x i k})dA = x R ∂ ∂ν i max(0, max i {ν i −ν 0 kx−x i k})dA = Area(R i ∩B i ). Therefore, the gradient vector of the constraint function g takes the form ∇g(ν) = [− n X i=1 x Ri∩Bi kx−x i kdA, Area(R 1 ∩B 1 )... Area(R n ∩B n )], which completes the proof. See Algorithm 3 for a formal description of the analytical center cutting plane method. Input: A compact, planar regionR containing a set of distinct landmark points x 1 ,...,x n which are interpreted as an empirical distribution ˆ f, an area parameter α, a distance parameter t, and a tolerance . Output: An -approximation of the variable ν ∗ that maximizes h(ν) =−ν 0 t + 1 n P n i=1 ν i subject to the constraint that g(ν) = s R max(0, max i {ν i −ν 0 kx−x i k})dA≤α. /* This is a standard analytic center cutting plane method applied to problem (6.7), which has an (n + 1)-dimensional variable space. */ Set the initial polyhedron P 0 ={ν∈R (n+1) :ν≥ 0,ν 0 ≤ 1,ν i ≤ 1 t ∀i}; Set k = 0; while u (k) best −h (k) best > do Let ¯ ν be the analytic center of P k ; Query cutting plane oracle at ¯ ν; /* If ¯ ν is feasible, we have deep objective cut */ If g(¯ ν)≤α, letH ={−∇h(¯ ν) T ν≤−∇h(¯ ν) T ¯ ν−h (k) best +h(¯ ν)}; /* If ¯ ν is not feasible, we have deep feasibility cut */ Else, letH ={∇g(¯ ν) T ν≤∇g(¯ ν) T ¯ ν +α−g(¯ ν)}; Set P k+1 =P k ∩H; If P k+1 =φ, quit; Set k =k + 1; end return h (k) best ; Algorithm 3: Algorithm WorstHDRDensity takes as input a compact planar region contain- ing a set of n distinct points, an area parameter α, a distance parameter t, and a tolerance . 57 6.2 Find the Worse-case Distribution From the preceding section, we can solve problem (6.7) efficiently using the analytic center cut- ting plane method. Now assume we have optimal values of ν ∗ 0 > 0 (otherwise uniform distri- bution would be feasible) and ν ∗ i ≥ 0. Then it must be feasible and optimal to set λ ∗ (x) = max(0, max i {ν ∗ i −ν ∗ 0 kx−x i k}). We will find the optimal distribution f ∗ via complementary slackness using the optimal pre-dual variable ν ∗ . The optimal distribution f ∗ is characterized in the following theorem: Theorem 21. The worse-case distribution f ∗ that solves problem (6.3), and therefore (6.2), is a mixed-type distribution with possibly Dirac delta functions at landmark points and uniform distribution over some hyperbolic and circular regions. Proof. We have the pre-dual problem maximize ν0,νi,λ −ν 0 t + 1 n n X i=1 ν i s.t. x R λ(x)dA ≤ α ν 0 kx−x i k +λ(x)−ν i ≥ 0 ∀i,x∈R λ(x) ≤ 1 ∀x∈R ν 0 ≥ 0 ν i ≥ 0 ∀i λ(x) ≥ 0 ∀x∈R, 58 and the primal problem minimize fi,u,y αy + x R u(x)dA s.t. n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA ≥ 1 n ∀i u(x) +y− n X i=1 f i (x) ≥ 0 ∀x∈R f i (x) ≥ 0 ∀i,x∈R u(x) ≥ 0 ∀x∈R y ≥ 0. • Ifλ ∗ (x) = 1,x must be a landmark point. We haveu ∗ (x) = P n i=1 f ∗ i (x)−y ∗ andu ∗ (x)≥ 0, implying that P n i=1 f ∗ i (x)≥y ∗ . This means we could have a delta function at the landmark point x. • If 0 < λ ∗ (x) < 1, we have u ∗ (x) = P n i=1 f ∗ i (x)− y ∗ and u ∗ (x) = 0, implying that P n i=1 f ∗ i (x) = y ∗ . This means there is a positive density equal to y ∗ at x, and it gets transported to arg max i {ν ∗ i −ν ∗ 0 kx−x i k}. • Ifλ ∗ (x) = 0, we haveu ∗ (x)≥ P n i=1 f ∗ i (x)−y ∗ andu ∗ (x) = 0, implying that P n i=1 f ∗ i (x)≤ y ∗ . – If ν ∗ i −ν ∗ 0 kx−x i k< 0∀i, then f ∗ i (x) = 0∀i, and therefore we have P n i=1 f ∗ i (x) = 0. – If max i {ν ∗ i −ν ∗ 0 kx−x i k}) = 0, we still have P n i=1 f ∗ i (x)≤y ∗ . This means the density at x is bounded above by y ∗ , and thus there is no delta function at x. Basedontheabovereasoning, itfollowsthateachworst-casedistributionf ∗ i (x)ischaracterized as follows: 59 (1) Let R i be the set of points x for which ν ∗ i −ν ∗ 0 kx−x i k is maximal, i.e. ν ∗ i −ν ∗ 0 kx−x i k≥ ν ∗ j −ν ∗ 0 kx−x j k ∀j. (2) Let B i be the set of points x for which ν ∗ i −ν ∗ 0 kx−x i k≥ 0. (3) Wesetf ∗ i (x)tobeequaltosome(unknown)valuey ∗ onR i ∩B i , andequaltosome(unknown) delta function value at x i itself. (4) To determine what the value of y ∗ and the deltas should be, we set d i = s Ri∩Bi kx−x i kdA. Since ν ∗ 0 > 0, the constraint P n i=1 s R kx−x i kf ∗ i (x)dA≤ t must be binding, and therefore we set y ∗ = t P i d i . The delta components are whatever is left over to make sure each f ∗ i (x) integrates to 1/n. 60 Chapter 7 Computational Experiments In this section, we first show how to select the distance parameter t in the Wasserstein distance constraint D(f, ˆ f) ≤ t. Then we apply our theoretical results of the highest density region optimization problem to two computational experiments: the first experiment shows the impact of increasing the number of samplesn, and the second is a districting strategy in which we divide a map into pieces and assign a drone to each piece in order to maximize the number of targets to be captured. 7.1 Selecting the Distance Parameter t From the preceding discussion, it is clear that the parameter t in the Wasserstein distance con- straintD(f, ˆ f) ≤ t from our original problem (4.1) has a significant impact on the problem solution. Of course, in practice, we do not have any way of a priori calculating an exact value of t. However, in order to estimate t in a data-driven fashion, the following result is helpful: Theorem 22. Let ˆ f 1 and ˆ f 2 denote empirical distributions associated with two sets of samples of n points from a distribution f. Then 1 2 ED( ˆ f 1 , ˆ f 2 )≤ ED(f, ˆ f 1 )≤ ED( ˆ f 1 , ˆ f 2 ). 61 Proof. This is due to [14], and follows from Jensen’s inequality and the triangle inequality. TheaboveresultisusefulbecausethedistancebetweenthetwoempiricaldistributionsD( ˆ f 1 , ˆ f 2 ) is simply the cost of a minimum-weight bipartite matching between the elements of ˆ f 1 and ˆ f 2 , multiplied by a factor of 1/n. Thus, one simple, “back-of-the-envelope” procedure to select the distance parameter t would be to sample two sets of n points each, let c be equal to the cost of the minimum-weight bipartite matching between them, and set t =γc with γ∈ [1/2, 1]. If we desire rigorous probabilistic bounds on t, more sophisticated machinery is required. Theorem 6.15 of [51] gives a useful bound on the Wasserstein distance between two probability density functions f 1 and f 2 by D(f 1 ,f 2 )≤ x R kx 0 −xk·|f 1 (x)−f 2 (x)|dA for anyx 0 ∈R. Theorem 1(i) of [8] relates the right-hand side of the above to the relative entropy H(f 1 |f 2 ) between f 1 and f 2 by the expression x R kx 0 −xk·|f 1 (x)−f 2 (x)|dA≤ 3 2 + log x R e 2kx−x0k f 2 (x)dA ! p H(f 1 |f 2 ) + 1 2 H(f 1 |f 2 ) , where we define H(f 1 |f 2 ) = x R f 1 (x) log f 1 (x) f 2 (x) dA. Let r = min x0∈R max x∈R kx−x 0 k denote the “radius” ofR, whence log s R e 2kx−x0k f 2 (x)dA≤ loge 2r = 2r. Thus, ifD(f 1 ,f 2 )≥t, we have t≤D(f 1 ,f 2 ) ≤ x R kx 0 −xk·|f 1 (x)−f 2 (x)|dA ≤ 3 2 + 2r p H(f 1 |f 2 ) + 1 2 H(f 1 |f 2 ) =⇒ H(f 1 |f 2 ) ≥ 8r− 2 √ 16r 2 + 16rt + 24r + 12t + 9 + 4t + 6 3 + 4r . (7.1) 62 Next, the paper [7] shows that, for any distribution f with empirical distribution ˆ f, we have lim sup n→∞ 1 n log Pr(D(f, ˆ f)≥t)≤−α(t), where the function α(t) is defined as α(t) = inf g:D(f,g)≥t H(f|g). the result (7.1) establishes that α(t)≥ (8r− 2 √ 16r 2 + 16rt + 24r + 12t + 9 + 4t + 6)/(3 + 4r), and therefore we find that lim sup n→∞ 1 n log Pr(D(f, ˆ f)≥t) ≤ − 8r− 2 √ 16r 2 + 16rt + 24r + 12t + 9 + 4t + 6 3 + 4r =⇒ Pr(D(f, ˆ f)≥t) > exp −n 8r− 2 √ 16r 2 + 16rt + 24r + 12t + 9 + 4t + 6 3 + 4r ! , (7.2) wherethenotation“>”reflectstheapproximateinequalitythatresultsfromdroppingthe“lim sup” term. Thus, given a desired significance level 1−β, we can construct a threshold distance t by equating the right-hand side of (7.2) to 1−β and solving for t. Figure 7.1 shows a plot of these threshold values of t as a function of β, for n = 100 samples in the unit square. 7.2 Varying Values of n and t In this section, we conduct computational experiments on highest density region optimization, showing the impact of increasing the number of samplesn and the distance parametert. LetR be the unit square, the area parameter α = 0.4 and as a ground truth distribution ¯ f we use an even mixture of two truncated Gaussian distributions with meansμ 1 ,μ 2 = (0.400, 0.187), (0.795, 0.490) and covariance matrices Σ 1 = Σ 2 = ( 0.040 0 0 0.040 ). This mixture was chosen because it satisfies max S⊆R:Area(S)=α s S ¯ f(x)dA = 0.7635 and therefore represents a compromise between extreme 63 t 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Figure 7.1: Threshold values of t for significance levels between 0 and 1, where n = 100 points are sampled in the unit square. For example, at 1−β = 0.9, we have t = 0.102; this means that, if 100 samples are drawn from any distribution f in the unit square, then there is at least a 90% probability thatD(f, ˆ f)≤ 0.102. clustering (which would have max S⊆R:Area(S)=α s S ¯ f(x)dA close to 1) and a perfect uniform distribution(whichwouldhave max S⊆R:Area(S)=α s S ¯ f(x)dAequalto 0.4). Forn∈{2,..., 100}, we performed 10 independent experiments where we drew n samples from ¯ f and then obtained the worst-case HDR distribution f ∗ by solving problem (6.1) via Algorithm 3 (hence, 99× 10 experiments in total). For each experiment, we defined our distance constraint using Theorem 22 by setting t to be the cost of a minimum-weight bipartite matching between two independent collections of samples of size n from ¯ f (multiplied by a factor of 1/n). Figure 7.2a shows a plot of the worst-case HDR probabilities max S⊆R:Area(S)=α s S f ∗ (x)dA as n varies, and Figure 7.2b shows the same data, only using the Wasserstein distance threshold t as the independent variable. Not surprisingly, it is clear that the worst-case probability increases as n increases and as t decreases. Figure 7.2b suggests that the worst-case probability, measured as a function of t, decreases in a convex fashion. For purposes of comparison, Figure 7.2c shows the estimates of max S⊆R:Area(S)=α s S ¯ f(x)dA obtained when one uses a uniform kernel density estimator. 64 0 10 20 30 40 50 60 70 80 90 100 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 n (a) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 t (b) 0 10 20 30 40 50 60 70 80 90 100 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 r = 0.10 r = 0.16 r = 0.24 r = 0.30 r = 0.36 n (c) Figure 7.2: Figure 7.2a shows the worst-case probabilities that are computed during the 99× 10 executions of our algorithm; the gray plots indicate the results obtained from individual samples and the thick line indicates the sample averages of the 10 trials for fixed n. Figure 7.2b shows the same data set, only we plot the worst-case probabilities as a function of the Wasserstein distance threshold t rather than a function of n; the gray points indicate individual experiments and the dark points again indicate the sample averages of the 10 trials for fixed n. For purposes of comparison, Figure 7.2c shows the estimates of max S⊆R:Area(S)=α s S ¯ f(x)dA obtained when one uses a uniform kernel density estimator; that is, if we draw n samples x 1 ,...,x n from ¯ f, then we define an estimator ˜ f by setting ˜ f(x) = 1 C P i 1(kx−x i k≤ r), where r is a “band- width” parameter and C is a normalization constant. As in the preceding two figures, the gray plots indicate the results from individual samples, the thick lines indicate sample averages of 10 trials for fixed n, and the dashed line indicates the true value of max S⊆R:Area(S)=α s S ¯ f(x)dA; furthermore, as indicated, we used 5 different values of r between 0.10 and 0.36; note that the estimate max S⊆R:Area(S)=α s S ˜ f(x)dA is highly sensitive to the choice of r. 65 7.3 A Districting Experiment with Geospatial Data In this section, we describe an experiment in which we divide a service region into districts so as to allocate the workloads of a fleet of drones. This experiment is much more elaborate than that of the preceding section because we use data from an industrial collaborator, rather than simply generate from a Gaussian mixture distribution. Our dataset comes from an industrial affiliate, a major North American mapping company. In a nutshell, the problem faced by the affiliate is to collect map data using vehicles that are equipped with rotating LIDAR, positioning sensors, and high-resolution panoramic cameras. Scans of this kind usually consist of two stages: the first stage, called a rough scan, is essentially a large-scale, multi-vehicle Chinese Postman Problem in which the goal is to traverse every street (or a large collection of streets) in a region. The second stage, which turns out to be equally time-consuming and is the subject of this experiment, is called a targeted scan, and consists of visiting those points in the region for which the measurements in the first stage were not adequate; this might be because the driver missed a street, the sensor’s view was obstructed, or because an alternate vantage point of a particular location is needed. It is often the case that, when performing a 3-dimensional scan of a structure (see Figure 7.3 for an example), multiple trips to that structure are needed to accurately recover the original shape from the measurements. (a) (b) Figure 7.3: A 3-dimensional LIDAR scan taken from a vehicle travelling along a city street (7.3a); these measurements are aggregated together to form point cloud representations of entire cities, as shown in 7.3b. 66 Instead of using vehicles for the second stage, i.e. the targeted scan stage, we use drones to visit targets. Drones have a low per-mile cost, can operate without human intervention, and can travel at high speeds while being unaffected by road traffic, and therefore are ideal for the targeted scan stage. We describe here an algorithm for designing drone districts to maximize the number of targets to be captured during the targeted scan stage, given constraints on the number of drones and the total distance that each drone can travel (due to battery life). This problem satisfies the following four attributes: 1. The service region is a compact Euclidean domain R, and distances between points are Euclidean, or close to some “natural” metric such as the ` 1 or ` ∞ norm. Each service region is the metropolitan region surrounding a major city. 2. During each service period, a fleet of aerial vehicles (i.e. drones) must visit a collection of destination points inR that are sampled from a probability dis- tribution f. The destination points are determined as follows: once the rough scan (the first stage) has been completed, the raw data from all vehicles are dumped to an in-house cloud computing service, where they are “scrubbed” to identify those locations that must be re-scanned. Since this scrubbing process is performed in parallel across many CPUs, and because certain errors take longer to find than others (in fact, many errors are explicitly found only by human analysts), these locations can essentially be thought of as occurring randomly. 3. The goal is to partitionR into districts, with one district per drone, in a way that is “optimal” as the number of destination points becomes large. We seek a districting strategy because it is much simpler to implement than explicitly computing drone tours, and also because additional destination points may arrive in an online fashion as they are identified by the cloud computing service. Districting strategies (as opposed to explicit routing strategies) are very common in the mapping industry because of the ease 67 of operation that is afforded. In addition, since the number of drones as well as the total distance that each drone can travel (due to battery life) are limited, the primary goal is to maximize the number of targets to be captured during the targeted scan stage. 4. The distribution f is unknown at the time that the districting decision is to be made, and there is only a set of independent samples from f available to make the districting decision. The districting decision must be made soon after the rough scan is completed, at which point there are a limited number of sample points available that are obtained from scrubbing the initial dataset. The sample points are generated overnight, after the rough scans are completed. Their distribution is non-uniform throughout the region because of spatial clustering: for example, erroneous data points often arise due to a complex 3-dimensional object (e.g. a sculpture or some elaborate architecture) or if the sun is shining at an unusual angle, so that many of the measurements in a localized area are affected. The assumption of independence is a strong one and requires some care. Simply put, it arises due to the fact that scrubbing takes place across many CPUs simultaneously. As an extreme example, if one were to use a single CPU to process the dataset, it would scan through the measurements and detect errors in the same order that the vehicle visited them. Thus, the samples would appear in an extremely dependent way. Conversely, if we distribute the scanning process across many CPUs, then each CPU is responsible for only a small leg of the vehicle’s tour, and two CPUs might detect errors in two entirely separate locations at the same time. As more CPUs are used, we see less dependence between (temporally) consecutive samples. Thepurposeofthisexperimentistwo-fold: first, weaimtodemonstratethattheproposedcon- tinuous approximation techniques are actually useful for solving practical problems, and second, we then seek to show that our proposed methodology is superior to that of existing approaches. 68 There are 7 different service regionsR that will be studied, all of which are metropolitan regions in Western Europe. 7.3.1 Distance-constrainedNeighborhoodTravellingSalesmanProblem We can model the tour of drone in each district as a distance-constrained neighborhood travelling salesman problem (DCNTSP). Each drone has a tour length constraint l because of battery life, and also a surveillance radius r which is determined by the camera angle and flying height. Our objective is to capture as many target points as possible during the tour. Figure 7.4a shows the tour of a drone that can capture part of the target points. Figure 7.4b shows the optimal tour of a drone in the sense that the percentage of the target points that can be captured by the drone is the largest. (a) (b) Figure 7.4: Figure 7.4a shows the tour of a drone that can capture part of the target points. Figure 7.4b shows the optimal tour of a drone in the sense that the percentage of the target points that can be captured by the drone is the largest. To formally state the problem, we have length ` and a radiusr, and we want a path of length ` that visits as many points as possible within radius r. Call this DCNTSP(p 1 ,...,p n ;r,`). We claim that lim `→∞;r=α/(2`) lim n→∞ sup 1 n DCNTSP(p 1 ,...,p n ;r,`) =HDR(f;α) 69 with probability one when the p i ’s are iid samples from f. In fact, the upper bound lim `→∞;r=α/(2`) lim n→∞ sup 1 n DCNTSP(p 1 ,...,p n ;r,`) ≤HDR(f;α) is the only interesting case because it is obvious that we achieve the lower bound in the limit (just drive through the densest areas). Lemma 23. LetL be a loop in the plane with length ` and let r > 0, and assume that r≤ ` 2π . The area of an r-neighborhood ofL is at most Area(N r (L))≤ 2r`. Proof. Assume without loss of generality thatL forms the boundary of a convex region and that L is piecewise linear, i.e. polygonal. The r-neighborhood ofL has an “inner” portion R in and an “outer” portionR out as shown in Figure 7.5a, and we can also see from Figure 7.5a that the area of the “outer” portionR out is always exactlyπr 2 +r`. It is also obvious that the outer perimeter of R out is exactly ` + 2πr. The area of the “inner” portion R in is a little more complicated to bound; first, for any r 0 ≤r, we letL r 0 denote the closed curve insideL consisting of points that are exactly r 0 away from their nearest point inL (thus, our originalL would simply be written asL 0 under this notation). It is of course possible thatL r 0 =∅ for sufficiently larger 0 . SinceR in is simply the union of all curvesL r 0 over r 0 ∈ [0,r], the coarea formula [36] says that the area of R in is obtained by simply integrating the length ofL r 0 from r 0 = 0 to r 0 =r: Area(R in ) = Z r 0 length(L r 0) dr 0 . Let R 0 out denote the “outer” portion of the r 0 -neighborhood ofL r 0, as shown in Figure 7.5b. We then haveR 0 out ⊆R in , which by convexity implies that the outer perimeter of R 0 out is less than or 70 (a) (b) Figure 7.5: (a) shows the neighborhoodsR in andR out ; note thatR out is the union of the hatched circularsectorsandshadedrectangles, andtherefore Area(R out ) =πr 2 +r`and perimeter(R out ) = ` + 2πr. (b) shows an inner loopL r 0 and illustrates thatR 0 out ⊆R in for allr 0 (which characterize R 0 out ). equal to the length ofL, i.e. `. However, by the same reasoning as our calculation of the outer perimeter of R out , we also see that the outer perimeter of R 0 out is exactly length(L r 0) + 2πr 0 , and therefore ` = length(L)≥ perimeter(R 0 out ) = length(L r 0) + 2πr 0 =⇒ length(L r 0) ≤ `− 2πr 0 provided thatL r 0 exists, i.e. that length(L r 0)> 0. Thus, we see that Area(R in ) = Z r 0 length(L r 0) dr 0 ≤ Z r 0 max{`− 2πr 0 , 0} dr 0 =r`−πr 2 from which the desired result follows, since Area(N r (L)) = Area(R in ) + Area(R out ). Lemma 24. Consider fixed values of ` and r =α/(2`), and letL be a loop of length `. Suppose that the region (the unit square) is broken into identical squares of side length δ. Let s be the number of squares that touch N r (L). Then s≤ 2r` δ 2 + 2 √ 2` δ . 71 Proof. It is easy to see that any square that touches N r (L) must be fully contained within N r+δ √ 2 (L). Since Area(N r+δ √ 2 (L)) ≤ 2(r + δ √ 2)` by the preceding lemma, the maximum number of squares that could be fully contained in N r+δ √ 2 (L) is 2(r +δ √ 2)` δ 2 = 2r` δ 2 + 2 √ 2` δ as desired. Nowsupposethattheunitsquareisbrokenintosquaresofsidelengthδ,andletMAXSQ(p 1 ,...,p n ;δ,s) be the maximum number of points we can cover by selectings squares (so, just pick thes densest squares; this is like a discretized HPD). It is obvious that, for any δ and s = 2r` δ 2 + 2 √ 2` δ , we therefore have DCNTSP(p 1 ,...,p n ;r,`)≤MAXSQ(p 1 ,...,p n ;δ,s) since the right hand side involves the selection of any s squares and the left-hand side involves a restriction that we have to form a closed path. Let’s assume that the densityf is actually a piecewise density function defined on a grid with squares of side length (this assumption is a weak one because could be arbitrarily small). We’ll write f(x) = 1/ 2 X k=1 a k 1(x∈ k ) where k denotes the k-th square and 1(·) denotes an indicator function. Assume WLOG that the squares are arranged in decreasing order of a k , i.e. a 1 ≥ a 2 ≥···. The variables with this density can be thought of as being produced by an experiment where one chooses a square k with probability 2 a k and then chooses a point at random from k . Let M i,k be the Bernoulli random variable that indicates if sample i belongs to square k. The strong law of large numbers 72 says that the fraction of points in square k, i.e. 1 n P n i=1 M i,k , converges to 2 a k with probability one. Therefore, we see that 1 n MAXSQ(p 1 ,...,p n ;,s)→ 2 s X i=1 a k =HDR(f; 2 s) with probability one, and if we like, we can make the grid even finer by selecting squares of side length δ =/m with fixed m∈Z: 1 n MAXSQ(p 1 ,...,p n ;δ,s)→δ 2 s X i=1 a k =HDR(f;δ 2 s). We’re basically done. For any fixed ` and r =α/(2`), let δ =/d` 2 e . We have 1 n DCNTSP(p 1 ,...,p n ;r,`)≤ 1 n MAXSQ(p 1 ,...,p n ;δ,s = 2r` δ 2 + 2 √ 2` δ ) and we just established that the right-hand side converges to 1 n MAXSQ(p 1 ,...,p n ;δ,s = 2r` δ 2 + 2 √ 2` δ )→HDR(f;δ 2 s) =HDR(f; 2r` + 2 √ 2δ`) =HDR(f; 2r` + 2 √ 2` d` 2 e ) which is a deterministic quantity that converges toHDR(f; 2r`) =HDR(f;α) as`→∞ as desired. 7.3.2 Districting Strategy We use a simple districting strategy to partition the service regionR into rectangular districts. For example, if we want to divideR into four pieces, we only need three variables, i.e., one for the vertical line and two for the horizontal lines. This is demonstrated in Figure 7.6. 73 (a) (b) Figure 7.6: Figure7.6a shows sample points over a region. Figure 7.6b shows using three variables to partition the region into four districts. After dividingR into districts D 1 ,...,D m , we want to find the worst-case distribution that maximizes the total capture probability of all districts, the problem of interest becomes minimize fi m X j=1 max Dj⊆R:Area(S)=α x S n X i=1 f i (x)dA s.t. (7.3) n X i=1 x R kx−x i kf i (x)dA ≤ t x R f i (x)dA = 1 n ∀i f i (x) ≥ 0 ∀i,x∈R. This problem is almost identical to problem 6.2, and we omit how to solve it here for brevity. 7.3.3 Districting Criteria We tried two different partition criteria in our experiments: Wasserstein robust partitioning Problem (7.3) in Section 7.3.2 of this dissertation describes the worst-case distribution that maximizes the total capture probability of all districts, 74 subject to the Wasserstein distance constraint. Therefore, we can search through all feasible partitions to find the one which has the largest total capture probability. Bisection partitioning Given a set of sample points x 1 ,...,x n ∈R, another natural approach is to divide the service region into rectangles, so that the number of sample points in each rectangle is approximately equal. 7.3.4 Results We applied the two aforementioned districting criteria to problem instances defined on 7 different metropolitan regions using data provided by our industrial affiliate, where the goal is to partition each region into m = 4 districts with the area parameter α = 1/12. That is saying we want to use 4 drones to cover 1/3 of the region area. When using the Wasserstein partitioning criterion, we set the distance threshold t using the result from Theorem 22 withγ = 1. Specifically, we split the sample points into two equally-sized (random) sets ˆ f 1 and ˆ f 2 , and computed a minimum-weight bipartite matching between the two; this process was repeated over 100 independent trials. We found this bound preferable to the rig- orous result from (7.2) because, even for very low significance levels 1−θ, the distance threshold becomes unrealistically high (more specifically, the uniform distribution lies in the distributional ambiguity region, which is plainly not the case as the plots make clear). Six of these regions are shown in Figure 7.7, and Figure 7.8 shows the output of our method (as well as the other districting criteria) when applied to seven metropolitan regions. In each of these experiments, the sample points (i.e. the larger markers in Figures 7.7 and 7.8) were provided by our industrial affiliate, and were identified during the overnight scrubbing process that occurred after the vehi- cles’ rough scans were finished. These sample points were identified automatically by software; most of the remaining destinations were detected by human analysts. The complete results of all 7 metropolitan regions are shown in Table 7.1. Note that Wasserstein partition outperforms 75 Region # points # samples area (km 2 ) Wasserstein bisection no partition Amsterdam 808 26 2528 0.7302 0.7302 0.7302 Bergen 972 19 2880 0.8025 0.7881 0.8025 Copenhagen 860 26 1680 0.7988 0.7884 0.7988 Helsinki 867 41 4372 0.7105 0.6390 0.7197 Oslo 952 29 3460 0.7374 0.7374 0.7374 Rotterdam 881 21 2652 0.7264 0.7264 0.7264 Zürich 965 30 1616 0.7741 0.7575 0.7741 Table 7.1: Capture probabilities for 7 regions and 2 partitioning criteria; for each region, the optimal criterion is highlighted and underlined. bisection partition in four out of seven instances, and shows the same results as bisection partition in the remaining three instances. Table 7.1 also shows an additional piece of information : the column labelled “no partition”, which is obtained by computing the worse-case capture proba- bility of the entire region with the area parameter set as 4α, i.e., 1/3. Wasserstein partition is almost as good as no partition, which means Wasserstein partition doesn’t introduce much cost for districting. 76 (a) Amsterdam (b) Geneva (c) Göteborg (d) Malmö (e) Oslo (f) Zürich Figure 7.7: Six of the metropolitan regions surveyed in the computational study; the larger markers indicate data samplesx i and the smaller markers indicate additional locations that were realized later. 77 (a) Amsterdam Wasserstein partition (b) Amsterdam bisection partition (c) Bergen Wasserstein partition (d) Bergen bisection partition (e) Copenhagen Wasserstein partition (f) Copenhagen bisection partition (g) Helsinki Wasserstein partition (h) Helsinki bisection partition 78 (i) Oslo Wasserstein partition (j) Oslo bisection partition (k) Rotterdam Wasserstein partition (l) Rotterdam bisection partition (m) Zürich Wasserstein partition (n) Zürich bisection partition Figure 7.8: Seven metropolitan regions that are divided using Wasserstein partition and bisection partition correspondingly; the larger markers indicate data samples x i and the smaller markers indicate additional locations that were realized later. 79 Chapter 8 Conclusion In previous sections, by using the Wasserstein distance to define our ambiguous set, we have formulated the distributionally robust versions of the Euclidean travelling salesman problem, the entropy maximization problem and the highest density region optimization problem. Also we have developed analytic center cutting plane methods to efficient solve our problems. We conducted a simple computational experiment on highest density region optimization, showing the impact of increasing the number of samples n and the distance parameter t. We also conducted a more elaborate districting experiment using actual geospatial data. In this experiment, we divide a service region into districts so as to allocate the workloads of a fleet of drones. We compared the results using Wasserstein partition and bisection partition in seven metropolitan regions, in all of them Wasserstein partition shows better results. In the future, we will use more advanced partition schemes such as power diagram to further elaborate our computational experiments. We are also interested in other optimization problems using the Wasserstein distance to define the ambiguity set. 80 Reference List [1] J.E.Anderson. Thegravitymodel. Technicalreport, NationalBureauofEconomicResearch, 2010. [2] D. Applegate, W. Cook, D. S. Johnson, and N. J. A. Sloane. Using large-scale computation to estimate the Beardwood-Halton-Hammersley TSP constant. Presentation at 42 SBPO, 2010. [3] J. Beardwood, J. H. Halton, and J. M. Hammersley. The shortest path through many points. Mathematical Proceedings of the Cambridge Philosophical Society, 55(4):299–327, 1959. [4] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013. [5] D. Bertsimas D. Brown Ben-Tal, A. A soft robust model for optimization under ambiguity. Operations Research, 58(4):1220–1234, 2010. [6] D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. arXiv preprint arXiv:1401.0212, 2013. [7] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137(3-4):541–593, 2007. [8] F. Bolley and C. Villani. Weighted csiszár-kullback-pinsker inequalities and applications to transportation inequalities. In Annales de la Faculté des sciences de Toulouse: Mathéma- tiques, volume 14, pages 331–352, 2005. [9] S. Boyd. Localization and cutting-plane methods. http://stanford.edu/class/ee364b/ lectures/localization_methods_slides.pdf, 2014. [10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [11] C. Caillerie, F. Chazal, J. Dedecker, and B. Michel. Deconvolution for the wasserstein metric and geometric inference. In Geometric Science of Information, pages 561–568. Springer, 2013. [12] G. C. Calafiore. Ambiguous risk measures and optimal robust portfolios. SIAM Journal on Optimization, 18(3):853–877, 2007. [13] G. C. Calafiore and L. El Ghaoui. On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications, 130(1):1–22, 2006. [14] G. Canas and L. Rosasco. Learning probability measures with respect to optimal transport metrics. In Advances in Neural Information Processing Systems, pages 2492–2500, 2012. 81 [15] J. G. Carlsson and E. Delage. Robust partitioning for stochastic multivehicle routing. Op- erations Research, 61(3):727–744, 2013. [16] J. G. Carlsson and R. Devulapalli. Dividing a territory among several facilities. INFORMS Journal on Computing, 25(4):730–742, 2012. [17] J.G. Carlsson, E. Carlsson, and R. Devulapalli. Shadow prices in territory division. Networks and Spatial Economics, 2015. [18] X. Chen, M. Sim, and P. Sun. A robust optimization perspective on stochastic programming. Operations Research, 55(6):1058–1071, 2007. [19] C. Daganzo. Logistics Systems Analysis. Springer, 2005. [20] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010. [21] E Erdoğan and G Iyengar. Ambiguous chance constrained problems and robust optimization. Mathematical Programming, 107(1-2):37–61, 2006. [22] P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. arXiv preprint arXiv:1505.05116, 2015. [23] David Wozabal Georg Pflug. Ambiguity in portfolio selection. Quantitative Finance, 7:435– 442, 2007. [24] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International statistical review, 70(3):419–435, 2002. [25] J. Goh and M. Sim. Distributionally robust optimization and its tractable approximations. Operations research, 58(4-part-1):902–917, 2010. [26] M. Haimovich and A. H. G. Rinnooy Kan. Bounds and heuristics for capacitated routing problems. Mathematics of Operations Research, 10(4):527–542, 1985. [27] N. Halverson. Google claims right to post photos from private land. The Press Democrat, August 21, 2008. [28] Heart and Stroke Foundation. Statistics., 2013. [29] Rob J. Hyndman. Computing and graphing highest density regions. The American Statisti- cian, 50(2):120–126, 1996. [30] A. Irpino and R. Verde. A new wasserstein based distance for the hierarchical clustering of histogram symbolic data. In Data science and classification, pages 185–192. Springer, 2006. [31] G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005. [32] E.T.Jaynes. Informationtheoryandstatisticalmechanics. Physical Review, 104(4):620–630, 1957. [33] Mehdi Behroozi John Gunnar Carlsson. Wasserstein distance and the distributionally robust tsp. Submitted for review, 2015. [34] Karthyek Murthy Jose Blanchet, Yang Kang. Robust wasserstein profile inference and ap- plications to machine learning. arXiv preprint arXiv:1610.05627, 2017. 82 [35] D. Klabjan, D. Simchi-Levi, and M. Song. Robust stochastic lot-sizing by means of his- tograms. Production and Operations Management, 22(3):691–710, 2013. [36] S.G. Krantz and H.R. Parks. Geometric Integration Theory. Cornerstones. Birkhäuser, 2008. [37] E. H. Lockwood. A book of curves. Cambridge University Press, 1967. [38] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1968. [39] K. Savla M. Pavone and E. Frazzoli. Sharing the load. Robotics & Automation Magazine, IEEE, 16:52–61, 2009. [40] E. Pampalk, A. Flexer, G. Widmer, et al. Improvements of audio-based music similarity and genre classificaton. In ISMIR, volume 5, pages 634–637. London, UK, 2005. [41] I. Popescu. Robust mean-covariance solutions for stochastic optimization. Operations Re- search, 55(1):98–112, 2007. [42] H. L. Royden and P. Fitzpatrick. Real analysis, volume 32. Macmillan New York, 1988. [43] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000. [44] Anton J. Kleywegt Rui Gao. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016. [45] H. Scarf. A min-max solution of an inventory problem. Studies in The Mathematical Theory of Inventory and Production, pages 201–209, 1958. [46] A. Shapiro. Stochastic programming by monte carlo simulation methods. Stochastic Pro- gramming E-Print Series, 2000. [47] et al. Shapiro A, Dentcheva D. Lectures on stochastic programming: modeling and theory, 2014. [48] J.M. Steele. Probability Theory and Combinatorial Optimization. CBMS-NSF Regional Con- ferenceSeriesinAppliedMathematics.SocietyforIndustrialandAppliedMathematics, 1987. [49] Brian P. Tice. Unmanned aerial vehicles the force multiplier of the 1990s. Airpower Journal, 1991. [50] C. Villani. Topics in optimal transportation. American Mathematical Soc., 2003. [51] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [52] D. Wozabal. A framework for optimization under ambiguity. Annals of Operations Research, 193(1):21–47, 2012. [53] D. Wozabal. Robustifying convex risk measures for linear portfolios: a nonparametric ap- proach. Operations Research, 62(6):1302–1315, 2014. [54] W. Xie and Y. Ouyang. Optimal layout of transshipment facility locations on an infinite homogeneous plane. Transportation Research Part B: Methodological, 75:74–88, 2015. [55] S. Zymler, D. Kuhn, and B. Rustem. Distributionally robust joint chance constraints with second-order moment information. Mathematical Programming, 137(1-2):167–198, 2013. 83 Appendix A Proof of Lemma 4 Proofs of statements 2 through 4 follow below: Proof of statement 2 We seek to show that x R f(x) min i {kx−x i k−λ 0 i }dA≤ x R f(x) min i {kx−x i k−λ i }dA + g T (λ 0 −λ), which is equivalent to showing that x R f(x) min i {kx−x i k−λ 0 i }dA≤ n X i=1 x Ri f(x)(kx−x i k−λ i )dA +g i (λ 0 i −λ i ). Consider the right-hand side of the above; for each i, we have x Ri f(x)(kx−x i k−λ i )dA +g i (λ 0 i −λ i ) = x Ri f(x)(kx−x i k−λ i )dA− (λ 0 i −λ i ) x Ri f(x)dA = x Ri f(x)(kx−x i k−λ 0 i )dA and therefore, if we define regions R 0 1 ,...,R 0 n in the obvious way by R 0 i = n x∈R :kx−x i k−λ 0 i ≤kx−x j k−λ 0 j ∀j6=i o , we see that x R f(x) min i {kx−x i k−λ 0 i }dA = n X i=1 x R 0 i f(x)(kx−x i k−λ 0 i )dA≤ n X i=1 x Ri f(x)(kx−x i k−λ 0 i )dA isobviousbecausethepartitionR 0 1 ,...,R 0 n isobtainedbytakingtheminimal valueofkx−x i k−λ 0 i , and is therefore minimal over all partitions ofR. This completes the proof. Proof of statement 3 We observe that the vector− 1 n e∈ R n must be a supergradient at λ ∗ ; this simply follows from the KKT conditions of (3.2), which is a finite-dimensional problem. 84 Therefore, it follows that s R ∗ i f(x)dA = 1/n for alli, and therefore the objective value of problem (3.2) is x R f(x) min i {kx−x i k−λ ∗ i }dA = n X i=1 x R ∗ i f(x)(kx−x i k−λ ∗ i )dA = n X i=1 x R ∗ i f(x)kx−x i kdA−λ ∗ i x R ∗ i f(x)dA = n X i=1 x R ∗ i f(x)kx−x i kdA− 1 n e T λ ∗ |{z} =0 = n X i=1 x R ∗ i f(x)kx−x i kdA and therefore the Wasserstein distance between f and ˆ f as induced by the partition R ∗ 1 ,...,R ∗ n is the same as that of the optimal objective value of (3.2), which completes the proof. Proof of statement 4 We simply note that if f(x) > 0 then the supergradient inequality in the proof of statement 2 is actually strict: n X i=1 x R 0 i f(x)(kx−x i k−λ 0 i )dA< n X i=1 x Ri f(x)(kx−x i k−λ 0 i )dA. The objective function of problem (3.2) is therefore strictly concave, thus guaranteeing uniqueness of λ ∗ . The fact that λ ∗ exists follows from the boundedness ofR, because if we were ever to have λ i −λ j > diam(R), it would imply thatkx−x i k−λ i <kx−x j k−λ j for all x∈R, thus rendering R j to be empty. 85 Appendix B Proof of Theorem 8 Purely for ease of exposition, we assume thatR is the unit square. Section 2.1 of [14] says that D( ˆ f n , ¯ f)→ 0 with probability one because the Wasserstein distance metrizes weak convergence wheneverR is compact. Thus, setting t n =D( ˆ f n , ¯ f) for all n≥ 1 gives us a sequence that converges to 0 with probability one, with the added feature that ¯ f is feasible for problem (4.9) by construction. Next, for each n, the triangle inequality says that the set of distributions f on R such thatD(f, ¯ f)≤ 2t n must contain the set of distributions whereD(f, ˆ f n )≤ t n . Thus, an upper bound for problem (4.9) – which is itself always an upper bound for the ground truth cost s R q ¯ f(x)dA by our construction of t n – is given by the problem maximize f∈L 1 (R) x R p f(x)dA s.t. (B.1) D(f, ¯ f) ≤ 2t n x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R ; it will therefore suffice to verify that the optimal objective value to this problem approaches the ground truth cost as t n → 0. We will relax problem (B.1) one step further by using an alternate metric to the Wasserstein distance, namely the Prokhorov metricD P (·,·), defined by D P (μ 1 ,μ 2 ) = inf{> 0 :μ 1 (B)≤μ 2 (B ) + for all Borel sets B onR} where B ={x : inf y∈B d(x,y)≤}. Theorem 2 of [24] says that for any two distributions f and g onR, we have (D P (f,g)) 2 ≤D(f,g), and therefore we can study the relaxation of (B.1) given by maximize f∈L 1 (R) x R p f(x)dA s.t. (B.2) D P (f, ¯ f) ≤ √ 2t n x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. as n→∞, whence t n → 0 with probability one. For ease of notation, we will define = √ 2t n . 86 Figure B.1: A division of the unit squareR into N 2 = 16 grid cells. The larger square S 10 has side length 1/N + 2/N 3 and contains s 10 . LetN be a positive integer and suppose that = 1/N 3 . We then divideR intoN 2 square grid cells s i with side length 1/N. The distance constraintD P (f, ¯ f)≤ implies that for each B =s i , we have s si f(x)dA≤ s Si ¯ f(x)dA +, where S i is the square of side length 1/N + 2/N 3 that containss i (see Figure B.1). Definem i =N 2 s Si ¯ f(x)dA for eachm i and consider the relaxation of (B.2) given by maximize f∈L 1 (R) x R p f(x)dA s.t. (B.3) x si f(x)dA ≤ m i N 2 + ∀i x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. Ifweignoretheconstraintthat s R f(x)dA = 1, thenclearly, ouroptimalsolutionf ∗ wouldsimply have s si f ∗ (x)dA =m i /N 2 + for eachi. This problem has a finite-dimensional constraint space and it is straightforward to see that its optimal solution f ∗ must be piecewise constant on each piece s i , so that f ∗ =q ∗ i on each s i , defined by q ∗ i N 2 = m i N 2 + or equivalently q ∗ i =m i + 1/N. Thus, the optimal objective value of (B.3) is at most 1 N 2 N 2 X i=1 p q ∗ i = 1 N 2 N 2 X i=1 p m i + 1/N≤ 1 N 2 N 2 X i=1 √ m i + 1 N 2 N 2 X i=1 p 1/N = 1 N 2 N 2 X i=1 √ m i + p 1/N ; it is routine to verify that 1 N 2 P N 2 i=1 √ m i → s R q ¯ f(x)dA (the only reason that this is not simply the definition of an integral is because the squares S i that characterize the m i ’s have an area of (1/N + 2/N 3 ) 2 rather than 1/N 2 ), which thereby completes the proof. 87 Appendix C Probabilistic analysis of the capacitated VRP Wefirstnotethat,ifnsamplesaredrawnfromadistributionf,then E( P n i=1 kx i k) =n s R kxkf(x)dA. The representation of capacity constraints via the substitution κ =s √ n is a standard and useful technique that can be seen in Section 4.2 of [19]. By exchanging the expectation and max{·,·} operators, we can express the bound (4.12) as max ( 2 √ n s x R kxkf(x)dA, β √ n x R p f c (x)dA ) +o( √ n) ≤ EVRP(X) ≤ 2 √ n s x R kxkf(x)dA + 1− 1 s √ n β √ n x R p f c (x)dA +o( √ n). Note thatd √ n/se is simply the number of vehicles needed to provide service. Since we are interested in the limiting behavior asn→∞, we haved √ n/se∼ √ n/s and 1/(s √ n)→ 0, so that we can write √ n·max ( 2 s x R kxkf(x)dA, β x R p f c (x)dA ) >VRP(X)> √ n· 2 s x R kxkf(x)dA +β x R p f c (x)dA ! as desired, where the approximate inequality implied by the “>” terms simply reflects the fact that we have disregarded the o( √ n) terms. 88 Appendix D Proof of Theorem 12 Purely for ease of exposition, we assume thatR is the unit square. Section 2.1 of [14] says that D( ˆ f n , ¯ f)→ 0 with probability one because the earth mover’s distance metrizes weak convergence on state space of bounded diameter. Thus, setting t n =D( ˆ f n , ¯ f) for alln≥ 1 gives us a sequence that converges to 0 with probability one, with the added feature that ¯ f is feasible for problem (5.2) by construction. Next, for each n, the triangle inequality says that the set of distributions f onR such thatD(f, ¯ f)≤ 2t n must contain the set of distributions whereD(f, ˆ f n )≤t n . Thus, an upper bound for problem (5.2) – which is itself always an upper bound for the ground truth cost s R − ¯ f(x) log ¯ f(x)dA by our construction of t n – is given by the problem maximize f∈L 1 x R −f(x) logf(x)dA s.t. (D.1) D(f, ¯ f) ≤ 2t n x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R ; it will therefore suffice to verify that the optimal objective value to this problem approaches the ground truth cost as t n → 0. We will relax problem (D.1) one step further by using an alternate metric to the earth mover’s distance, namely the Prokhorov metricD P (·,·), defined by D P (μ 1 ,μ 2 ) = inf{> 0 :μ 1 (B)≤μ 2 (B ) + for all Borel sets B onR} where B ={x : inf y∈B d(x,y)≤}, in which the metric d is Euclidean distance in our problem. Theorem 2 of [24] says that for any two distributionsf andg onR, we have (D P (f,g)) 2 ≤D(f,g), and therefore we can study the relaxation of (D.1) given by maximize f∈L 1 x R −f(x) logf(x)dA s.t. (D.2) D P (f, ¯ f) ≤ x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. as → 0, where it is implicitly assumed that √ 2t n ≤. LetN be a positive integer and suppose that = 1/N 3 . We then divideR intoN 2 square grid cells s i with side length 1/N. The distance constraintD P (f, ¯ f)≤ implies that for each B =s i , 89 Figure D.1: A division of the unit squareR into N 2 = 16 grid cells. The larger square S 10 has side length 1/N + 2/N 3 and contains s 10 . we have s si f(x)dA≤ s Si ¯ f(x)dA +, where S i is the square of side length 1/N + 2/N 3 that containss i (see Figure D.1). Definem i =N 2 s Si ¯ f(x)dA for eachm i and consider the relaxation of (D.2) given by maximize f∈L 1 x R −f(x) logf(x)dA s.t. (D.3) x si f(x)dA ≤ m i N 2 + ∀i x R f(x)dA = 1 f(x) ≥ 0 ∀x∈R. Ifweignoretheconstraintthat s R f(x)dA = 1, thenclearly, ouroptimalsolutionf ∗ wouldsimply have s si f ∗ (x)dA =m i /N 2 + for eachi. This problem has a finite-dimensional constraint space and it is straightforward to see that its optimal solution f ∗ must be piecewise constant on each piece s i , so that f ∗ =q ∗ i on each s i , defined by q ∗ i N 2 = m i N 2 + or equivalently q ∗ i =m i + 1/N. Thus, the optimal objective value of (D.3) is at most −1 N 2 N 2 X i=1 q ∗ i logq ∗ i = −1 N 2 N 2 X i=1 (m i + 1/N) log (m i + 1/N) ; which goes to −1 N 2 P N 2 i=1 m i logm i as N→∞. It is routine to verify that −1 N 2 P N 2 i=1 m i logm i → s R q ¯ f(x)dA (the only reason that this is not simply the definition of an integral is because the squares S i that characterize the m i ’s have an area of (1/N + 2/N 3 ) 2 rather than 1/N 2 ), which thereby completes the proof. 90
Abstract (if available)
Abstract
Recent research on formulating and solving distributionally robust optimization problems has seen many different approaches for describing one’s ambiguity set, such as moment based approach and statistical distance based approach. In this dissertation, we use Wasserstein distance to characterize the ambiguity set of distributions, which allows us to circumvent common overestimation that arises when other procedures are used, such as fixing the center of mass and the covariance matrix of the distribution. ❧ We consider distributionally robust versions of the Euclidean travelling salesman problem, the entropy maximization problem and the highest density region optimization problem respectively: as input, we are given a compact, contiguous planar region R and a realization of sampled demand points in that region, and our objective is to construct a probability distribution on R that is sufficiently “close” to the empirical distribution consisting of the sampled points and is as “spread out” as possible, in the sense that the asymptotic length of a TSP tour of points drawn from that distribution should be as large as possible, the entropy of the distribution should be as large as possible, and the coverage probability over the highest density region should be as small as possible. Numerical experiments confirm that our approach is useful when used in a decision support tool for dividing a territory into service districts for a fleet of aerial vehicles when limited data is available.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
The warehouse traveling salesman problem and its application
PDF
Discounted robust stochastic games with applications to homeland security and flow control
PDF
Integer optimization for analytics in high stakes domain
PDF
Asymptotic analysis of the generalized traveling salesman problem and its application
PDF
Applications of topological data analysis to operational research problems
PDF
Package delivery with trucks and UAVs
PDF
Bayesian optimal stopping problems with partial information
PDF
Optimization strategies for robustness and fairness
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
Continuous approximation formulas for cumulative routing optimization problems
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Routing for ridesharing
PDF
A continuous approximation model for the parallel drone scheduling traveling salesman problem
PDF
Continuous approximation formulas for location and hybrid routing/location problems
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Generalized optimal location planning
PDF
Train routing and timetabling algorithms for general networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
I. Asynchronous optimization over weakly coupled renewal systems
Asset Metadata
Creator
Wang, Ye
(author)
Core Title
Applications of Wasserstein distance in distributionally robust optimization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
06/26/2018
Defense Date
04/23/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
distributionally robust optimization,OAI-PMH Harvest,Wasserstein distance
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Carlsson, John Gunnar (
committee chair
), Abbas, Ali (
committee member
), Soltanolkotabi, Mahdi (
committee member
), Vayanos, Phebe (
committee member
)
Creator Email
wang141@usc.edu,wangye.crystal@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-507562
Unique identifier
UC11268201
Identifier
etd-WangYe-6347.pdf (filename),usctheses-c40-507562 (legacy record id)
Legacy Identifier
etd-WangYe-6347.pdf
Dmrecord
507562
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Ye
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
distributionally robust optimization
Wasserstein distance