Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Generalized optimal location planning
(USC Thesis Other)
Generalized optimal location planning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENERALIZED OPTIMAL LOCATION PLANNING by Parisa Ghaemi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2012 Copyright 2012 Parisa Ghaemi Dedication To My Family ii Acknowledgments I would like to express my deepest gratitude to my advisor, Professor John Wilson, whose guidance and support has made this journey truly magnificent. He has not only been my advisor but also a friend who has taught me so many things beyond research. I will never forget the conversation we had during the last years. I would also like to acknowledge members of USC GIS Research Laboratory for their support and friendship. It is hard to express how thankful I am to my dearest supportive parents, my lovely daughter, my amazing sisters and my dear family for their relentless support and encour- agement in all aspects of life, and making me a person who I am today. Above all I would like to express my deepest appreciation to my dearest husband for his endless support and encouragement, his great patience and deep love not only during my study but also in all aspects of my life. iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 What are Optimal Location Queries? . . . . . . . . . . . . . . . . . . . 2 1.2.1 Traditional Optimal Location Query Problem Statement . . . . 4 1.3 Goals of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Optimal Network Location Queries . . . . . . . . . . . . . . . 6 1.3.2 Multi-criteria Optimal Location Queries . . . . . . . . . . . . . 6 1.3.3 Dynamic Optimal Location Queries . . . . . . . . . . . . . . . 7 1.4 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Related Work 9 3 Optimal Network Location Queries 12 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Expansion-based Optimal Network Location Queries (EONL) . . . . . 21 3.5 Bound-Based Optimal Network Location (BONL) . . . . . . . . . . . . 26 3.5.1 Bound-based Optimal Location with Upper Bound (BONL-U) . 29 3.5.2 Bound-based Optimal Location with Minimum Bound (BONL-M) 31 3.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 3.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Multi-Criteria Optimal Location Queries 44 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Optimal Location . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Skyline Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.1 Multi-Criteria Optimal Location (MCOL) . . . . . . . . . . . . 49 4.3.2 Maximal Reverse Skyline (MaxRSKY) . . . . . . . . . . . . . 51 4.4 Baseline Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Basic-Filtering Approach . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.1 Data Structure Computation . . . . . . . . . . . . . . . . . . . 55 4.5.1.1 Skyline Computation and SSR construction . . . . . 55 4.5.1.2 Overlapping Data Structure . . . . . . . . . . . . . . 56 4.5.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Grid-based-Filtering Approach . . . . . . . . . . . . . . . . . . . . . . 62 4.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 68 4.7.2.1 Feasibility Study . . . . . . . . . . . . . . . . . . . . 68 4.7.2.2 Comparing the cost of skyline computation/building SSRs with the cost of computing overlaps among SSRs 70 4.7.2.3 Empirical Analysis . . . . . . . . . . . . . . . . . . 71 4.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . 76 5 Dynamic Optimal Location Queries 78 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.1 Continuous Maximal Reverse Nearest Neighbor Query (CMaxRNN) 84 5.3.2 Update Operations . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5 Precomputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Update Operation Execution . . . . . . . . . . . . . . . . . . . . . . . 91 5.6.1 Delete Operation . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6.2 Insert Operation . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6.3 Weight-Update Operation . . . . . . . . . . . . . . . . . . . . 93 5.7 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.8 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 95 v 5.8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 95 5.8.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 97 5.8.2.1 Feasibility Study . . . . . . . . . . . . . . . . . . . . 97 5.8.2.2 Empirical Analysis . . . . . . . . . . . . . . . . . . 98 5.8.2.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . 104 5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6 Conclusions and Future Work 107 References 110 vi List of Tables 3.1 Marked Edge Table (MET) . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Edge collapsing technique . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Pair-wise Overlapping Table (POT) . . . . . . . . . . . . . . . . . . . . 28 3.4 Five real datasets for sites . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Average distance of optimal network location and optimal location derived byL 1 andL 2 approaches (size of the entire area is 6:2km 9km) . . . 37 3.6 Comparing the execution time of BONL-U with the two grid-based landmark selection techniques . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Overlap Table (OT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Comparing the cost of skyline computation/building SSRs with the cost of computing overlaps among SSRs . . . . . . . . . . . . . . . . . . . . 70 5.1 Four synthetic datasets for objects and sites . . . . . . . . . . . . . . . 97 5.2 Comparing the execution time of SMaxRNN and CMaxRNN queries . . 98 5.3 Summary of the result of CMaxRNN queries on FoodTrucks . . . . . . 106 vii List of Figures 3.1 Optimal Location Queries . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Road network model . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Local Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Overlap segment of multiple local networks . . . . . . . . . . . . . . . 20 3.5 Local Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Upper bound value computation of the network distances . . . . . . . . 30 3.7 Non-overlapping case . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8 Execution times of the algorithms with a single site dataset and four skewed object datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.9 Execution times of the algorithms with a single site dataset and four uniformly distributed object datasets . . . . . . . . . . . . . . . . . . . 39 3.10 Execution times of the algorithms with uniformly distributed objects . . 40 3.11 Average size of local bounds radii (meters) . . . . . . . . . . . . . . . . 41 3.12 Comparing the execution time of the EONL algorithm with the FGP- OTF algorithm [XYL11] in seconds . . . . . . . . . . . . . . . . . . . 43 4.1 Example site and object datasets in 2-dimensional space . . . . . . . . 50 4.2 Skyline and reverse skyline query examples along with corresponding SSR and DR regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 An illustration of the Skyline Search region (SSR) and the Dominance Region (DR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Example SSR Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 viii 4.5 The data structure computation procedure (Basic-Filtering approach) . . 58 4.6 MaxRSKY Query Processing (Basic-Filtering Approach) . . . . . . . . 60 4.7 An illustration of the Grid-based-Filtering approach . . . . . . . . . . . 63 4.8 The data structure computation procedure (Grid-based-Filtering approach) 65 4.9 MaxRSKY Query Processing (Grid-based-Filtering Approach) . . . . . 67 4.10 Comparing the execution times of GridBased-Filtering and GridBased- Filtering approaches with real dataset . . . . . . . . . . . . . . . . . . 69 4.11 Execution times of the Basic-Filtering and Grid-based-Filtering approaches on synthetic datasets with different distributions . . . . . . . . . . . . . 72 4.12 Magnifying the execution times of the Grid-based-Filtering approach illustrated in Figure 4.11 . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.13 Execution times of Basic-Filtering and Grid-based-Filtering approaches on synthetic datasets with varying object cardinality and a fixed size site dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.14 Execution times of Basic-Filtering and GridBased-Filtering approaches on synthetic datasets with varying site cardinality and fixed size object dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.15 Execution times of GridBased-Filtering approach with various grid gran- ularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1 Schema of the precomputed data structure . . . . . . . . . . . . . . . . 88 5.2 The precomputation procedure . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Optimal location computation based on MET . . . . . . . . . . . . . . 90 5.4 Execution times of CMaxRNN and SMaxRNN onS 1 . . . . . . . . . . 99 5.5 Execution times of CMaxRNN and SMaxRNN onS 2 . . . . . . . . . . 99 5.6 Execution times of the Delete/Insert/Weight-Update operations onS 1 . 100 5.7 Execution times of the Delete/Insert/Weight-Update operations onS 2 . 100 5.8 Comparing the execution times of the Insert operation onS 2 (uniform) andS 4 (normal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 ix 5.9 Comparing the execution times of the Weight-Update operation onS 2 (uniform) andS 4 (normal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.10 Distribution of the execution times of the Insert operation onS 2 (uniform weight) andS 3 (normal weight) . . . . . . . . . . . . . . . . . . . . . . 104 5.11 Distribution of the execution times of the Weight-Update operation on S 2 (uniform weight) andS 3 (normal weight) . . . . . . . . . . . . . . . 104 5.12 Execution times of CMaxRNN for Insert operations in the interval from 9:57 and 11:11 a.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.13 Execution times of CMaxRNN for Delete operations in the interval from 12:50 and 2:07 p.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 x Abstract Optimal location queries have been widely used in spatial decision support systems and marketing in recent years. Given a set S of sites and a set O of weighted objects, a “basic optimal location query” finds the location(s) where introducing a new site maximizes the total weight of the objects that are closer to the new site than to any other site. Due to the intrinsic computational complexity of the optimal location problem, researchers have often resorted to making simplifying assumptions in order for the proposed solutions to scale with large datasets. However, there are many real-world applications where such restrictive assumptions may not hold. In this dissertation, we relax three of the aforementioned simplifying assumptions and correspondingly propose solutions for three popular variations of the basic optimal location problem, namely the “optimal network location problem”, the “multi-criteria optimal location problem” and the “dynamic optimal location problem”. These varia- tions of the original problem allow for considering network distance (rather than p-norm distance), multiple preference criteria (rather than distance as the single preference cri- terion), and dynamic objects and sites (rather than static ones), respectively. In Chapter 3, we introduce two complementary approaches for efficient computation of optimal network location (ONL) queries, namely EONL (short for “Expansion-based ONL”) and BONL (short for “Bound-based ONL”), which enable efficient computa- tion of ONL queries with object-datasets containing uniform and skewed distributions, xi respectively. Thereafter, we experimentally compare our proposed approaches and dis- cuss their use cases with different real-world applications. Our experimental results with real datasets show that given uniformly distributed object-datasets (i.e., datasets with uniform spatial distributions), EONL is an order of magnitude faster than BONL, whereas with object-datasets with skewed distributions BONL outperforms EONL. There- fore, EONL and BONL have their own exclusive use cases in real-world applications and are complementary. In Chapter 4, we formalize the multi-criteria location problem as maximal reverse skyline query (MaxRSKY) and introduce two filter-and-refine approaches termed “Basic- Filtering” and “Grid-based-Filtering” that allow for efficient computation of MaxRSKY queries. The latter approach is an enhanced solution because it avoids redundant com- putation by filtering out the irrelevant parts of the search space for improved efficiency. Our extensive empirical analysis with both real-world and synthetic datasets show that our enhanced solution is more efficient in computing answers for MaxRSKY queries with large datasets containing thousands of sites and objects. For the datasets that the “Basic-Filtering” approach responds to a MaxRSKY query in hours, this computation can be completed in minutes using the “Grid-based-Filtering” approach. In Chapter 5, we formalize dynamic optimal network location queries as Continu- ous Maximal Reverse Nearest Neighbor (CMaxRNN) queries on Spatial Networks, and present a scalable and exact solution for CMaxRNN query computation. In our pro- posed approach we avoid computation of the optimal location query from scratch, and instead, compute the query incrementally to leverage computations from past queries. Our experimental results on a real-world dataset shows that the CMAxRNN queries are about two orders of magnitude faster than running the optimal location query from scratch. xii Chapter 1 Introduction 1.1 Motivation Consider the hypothetical case in which the Los Angeles Fire Department (LAFD) seeks to respond to fire emergencies in Los Angeles (LA) county in a timely manner, but budgetary limits means that it falls short of what is needed to assure timely responses throughout the whole service area. Assume that it can only afford to build one new fire station. This problem is an example of a type of optimal location planning problem since the LAFD seeks an optimal site for this station that covers as much of the demand to be served as possible (demand here represents an estimate of the number of fire response calls anticipated over a period of time in a neighborhood). There are a number of other similar questions that a city planner might want to answer: “Where is the best location to locate a new public library, a school, or a new park”?, among others. In many cases, planners and/or decision makers need to study the location planning problem from different perspectives and test different scenarios to identify the final opti- mal solution. Let’s consider a real scenario, which is motivated by a project completed by Ghaemi et al. [GSS + 09]. In this scenario, we want to design a spatial decision sup- port tool to support interactive park/open space planning in Southern California. The main goal is to provide an exploratory tool for decision makers (e.g., municipalities and community-based groups) to identify candidate sites that can promote park and open space access for local residents. In addition, this tool aims to enable decision makers to have a better view of the planning problem and to test different scenarios by varying 1 the objectives, constraints, and parameters of the models. For instance, decision mak- ers may want to study where is the best location to locate a new park so that it attracts the maximum number of people. Alternatively, they may want to explore what hap- pens when people consider other factors besides proximity to pick the most convenient park (e.g., their interest in specific facilities and/or amenities of parks). For example, they want to focus on the health of toddlers and find the best location to serve as many toddlers as possible. By reviewing different scenarios, changing input datasets and con- sidering different constraints (budgets, time limits, etc.), decision makers will be able to identify a series of optimal locations. 1.2 What are Optimal Location Queries? The first study on optimal location planning can be traced back to Weber ([Web09]) who described a location planning problem associated with an industrial setting as follows: ”Find a location for a manufacturing facility that receives raw material from two point sources and ships its final products to a point-specified market. Thus, Weber’s problem suggests finding a fourth point (the factory) among three points (two raw-material points and one market point) in order to minimize the transport cost (Euclidean distance)”. The Weber problem is known as the first location-allocation problem or single-facility location problem (Church et al. [CM09]), in which the aim is to locate one manufactur- ing facility at continuous locations in the Euclidean plane. Weber’s original work has been expanded over time in different problem domains as multi-facility location prob- lems, location planning on networks/metric spaces, siting facilities at discrete locations, among others. 2 Optimal location planning is multidisciplinary and continues to be of interest to researchers representing a variety of fields, ranging from business to geography, mathe- matics, operations research, and the database community in computer science. Studies accomplished in each field discuss the topic from a different perspective. For instance, in Operations Research (OR), most studies are focused on mathematical/geometric mod- eling of the location problem or on formulating it as an optimization problem. They provide an optimal solution or nearly optimal (approximate) solution by applying var- ious optimization methods (e.g., integer programming, linear/non-linear programming formulations), or using heuristic-based approximation methods (e.g., simulated anneal- ing, greedy heuristic, and genetic algorithms). In addition, they may provide an optimal solution by making some assumptions to simplify the problem (e.g., representing a gen- eral road network graph as a tree network). While efficient, most of proposed methods in OR are not scalable to large datasets due to their computational complexity since the general optimal location problem is known to be a NP-hard problem. In recent years, the availability of large datasets and the need to provide an exact optimal solution to location planning problems have increased. Because of this fact, a number of com- plementary solutions have been proposed to answer scalable optimal location queries in the computer science database community. In these approaches, the main focus is on providing an exact solution for optimal location problems with large datasets in a timely manner. Basically, the proposed approaches reduce the computation complexity of identifying the optimal location by considering the regions with a higher potential of containing the optimal location. As a general view, optimal location queries deal with locating one or more sites in order to optimize one or more objectives. In different applications, the term “optimal location” might mean different things. For instance, in opening a new public library, the optimal location might be the site that would maximize the number of patrons for whom 3 this is the closest library. However, the optimal location to place a new fire station might be the location that minimizes the maximum distance of buildings to their closest fire station (This problem in general is called the 1-center problem as originally described by Hakimi ([Hak64], [Hak65])). Alternatively, the optimal region to build a new park is the region where locating a new park would minimize the weighted distance from local residents to their closest park ( this problem is an example of the 1-median problem as originally described by Hakimi ([Hak64], [Hak65])). Among all aforementioned objectives, in this dissertation we focus on the objective of maximizing the total number of demand points in placing a new facility site. In next section, we formally describe the basic optimal location query which assumes this objective. 1.2.1 Traditional Optimal Location Query Problem Statement Given a set S of sites and a set O of weighted objects the basic optimal location query (OL) computes a location where introducing a new site would maximize the total weight of objects that are closer to the new site than to any other site. In other words, the basic optimal location query can be defined as finding the site with the maximum bichromatic reverse nearest neighbor (BRNN) set. The BRNN query on a given site s, returns all the objects o2 O whose nearest neighbor site is s and there is not any other site s02 S such thatDist(o;s)<Dist(o;s0). Thus, the optimal location query can be formulated as a problem where we seek to find a location such that if a new site s is placed there, the size of the BRNN set of s, BRNN (s), is maximized. In the aforementioned problem statement, there are four constraints: 1. Object points are always assigned to the closest sites or they have a tendency to go to the nearest sites. By default,L 2 /Euclidean orL 1 -norm distances ([Sam90]) are the distance measures used to calculate the distance between the object points and their nearest neighbor site. 4 2. Proximity as a single criterion and the only factor which is considered by objects to pick their convenient site. 3. Object and site points are both static and they do not change location over time. 4. Object and site points can be located at continuous locations, anywhere in the plane. Relaxing some of the above constraints will help to answer a broader range of opti- mal location queries, and the goal of this dissertation is to relax some of the constraints assumed with the basic optimal location solutions as will be explained in next section. 1.3 Goals of the Dissertation This dissertation examines three variations of basic optimal location queries by relaxing the first three constraints mentioned in the previous section. The first study on “Opti- mal Network Location Queries” relaxes the first constraint by considering the network distance as the distance measure between object points and their nearest neighbor site. This topic was first introduced by Ghaemi et al. [GSWBK10] and a followup paper on this topic will be described in detail in Section 3.2 ([XYL11]). The first study is pre- sented in detail along with an extensive set of performed experiments in Chapter 3. In the second study, “Multi-criteria Optimal Location Queries”, we aim to relax the second constraint by considering multiple criteria rather than a single criterion in selecting the convenient site. This study will be explained in Chapter 4. In the third study, “Dynamic Optimal Location Queries”, our goal is to relax the third constraint by assuming that site and object points are not static and may change their geographic location over time. The latter study (Ghaemi et al. [GSWBK12]) will be presented in Chapter 5. In the following subsections, I briefly discuss each of these studies. 5 1.3.1 Optimal Network Location Queries The existing works on optimal location queries assume either L 2 /Euclidean distance or L 1 -norm distance. However, in many real world applications, objects and sites are located on a spatial network as specified by the underlying network (roads, railways, rivers, etc.). With this work, we focus on optimal network location (ONL) queries, i.e., optimal location queries, in which objects and sites reside on a spatial network. We introduce two complementary approaches, namely EONL (short for Expansion- based ONL) and BONL (short for Bound-based ONL), which enable efficient compu- tation of ONL queries with datasets of uniform and skewed distributions, respectively. Moreover, with an extensive experimental study we verify and compare the efficiency of our proposed approaches with real-world datasets. In particular, we show that BONL is more effective when the given object-dataset has a skewed spatial distribution, whereas EONL outperforms BONL with uniformly distributed objects. Finally, we demonstrate the importance of considering network distance (rather than p-norm distance) with ONL queries. 1.3.2 Multi-criteria Optimal Location Queries In the basic optimal location query, we assumed that people have a tendency to go to the nearest site. Thus, proximity is the single criterion to be considered in selecting the con- venient site. However, sometimes the closest site may not be the best location and people may have different criteria/preferences in selecting their convenient site. From a techni- cal perspective, considering proximity as the only factor provides a single criterion spa- tial query, which can be solved by maximizing reverse nearest neighbor queries. How- ever, by considering different criteria/preferences, we are dealing with a multi-criteria optimal location query. In our second study, we reduce the problem of multi-criteria 6 optimal location queries into the problem of maximizing bichromatic reverse skyline queries and we propose two efficient approaches, namely a Basic-Filtering approach and a Gridbased-Filtering approach, to compute maximum reverse skyline queries for optimal location planning. Moreover, with an extensive experimental study we verify and compare the efficiency of our proposed approaches with both real and synthetic datasets. 1.3.3 Dynamic Optimal Location Queries In the basic optimal location query, the assumption is that site and object points do not change location over time. But in the real world this might happen frequently. Food trucks are a good example. The food truck business has grown more popular in the United States and is bringing a more varied cuisine to the streets. Food trucks frequently change their locations during the day and usually use Twitter to let their clientele know where they can catch the truck to purchase their favorite food. With this application not only sites (i.e. food trucks) are moving, but also objects (i.e. customers) change location as they commute throughout the day. The food truck problem discussed above is an example of a general family of prob- lems that are best addressed by Dynamic Optimal Location (DOL) Queries, where the site and object points change their location over time. One could na¨ ıvely solve the DOL query by repeated execution of the basic optimal location query. However, the results might be inaccurate and inefficient. In this study, we focus on DOL queries and propose an “incremental” approach that avoids redundant computation and continuously main- tains the optimal result by updating prior computations as needed. With an extensive experimental study we verify and evaluate the efficiency of our proposed approach with both synthetic and real-world datasets. 7 1.4 Outline of the dissertation The remainder of this dissertation is organized as follows. In Chapter 2, we review the related work. Chapter 3 introduces the optimal network location queries and presents two complementary approaches, EONL and BONL for efficient computation of ONL queries with datasets of uniform and skewed distributions. In Chapter 4, we introduce the multi-criteria optimal location queries and present two approaches, namely Basic- Filtering and Grid-based-Filtering, for addressing this problem. Finally, as another variant of optimal location queries, we introduce the dynamic optimal location queries and propose an efficient and incremental approach to solve this problem in Chapter 5. The final chapter (Chapter 6) summarizes the major directions for future research in other variants of optimal location queries. 8 Chapter 2 Related Work Optimal location queries have been studied by researchers in geography, operations research (OR) and database systems. In geography as well as OR, there is a substantial amount of work that have been performed on location planning problems that address different objectives (e.g. p-median and p-center problems ([Hak64], [Hak65]), covering problems, etc.) which are all not within the scope of this research proposal. Thus, in this section we only include the works relevant to the focus of this dissertation. In geography and OR, most optimal location query problems (also called facility location problems) are formulated as covering problems. These involve locating n sites to cover all or most of the (so-called) demand objects assuming a fixed service distance for sites. Covering problems are generally classified into two main classes. The first is the Location Set Covering Problems (LCSPs) [TSRB71] that seek to position a mini- mum number of sites in such a way that each and every demand object has at least one site placed within some threshold distance. The second class is the Maximal Covering Location Problems (MCLPs) [CR74] which seek to establish a set of m sites to maxi- mize the total weight of the ”covered” objects, where an object is considered covered if it is located within a specified distance from the closest facility. Many other problems in this class extend the original MCLP by imposing various placement restrictions for sites [MS82, Chu84], assuming various types of objects (points, lines and polygons) [MT07], and considering varying definitions for coverage [BK02]. While geography and OR-based solutions are effective and address various types of optimal location problems, many of these solutions fail to scale with real datasets 9 that consist of large numbers of sites and objects due to their computational complexity. Accordingly, a number of complementary solutions have been proposed by the database community to provide scalable optimal location query answers. One is the Bichromatic Reverse Nearest Neighbor, BRNN query [KM00, YL01, SRAA01]. The optimal location query can be formulated as a BRNN maximization problem, with which we try to locate a new site s such that the size of the BRNN set of s is maximized; hence, BRNN and optimal location are orthogonal problems. Another relevant problem involves finding the top-k most influential sites [XZKD05]. Here, the influence of a site s is defined as the total weight of the objects in a BRNN set of s. With this problem, a set of existing sites are assumed to be among those for which we want to find the most influential sites, whereas with the optimal location problem, the goal is to locate a new site with maximum influence. Zhang et al. ([ZDX06]), on the other hand, solved the min-dist optimal location query in spatial databases. This particular query minimizes the average distance from each object to its closests if a new site is built. While efficient, their approach is not applicable to our problems. Finally, Wong et al. [WOY + 09] and Du et al. [DZX05] have both tackled the optimal location problem. Their approaches both form a spatial bound around each object o such that it includes a location l if and only if o is closer to l than to any other site. The intersection areas where these bounds overlap are the best candidate locations to introduce a new site. Therefore, to compute the optimal location query one can start with the areas with the maximum number of overlapping bounds and avoid other areas to reduce the search space and improve the query efficiency. While efficient, both of the aforementioned approaches assume p-norm space (namely, [WOY + 09] assumesL 2 and [DZX05] assumes L 1 ), which cannot support optimal location queries on spatial 10 networks as we will present in our experimental results in Section 3.7.2, the “Accuracy” subsection. 11 Chapter 3 Optimal Network Location Queries P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. Optimal network location queries. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 478-481, 2010. 3.1 Introduction Optimal location queries are computationally complex to answer. The existing work on OL queries considers L 1 distance metrics or L 2 /Euclidean as the measure of distance between objects and sites and proposes efficient solutions in these p-norm metric spaces ([WOY + 09], [DZX05]). However, with many real world applications objects and sites are located on a spatial network (e.g., roads, railways, and rivers), and therefore, the approaches that assume p-norm distance fail to apply. We show this by an example as follows. Figure 3.1 compares the result of a simple optimal location query assumingL 2 andL 1 distance between objects and sites versus the result of the same query assuming the actual distance on the spatial network (i.e., the network distance). With this sample query, a set S of two sitesS 1 andS 2 , and a set O of three objectsO 1 ,O 2 , andO 3 with equal weights are located on a road network (shown by thick lines). Figure 3.1(a) depicts the approach proposed for optimal location query computation inL 2 space [WOY + 09], where the intersection of multiple circles represents the identified optimal regionR 1 . As shown, the optimal regionR 1 and the actual optimal network location, i.e., the network segmentn 1 n 2 , are completely disjoint. Similarly, Figure 3.1(b) illustrates the optimal 12 a.L 2 space result versus network space result b.L 1 space result versus network space result Figure 3.1: Optimal Location Queries location query approach proposed forL 1 space [DZX05]. The hatched area (comprising the rectangular areasR 2 andR 3 ) is the optimal region inL 1 space, which significantly overestimates the actual optimal location n 1 n 2 . We further verify the importance of assuming network distance with ONL queries in Section 3.7.2 via experiments, and we show that in 75% of the cases the results of optimal location queries inL 1 andL 2 spaces are totally disjoint from the actual optimal network location, with less than 20% overlap in the remainder of the cases. In this chapter, for the first time we introduce two complementary approaches for efficient computation of ONL queries, namely EONL (short for Expansion-based ONL) and BONL (short for Bound-based ONL), which enable efficient computation of ONL queries with object-datasets containing uniform and skewed distributions, respectively. We argue that the dominating computational complexity with ONL queries is two-fold (this also applies to regular optimal location queries). To answer any ONL query, one first has to compute a spatial neighborhood around each (and every) object o of the given object-dataset such that if s is the nearest site to object o, any new site s0 introduced 13 within the locality of o will be closer to o as compared to the distance between s and o. Second, one must compute the overlap among object neighborhoods to identify the optimal network location, which is a network segment (or a set of segments) where the neighborhoods of a subset of objects with maximum total weight overlap. Accordingly, with our two proposed algorithms, EONL and BONL, we focus on reducing the computational complexity of the latter and former steps in ONL queries, respectively. In particular, with Expansion-based ONL (EONL) we simply compute the neighborhood of an object by expanding the network around the object until we reach the nearest site s to the object which adds a costly computation at the first step in answer- ing ONL queries. However, during network expansion we identify and record potential overlaps between neighborhoods of the objects to avoid redundant computation at the second step; thus, providing efficient computation of overlaps among object neighbor- hoods at the second step. On the other hand, with Bound-based ONL (BONL), we avoid the costly network expansion at the first step and instead approximate object localities by an upper bound. In particular, we introduce two bound estimation techniques, which correspondingly result in two variations of BONL. Subsequently, at the second step we compute the overlap among the actual object neighborhoods by network expansion in the event that the object bounds overlap. Our experimental results with real datasets show that given uniformly distributed object-datasets (i.e., datasets with uniform spatial distributions), EONL is an order of magnitude faster than BONL, whereas with object-datasets with skewed distribu- tions BONL outperforms EONL. We attribute the difference in efficiency of the two approaches with the two types of datasets to the fact that with skewed/clustered datasets, there is less overlap between the neighborhood bounds of the objects; hence, less need 14 for expansion at the second step. In the real-world, skewed and uniform distribu- tion of the object-datasets correspond to, for example, the typical distributions of peo- ple/customers in urban and rural areas, respectively. Therefore, EONL and BONL have their own exclusive usecases in real-world applications and are complementary. The key contributions of this chapter can be summarized as follows: 1. We define and formalize the optimal network location query problem. 2. We introduce two complementary approaches for efficient computation of optimal network location queries. 3. We experimentally compare our proposed approaches and discuss their usecases with different real-world applications. The remainder of this chapter is organized as follows. Section 3.2 reviews the related work. Section 3.3 formally defines optimal network location queries in spatial network databases. Sections 3.4 and 3.5 introduce our proposed expansion and bound-based solutions for optimal network location queries, respectively. In Section 3.6, we present complexity analysis of our proposed approaches. In Section 3.7, we evaluate our pro- posed solutions via experiments with real-world data. Section 3.8 concludes the study and discusses directions for future research. 3.2 Related Work Similar to the basic optimal location queries, there is a large body of literature on opti- mal network location queries in both the operations research and database communities. For the first time, Church and Meadows ([CM79]) introduced the Location Set Covering Problems (LSCP) and Maximal Covering Location Problems (MCLP) on networks. In their approach, the p-maximal covering problem on a network can be modeled by an 15 integer programming formulation and solved using linear programming. In addition, Megidd et al. ([MZH83]) presented a polynomial-time algorithm for the the maximum coverage problem on a tree network since the problem is obviously NP-hard on a general graph ([RS90]). Berman and Wang ([BW08]) formulated the probabilistic 1-maximal covering problem and developed a normal approximation approach to search for an opti- mal solution. None of the above approaches is applicable to ONL queries since they are largely limited to the cases where networks are trees, they provide approximate optimal solutions, or they are not efficient with large datasets. Our goal is to provide an exact, scalable solution for optimal location queries on a general road network. In the database community, there are some existing works on query processing tech- niques for spatial networks. Most of these techniques are related to the nearest neighbor (NN) query proposed by (Papadias et al. [PZM03], Kolahdouzan et al. [KS04], Jiensen et al. [JKPT03], Samet et al. [SSA]) or its variants, e.g., continuous NN queries studied by ( Mouratidis et al. [MLDM06]), approximate NN queries presented by Sankara- narayanan et al. ([SSA09], [SS09]), and aggregate NN queries introduced by Yiu et al. [YMP05]. None of those approaches can address the problem of optimal network location queries, due to the fundamental differences between the nearest neighbor and optimal location queries. In addition, there are some existing works on Bichromatic Reverse Nearest Neighbor (BRNN) query studied by (Korn et al. [KM00], Yang et al. [YL01], Stanoi et al. [SRAA01]) and the top-k most influential sites proposed by Xia et al. [XZKD05] which are both orthogonal with optimal location queries. Wong et al. [WOY + 09] and Du et al [DZX05] tackled the optimal location problem. While efficient, both of the aforementioned approaches assume p-norm space (namely, [WOY + 09] assumes L2 and [DZX05] assumes L1), which as shown in Figure 3.1 cannot support optimal location queries on spatial networks. Our proposed solutions utilized network distance to address optimal network location queries. 16 More recently, Xiao et al. [XYL11] proposed a unified framework that addresses three variants of optimal location queries in road networks efficiently. One of these vari- ants called Competitive Location Queries is the same problem we have defined as ONL queries in this article. To address this problem, they divide the edges of the networks into small intervals and find the optimal location on each interval. To avoid the exhaus- tive search on all edges, in their optimized method called FGP-OTF, they first partition the road network graph to sub-graphs and process them in descending order of their likelihood of containing the optimal locations. Their extensive experiments show the high performance of the FGP-OTF approach in terms of running time and memory con- sumption. However, our experiment in Section 3.7 shows that the FGP-OTF approach does not perform efficiently with a large road network dataset and a set of object points with a nearly uniform weight distribution. Our experimental results also illustrate that a single approach for ONL queries might not perform efficiently for different distributions of object and site points (e.g., uniform vs. skewed distributions). 3.3 Problem Definition In this section, we formalize the problem of optimal network location as a Maximum Overlap Segment (MaxOSN) problem. Assume we have a set S of sites (e.g. public schools, libraries, restaurants) in a 2D environment. Also we have a set O of objects with a weight w(o) for each object o. For instance, object o might be a residential building/property where w(o) represents the number of people living in this building. A MaxOSN query returns a subset of the spatial network (i.e., a segment or collection of segments) where introducing a new site would maximize the total weight of the objects that are closer to the new site than to any other site. We assume both sites and objects are located on a spatial network, e.g., a road network. We model the road network as 17 Figure 3.2: Road network model a graph G (N, E), where N is the set of intersections/nodes and E is the set of edges of the road network. Each edge e(a ,b) has a travel cost. In this study, we assume the cost of each edge e is proportional to the distance between the two end points a and b of e. Accordingly, the network distance dN(a,b) between any two nodes a and b, is the travel cost of the path with least cost from a to b. Figure 3.2 shows a road network with 14 nodes and weighted edges, four objectso 1 , o 2 , o 3 , ando 4 with weights 3, 6, 5, and 4, respectively, and three sitess 1 ,s 2 , ands 3 . Below, we first define our terminology. Thereafter, we describe the MaxOSN query problem. DEFINITION 1 (LOCAL NETWORK). Given an object o, the local network LN(o) of o, is a sub-network expanded at object o that contains all points on the road network with a network distance less than or equal to the network distance between o and its nearest site s; i.e: LN(o)=fqjq2e;dN(o;q)dN(o;s)g wheree2E ands =argmin p2S dN(o;p). 18 Figure 3.3: Local Networks In Figure 3.3, sites 1 is the nearest site to the objecto 1 where dN(o 1 ,s 1 )=5. LN(o 1 ) is identified by expansion, i.e., starting from o 1 we traverse all possible paths up to the network distance equal to 5, and we delimit LN(o 1 ) by marking the ending points, namely markers (shown as arrows in Figure 3.3). We term this delimitation process edge marking. The expanded network consists of a set of local edges connecting the associated object to all marked ending points. It is important to note that local edges can fully or partially cover an actual edge of the road network. For example, the local edges of LN(o 1 ) areo 1 n 2 ,o 1 n 1 ,o 1 n 4 ando 1 n (shown as bold lines in Figure 3.3). Each local edge e is also assigned an influence value, denoted by I(e), which is equal to the weight of the corresponding object. For instance, all local edges in LN(o 1 ) have an influence value equal to 3 (i.e., the weight of objecto 1 ). DEFINITION 2 (OVERLAPPING LOCAL NETWORKS). A local network LN(o 1 ) overlaps a local networkLN(o 2 ) ifLN(o 1 )\LN(o 2 )6=?. In such cases, there exists at least one local edgee 1 inLN(o 1 ) which intersects a local edgee 2 inLN(o 2 ). For instance, in Figure 3.3, LN(o 1 ) overlaps with LN(o 2 ) since the local edgeo 1 n 2 in LN(o 1 ) overlaps with the local edgeo 2 n 3 in LN(o 1 ). 19 Figure 3.4: Overlap segment of multiple local networks DEFINITION 3 (OVERLAP SEGMENT). Given two overlapping local networks LN(o 1 ) andLN(o 2 ), an overlap segments is a network segment where two overlapping local edgese 1 ande 2 from the two local networks intersect; i.e.: s =fqjq 2 e 1 ;q 2 e 2 g where e 1 2 LN(o 1 ) and e 2 2 LN(o 2 ) and LN(o 1 )\ LN(o 2 )6=?. Accordingly, the influence value of segments,I s , is defined asI s =I(e 1 ) +I(e 2 ). For example, in Figure 3.3, the overlap segment jn 2 is identified by overlapping the local edges o 1 n 2 and o 2 n 3 . Also, its influence value is equal to 9. The definition of the overlap segment can be generalized for more than two local edges: Given multiple local networks and multiple markers on each edge, the overlap segment on the edge can be identified by considering the direction and length of the overlapping local edges (we will discuss this process, called edge collapsing, in detail in Section 3.4), Table 3.2). For instance, Figure 3.4 shows the overlap segment jk identified by overlapping local edges ak and bj. DEFINITION 4 (MAXIMUM OVERLAP SEGMENT QUERY (MaxOSN)). Given a set O of objects, a set S of sites, the MaxOSN query returns the optimal network location p, the set of overlap segment(s) with maximum influence value (I 0 ): p=fsjs2OS,s =argmax s2OS I s g whereOS is the set of overlap segments. For instance, in the road network illustrated in Figure 3.3 the MaxOSN query returns the set of overlap segmentsfo 3 n 8 ,o 3 n 5 g, where each segment has an optimal influence valueI 0 =11. 20 3.4 Expansion-based Optimal Network Location Queries (EONL) As we mentioned in Section 3.1, answering an ONL query is a two-phase process. At the first phase, one needs to build the local networks of all objects, whereas at the sec- ond phase local networks of the objects are intersected in order to identify the overlap segment(s) with maximum influence value (i.e., the optimal location/segment). With EONL, we focus on reducing the computational complexity of the second phase. In particular, at the first phase EONL simply uses network expansion to build the local networks. At the second phase, assuming we have n objects (and therefore, n local networks), one should compute the overlap between 2 n combinations of local networks. In this case, if (for example) one of the network range-query processing techniques proposed by Papadias et al. [PZM03] is used for overlap computation, the total com- putational complexity would be in the order ofO(2 jOj (jNjlogjNj +jEj)). Obviously, this approach is not scalable. Instead, with EONL we identify the potential optimal seg- ments while expanding local networks at the first phase, and leverage this information at the second phase to efficiently compute the segment(s) with the maximum influence value. To be specific, while expanding the local networks at the first phase, for each edge we record all ending points (i.e., the points that mark the border of the local net- works of the objects) that lie over the edge. Subsequently, at the second phase we use the information recorded at the first phase to compute a score for each edge, which is equal to the total weight of the objects whose local networks cover fully or partially that edge. One should observe that a higher score for an edge indicates higher potential of con- taining an optimal segment. Next, through a refinement process we sort the edges based on their scores in descending order, and starting from the edges with higher scores, we use a technique, termed edge collapsing, to compute the actual overlap segment(s) on 21 Table 3.1: Marked Edge Table (MET) e M Sc(e) kp fk,n 1 ,pg 3 kj fk,n 3 ,jg 9 hg fh,n 8 ,gg 11 each edge. It is important to note that through this refinement process we only have to compute the actual overlap segment(s) for an edge if the score of the edge is more than the influence value of the actual segments computed thus far. With our experiments, we observe that EONL only computes the actual overlap segments for a limited subset of the network edges before it identifies the optimal location/segment; hence, it provides effective pruning of the search space for better efficiency. Below, we explain how we implement EONL in more detail as a six step approach: Step 1 (Expanding local networks and edge marking): For each object point o, we first expand the local network of object o, LN (o), using the Dijkstra algorithm [Dij59] (from a single source to a single destination, i.e., the nearest site to the object o). Then, we mark the ending points/markers of the local networks on the edges and record the markers. Step 2 (Constructing Marked Edge Table): Once markers are generated, we construct a table called Marked Edge Table (MET). Table 3.1 shows sample subsets of the marked edges of Figure 3.3. Each row of the Marked Edge Table (MET) is an entry in the form of (e, M, Sc(e)) where M is the set of markers marked on edge e (including the starting and ending node of edge e) and Sc(e) is the score of edge e which is equal to the total weight of the objects whose local networks cover fully or partially edge e. The MET helps us to identify the overlapping segments with the maximum influence value using the edge collapsing technique described in step 5. 22 Step 3 (Sorting MET ): We sort all entries in the MET in descending order of Sc(e) because of our observation that the optimal solution is mostly derived from the entries with larger Sc(e) values. Step 4 (Initializing optimal result set): The EONL algorithm at the end returns the set of optimal overlap segments (S 0 ) with the optimal influence valueI 0 . We first initialize them to empty set values. Step 5 (Identifying overlap segments using edge collapsing technique): From the set of marked edges in MET, we identify the optimal overlap segments using the edge collapsing technique. Below we describe the edge collapsing technique applied to the marked edge e from MET. First, we split edge e into a set of segments, SG(e), where each segment is considered between two consecutive markers. Then, for each segment s of the set SG(e) we investigate which local networks are covering segment s. Accordingly, we compute the influence value of segment s by summing the weight of the correspond- ing local networks. For each edge of the MET, the optimal overlap segment (os), is the segment which has the highest influence value among all segments in SG(e). It is impor- tant to note that edge collapsing may produce more than one optimal overlap segment on each edge. As an example, we apply edge collapsing to the second marked edge of MET Table 3.2. Considering the three markers recorded on it, the set of segments iskn 3 ,n 3 j. The first segment,kn 3 , is only covered by local networkLN(o 1 ). However, the second segment,n 3 j, is covered by both local networksLN(o 1 ) andLN(o 2 ) which results in a higher influence value compared to segmentkn 3 . Therefore, the actual overlap segment of the edgekp is the segmentn 3 j with an influence value equal to 9. Table 3.2 represents four possible cases by which two local edgese 1 ande 2 might overlap each other. The dashed lines represent local edges e 1 and e 2 , the solid line represents the actual edge ab of the road network, and m 1 and m 2 are two markers. 23 Table 3.2: Edge collapsing technique Case Overlapping Local Edges Overlap Segment 1 os =am 1 I s =I(e 1 ) +I(e 2 ) 2 os =ab I s =I(e 1 ) +I(e 2 ) 3 If(I(e 1 )>I(e 2 )) os =am 1 ;I s =I(e 1 ) Else os =bm 2 ;I s =I(e 2 ) 4 os =m 2 m 1 I s =I(e 1 ) +I(e 2 ) The third column summarizes how the edge collapsing technique computes the overlap segment (os) with the maximum influence value (I s ) in each case. It is important to note that we could apply the edge collapsing technique to all marked edges; however, we do not need to apply this approach for some marked edges if there is another marked edge whose score value has a smaller value than I 0 . Thus, we can prune any marked edge withSc(e) < I o . For other edges, we updateI 0 to the influence value of the resultant actual overlap segment (I s ). Step 6 (Finding the maximum influence value): When the algorithm terminates,S 0 returns the set of optimal overlap segment(s) with the optimal influence valueI 0 . Algorithm 1 shows our implementation of the EONL algorithm: 24 Algorithm 1 EONL Algorithm 1: For each o2 O 2: Expand the local network of object o 3: Mark ending points/markers on edges 4: Construct Marked Edge Table (MET) 5: Sort MET table based onSc(e) 6: InitializeS o andI o to; 7: For each marked edgee of MET table 8: IfSc(e)I o 9: Apply edge collapsing to edgee 10: RetrieveI s and optimal overlap segment(s) 11: UpdateI o =I s 12: UpdateS o to the set of overlap segments with maximum influence valueI s 13: Return optimal solution setS o andI o Here, we illustrate the application of the EONL algorithm using the example depicted in Figure 3.2. Assume we have performed the local network expansion for four objectso 1 , . . . ,o 4 and all ending points are marked on edges as shown using arrows in Figure 3.3 (Recall starting and ending node of edges are considered as markers which are not illustrated using arrows in Figure 3.3). We construct the MET and sort its entries based on their Sc(e) values. The first edge in MET is hg. By applying the edge col- lapsing technique to hg we retrieve the optimal overlap segmento 3 n 8 with an influence valueI s equal to 11. Then, we updateI 0 to 11. Thereafter, we perform the iterative steps on the remainder of the marked edges. Among 14 marked edges from the road network 25 shown in Figure 3.2, only the marked edge ih satisfies the conditionSc(e)>I 0 . There- fore, the remainder of the marked edges can be pruned out. By applying edge collapsing on ih, the overlap segmento 3 n 5 is derived which leavesI 0 unchanged. At this point the algorithm terminates since all marked edges have been processed. Therefore, the opti- mal network queries on the dataset shown in Figure 3.3 returns overlap segmentsfo 3 n 5 , o 3 n 8 g with an optimal influence value of 11. 3.5 Bound-Based Optimal Network Location (BONL) Similar to EONL, our bound-based optimal network location (BONL), is implemented as a two-phase process. However, with BONL we avoid the computational complexity of network expansion at the first phase by approximating the local networks with their corresponding spatial bounds. In particular, we define a (circular) spatial bound around each objecto such that it is guaranteed to contain the local network of the object. For example, given an object point o and its nearest site s in the spatial network, one can use the Euclidean Restriction property proposed by Papadias et al. [PZM03] to define such a circular bound with radius equal to or greater thandN(o;s), which guarantees containment of the local network ofo. Figure 3.5 shows the local bounds of four objects o 1 , o 2 , o 3 , and o 4 as well as their corresponding local networks. The weight of local bound lb for an object, denoted by w(lb), is defined to be equal to the weight of the corresponding object. In order to form the local bound for an object using the Euclidean Restriction property, BONL must compute the (exact or approximate) distance between the object and its corresponding nearest site in the spatial network. Toward that end, we propose two variations of BONL. With BONL-U (i.e., BONL with upper bound), we approximate the local bound of an object by an upper bound which is derived using two different landmark selection techniques. On the other hand, with BONL-M (i.e., 26 BONL with minimum bound), we introduce an efficient approach to compute the exact distance between an object and its nearest site. While BONL-M always provides a more accurate approximation of the local networks, with our study we also considered BONL-U as an option with potentially more efficient bound computation. We explain our bound computation approaches with BONL-U and BONL-M in Sections 3.5.1 and 3.5.2, respectively. Here, assuming that local bounds (either upper bound with BONL-U or exact/minimum bound with BONL-M) are computed at the first phase of BONL, we explain the second phase of answering ONL queries answering with BONL. At the second phase, we need to overlap the computed spatial bounds and prioritize the inves- tigation of those overlapping areas that have a higher potential of covering the opti- mal segments (similar to the concept of the MET and edge collapsing technique with EONL). It is important to mention that overlapping spatial bounds help us predict those areas that might cover the optimal segments. However, to identify the exact optimal overlap segments we need to expand the local networks of spatial bounds and retrieve the optimal overlap segments using our edge collapsing technique. Below we explain our implementation of BONL in more detail. With BONL, once local bounds of the objects are identified, for each local bound lb we find a list of other local bounds that overlap with lb and we call this list the overlapping list OL(lb) of lb. Lemma 1 defines the condition to identify overlapping bounds: LEMMA 1. Local boundlb 1 with radiusr 1 overlaps local boundlb 2 with radiusr 2 if and only ifjr 1 j +jr 2 jjo 1 o 2 j whereo 1 ando 2 are centers of the circular boundslb 1 andlb 2 , respectively. PROOF. Proof is obvious.2 27 Figure 3.5: Local Bounds Table 3.3: Pair-wise Overlapping Table (POT) lb OL(lb) lb 1 lb 2 ;lb 3 ;lb 4 lb 3 lb 1 ;lb 2 lb 2 lb 1 ;lb 3 lb 4 lb 1 Once the overlapping list for each local bound is generated, we construct a Pair-wise Overlapping Table (POT), where each row is an entry in the form (lb,OL(lb)). We call OL(lb) simplyOL. The entries of POT are sorted in descending order ofw(OL), where w(OL) = P lb2OL w(lb) (Recallw(lb) is the weight of local boundlb). Table 3.3 shows the POT constructed for the example depicted in Figure 3.5. Finally, starting from the first entry, BONL processes each entry of POT to find the optimal segments as follows: Step 1 (Expanding local networks): For each entry (lb;OL) in the POT, we pick the OL list and expand the corresponding local networks as well as the local network oflb using the Dijkstra algorithm, while marking all end points on each edge of the network. 28 Step 2 (Identifying overlap segments): For each entry, we identify the overlap seg- ments for the overlapping local networks derived from the previous step using the edge collapsing technique described in Section 3.3. Step 3 (Finding the maximum influence value): Among all identified overlap seg- ments, we pick the one with the maximumI o value as the optimal solution. It is important to note that we do not need to expand the local networks of some entries if there is another entry in the POT table whose actual influence value has a greater value. This means we can prune some entries from the POT table where w(OL) +w(lb)<I o andI o is the current optimal influence in the current iteration. 3.5.1 Bound-based Optimal Location with Upper Bound (BONL-U) With BONL-U, the upper bound value of the network distance between each object point and its nearest site is computed based on a set of landmarks (which is a selected subset of site points), and the Dijkstra algorithm [Dij59]. This approach is inspired by the ALT algorithm of Goldberg et al. [GH05]. However, the lower bound of the short- est path distances in ALT is computed based on an A* search, landmarks and triangle inequality. Our upper bound value computation approach entails carefully choosing a small (constant) number of landmarks, then computing shortest path distances between all nodes of the spatial network and from/to each of these landmarks using the Dijkstra algorithm. Then, upper bounds are computed in constant time using these distances. Calculating “From” and “To” distances: Since each edge of our experimental road network (LA County road network) is directional, we calculate the shortest path dis- tances between all nodes of the road graph both “From” and “To” all Landmarks points using the Dijkstra algorithm (Figure 3.6(a)). 29 a. Distance from and to landmarks b. Upper bound value calculation Figure 3.6: Upper bound value computation of the network distances Calculating Upper Bound Value: Figure 3.6(b) illustrates how we compute the upper bound value of the network distance dN(o;s), UdN(o;s). The UdN(o;s) is calcu- lated from the network distanceo and landmarkL,dN(o;L), and the network distance dN(L;s) asUdN(o;s) =dN(o;L) +dN(L;s)dN(o;s). For calculating the upper bound value of object pointo and its nearest sites, first we calculate the shortest path dis- tance value between objecto and all site points traversing all possible landmark points. Then, from all computed upper bound values we pick the one with the minimum value as theUdN(o;s). Landmark Selection: Finding good landmarks is critical for the overall performance of upper bound value computation. The optimal approach is the one which contributes to the computation of an upper bound value very close to the actual value of the net- work distance. In the following we discuss two alternate techniques used for landmark selection: uniform and weighted grid-based landmark selection. In both techniques we guarantee two landmarks are spatially located farther than a specific range from each other. In the uniform grid-based landmark selection approach, we randomly select a con- stant number of landmarks in a series of grid cells spanning the LA County road net- work. With the weighted approach, we select more landmarks in regions with more site points. For this purpose, we count the number of sites which falls within each cell, denoted byTc. Then, we assignkTc=jSj landmarks to this particular grid cell where 30 k is the total number of landmarks, andjSj is the total number of site points in the entire dataset. The grid cells measured 10 km on a side, given that larger cells gave a result similar to the uniform selection strategy and grid sizes smaller than 10 km generate numerous grid cells with no assigned landmarks. Our experimental results (see section 3.7.2) show that the uniform grid-based landmark selection outperforms the weighted grid-based landmark selection in terms of computation cost. Our experimental results showed that the BONL-U algorithm has low performance due to the cost of the upper bound value computation for network distances using land- mark selection. The drawback of using landmark selection is that the radius of local bounds (UdN(o;s)) is always larger than the actual one. This fact produces large num- bers of overlapping local bounds and networks which leads to relatively high computa- tion cost when using the BONL algorithm. In the next subsection, we introduce the Bound-based Optimal Location Queries with Minimum Bound (BONL-M) in which we improve the local bounds by computing the actual network distance between objects and their nearest sites. 3.5.2 Bound-based Optimal Location with Minimum Bound (BONL-M) With BONL-M, we compute the actual network distance value between each object point and its nearest site,dN(o;s), using the following three-step approach: Step 1 (Reversing the road network graph): We first reverse the road network graph. LetG(N;E) represent a road network, whereN is the set of nodes andE is the set of edges. We defineG0(N;E0) as the reverse graph of the road networkG if for each edge e(a;b)2E, there exists a reverse edgee 0 (b;a)2E 0 . Step 2 (Calculating the network distance of each node to its nearest site): We then calculate the shortest distance between each node and its nearest site using the Dijkstra 31 algorithm. Toward this end, we run the Dijkstra algorithm from each site points, and traverse all nodes of the graphs. By traversing each noden, we store a value calledg n , which represents the shortest path distance value between sites and noden. Each time we pick a new sites, we check theg n value while traversing the nodes and if the current g n value is greater than the shortest distance value between noden and sites, we update theg n value and set it to the shortest distance value betweenn ands. After processing all site points, theg n values stored with the nodes represent the shortest path distances between the nodes and the nearest sites. Step 3 (Computing the network distance of each object to its nearest site): We calcu- late the network distance of each object to its nearest site using theg n values computed with the previous step. This three-step approach calculates the actual network distance value,dN(o;s) and these values are used in place of the upper bound values used in BONL-U. Our exper- iments demonstrate that the BONL-M approach reduced the radius of the local bounds and improved the performance of the BONL-M algorithm compared to the BONL-U approach. 3.6 Complexity Analysis In this section, we analyze the computational complexity of our proposed approaches. BONL-U: Below, we discuss the computational complexity of various tasks with BONL-U: Landmark Selection: The running time of landmark selection step takesO(k 2 jNj) (Recallk is the number of selected landmark points). Calculating ”From” and ”To” Distances: Given k landmark points, computing ”From” and ”To” distances takesO(k(jNjlogjNj+jEj)) andO(jNj(jNjlogjNj+jEj)), 32 respectively. In total, the running time of upper bound value computation would be O(jNj(jNjlogjNj +jEj)) which is extremely high in very large road networks. To improve the running time of this step we reverse the road graph (O(jEj)) and calculate the distances ”From” landmarks to nodes. This technique improves the running time to O(k(jNjlogjNj +jEj)). Calculating Upper Bound Value: This step takesO(kjOjjSj) and the total running time for computingUd(N;S) isO(k 2 jNj) +O(k(jNjlogjNj +jEj)) +O(kjOjjSj). Forming Local Bounds: This step takesO(1) time. Constructing the POT: This step takesO(jOj 2 ) time since there arejOj local bounds. Sorting the POT: Sorting takesO(jOjlogjOj) running time. Expanding the Local Networks: Since the maximum number of overlapping local bounds can theoretically be equal tojOj, the running time for expanding the local net- works takes O(jOj(jNjlogjNj) +jEj)). Then, we mark all ending points on edges which requiresO(jEj)) time. Note that the edge marking step cannot be performed at the same time as expanding the Dijkstra algorithm in step 1. Because Dijkstra’s algo- rithm assumes that the objects and sites fall on network nodes while in our scenario ending points may fall on edges. Identifying Overlap Segments with the Maximum Influence Values: The edge col- lapsing technique takesO(jEjjOj 2 ) time. Thus, consideringjOj entries in the POT, the edge collapsing step has a complexity equal toO(jEjjOj 3 ). Finding the Maximum Influence Value: This task takes O (1) time. The dominating factors in the overall running time of BONL-U areO(k(jNjlogjNj+ jEj)) +O(jOj 2 jNjlogjNj +jOj 3 jEj)). BONL-M: The complexity of BONL-M is similar to BONL-U but for the computa- tion of the actual network distance values which requires the following steps: Reversing the Road Network Graph: This step can be done inO(jEj) time. 33 Calculating the Network Distance of Each Node to Its Nearest Site: The running time of this step isO(jSj(jNjlogjNj +jEj)). Computing the Network Distance of Each Object to Its Nearest Site: This step can be done inO(jOj). The dominating factors in the overall running time of BONL-M are O(jSj(jNjlogjNj +jEj)) +O(jOj 2 jNjlogjNj +jOj 3 jEj)). EONL: In this case, we reduce the running time of the optimal network location query by eliminating the cost of upper bound/minimum upper bound value computation. The costs of constructing the MET and sorting the table areO(jEj) andO(jEjlogjEj) respectively. The overall running time isO(jEjlogjEj) +O(jOj(jNjlogjNj +jEj)) + O(jEjjOj 2 ) because edge collapsing is performed only once for each edge. Thus, the complexity of this technique improves toO(jEjjOj 2 ) in EONL. The dominant factors in the overall running time of EONL areO(jOjjNjlogjNj +jOj 2 jEj). 3.7 Experimental Evaluation We next describe the setup we used for the experiments and then present and discuss the results. 3.7.1 Experimental Setup All experiments were performed on an Intel Core Duo 3GHz, 4 GB of RAM, Dual-Boot Windows 7/ Fedora 16 Linux system. The algorithms are implemented in Microsoft C] in .NET platform 3.5. The reason we chose a dual-boot system is the fact that we will later compare our implemented approach with that of Xiao et al. [XYL11]. Their approach was also programmed in C++ on a Linux machine. We use a spatial network ofjNj = 375,691 nodes andjEj = 871,715 bidirectional edges, representing the LA 34 Table 3.4: Five real datasets for sites Datasets Cardinality Johnny Rockets 28 McDonald’s 328 Hospitals 308 Schools 2,621 Fast Food Outlets 19,160 County road network. The spatial network covers 130 km * 130 km and is cleaned to form a connected graph. We use real-world datasets for objects and sites. Objects are population data derived from LandScan Global Population Database (Bhaduri et al. [BBCD02]; see http://www.ornl.gov/landscan/ for additional details) and compiled on a 30” x 30” latitude/longitude grid. The centroid of each grid cell is treated as the location of each object and the population within each grid cell as the weight of object. For the objects which are not located on road network edges, we snapped them to the closest edge of the road network. In total we havejOj= 9,662 objects. The weights of objects are distributed nearly uniformly with an average of 1,100. For each experiment, we use a subset of object points selected from this base dataset that we will describe in each of the experiments. We also deployed five datasets consisting of Johnny Rockets restaurants, McDonald’s restaurants, hospitals, schools, and all fast food restaurants (i.e. outlets) in LA County (including McDonald’s and Johnny Rockets) for the sites. The cardinality of each site dataset is shown in Table 3.4. All sites, objects, nodes and edges are stored in memory-resident data structures. 3.7.2 Experimental Results Below we present the results of the four series of experiments that we ran on the afore- mentioned datasets. 35 Accuracy: We first verified that the optimal location query inL 1 andL 2 /Euclidean space is not applicable to spatial networks. For this test, we selected four datasets with 20, 40, 60, and 85 object points that were randomly selected from the population data (DS 1 DS 4 ). All four sets of object points were located on the LA county road net- work. For site points, we selected a subset of McDonald’s including seven sites (Figure 3.7 shows only four of the site points). We applied theL 2 [WOY + 09] andL 1 [DZX05] distance approaches and identified the optimal location in each case. Then, we per- formed the EONL algorithms on each dataset and retrieved their corresponding optimal network location. The result of this experiment showed that in 75% of cases (we call it setA) the optimal locations derived by theL 1 =L 2 approach did not overlap the opti- mal network location derived by EONL and when they did overlap, there was< 20% common coverage. Figure 3.7 shows one of the non-overlapping cases of setA (the cir- cles represent objects and triangles represents sites). From cases included in setA, the average distance between the optimal network location and the optimal location derived from theL 1 andL 2 approaches (< N;L 1 >,< N;L 2 >) are similar to the size of the entire area covered by these datasets (see Table 3.5) and verifies that using the existing L 1 andL 2 approaches for optimal location queries on spatial network databases is not accurate and likely to return irrelevant results. We also observed that the maximum influence value returned by the optimal net- work location query is 13% and 12% higher than those returned by the optimal location queries in the L 1 and L 2 approaches, respectively and would therefore identify larger numbers of customers for those interested in running these kinds of queries. Execution Time: In order to evaluate the execution times of our proposed approaches, we implemented two experiments. With the first one, we considered a fixed site-dataset and used various object-datasets with different sizes and spatial dis- tributions. With the second experiment, we fixed the object-dataset and used various 36 Figure 3.7: Non-overlapping case Table 3.5: Average distance of optimal network location and optimal location derived byL 1 andL 2 approaches (size of the entire area is 6:2km 9km) Dataset <N;L 1 > (meters) <N;L 2 > (meters) DS 1 Overlaps (< 20% coverage) DS 2 4,998 5,305 DS 3 4,995 2,743 DS 4 6,663 6,396 Average 5,552 4,814 site-datasets. Below, we describe each experiment in more detail. Effect of Object-Dataset: For this experiment, we sub-sampled four subsets of objects from the base dataset with sizes 366 (C1), 567 (C2), 1,049 (C3) and 1,533 (C4). We sub- sampled the objects with two different spatial distributions: uniform and skewed. To select each object, we randomly picked both X and Y dimensions of the grid cell corre- sponding to the object using a uniform or skewed distribution. For the fixed site dataset, we picked the set of Johnny Rockets restaurants which has a small number of site data points compared to the other site datasets summarized in Table 3.4. Thereafter, we applied the BONL-M and EONL approaches to the aforementioned datasets and com- puted the execution times (as we show later, BONL-M outperforms BONL-U, hence we 37 0 1 2 3 4 5 6 C1 C2 C3 C4 Object datasets with skewed distribution Execution Time (min) BONL-M EONL Figure 3.8: Execution times of the algorithms with a single site dataset and four skewed object datasets excluded this variant from this experiment). Figure 3.8 depicts the results of our experi- ment. We observe that when the size of the object-dataset is small (C1) and its distribu- tion is skewed, the execution time of BONL-M is higher than EONL. This is because the cost of computing the radius of the local bounds (O(jEj) +O(jSj(jNjlogjNj +jEj)) is comparable to the cost of expansion of local networks (O(jOj(jNjlogjNj+jEj))) when the number of object points (jOj) is low. However, with the larger object-datasets (C2 to C4), the performance of BONL-M increasingly improves relative to EONL, because with a skewed object distribution, the number of overlapping local bounds is signifi- cantly reduced. Therefore, the cost of overlap computation with BONL-M becomes less than the cost of local network expansion with EONL. On the other hand, with uniformly distributed object points, EONL outperforms BONL-M with all object-datasets (Figure 3.9). This is because with a uniform distri- bution of object points, the number of overlapping local bounds is always high which results in a higher cost of identifying overlap segments as compared to the cost of local network expansion. Effect of Site-Dataset: For this experiment, we applied all three algorithms to the four 38 0 2 4 6 8 10 C1 C2 C3 C4 Object datasets with uniform distribution Execution Time (min) BONL-M EONL Figure 3.9: Execution times of the algorithms with a single site dataset and four uni- formly distributed object datasets site datasets of Table 3.4 and we selected the uniformly distributed population data as the fixed object dataset (including all 9,662 points). Thereafter, we computed the execu- tion times to compare their performance. Figure 3.10 shows that EONL has the highest performance, beating BONL-M and BONL-U by factors of 6 and 12 on average, respec- tively. The three algorithms behaved similarly with the small McDonald’s, Hospitals, and Schools datasets. The Hospitals sites were more skewed and this variability meant that the expansion and edge marking took longer in those parts of the graph with few hospitals (see Figure 3.10). Also, although the size of fast food restaurants is large, the execution time of EONL is low. This is because the complexity of EONL (O(jEjlogjEj) +O(jOj(jNjlogjNj +jEj)) +O(jEjjOj 2 )) is independent of the number of site points,jSj. Magnitude of Local Bounds: The radius of the local bounds was improved an aver- age of 53% by using BONL-M in place of the BONL-U algorithm. Figure 3.11 shows how the radius of local bounds was reduced by using the BONL-M algorithm in place of BONL-U for each of the aforementioned datasets. Furthermore, we observed that 39 0 5 10 15 20 25 BONL-U BONL-M EONL Execution Time (min) McDonald's Hospitals Schools Fast Foods Figure 3.10: Execution times of the algorithms with uniformly distributed objects the Hospitals dataset has the highest average of local bound radii in both algorithms because the skewed site distribution meant that the expansion of the local network tra- verses a longer path until it reaches the nearest site. Landmark Selection: For this experiment, we selected 100 landmarks and applied the BONL-U algorithm to four datasets. The results in Table 3.6 show that the weighted grid-based approach took more time for three of the four datasets and especially the McDonald’s and Hospitals datasets which are sparser. This is because the weighted 40 Figure 3.11: Average size of local bounds radii (meters) Table 3.6: Comparing the execution time of BONL-U with the two grid-based landmark selection techniques McDonald’s Hospitals Schools Fast Foods BONL-U Uniform 10 min 21.5 min 10.5 min 22 min BONL-U Weighted > 1hour > 1hour 16.5 min 31 min approach assigns more landmark points to the areas with more site points; hence, it takes more time for Dijkstra expansion in the areas with lower site density. Comparison with FGP-OTF method [XYL11]: In this experiment, we compare the performance of optimal location queries introduced in [XYL11] with our proposed approach. From the several techniques presented in [XYL11], we focused on the FGP- OTF method since it was reported as the most efficient approach in terms of execution time. We applied the FGP-OTF algorithm and EONL algorithm to the four sites datasets of Table 3.4 and we selected the uniformly distributed population data with 9,662 points as the fixed object dataset. There is a user defined parameter called 2 (0; 1] used in the FGP-OTF algorithm. For this experiment, we ran the FGP-OTF algorithm with 41 different values in the range of [0.0001, 1]. The value equal to 0.001 resulted in less computation cost and the corresponding execution time is reported in Figure 3.12. Thereafter, we identified the optimal location derived by each approach to compare their accuracy. Both approaches reported the same set of segments as the optimal location. Then, we computed their execution times to compare their performance. Figure 3.12 shows that the EONL approach outperforms the FGP-OTF approach. Despite the fact that FGP-OTF avoids the exhaustive search on all edges of the network by partitioning the network into sub-graphs and pruning the edges of some sub-graphs, it still shows a significant overhead in computation for these datasets. In this experimental setup, as we mentioned before, the weights of objects are distributed nearly uniformly. We observed while running FGP-OTF the upper-bound values of the weight of the sub-graphs returns relatively similar values in the graph partitioning process. This feature produces fewer numbers of pruned sub-graphs and edges, respectively. As a result, in a large road network like the one used in this experiment (recalljEj = 871,715 andjNj = 375,691), the FGP-OTG algorithm resulted in a high computation cost (it took hours for some datasets) (Figure 3.12). It is important to note that according to [XYL11], FGP-OTF can find the optimal location in minutes for road networks with sizes of similar order (e.g.jEj=223,000 and jNj=174,955). However, in that case unlike our experimental data set the weight dis- tribution of the objects is skewed (i.e., non-uniform). As a result the pruning algorithm of FGP-OTF effectively filters out a large number of sub-graphs and their correspond- ing edges and efficiently identify the optimal location. However, this efficiency is not always guaranteed for all type of datasets as discussed above. 42 Figure 3.12: Comparing the execution time of the EONL algorithm with the FGP-OTF algorithm [XYL11] in seconds 3.8 Conclusions In this chapter, we proposed a set of scalable solutions for the problem of optimal loca- tion for objects and sites located on spatial networks. Accordingly, we proposed EONL and BONL as two complementary approaches for the efficient computation of optimal network location queries with datasets of different spatial distributions. In particular, we showed that avoiding network expansion with BONL is more effective when the given object-dataset has a skewed spatial distribution, whereas EONL outperforms BONL with uniformly distributed objects. We verified and compared the performance of our proposed solutions with rigorous complexity analysis as well as extensive experimental evaluation using real-world data. 43 Chapter 4 Multi-Criteria Optimal Location Queries 4.1 Introduction The “optimal location” is embedded in many applications of spatial decision support systems and marketing tools. With this problem, given the sets S of sites and O of objects in a metric space, one must compute the optimal location where introducing a new site maximizes the number of objects that would choose the new site as their preferred site among all sites. For instance, a city planner must solve an optimal location problem to answer questions such as “where is the optimal location to open a new public library such that the number of patrons closest to the new library (i.e., the patrons that would perhaps prefer the new library to any other library) is maximized?” An important limitation of the existing solutions for the optimal location problem is due to a common simplifying assumption that there is only one criterion needed to determine the preferred site for each object, i.e., the metric distance between objects and sites. In other words, the preferred site for an object is always assumed to be the closest site to the object. However, there are numerous real-world applications for which one needs to consider multiple criteria (perhaps including distance) to choose the most preferred site for each object. The extension of the optimal location problem which allows for using multiple criteria in selecting the preferred site for each object is termed multi-criteria optimal location (or MCOL, for short). 44 For an instance of the MCOL problem, consider the following market analysis appli- cation. In order to decide on the ideal specifications of its next product to be released, a laptop manufacturer wants to identify the most preferred / desired combination of lap- top specifications in the market. For example, the current most preferred combination of laptop specifications can be<5lb, 8GB, 2.3GHz, 14in>, where the numbers stand for weight, memory capacity, CPU speed, and display size of laptop, respectively. One can formulate this problem as an MCOL problem, where each site represents an existing lap- top product in the market with known specifications, and each object represents a buyer in the market with known preferences on the specifications of his/her desired laptop (the preferences of the buyers can be determined, for example, by compiling their web search queries). In this case, laptop specifications (i.e., weight, memory capacity, CPU speed, and display size) are the criteria that objects (buyers) use to determine their pre- ferred site (laptop). Accordingly, by solving this MCOL problem, the manufacturer can identify the specifications of a new laptop product (i.e., the new site which is optimally located) such that the number of potential buyers is maximized. Similarly, a cell phone company can identify the features (e.g., the monthly voice service allowance in minutes, text service allowance in number of text messages, and data service allowance in GB) of a new cell phone plan that would attract the largest number of potential subscribers with different usage statistics. While the MCOL problem is well studied in the operations research community, the existing solutions for this problem are not only approximate without guaranteed error bounds but also more importantly, unscalable solutions that work with very small site and object datasets. In this study, for the first time we focus on developing an efficient and exact solution for MCOL that can scale with large datasets containing thousands of sites and objects. 45 Toward that end, first we formalize the MCOL problem as a maximal reverse skyline query (MaxRSKY). Given a set of sites and a set of objects in a d-dimensional space, MaxRSKY query returns a location in the d-dimensional space where if a new site s is introduced, the size of the (bichromatic) reverse skyline set of s is maximized. To the best of our knowledge, this study is the first to define and study MaxRSKY queries. Second, we develop a baseline solution for MaxRSKY which derives an answer for the query by: (1) computing the skyline set and the corresponding skyline region for every object (the skyline region for an object is a region where if a new site is introduced it will become a skyline site for the object); and (2) for each subset of the set of skyline regions computed for all objects, overlap all combinations of regions to identify the maximum overlap region (i.e., the region where the largest number of skyline regions intersect in the subset). One can observe that among all maximum overlap regions identified for all subsets of skyline regions, the one with the largest number of overlapping regions is where if a new site is introduced, its reverse skyline set is maximized. We call this region the maximal overlap region. Our baseline solution illustrates the intrinsic computational complexity of the MaxRSKY query, and shows that the dominating cost of computing an answer for MaxRSKY is due to the second step, i.e., the maximal overlap region computation. Accordingly, in order to reduce the cost of overlap computation we propose a filter- based solution that effectively reduces the search space for the maximal overlap region computation. Our solution achieves efficiency by: (1) prioritizing the maximum overlap computation for the subsets by considering the potential of including the maximal over- lap region for each subset; and (2) avoiding redundant maximum overlap computation for the subsets that cannot possibly include the maximal overlap region. While our pro- posed solution significantly improves the efficiency of the MaxRSKY computation, we observe that under certain circumstances and depending on the dataset characteristics, 46 it can lose effectiveness in filtering the search space, and hence perform less efficiently. Therefore, to address this issue we further extend this solution and propose an enhanced solution that uses grid (rather than the skyline regions themselves) for subset prioriti- zation. Consequently, our enhanced solution offers data independence. Our extensive empirical analysis with both real-world and synthetic datasets show that our enhanced solution is invariably efficient in computing answers for MaxRSKY queries with large datasets containing thousands of sites and objects. The remainder of this chapter is organized as follows. Section 4.2 reviews the related work. Section 4.3 formally defines the MCOL problem and formalizes this problem as a MaxRSKY query. In Sections 4.5 and 4.6, we present our solutions for efficient computation of the MaxRSKY query. Section 4.7 evaluates our proposed solutions via experiments, and Section 4.8 concludes this chapter. 4.2 Related Work In this section, we review the related work under two main categories. First, we discuss the previous work on the problem of optimal location. Thereafter, we present a review of the skyline query processing literature. 4.2.1 Optimal Location Among other variations of the optimal location problem, the multi-criteria optimal loca- tion (MCOL) a.k.a. multi-objective or multi-attribute optimal location, has been widely studied by researchers in the operations research (OR) community (e.g., Farahani et al. [FSA10] and [FH11], Hekmatfar et al. [HS09], Larichev et al. [LO01], Cohon et al. [Coh78], Szidarovvsky et al. [SGD86] and Hwang et al. [HM79]). However, given the computational complexity of the MCOL problem most of the existing solutions: (1) 47 resort to the use of heuristics that can only approximate the optimal location without any guaranteed error bounds, and more importantly; (2) fail to scale with real-world datasets that often consist of thousands of sites and objects (rather than tens of sites and objects usually assumed by the existing solutions). The database community, on the other hand, has shown interest in developing effi- cient and exact solutions for these problems. However, so far all of the proposed solu- tions have focused on the basic (single-criterion) optimal location problem. In particular, Wong et al. [WOY + 09] and Du et al. [DZX05] formalized the basic optimal location problem as a maximal reverse nearest neighbor (MaxRNN) query, and presented two scalable approaches to solve the problem in p-norm space (assuming L2-norm and L1- norm, respectively). Thereafter, Ghaemi et al. [GSWBK10] and Xiao et al. [XYL11] continued the previous studies and proposed solutions for MaxRNN assuming network distance. Finally, Zhou et al. [ZWL + 11] presented an efficient solution to solve the extended MaxRkNN problem, which computes the optimal location where introducing a new site maximizes the number of objects that consider the new site as one of their k nearest sites. To the best of our knowledge, we are the first to tackle the MCOL problem by developing an efficient and exact solution that can scale with large datasets containing thousands of sites and objects. 4.2.2 Skyline Queries Skyline queries were first theoretically studied as maximal vectors [KLP75, BKST78, BCL90]. However, it was Borzsonyi et al. [BKS01] who first introduced the concept to the database community and showed the need for scalable solutions to process skyline queries on large datasets. Since then, numerous efficient algorithms have been pro- posed for processing static and dynamic skyline queries, such as BNL [BKS01], D&C [BKS01], Bitmap [TEO01], SFS [CGGL03], Index [TEO01], NN [KRR02], and BBS 48 [PFCS05]. Moreover, several variations of skyline queries have been proposed and stud- ied, among which the reverse skyline query is most relevant to this chapter. The reverse skyline of a query objectq returns the objects whose dynamic skyline containsq. Del- lis et al. [DS07] first introduced reverse skyline queries in a monochromatic context (involving a single dataset). Lian et al. [LC08] extended the definition of reverse sky- line to a bichromatic scenario, and proposed an algorithm for efficient computation of reverse skyline on uncertain data. Later, Wu et al. [WTW + 09] further studied bichro- matic reverse skyline queries and proposed the most efficient query processing solution known so far assuming certain datasets. However, it is important to note that the reverse skyline query and maximal reverse skyline queries (MaxRSKY) are two orthogonal problems. For our focus problem (i.e., MaxRSKY) we can leverage any efficient solution for reverse skyline computation, as we show in Section 4.4 our main challenge is to identify a location which is on the reverse skyline set of a maximal number of objects. 4.3 Problem Definition In this section, we first formally define the problem of multi-criteria optimal loca- tion (MCOL). Then, we formalize this problem as a maximal reverse skyline query (MaxRSKY). 4.3.1 Multi-Criteria Optimal Location (MCOL) Suppose we have a setS of sitess(s 1 ;s 2 ;:::;s d ) wheres i is the value of thei-th attribute for the site s, as well as a set O of objects o(o 1 ;o 2 ;:::;o d ) in the same d-dimensional space where o i indicates the preference of o on the i-th attribute. For example, con- sidering our laptop market analysis application from Section 4.1, each laptop is a site 49 Figure 4.1: Example site and object datasets in 2-dimensional space with four attributes, namely, weight, memory capacity, CPU speed, and display size. Similarly, each potential buyer is represented by an object with four preferences corre- sponding to the four aforementioned attributes. Figure 4.1 illustrates six sites/laptopss 1 tos 6 each characterized by two attributes, weight and memory capacity (for simplicity of presentation, hereafter we consider a 2-dimensional space without loss of generality). In the same figure, three objects/buyers o 1 to o 3 are shown by indicating their prefer- ences on weight and memory capacity of laptops in the same 2-dimensional space. Accordingly, we define the MCOL problem as follows. Given a setS of sites withd attributes and a setO of objects withd preferences corresponding to the same attributes, the multi-criteria optimal location problem seeks a location in thed-dimensional space such that introducing a new site in this location maximizes the number of objects that each considers the new site among its set of “preferred sites”. A site s is a preferred site for objecto if given the preferences ofo, there is no other sites 0 inS that is more “preferred” byo as compared tos; in turn, intuitively for an objecto we say a sites is 50 more preferred as compared to a sites 0 if considering its preferences collectively,o has no reason to chooses 0 overs. For example, in Figure 4.1 the set of preferred sites for the objecto 1 isfs 2 ;s 3 g; note that while foro,s 2 ands 3 are not preferred over each other, they both are preferred as compared to all other sitess 1 ,s 4 ,s 5 , ands 6 . 4.3.2 Maximal Reverse Skyline (MaxRSKY) In this section, we first review the formal definitions of dynamic skyline and bichromatic reverse skyline query. Thereafter, we define the maximal reverse skyline (MaxRSKY) query, which is equivalent to and formalizes the MCOL problem. DEFINITION 1 (DYNAMIC SKYLINE QUERY): Given a set S of sites with d attributes and a query object o in the same d-dimensional space, the dynamic skyline query with respect too, termed DSL(o), returns all sites inS that are not “dominated” by other sites with respect to o. We say a site s 1 2 S dominates a site s 2 2 S with respect too if (1) for all 1 i d,js i 1 q i jjs i 2 q i j; and (2) there exists at least onej (1jd) such that s j 1 q j < s j 2 q j For example, as shown in Figure 4.2, the skyline set for the objecto 1 isDSL(o 1 ) = fs 2 ;s 3 g. Note thats 0 2 ands 0 6 are transformed proxies of the sitess 2 ands 6 with respect too 1 , respectively. DEFINITION 2 (BICHROMATIC REVERSE SKYLINE QUERY): LetS andO be the sets of sites and objects in ad-dimensional space, respectively. Given a query site s2S, the bichromatic reverse skyline query with respect tos returns all objectso2O such thats is in the dynamic skyline set ofo, i.e.,s2DSL(o) For instance, in Figure 4.2 the reverse skyline set ofs 2 isfo 1 g, becauseDSL(o 1 ) = fs 2 ;s 3 g,DSL(o 2 ) =fs 1 ;s 4 ;s 5 g,DSL(o 3 ) =fs 6 g, and therefore,s 2 only belongs to DSL(o 1 ). 51 Figure 4.2: Skyline and reverse skyline query examples along with corresponding SSR and DR regions DEFINITION 3 (MAXIMAL REVERSE SKYLINE QUERY (MaxRSKY)): LetS and O be the sets of sites and objects in a d-dimensional space, respectively. The MaxRSKY query returns a location in this d-dimensional space where if a new site s is introduced, the size of the (bichromatic) reverse skyline set ofs is maximal It is easy to observe that the MaxRSKY query and MCOL problem are equivalent, because maximizing the reverse skyline set of the newly introduced sites equivalently maximizes the number of objects whose sets of preferred sites includes. 4.4 Baseline Solution Central to the solution for maximizing the reverse skyline is the concept of Skyline Search Region (SSR) and Dominance Region (DR) (introduced by Papadias[PZM03]). 52 Figure 4.3: An illustration of the Skyline Search region (SSR) and the Dominance Region (DR) The skyline search region (SSR) is part of the data space that contains the points domi- nating some skyline points. Accordingly, we define the dominance region (DR) as part of the space that contains the points dominated by at least one skyline point. Consider for instance the example in Figure 4.3 with skyline pointsfs 1 , s 5 , s 6 , s 7 g. The SSR of object o is the shaded area bounded by the skyline points and the two axes. The complementary region is the dominance region of objecto, DR(o). Note that the SSR region does not include the skyline points since a skyline point does not dominate itself. In addition, it is important to mention that the SSR region resembles a convex polygon withn vertices which at most hasjSj vertices wherejSj represents the number of site points. LEMMA1. (Proved in [DS07]; see Lemma 8) For a given object pointo, letDSL(o) be the set of dynamic skyline ofo. Letq be a query point. If the query pointq dominates a pointp fromDSL(o), then object pointo is a reverse skyline ofq. 53 The intuition behind the above lemma is the fact when we want to determine whether a point q is in the reverse skyline set of object point o, we only need to examine the region to which pointq belongs. If the query pointq is insideSSR(o) then pointo is a reverse skyline of pointq. However, if the query pointq falls inDR(o), then pointo is not a reverse skyline of pointq. For instance, in Figure 4.3 pointq 1 is located inside the SSR(o) region. Thus, pointo is in the reverse skyline set ofq 1 . This is because point q 1 dominates some of the skyline points of o (e.g., s 5 ). In the contrary, point q 2 falls insideDR(o) and is dominated by some point of the dynamic skyline ofo such ass 6 . Therefore, pointo is not a reverse skyline ofq 2 . Based on this observation, we propose our baseline approach for maximizing the reverse skyline queries. Given a setS of sites and a setO of objects we implement our baseline approach as a two-phase process as follows: 1. We compute the dynamic skyline set DSL(o) 2 S of sites for each object o. Then, we construct the corresponding SSR regions,SSR(o). This step produces jOj regions. 2. We intersect the SSR regions of all object points retrieved from the previous step. ConsideringjOj regions, this step involves overlapping 2 jOj combinations of SSR regions. The region that has the maximum number of overlaps among SSR regions is the region to place a new site points, which would maximize the number of object points that are in the reverse skyline set ofs. However, the proposed baseline approach suffers from two computational complex- ity problems: 1. Given the computational complexity of computing a skyline query on the one hand, and the large size of the object and site datasets on the other hand, comput- ing the SSR region for all object points is costly. 54 2. Computing the overlap among SSR regions of all object points requires exponen- tial time. Therefore, it can be computationally complex if there is a large number of object points in the dataset. Based on our experimental results ( to be discussed in Section 4.7) the computa- tional complexity of computing skyline points and constructing SSR regions is not costly compared to computing the overlap among SSR regions. Therefore, in this study we focus on reducing the computational complexity of computing the overlap among SSR regions and identifying the maximum number of overlaps. In line with this goal, we propose two approaches termed Basic-Filtering and Grid-based-Filtering for answering the MaxRSKY queries. 4.5 Basic-Filtering Approach As we mentioned before, the Basic-Filtering approach consists of two main phases: Data Structure Computation and Query Processing. In the following, we will describe each component in detail. 4.5.1 Data Structure Computation 4.5.1.1 Skyline Computation and SSR construction In the data structure computation phase, we first compute the skyline set of all object points and build their corresponding SSR regions. Accordingly, for each objecto2 O, we compute theDSL(o)2S. Each objecto partitions the d-dimensional space into 2 D orthants i , each identified by a number in the range [0, 2 D -1]. For example, in Figure 4.4(a) where D=2,o 1 partitions the space into four orthants (quadrants 0 ,..., 3 ). For simplicity, we hereafter use 2D examples, since our solution can be generalized to any 55 dimensionality in a straight-forward manner. Since all orthants are symmetric and we are interested in the absolute distance between site points, we can transform all site points to 0 and compute the skyline points in 0 . As illustrated in Figure 4.4(a), in order to compute theDSL(o 1 ), sitess 2 ands 6 are transformed to 0 (s0 2 ,s0 6 ). The DSL(o 1 ) includesfs0 2 ,s 3 g. Thereafter, based on the derived skyline points we build the SSR region in 0 , and respectively, the other symmetric SSR regions in other orthants. In Figure 4.4(a) the hatched area demonstrates theSSR(o 1 ) in 0 and the shaded area presents the SSR region in all four quadrants. Accordingly, the shaded regions in Figures 4.4(b)and (c) present the SSR region of objecto 2 ando 3 , respectively. As we can see in Figure 4.4(b) and (c), the SSR regions in all four quadrants are not symmetric since they might be bounded by two axes. Figure 4.4(d) demonstrates the three SSR regions ofo 1 , o 2 , ando 3 in a single view. The time complexity of computing skyline sets and constructing the SSR regions is equal toO(jOjjSj 2 ) andO(jOj), respectively. Thus, the overall running time of this step isO(jOjjSj 2 ). 4.5.1.2 Overlapping Data Structure Assuming we haven SSR regions (and therefore, n polygons) from the previous step, one should compute the overlap between 2 n combinations of polygons. In this case, if (for example) one of the computational geometry techniques proposed by Dobkin et al. [DK83] is used for the geometric intersection ofn convex polytopes in ad-dimensional space, the total computational complexity would be in the order ofO(2 n )O(n log d1 n+ k log d1 m) wherek is the number of intersecting pairs andm is the maximum number of vertices in any polytope. Obviously, this approach is not scalable considering a large number of polygons. Instead, the main idea behind our proposed solution in the Basic- Filtering approach is to precompute the likelihood of containing the maximum number 56 a. Dynamic Skyline ofo 1 and its corresponding SSR region b. Dynamic Skyline ofo 2 and its corresponding SSR region c. Dynamic Skyline ofo 3 and its corresponding SSR region d. The three SSR regions illustrated in a single view Figure 4.4: Example SSR Regions of overlaps for each SSR region, and maintain a ranking of those regions based on this likelihood from high to low. In particular, we compute the optimality likelihood for each SSR region by computing a “score” that reflects the total number of SSRs overlapping with this region. Obviously, the higher the score of a SSR region, the greater is the chance of finding an optimal location within (or at least partly within) this SSR region. Motivated by this idea, for each SSR region r, we find a list of SSR regions that overlap with r, denoted by L(r). Accordingly, the total number of regions listed in L(r) is recorded as the score of regionr, denoted bySc(r). Table 4.1 shows the list of 57 Table 4.1: Overlap Table (OT) r L(r) Sc(r) SSR 1 fSSR 2 g 1 SSR 2 fSSR 1 ,SSR 3 g 2 SSR 3 fSSR 2 g 1 overlapping SSRs in Figure 4.4(d). Table 4.1 is called an overlap table. Each row of the overlap table is called an entry in a form of (r;L(r);Sc(r)) or briefly (r;L;Sc). It is important to note that regionsSSR 1 andSSR 3 are not overlapping each other since they have only common intersection points on their boundaries which are not included in SSR regions. Figure 4.5 represents the procedure we follow to populate the data structure. Below, we explain how we implement this procedure in three steps: 1. Computing Dynamic Skyline Set and Constructing SSR Regions: As we mentioned before, for each objecto we first compute the dynamic skyline set of o, DSL(o). Then, we construct the corresponding SSR region, SSR(o), which is bounded by the derived skyline points and the two axes. In order to support the overlapping of SSR regions, we build an R-tree over all SSRs created in this step. The time complexity of computing skyline sets and constructing SSR regions isO(jOjjSj 2 ) andO(jOj), respectively. 1: For each o2 O 2: ComputeDSL(o) 3: BuildSSR(o) 4: Construct overlap table; (r, L(r), Sc(r)) 5: Sort overlap table based onSc(r) Figure 4.5: The data structure computation procedure (Basic-Filtering approach) 58 2. Computing Pair-Wise Overlapping SSRs and Populating OT: Once DSLs and SSRs are generated, we populate the overlap table entries with the values described next. For each region entry r, we perform a range query to find all SSRs that overlap withr. Then, we compute and store the region score Sc(r) (as described above). OT represents our region-optimality-likelihood list, to be used for computation of the optimal location (described next and as presented in Figure 4.6). Let(N) be the running time of a range query over dataset of sizeN. Since there arejOj SSRs, this step requiresO(jOj)(jOj). In the literature,(N) is theoretically bounded. Letk be the greatest result size of a range query (i.e. the greatest number of SSRs that overlap a given SSR). Since a range query can be executed inO(k +logjOj) time [Cha86], this step can be performed inO(jOj(k +logjOj)). 3. Sorting Overlap Table: Finally, we sort all entries in OT in descending order of Sc(r), to identify the regions with higher likelihood of optimality. This step requiresO(jOjlogjOj). The overall running time of data structure computation procedure is O(jOjjSj 2 ) + O(jOj(k +logjOj)) +O(jOjlogjOj). 4.5.2 Query Processing In this section we discuss how the MaxRSKY query is answered based on the precom- puted data structure. With the query processing, we use the information recorded in the overlap table at the first phase to compute a score for each SSR region, which is equal to the total number of regions overlapping this region in a pair-wise relation. This score provides an over-estimate of the actual number of overlapping regions. One should observe that a higher score for a region indicates higher potential of containing an opti- mal location. Next, through a refinement process we sort the regions based on their 59 1: InitializeS o andI o to? 2: For each SSR regionr of OT table 3: IfSc(r)I o 4: For each SSR regionr02L(r) 5: Compute the intersection points between regionsr andr0 6: For each intersection pointq found 7: Perform a point query fromq to find all SSRs coveringq 8: LetS be the result of the above point query 9: I s =jSj 10: IfI s >I o 11: UpdateI o =I s andS o =S 12: Return optimal solution setS o andI o Figure 4.6: MaxRSKY Query Processing (Basic-Filtering Approach) scores in descending order, and starting from the regions with higher scores, we use an efficient technique to compute the actual set of overlapping regions for each entry of the overlap table. It is important to note that through this refinement process we only have to compute the actual overlap(s) for an entry if the score of the region is more than the influence value of the actual overlaps computed thus far. With our experiments, we observe that Basic-Filtering only computes the actual overlaps for a limited subset of regions before it identifies the optimal location; hence, it provides effective filtering of the search space for better efficiency. Figure 4.6 represents the procedure we follow to answer a MaxRSKY query. Below, we explain how we implement this computation in more detail as three steps: Step 1 (Initializing optimal result set): SupposeS o is a set of SSRs whose intersection corresponds to the optimal regionr returned by a MaxRSKY query. Also, we assume S o has an optimal influence value denoted byI o which represents the number of regions 60 belonging to the setS o . At this step, we initializeS o to an empty set andI o to zero. The initialization can be accomplished inO(1). Step 2 (Identifying overlap regions for each OT entry): For each entry (r;L;Sc) of OT, we identify the optimal overlap regions by performing the following sub-steps: 1. We compare all pairs of SSRs inL and check whether each pair are overlapping. If so, we compute a setQ of all intersection points between any two overlapping SSRs in L. Each pair check can be performed in O(jSj logjSj); according to [Mou04], the problem of detecting the intersection of two simple polygons ofn andm vertices can be determined inO((n +m) log (n +m). In addition, given two simple polygons withn andm vertices, there would beO(nm) intersection points between the edges of the two polygons which can be computed inO(nm) time ([Mou04]). Therefore, computing the intersection points of two SSR regions requiresO(jS 2 j) time. Since there are at mostO(jOj 2 ) pairs in each entry of OT, the total running time of this sub-step isO(jOj 2 (jSj logjSj +jS 2 j)). 2. For each point q2 Q, we perform a point query for q to find a set S of SSRs covering q. Accordingly, we compute the influence value of S by counting the number of regions belonging toS. We updateS o andI o , if the influence valueI s ofS is larger than the currentI o . Let(N) be the running time of a point query over a dataset sizeN. Since there are at mostO(jOj 2 jSj 2 ) intersection points for each entry of OT, computing intersection points requiresO(jOj 2 jSj 2 (jOj)). This step takes O(kjOj 2 jSj 2 (jOj)) where k is the greatest number of SSRs overlap- ping with a SSR (i.e., the greatest size ofL of an entry (r;L) inOT ). With the techniques described in [Cha86],(jOj) = O(k + logjOj) and thus the running time of this sub-step isO(kjOj 2 jSj 2 (k + logjOj)). 61 Although the aforementioned sub-step can find an optimal solution, it is inefficient because it has to process all possible pairs of overlapping SSRs. In fact, some entries of OT need not be considered and processed if there exists another entry whose intersection has a larger influence value (see line 3 in Figure 4.6). This aforementioned technique is called influence-based pruning ([WOY + 09]) which prunes large numbers of candidate pairs with less likelihood of containing the optimal result. This impact improves significantly the efficiency of the MaxRSKY computation. Considering there are at mostjOj entries in OT, the total running time of Step 2 is O(jOj 3 (jSj logjSj +jS 2 j) +kjOj 3 jSj 2 (k + logjOj)). Step 3 (Finding the maximum influence value): Once the computation terminates, S o includes the set of optimal overlap region(s) with the largest value of I s as our final influence valueI o . As we mentioned before, the complexity of data structure computation and query pro- cessing areO(jOjjSj 2 ) +O(jOj(k +logjOj)) +O(jOjlogjOj) andO(jOj 3 (jSj logjSj + jS 2 j) +kjOj 3 jSj 2 (k + logjOj)), respectively. Therefore, the cost of data structure com- putation is negligible compared to that of query processing and the overall running time of the MaxRSKY computation isO(jOj 3 (jSj logjSj +jS 2 j) +kjOj 3 jSj 2 (k + logjOj)). 4.6 Grid-based-Filtering Approach As we mentioned before, the Basic-Filtering approach provides an efficient solution for MaxRSKY queries. However, our experimental results in Section (4.7.2) show that with a large dataset this approach suffers from two main drawbacks which affects its performance: 62 a. Overlapping of four SSRs b. Imposing a grid on SSRs Figure 4.7: An illustration of the Grid-based-Filtering approach 1. For each entry of OT , there is a large number of pair-wise overlapped regions which results in an over-estimate of the score of each entry. 2. Due to the over-estimation value of scores, the influence-based pruning method has less significant impacts on filtering those entries with less likelihood of con- taining the optimal region. Consequently, a large number ofOT entries are pro- cessed during computation. The two aforementioned drawbacks happen during the MaxRSKY computation since the entities engaged in the pairwise overlapping computation are regions with large geometric shapes. Therefore, there is a large number of entities which have a large set of pairwise overlapping regions where most of them have no impact in identifying the optimal location. Figure 4.7(a) illustrates this impact. In this diagram, the list of pair- wise overlapping regions withSSR 1 isf(SSR 1 ,SSR 2 ),(SSR 1 ,SSR 3 ),(SSR 1 ,SSR 4 )g whereas the actual overlapping set isf(SSR 1 ,SSR 2 ,SSR 4 )g. Accordingly, the pair (SSR 1 ,SSR 3 ) has no impact in identifying the optimal location. To avoiding this problem in the MaxRSKY computation, we propose the Grid- based-Filtering approach. With this approach, we impose a grid over the SSR regions 63 and spatially subdivide them into a regular grid of squares (or generally hypercubes). By imposing the grid, SSRs are decomposed into a set of smaller entities (grid cells) where those entities are considered as the unit of overlap. Therefore, in OT, entries are based on grid cells defined in a form of (c;L(r);Sc(c)) where c represents a grid cell, L is the list of SSR regions covering partially or fully cell c and Sc(c) reflects the total number of SSR regions overlapping with cell c. Considering the finer res- olution of the overlapping units in Grid-based-Filtering, the elements listed in L are closer to the actual overlapping set compared to the Basic-Filtering. As a result, the pairwise overlaps that have no impact in identifying the optimal result are eliminated from computation. For instance, in Figure 4.7(b), for the three grid cells 1, 2, and 3, their corresponding lists L(1), L(2), and L(3) are computed asf(SSR 1 ,SSR 2 ,SSR 4 )g, f(SSR 1 )g andf(SSR 1 ,SSR 3 )g, respectively. In addition, their corresponding score values,Sc(c), are 3, 1 and 2, respectively which have the same value as the actual ones. Considering the facts that listL provides a more accurate view of the actual overlaps and also the fact thatSc(c) values are close to the actual ones, both overlapping computa- tion and influence-based pruning are performed in a more efficient way. These impacts improve significantly the performance of Grid-based-Filtering (results to be discussed in Section 4.7.2). In terms of implementation, the Grid-based-Filtering approach consists of two main phases, Data Structure Computation and Query Processing, similar to the Basic- Filtering approach. However, in both components there are some differences which will be described next. Data Structure Computation: In the data structure computation phase, similar to the Basic-Filtering approach, we first compute theDSL(o) andSSR(o) of each object pointo (Lines 1-3 in Figure 4.8). Then, we impose a regular grid of squares, G, over SSRs (Line 4). We use the termjGj to represent the total number of grid cells inG. The 64 side length (cell size) is typically chosen so that either there are not many empty cells to traverse, or the expected number of regions overlapping each cell is bounded. For each grid cellc, its list of overlappingSSRs and corresponding score values are computed and stored inOT . Thereafter,OT is sorted based onSc(c) values. Similar to Basic-Filtering, the time complexity of computing skyline sets and con- structing SSR regions is O(jOjjSj 2 ) and O(jOj), respectively (Lines 1-3, Figure 4.8). While constructing each SSR region r, we identify those grid cells which are covered partially or fully by regionr. Once all SSRs are constructed, we have a list of overlap- ping regions for each cell. Therefore, there is no need for a range query to compute all pairwise overlapping regions with a given cell. Since there arejGj grid cells, impos- ing the grid (Line 4) and constructingOT requireO(jGj) time (Line 5). Accordingly, sorting OT takes O(jGj logjGj). Therefore, the total running time of data structure computation in the Grid-based-Filtering approach isO(jOjjSj 2 ) +O(jGj logjGj). Query Processing: In this component, all steps in MaxRSKY computation are the same as described in Basic-Filtering (Section 4.5.2). However, as we mentioned before, the unit of overlap is the precomputed grid cells stored inOT whereas in Basic-Filtering the unit of overlap is theSSR regions. Below, we describe the time complexity of query processing in the Grid-based- Filtering approach according to the three steps discussed in section 4.5.2: Step 1 (Initializing optimal result set): The initialization can be done inO(1). 1: For each o2 O 2: ComputeDSL(o) 3: BuildSSR(o) 4: Impose a Grid on SSRs 5: Construct overlap table; (c, L(r), Sc(c)) 6: Sort overlap table based onSc(c) Figure 4.8: The data structure computation procedure (Grid-based-Filtering approach) 65 Step 2 (Identifying overlap regions for eachOT entry): 1. For each entry OT (grid cell c), we compute a set Q of all intersection points between cellc and any overlapping regions listed inL(r). Given a square (grid cell) and a convex polygon (SSR region), there would be four intersection points between the edges of the two polygons which can be computed in O(4) time. Therefore, computing the intersection points of two grid cells requiresO(4) time. Since there are at mostO(jOj) overlapping regions for each entry of OT, the total running time of this sub-step isO(4jOj). 2. For each point q2 Q, we perform a point query for q to find a set S of SSRs covering q. Accordingly, we compute the influence value of S by counting the number of SSRs belonging to S. We update S o and I o , if the influence value I s ofS is larger than the currentI o . Since there are at mostO(4) intersection points for one pair and the cost of a range query isO(k + logjOj), the running time of this sub-step isO(4k(k+logjOj)) (Recallk is the greatest number of overlapping SSRs with a given cell). Considering there are at mostjGj entries in OT, the total running time of Step 2 isO(4jOjjGj + 4kjGj(k + logjOj)). Step 3 (Finding the maximum influence value): This step can be done inO(1). The time complexity of query processing in Gridbased-Filtering is O(4jOjjGj + 4kjGj(k + logjOj)). Considering the running time of data structure computation, the total cost of the MaxRSKY computation using Gridbased-Filtering isO(jOjjSj 2 )+ O(4jOjjGj + 4kjGj(k + logjOj)). 4.7 Experimental Evaluation We next describe the setup we used for the experiments and then present and discuss the results. 66 1: InitializeS o andI o to? 2: For each grid cellc of OT table 3: IfSc(c)I o 4: For each SSR regionr02L(r) 5: Compute the intersection points between grid cellc and regionr0 6: For each intersection pointq found 7: Perform a point query fromq to find all SSR regions coveringq 8: LetS be the result of the above point query 9: I s =jSj 10: IfI s >I o 11: UpdateI o =I s andS o =S 12: Return optimal solution setS o andI o Figure 4.9: MaxRSKY Query Processing (Grid-based-Filtering Approach) 4.7.1 Experimental Setup All experiments were performed on an Intel Core 2.2GHz, 4 GB of RAM, running Windows 7 and the .NET platform 3.5. The algorithms are implemented in Microsoft C]. We use both real-world and synthetic datasets for objects and sites. Real-world dataset We used a real-world dataset, namely CarDB in our experi- ments. The used-car database CarDB is a 6-dimensional dataset with attributes referring to Make, Model, Year, Price, Mileage and Location. This dataset contains 19,000 tuples extracted from Yahoo! Autos (autos.yahoo.com). The two numerical attributes Price and Mileage are considered in our experiments. Synthetic dataset We synthesized extensive datasets, varying two parameters: dis- tribution and cardinality. We deployed datasets of three typical distributions in the sky- line literature ([BKS01]): independent, correlated and anti-correlated. In an indepen- dent dataset, every point is randomly distributed in the dataspace. In a correlated datas- pace, if a point has a low value on one dimension, very likely it also has a small value 67 on the other dimensions. Conversely, in an anti-correlated dataset, if a point has a low value on one dimension, it tends to have a large value on other dimensions. The dataspace is normalized to have a unit range [1, 100] on every dimension. On this dataspace, we imposed a grid 250 250 with the cell size of 1 1 which implies including 62,500 grid cells. All datasets are two-dimensional and their cardinality varies from 500 to 10 K in different experiments which are described next. 4.7.2 Experimental Results Below we present the results of the two series of experiments that we ran on the afore- mentioned datasets. 4.7.2.1 Feasibility Study For this experiment we used the real-world dataset (Yahoo! Autos) and extracted five series of data with fixed size sites (1,000 points) and various object cardinality in a range of [1,000, ..., 9,000]. Then, we applied both Basic-Filtering and Gridbased-Filtering approaches to the aforementioned dataset and computed their execution time. For the Gridbased-Filtering approach we imposed a grid including 250 250 grid cells, each cell with a resolution of 2000 10. As illustrated in Figure 4.10, Gridbased-Filtering outperforms Basic-Filtering by a factor of 22 on average. The time complexity of the two approaches (discussed in sections 4.5 and 4.6) verifies the aforementioned observa- tion. The time complexity of GridBased-Filtering and Basic-Filtering areO(jOjjSj 2 )+ O(4jOjjGj+4kjGj(k+logjOj)) andO(jOj 3 (jSj logjSj+jS 2 j)+kjOj 3 jSj 2 (k+logjOj)) respectively. Therefore, the running time of Gridbased-Filtering grows linearly as the number of objects increases whereas the running time of Basic-Filtering is polynomi- ally (cubic) proportional to the number of object points. This observation verifies the 68 Figure 4.10: Comparing the execution times of GridBased-Filtering and GridBased- Filtering approaches with real dataset fact that the Basic-Filtering approach cannot be applicable for efficient computation of MaxRSKY with large datasets. One may argue that the execution time of Gridbased-Filtering approach grows faster in the last series of data despite their linear relation with the cardinality of objects. This is because of the fact that the the execution time of Gridbased-Filtering is not only proportional tojOj but also to k. As we mentioned before, k represents the greatest number of SSRs overlapping with a grid cell. Thus, the value of this factor also improves with an increase in the size of object datasets. It is important to note that in this experiment while computing the execution time of Basic-Filtering an “Out of Memory” case happened for the last series of data. This impact is because of the fact that the larger number of objects we have, the more SSRs are constructed; hence, the size ofL(r) and the number of intersecting points become larger which resulted in memory shortages during MaxRSKY computation. 69 Table 4.2: Comparing the cost of skyline computation/building SSRs with the cost of computing overlaps among SSRs Sites Objects C 1 (min) C 2 (min) 1000 2000 0.08 10.36 2000 4000 0.35 78.2 3000 6000 0.82 246.2 4000 8000 1.52 410.13 5000 10000 2.49 1002.6 4.7.2.2 Comparing the cost of skyline computation/buildingSSRs with the cost of computing overlaps amongSSRs For this experiment, we used a synthetic dataset consisting of five series of 2- dimensional site and object points with anti-correlated distribution. The cardinality of site points in these five series is 1,000, 2,000, 3,000, 4,000 and 5,000, respec- tively and the number of corresponding objects is doubled. We applied the Basic- Filtering approach to the aforementioned dataset and computed the execution time of skyline computation/building SSRs (C 1 ) and the execution time of computing the over- laps among SSRs/identifying the optimal location (C 2 ) separately. Table 4.2 presents the results of this experiment. As illustrated in Table 4.2, the execution time of (C 2 ) is about two orders of magnitude greater than (C 1 ). In addition, the cost of the latter computation approach increases significantly with larger object and site datasets. Therefore, in this study we focused on reducing the computational complexity of computing the overlaps among SSRs and identifying the optimal location. It is important to note that datasets with correlated and independent distributions show the same behavior; hence, they are not reported here. 70 4.7.2.3 Empirical Analysis In order to evaluate the execution times of our proposed approaches, we implemented a set of experiments with synthetic datasets. Below, we describe each experiment in more detail. Effect of site and object distribution onBasic-Filtering andGrid-based-Filtering approaches: For this experiment, we used a synthetic dataset consisting of five series of 2-dimensional site and object points with fixed cardinality and different distribu- tions. The cardinality of site points in these five series is 1,000, 2,000, 3,000, 4,000 and 5,000, respectively and the number of corresponding objects is doubled. The dis- tribution combinations of site and object points are either both independent, correlated or anti-correlated. We applied the Basic-Filtering and Grid-based-Filtering approaches to the aforementioned dataset and computed their execution times. Figures 4.11 and 4.12 depicts the results of our experiments. We observe that the Grid-based-Filtering approach is about two orders of magnitude faster than Basic-Filtering on average in all distributions. We also observe that the execution time of Basic-Filtering deteri- orates rapidly with an increase in the size of object and site points. However, in GridBased-Filtering there is no significance growth in its execution times (see Figure 4.12 for a more accurate view). The time complexity of the two approaches (discussed in sections 4.5 and 4.6) verifies the aforementioned observations. Since the dominat- ing factor of time complexity of the Basic-Filtering approach isO(jOj 3 jSj 2 ) which is polynomially proportional to site points (quadratic) and polynomially proportional to object points(cubic); hence, its execution time significantly improves with a larger data size. However, the dominating factor of the time complexity of Grid-based-Filtering approach isO(4jOjjGj + 4kjGj(k + logjOj)), which is proportional to the grid sizejGj as well asjOj. In addition, Figures 4.11 and 4.12 illustrate that in both approaches their longest execution times occurred when datasets with independent distributions were 71 0 500 1000 1500 2000 2500 3000 1000 2000 3000 4000 5000 Execution Time (min) Sites Basic-Indep Basic-Corr Basic-Anti GridBased-Indep GridBased-Corr GridBased-Anti Out of Memory 4000 6000 8000 10000 Objects 2000 Figure 4.11: Execution times of the Basic-Filtering and Grid-based-Filtering approaches on synthetic datasets with different distributions used, followed by the datasets with correlated distributions and lastly the datasets with anti-correlated distribution. With independent distributions, both object and site points are uniformly distributed in dataspace. Thus, for a given object its skyline set is scattered across the dataspace which results in large SSR regions. The larger the SSRs, the greater the time required for overlap computation and identifying the optimal location. How- ever, in anti-correlated distributions, both object and site points are closely distributed which results in small SSRs and faster execution times for the overlap and MaxRSKY computation. Although the distribution of correlated datasets used for the evaluation of skyline queries ([BKS01]) is clustered, but is sparser than anti-correlated datasets. Therefore, their SSRs size and their corresponding overlap computation is larger than for the anti-correlated datasets. It is important to note that in this experiment two “Out of Memory” cases occurred for the last two data series while computing the execution time of Basic-Filtering with independent datasets. The SSRs are large in their instances; so that the size ofL(r) and the number of intersecting points are large, which resulted in memory shortages during the MaxRSKY computations. 72 Execution Time (min) Sites Objects 0 10 20 30 40 50 1000 2000 3000 4000 5000 Execution Time (min) Sites GridBased-Indep GridBased-Corr GridBased-Anti Objects 2000 4000 6000 8000 10000 Figure 4.12: Magnifying the execution times of the Grid-based-Filtering approach illus- trated in Figure 4.11 Effect of site and object cardinality onBasic-Filtering andGrid-based-Filtering approaches: In order to evaluate the effect of site and object cardinality on our proposed approaches, we implemented two experiments. With the first one, we considered a fixed site-dataset and used various object-datasets of different sizes. With the second exper- iment, we fixed the object-dataset and used various site-datasets. Below, we describe each experiment in more detail. Effect of Object-Dataset: For this experiment, we used a synthetic dataset consisting of five series of 2-dimensional object points and fixed size site points (1,000 points). The cardinality of object points in these five series is 2,000, 3,000, 4,000, 5,000 and 6,000 respectively. Both sites and objects have anti-correlated distribution. We applied the Basic-Filtering and Grid-based-Filtering approaches to the aforementioned dataset and computed their execution times. Figure 4.13 depicts the results of our experiment. This diagram shows that the execution time of both approaches improves with an increase in the number of object points since they deal with a larger number of SSRs and over- laps among them. However, the growth trend of execution times in Basic-Filtering is polynomial whereas the growth trend in Grid-based-Filtering is linear with respect to 73 0 200 400 600 800 1000 1200 2000 3000 4000 5000 6000 Execution Time (min) Basic-Anti GridBased-Anti Objects Figure 4.13: Execution times of Basic-Filtering and Grid-based-Filtering approaches on synthetic datasets with varying object cardinality and a fixed size site dataset the number of object points. The time complexity of both approaches verifies the afore- mentioned impact. The total cost of the MaxRSKY computation using Basic-Filtering and Grid-based-Filtering areO(jOj 3 (jSj logjSj +jS 2 j) +kjOj 3 jSj 2 (k + logjOj)) and O(jOjjSj 2 )+O(4jOjjGj + 4kjGj(k + logjOj)), respectively. Hence, we observe that Grid-based-Filtering outperforms Basic-Filtering by a factor of 20 on average (Figure 4.13). Since both approaches behave similarly for datasets with correlated and independent distributions, they are not reported here. Effect of Site-Dataset: For this experiment, we used a synthetic dataset consisting of five series of 2-dimensional site points and fixed size object points (6,000 points). The cardinality of site points in these five series is 1,000, 2,000, 3,000, 4,000 and 5,000, respectively. Both sites and objects have anti-correlated distributions. We applied the Basic-Filtering and Grid-based-Filtering approaches to the aforementioned dataset and computed their execution times. Figure 4.14 demonstrates that the execution time of both approaches reduces with an increase in the number of site points since they deal with a smaller number of skyline points and smaller number of SSRs, respectively. We also observed that Grid-based-Filtering outperforms Basic-Filtering by a factor of 17 74 0 200 400 600 800 1000 1200 1000 2000 3000 4000 5000 Execution Time (min) Sites Basic-Anti GridBased-Anti Figure 4.14: Execution times of Basic-Filtering and GridBased-Filtering approaches on synthetic datasets with varying site cardinality and fixed size object dataset on average since as discussed earlier, the running time of Grid-based-Filtering is pro- portional to the grid size whereas the running time of Basic-Filtering is polynomially proportional to the number of site and object points. Effect of grid cell size on Grid-based-Filtering approach: With this experiment, we studied the effect of changing the granularity of the imposed grid on the execution time of the Grid-based-Filtering approach. As mentioned earlier in Section 4.7.1, the size of the default grid was 250 250 with a cell size of 1 1. For this experiment, we changed the size of grid cells in a range value of [0.04,..., 80] (Figure 4.15). We applied Grid-based-Filtering on a dataset with 1,000 site points and 2,000 object points with anti-correlated distribution and computed its execution times by varying the grid cell size. As illustrated in Figure 4.15, the execution time of Grid-based-Filtering (the dashed curve line) fluctuates with changes in grid cell size. These fluctuations occur because the size ofL(c) for each entry in the OT table and the number of pruned entries may vary from one grid cell size to another. However, the trend-line of the execution times (the solid polynomial curve) shows that for some grid cell values belonging to the middle of the range (e.g., the cell sizes between 0.0625 and 0.675 in Figure 4.15), the 75 0 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 Execution Time (min) Grid Cell Size Execution Time Trendline 0.04 60 30 5 0.675 0.25 0.0875 0.0625 0.045 0.05 80 Figure 4.15: Execution times of GridBased-Filtering approach with various grid granu- larity execution time is low compared to the other cell size values. However, for coarser gran- ularities (i.e. cell size values greater than 0.675), the execution times goes up because with larger cell sizes, we are dealing with larger numbers of pairwise overlaps which results in higher execution times. In the worst case, when the cell size approaches the area of the entire dataspace, the performance of the Grid-based-Filtering approach degrades up to the performance of the Basic-Filtering approach. Also, for finer gran- ularities (e.g., cell sizes less than 0.05), the execution time deteriorates slightly. This is because of the fact that splitting up the grid beyond an optimal cell size, provides no more improvement in terms of the efficiency of the Grid-based-Filtering approach. However, larger numbers of grid cells results in larger numbers of entries in OT and higher execution times, respectively. 4.8 Conclusions and Future Work In this study, for the first time we proposed a solution for the problem of maximizing reverse skyline queries. Accordingly, we proposed two approaches, Basic-Filtering and Grid-based-Filtering, for efficient computation of MaxRSKY with a focus on reducing 76 the cost of overlap computation among SSRs. We verified and compared the perfor- mance of our proposed solutions with rigorous complexity analysis as well as an exten- sive experimental evaluation using real-world and synthetic datasets. With the proposed approach for multi-criteria optimal location queries discussed in this chapter we assumed that the attributes of sites and objects are all non-spatial attributes (e.g., the ”Weight“ and ”Memory“ attributes considered in laptop marketing scenario). However, spatial attribute such as the geographic location of sites and objects might be of interest which are not included in this study. As a future study, a new research challenge would be to consider the spatial attribute besides non-spatial ones and look into the MCOL problem in two ways: (1) by adapting the existing solutions in order to include the spatial attributes; or (2) by enhancing the existing solutions in order to address the MCOL problem considering spatial attributes. 77 Chapter 5 Dynamic Optimal Location Queries P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. Continuous maximal reverse nearest neighbor query on spatial networks. In Proceedings of the 20th SIGSPA- TIAL International Conference on Advances in Geographic Information Systems, page forthcoming, 2012. 5.1 Introduction A common and important limitation of the existing solutions for optimal network loca- tion query is the assumption that sites and objects rarely (if ever) change their loca- tion over time. However, there are numerous real-world applications where sites and/or objects are moving entities that frequently change location. Examples of such applica- tions are food truck location planning, diaster-response facility location planning, and mobile police unit assignment, to name a few. For instance, with the food truck loca- tion planning application, food trucks which frequently change their locations during the day, can use optimal location queries to determine the best location to stop next (i.e. where they can serve the most customers). With this application not only sites (i.e. food trucks) are moving, but also objects (i.e. customers) change location as they commute throughout the day. Similarly, with the disaster-response planning application aid supply, support units (sites) must be placed (on-the-fly) where they can serve most victims (objects). Likewise, the demand for aid in different areas is likely to frequently 78 change over time as inspections identify new victims and the identified victims receive aid during the disaster response. The existing solutions for optimal location query consume hours (or tens of minutes at best) to compute the optimal location; nevertheless, they are applicable to classic opti- mal location applications because the location of the sites and objects (most probably) remain unchanged during the query computation. However, with dynamic applications such as those described above, since sites and/or objects frequently move, the result gen- erated by such solutions is most probably invalid by the time computation is complete; hence, these solutions are inapplicable. For example, according to the data collected from a real-world food truck application with only 32 trucks, one of the trucks may change location as often as every two minutes, whereas each run of the optimal loca- tion query on average takes about 39 minutes to complete (see Section 5.8.2 for more details). To be able to support dynamic applications, one should avoid computation of the optimal location query from scratch, and instead, compute the query incrementally to leverage computations from past queries. We term such queries Dynamic Optimal Net- work Location Queries (or DONL queries, for short). DONL queries continuously pro- vide optimal network locations by incrementally updating the result of the query at each time t as sites and objects change location over time. We should mention that while some similar problems have been studied under the topic of ”dynamic location model- ing” by the geography and operations research communities (e.g., the studies proposed by Horner et al. [HD07] and Erlenkotter et al. [Erl75]), by design their solutions only scale to problems with very small site and object datasets, and can only approximate the exact result by applying heuristics (see Section 5.2). In this study, we formalize DONL queries as Continuous Maximal Reverse Nearest Neighbor (CMaxRNN) queries on spatial networks, and present a scalable and exact 79 solution for CMaxRNN query computation (hereafter, we use the terms DONL and CMaxRNN interchangeably). We argue that answering any basic optimal network loca- tion query includes two main components. First, one has to compute a spatial neighbor- hood around each (and every) object o of the given object dataset such that if s is the nearest site to object o, any new sites 0 introduced within the locality of o will be closer to o as compared to the distance between s and o. The intersection areas where these neighborhoods overlap are the best candidate locations to introduce a new site. There- fore, at the second phase one must compute the overlap among object neighborhoods and identify the optimal network location, which is a network segment (or a set of seg- ments) with maximum total weight. This approach is also applicable to dynamic optimal network location queries. However, repeating the execution of the two aforementioned steps for the entire dataset every time a site point moves or the weight of an object changes, results in a great amount of computational cost and resource consumption. Instead, in our proposed approach we present a framework for incrementally mon- itoring the MaxRNN queries in spatial networks to avoid redundant computation. We first precompute a data structure and store the status of the spatial neighborhoods, the overlap among them, and their optimal network location in this data structure. This data structure is constructed in a way that it supports the CMaxRNN queries and is efficiently updateable. At any time instantt, upon receiving an update either in the location of the site and/or object points, we leverage the precomputed information in the initial phase and identify the part of the network that is impacted by these changes. Then, we update locally the spatial neighborhood of those objects within this locality. Thereafter, the status of the overlap among neighborhoods and the new optimal network location are efficiently updated and stored in the data structure to be used for future dynamic queries. The remainder of this chapter is organized as follows. Section 5.2 reviews the related work. Section 5.3 formally defines CMaxRNN queries on spatial networks and Section 80 5.4 formally defines the terminology used throughout the remainder of the chapter. In Section 5.5, we present our index structure and introduce our proposed approach, and in Section 5.6 we present the procedures we have developed to execute the update opera- tions. In Section 5.7 we present a complexity analysis of our proposed approach. Sec- tion 5.8 evaluates our proposed approach via experiments and Section 5.9 offers some conclusions. 5.2 Related Work Since the pioneering work of Ballou [Bal68], OR (operations research) researchers have shown continuing interest in dynamic location modeling. Such models typically result in a schedule or plan for opening and/or closing facilities (sites) at specific times and locations in response to changes in parameters (e.g., demand of the objects, loca- tion/relocation cost of facilities) over time horizon. Some common dynamic loca- tion/relocation models can deal with single (studied by Wesolowsky et al. [Wes73]) and multiple facilities (proposed by Wesolowsky et al. [WT75], Erlenkotter et al. [Erl75], Daskin et al. [DWH92] and Drezner et al. [Dre95]), as well as dynamic location/relocation and time-dependent facility locations (introduced by Drezner et al. [DW91] and Farahani et al. [FDA09]), where the demand changes over time. However, given the computational complexity of most dynamic location modeling problems exist- ing solutions mostly comprise heuristics that can only “guestimate” the optimal location without any guaranteed error bounds. More importantly, due to their computational complexity these solutions/models fail to scale to real-world datasets that often consist of large numbers of sites and objects. On the other hand, given similar scalability issues with existing solutions for the generic family of “location problems”, the database community has recently shown 81 interest in developing scalable solutions for these problems. However, to the best of our knowledge, we are the first to introduce and address the CMaxRNN (or DONL) prob- lem, providing a solution which is both exact and scalable. Below, we review the two closest types of related (but orthogonal) location problems that are previously studied by the database community; namely, the problem of maximal reverse nearest neighbor (MaxRNN), which assumes static site and object datasets, and the problem of continu- ous reverse nearest neighbor monitoring. The CMaxRNN problem can be thought of as a combination of these two problems. Wong et al. [WOY + 09] and Du et al. [DZX05] both tackled the problem of MaxRNN queries. While efficient, both of the aforementioned approaches assume p- norm space ([WOY + 09] assumesL 2 and [DZX05] assumesL 1 ); hence, their solutions do not apply to spatial networks. Zhou et al. [ZWL + 11] presented an efficient solution to solve the MaxBRkNN problem which finds an optimal region such that setting up a service site in this region guarantees the maximum number of customers who would consider the site as one of their k nearest service locations. Their approach assumes L 2 space which is not applicable to our problem. The problem of MaxRNN on spa- tial networks is studied both by Ghaemi et al. [GSWBK10] and Xiao et al. [XYL11]. Ghaemi et al. [GSWBK10] introduced two complementary approaches which enable efficient computation of optimal network location queries with datasets of uniform and skewed distributions, respectively. Xiao et al. [XYL11] also proposed a unified frame- work that addresses three variants of optimal location queries on spatial networks. They divide the edges of the networks into small intervals and find the optimal location on each interval. To avoid the exhaustive search on all edges, they first partition the road network graph to sub-graphs and process them in descending order of their likelihood of containing the optimal locations. In both aforementioned approaches, the assumption 82 is that objects and sites are static and they do not change location over time. Therefore, these approaches are not applicable to CMaxRNN queries. Continuous monitoring of RNN queries has received considerable attention. The first continuous RNN monitoring solution is presented by Benetis et al. [BJKS06]. However, they assume that velocities of the objects are known. The first work that does not assume any knowledge of objects’ motion patterns was presented by Xia et al. [XZ06]. Their proposed solution is based on the six-regions approach. Kang et al. [KMS + 07] used the concept of half space pruning for addressing continuous RNN queries. Wu et al. [WYCT08] proposed a solution for continuous monitoring of RkNN which is similar to the six-regions based RNN monitoring approach studied by Xia et al. [XZ06]. Cheema et al. [CLZZ11] focused on continuous bichromatic RkNN queries where only the data objects move. Sun et al. [SJLS08] studied the continuous monitoring of RNN queries in spatial networks, but their approach is only applicable to bichromatic RNN queries and also assumes that the query points do not move. Recently, Cheema et al. [CZL + 12] presented a technique for continuously monitoring RkNN queries on spatial networks where both the objects and queries continuously change their locations. However, none of the aforementioned approaches in this group considers continuous maximization of RNNs. 5.3 Problem Definition In this section we formally define the problem of CMaxRNN queries. Consider a uni- versal set S of sites (e.g., the set of food trucks, in our food truck location planning application), and a set O of objects with the weight w o for each object o2 O (e.g., each object can represent the group of people residing in a building, with the number of building occupants as the current weight of the group/object). We assume both sites and 83 objects are located on a spatial network (i.e. a road network). At each timet, a site can be either in- or out-of-service, where sites can switch between these states throughout the day. Moreover, an in-service site can relocate at any time. On the other hand, to model change of demand, we assume the weight of each object is time-dependent (e.g., in our running example, people can move from one building to another). Note that with this model we can also capture relocation of the objects. Next, we first define the problem of CMaxRNN queries. Thereafter, we reduce this problem to a series of update operations, which if supported efficiently, one can continuously maintain a precomputation of the MaxRNN (i.e., the optimal location) as site and object datasets change. 5.3.1 Continuous Maximal Reverse Nearest Neighbor Query (CMaxRNN) Intuitively, a CMaxRNN query is a continuous query that at time t returns a network segment (or a set of segments) where introducing a new site would maximize the total weight of the objects that are closer to the new site than to any other site. More for- mally, given a set S 0 of in-service sites and a set O of objects with weight w o (t) at time t, CMaxRNN returns a subset of the spatial network (i.e. a segment or collec- tion of segments) where introducing a new site s would maximize the total weight of the objects in the bichromatic reverse nearest neighbor (BRNN) set of s. Here, we should remind the reader that the BRNN query on a given site s, returns all the objects o2O which their nearest neighbor site is s, i.e., there is no other sites 0 2S 0 such that Dist(o;s 0 )<Dist(o;s). 84 5.3.2 Update Operations Our assumed changes in the site and object datasets (i.e., state change of the sites, relo- cation of the sites, and weight change of the objects) can be captured by one (or a combination) of the following three so-called “update operations”: Site Delete Operation (termed Delete, for short), with which a sites is removed from the setS 0 of in-service sites at some timet; Site Insert Operation (termed Insert, for short), with which a sites is added to the setS 0 of in-service sites at some timet at the optimal location; and Object Weight Change Operation (termed Weight-Update, for short), when the weightw o (t) of an objecto changes at some timet. For example, site relocation can be implemented as a Delete followed by an Insert. Note that we assume sites are always inserted at the optimal location. We argue that once MaxRNN is precomputed, for efficient execution of CMaxRNN (i.e., to maintain MaxRNN at every timet), one only needs to support efficient execu- tion of the aforementioned update operations, which capture all changes in the site and operation datasets. With a na¨ ıve approach, one can recompute MaxRNN from scratch each time one of the update operations occurs. However, as we mentioned before, this approach fails to scale with large datasets. Accordingly, we propose an incremental solution that consists of two components: (1) to identify and precompute a set of state variables that are required for incremental computation of MaxRNN (i.e., the optimal location), and to organize and store the variables in a data structure that allows for effi- cient update of the variables; and (2) efficiently execute the update operations by updat- ing the precomputed data structure that maintains the state variables. Next, after presenting our terminology in Section 5.4, we discuss the two aforemen- tioned components of our solution in Sections 5.5 and 5.6, respectively. 85 5.4 Terminology In this section we formally define our terminology to be used in the remainder of this chapter. For the definition of Local Network, Overlapping Local Networks and Overlap Seg- ment we refer the readers to Section 3.3. Below, we define the remainder of the preliminaries: DEFINITION 1 (IMPACTED OBJECTS): In “Delete” and “Insert” operations, the objects whose NN site is changed are called Impacted Objects(I-OBJ). When an exiting sites is removed from the set of in-service sites, a number of object points get impacted by losing their NN site which are in the RNN set of sites. Also, adding a new site s to the set of in-service sites impacts the number of objects attracted to the new NN sites. Besides these two operations, in a “Weight-Update” operation, the objects whose weights change over time are called impacted objects as well. DEFINITION 2 (IMPACTED EDGES): The edges belonging to the local network of impacted objects are defined as Impacted Edges (I-EDG). Upon receiving a Delete or Insert operation, the impacted edges lose markers or receive new markers. However, in a “Weight-Update” operation, markers remains unchanged whereas the influence value of the impacted edges changes. DEFINITION 3 (SNAPSHOT MAXRNN QUERY): Given a set O of objects, a set S of sites, the SMaxRNN query computes a subset of the spatial network (i.e., a segment or collection of segments) where introducing a new sites would maximize the total weight of the objects in the bichromatic reverse nearest neighbor (BRNN) set ofs. This is equivalent to the basic optimal network location queries (i.e. Definition 4 of Section 3.3) where both objects and sites are considered static. 86 5.5 Precomputation In order to be able to continuously compute the optimal location with CMaxRNN, the main idea behind our proposed solution is to precompute the likelihood of containing the optimal location for each network edge, and to maintain a ranking of the edges based on this likelihood from high to low. With this precomputation, one can efficiently identify the optimal location to insert a new site (i.e., when an Insert operation is executed) by starting from the edges with higher likelihood and avoiding the edges with lower likelihood during the search process (rather than exhaustively searching for the optimal location on the entire network). In particular, we compute the optimality likelihood for each edge by computing a “score” that reflects the total weight of the objects whose local networks overlap (at least partly) with the edge. Obviously, the higher the score of an edge, the more is the chance of finding an optimal segment on the edge (where the sum of the weights of the objects whose local networks all overlap on the entire segment is maximized among all network segments). While precomputation of a ranked edge-optimality-likelihood list allows for efficient computation of the optimal location during Insert, we also need to maintain this ranked list as update operations are executed. In particular, with Insert and Delete operations, a new site is respectively added to or removed from the set of in-service sites, which may affect the local networks of some objects and in turn the optimality likelihood of some edges. Similarly, with the Weight-Update operation, the weights of a number of objects are changed, and accordingly the score of the impacted edges may change. Accordingly, to enable incremental maintenance of the ranked edge-optimality-likelihood list during execution of the update operations, in addition to the precomputed list itself, we precom- pute and maintain the local network for each object. With the latter precomputation, we can quickly identify the impacted objects and corresponding edges as update operations are executed, and hence, we can localize execution of the update operations (rather than 87 recomputing across the whole network). In the remainder of this section, we present our precomputation procedure along with the data structure used to store the precomputed measures. Figure 5.1 shows the schema of our precomputed data structure. The “Site” table, the “Object” table, and the “Edge” table (also called Marked Edge Table, or MET for short) are implemented as dynamic arrays, and maintain information about sites, objects and spatial network edges, respectively. It is important to note that the notion of MET was described in detail in Section 3.4. As depicted in the diagram, for each site we maintain the location of the site as well as the network edge it resides on, for each object, in addition to location and edge we maintain the nearest site of the object as well as (the edges that comprise) its local network, and for each edge we maintain the corresponding network nodes that delimit the edge, the computed score of the edge (as described above), and a list of pointers to the markers residing on the edge. Finally, the “Marker” table maintains information about all markers and is stored as a hash table to allow for efficient lookup and update during execution of the update operations. Figure 5.2 represents the procedure we follow to populate the precomputed data structure. To emphasize, the precomputation procedure is executed only once and offline Figure 5.1: Schema of the precomputed data structure 88 (as compared to update operations that are executed online). Below, we explain how we implement this procedure in three steps: 1. Expanding local networks and marking the edges: After populating the generic site, object and edge information onto the corresponding tables of our data struc- ture, for each objecto we first expand the local network ofo, LN(o), using the Dijkstra algorithm, and we stop when we reach the nearest site to o. Then, we mark the ending points/markers of the local networks on the edges and record the markers (this process is called edge marking and was introduced in Section 3.3). 2. Populating MET: Once markers are generated, we populate MET entries with corresponding markers. In addition, for each edge entrye, we compute and store the edge score Sc(e) (as described above). MET represents our edge-optimality- likelihood list, to be used for computation of the optimal location/segment ( as will be presented in Figure 5.3 below). 3. Sorting MET: Finally, we sort all entries in MET in descending order of Sc(e), to identify the edges with higher likelihood of optimality. The time complexity of the MET population step is O(jEj) and that of the MET sorting step is O(jEjlogjEj). Thus, the overall running time of our precomputation procedure isO(jEjlogjEj) +O(jOj(jNjlogjNj +jEj)). 1: For each o2 O 2: Expand the local network of object o 3: Mark ending points/markers on edges 4: Construct Marked Edge Table (MET) 5: Sort MET table based onSc(e) Figure 5.2: The precomputation procedure 89 1: InitializeS o andI o to? 2: For each marked edgee of MET table 3: IfSc(e)I o 4: Apply edge collapsing to edgee 5: RetrieveI s and optimal overlap segment(s) 6: UpdateI o =I s 7: UpdateS o to the set of overlap segments with maximum influence valueI s 8: Return optimal solution setS o andI o Figure 5.3: Optimal location computation based on MET Before we move on to discuss how update operations are implemented based on the precomputed data structure, next we will present a key computation which is fre- quently invoked during execution of the update operations, i.e., deriving the optimal location/segment based on MET. Figure 5.3 represents the procedure we follow to per- form this computation. Below, we explain how we implement this computation in three steps: 1. Initializing optimal result set: Assume the set of optimal location(s)/segment(s) is denoted byS o , with the optimal influence valueI o . At this step, we initializeS o to empty set andI o to zero. 2. Identifying overlap segments on each edge: From the set of marked edges in MET, we identify the optimal overlap segments by a process, called edge collapsing (described in detail in Section 3.4). First, we split the edgee into a set of segments, SG(e), where each segment is the part of the edge e which is located between two consecutive markers. Then, for each segment s of SG(e), we identify the local networks overlapping with e which are fully covering s. Accordingly, we compute the influence value of the segments by summing up the influence values of the corresponding local networks. For each edge entry in MET, the optimal 90 overlap segments o , is the segment which has the highest influence value among all segments in SG(e). Note that edge collapsing may produce more than one optimal overlap segment on each edge. 3. Finding the maximum influence value: After collapsing each edge, updateS o and I o , if the influence valueI so ofs o is larger than the currentI o . Once the computa- tion terminates,S o includes the set of optimal overlap segment(s) with the optimal influence valueI o . The complexity of the edge collapsing computation isO(jEjjOj 2 ) ( as was discussed in more detail in Section 3.6). 5.6 Update Operation Execution In this section, we present the procedures we have developed to execute the update operations based on the data structure precomputed in Section 5.5. 5.6.1 Delete Operation Once a sites is deleted from the set of in-service sites, the status of the corresponding impacted objects and edges must be updated in the data structure as follows: 1. Retrieving the impacted objects: All objects belonging to the RNN set of the sites are considered as impacted objects, because these objects are the ones losing their NN site s. With our proposed precomputed data structure, computing the RNN set fors is very efficient, because the NN site for each object is already stored in the Object table and can be used for quick RNN computation. 2. Removing local networks of the impacted objects: With our precomputed data structure, this is simply accomplished by removing the corresponding markers 91 from the MET and Marker tables. In this way, we can avoid the costly recompu- tation of the local networks for all impacted objects. 3. Expanding the new local networks for the impacted objects and identifying their new NN site: To identify the new NN site for each impacted object, one needs to expand a new local network for the object. However, we observe that the distance between the new NN site and the impacted object is always larger or equal to that of thes and the object. Therefore, we can derive the new local network simply by extending the previous local network, avoiding reexpansion of the network from scratch. 4. Populating the tables: At this step, given the new local network we add new markers of the impacted objects to the MET and Marker tables. 5. Updating the score of the impacted edges in MET: Finally, the score of each impacted edge e that loses markers or receives new markers is updated as follows: NewScore(e) =OldScore(e)+ [Score(newMarkers)Score(oldMarkers)] 5.6.2 Insert Operation Similar to the Delete operation, once a sites is added to the set of in-service sites, the status of the corresponding impacted objects and edges must be updated in the data structure as follows. Note that the computations required at each step are similar to that of the corresponding step in executing the Delete operation as discussed above, and here, we avoid repeating such details: 1. Insert the new site at the optimal location: At the very first step, the optimal loca- tion computation process depicted in Figure 5.3 is invoked to identify the (segment 92 of) an edgee, which is the optimal location for the new insert. Accordingly, the precomputed data structure is updated with the new site information. 2. Retrieving the impacted objects: Based on Lemma 1, the objects whose local networks include the edgee are considered the impacted objects. These objects can be quickly identified from the Object table given the precomputed information about the local networks of the objects. 3. Removing local networks of the impacted objects: See Step 2 of the Delete oper- ation execution for details. 4. Expanding the new local networks for the impacted objects and identify their new NN site: Similar to Step 3 of the Delete operation execution, we can compute the new local networks based on the previous local network expansions, but here by contracting the expansion instead. 5. Populating the tables: See Step 4 of the Delete operation execution for details. 6. Updating the score of the impacted edges in MET: See Step 5 of the Delete oper- ation execution for details. 5.6.3 Weight-Update Operation Execution procedure for the Weight-Update operation can be summarized as follows (the steps are self-explanatory by now): 1. Retrieving the impacted objects whose weight change. 2. Updating the score of the impacted edges belonging to the local networks of the impacted objects in MET. 93 5.7 Complexity Analysis In this section, we analyze the computational complexity of our three aforementioned queries. Delete Operation: Below, we discuss the computational complexity of various tasks with the Delete operation. Retrieving the impacted objects: As mentioned earlier, the RNN set of site points are computed during the precomputation phase. Accordingly, given the Edge ID of edge e all impacted objects can simply be retrieved by accessing the Site, Edge and Marker tables in a sequence (see Figure 5.1). Therefore, this step takes aboutO(jOj). However, this cost is very low compared to the approach of computing the RNN set by constructing the V oronoi cell of each in-service site ([SET09]). The NVD can be constructed using the parallel Dijkstra algorithm [EH00] with V oronoi generators as multiple sources (Recall the cost of running parallel Dijkstra is O(jOj(jNjlogjNj) + jEj)). Removing local networks of the impacted objects: The cost of removing the mark- ers of corresponding impacted objects isO(jEjjOj) since in the worst case allO(jOj) objects might get impacted and also all local networks might overlap each individual edge of the graph. Expanding the local network of the impacted objects: Since the maximum number of overlapping local networks is theoretically equal tojOj, the running time for expanding the local networks takesO(jOj(jNjlogjNj) +jEj)). Adding the markers of the impacted objects to the MET and Marker tables: Marking all ending points on edges requiresO(jOjjEj)) time. Updating the score of the impacted edges in MET: This step takesO(jEj) time. 94 Identifying the optimal segment(s) with the maximum influence value: This step includes sorting the MET and then applying the edge collapsing technique which require O(jEjlogjEj) andO(jEjjOj 2 ), respectively. The overall running time of the CMaxRNN query in response to a ”Delete” operation isO(jEjlogjEj) +O(jOj(jNjlogjNj +jEj)) +O(jEjjOj 2 ). Insert Operation: The computational complexity of the Insert operation is simi- lar to the Delete operation. For the first step (i.e. retrieving the impacted objects), objects whose local network includes a specific edge, can be identified by accessing the Object, Edge and Marker tables in this order (see Figure 5.1). Therefore, this step takes about O(jOj). The remainder of the steps have the same complexity as those related to an Insert operation. Therefore, the overall running time of the CMaxRNN query in response to an ”Insert” operation is O(jEjlogjEj) + O(jOj(jNjlogjNj +jEj)) + O(jEjjOj 2 ). Weight-Update Operation: The overall running time of the CMaxRNN query in response to an Weight-Update operation isO(jEjlogjEj) +O(jEjjOj 2 ) since there is no cost of expansion involved in this particular operation. 5.8 Experimental Evaluation We next describe the setup we used for the experiments and then present and discuss the results. 5.8.1 Experimental Setup All experiments are performed on an Intel Core 2.2GHz, 4 GB of RAM, running Win- dows 7 and the .NET platform 3.5. The algorithms are implemented in MicrosoftC]. We use a spatial network ofjNj = 375,691 nodes andjEj = 871,715 bidirectional edges, 95 representing the LA County road network. The spatial network covers 130 km * 130 km and is cleaned to form a connected graph. We use both real-world and synthetic datasets for objects and sites. All sites, objects, nodes and edges are stored in memory-resident data structures. Real-world dataset: In our real-world dataset, objects are population data derived from LandScan Global Population Database (Bhaduri et al. [BBCD02]; see http://www.ornl.gov/landscan/ for additional details) and compiled on a 30” x 30” lat- itude/longitude grid. The centroid of each grid cell is treated as the location of each object and the population within each grid cell as the weight of the object. For the objects which are not located on road network edges, we snapped them to the clos- est edge of the road network. In total we havejOj= 9,662 objects. The weights of objects are distributed nearly uniformly with an average weight of 1,100. Sites are food truck locations derived from a mobile web application operated by TruxMap (http://www.foodtrucksmap.com/la/). TruxMap is a live map of all the roaming food trucks in Los Angeles and elsewhere. Once a food truck has scheduled a start or stop through its Twitter account or an iPhone application using GPS, a marker is automati- cally generated on the TruxMap food truck tracker. Synthetic dataset: We synthesized four datasets (S 1 , S 2 , S 3 , andS 4 ) with different combinations of uniform and normal (mean = 1, standard deviation = 3:2) size and spatial distributions (see Table 5.1 for additional details). To select each object/site point, we randomly picked both X and Y dimensions of the point using the uniform or normal distribution. For each dataset, both objects and sites are either uniformly dis- tributed or have normal distribution. With S 1 , we considered a fixed site-dataset and various object-datasets whereas withS 2 ,S 3 , andS 4 we used a fixed object-dataset and various site-datasets. Also, we considered datasets with two different weight object dis- tributions. For instance, inS 1 ,S 2 , andS 4 the weight of object datasets are all uniformly 96 Table 5.1: Four synthetic datasets for objects and sites Datasets Object Size Site Size Spatial Distri- bution Weight Distri- bution S 1 2000, 5000, 10000, 20000 500 Uniform Uniform S 2 20000 500, 1000, 2000, 5000 Uniform Uniform S 3 20000 500, 1000, 2000, 5000 Uniform Normal S 4 20000 500, 1000, 2000, 5000 Normal Uniform distributed with a weight equal to 1. However, inS 3 the weight of objects are normally distributed (mean = 1, standard deviation = 10:2). 5.8.2 Experimental Results Below we present the results of the three series of experiments that we ran on the afore- mentioned datasets. 5.8.2.1 Feasibility Study We first verified that the SMaxRNN query is not applicable to the CMaxRNN Queries. For this test, we selected our real-world dataset with 9,662 object points and for site points, we retrieved the location of 32 food trucks from TruxMap tracked on a given day. We observed on this particular day that the location of sites points might change as frequently as every two minutes. We first applied the SMaxRNN query on the dataset and computed the execution time. Then, we performed the CMaxRNN algorithm and retrieved its corresponding execution time in response to the three Delete, Insert and Weight-Update operations. The execution time of each operation is retrieved by aver- aging the execution time of 100 runs. We observed that the SMaxRNN query takes about 39 minutes to identify the optimal location (Table 5.2). However, the CMaxRNN 97 Table 5.2: Comparing the execution time of SMaxRNN and CMaxRNN queries SMaxRNN CMaxRNN Delete Insert Weight-Update 39 minutes 68 seconds 37 seconds 19 seconds query takes about 68, 37, and 19 seconds for Delete, Insert, and Weight-Update, respec- tively. This experiment confirmed the assertion that using the SMaxRNN approach for continuously computing the MaxRNN set on spatial network databases is not feasible. 5.8.2.2 Empirical Analysis In order to evaluate the execution times of our proposed approach, we implemented a set of experiments with synthetic datasets. Below, we describe each experiment in more detail. Effect of site and object cardinality on CMaxRNN and SMaxRNN: For this experiment, we selected both the S 1 and S 2 datasets and applied the SMaxRNN and CMaxRNN approaches to them and computed their execution times. In order to com- pute the execution time of each operation (Delete/Insert/Weight-Update), we sampled 100 iterations and picked the average of their execution time as the result. Figures 5.4 and 5.5 depict the results of our experiments. We observe that in both datasets the CMaxRNN is about orders of magnitude faster than the SMaxRNN. This is because all objects in the dataset and their corresponding local edges are engaged in the SMaxRNN computation. However, in CMaxRNN the computation is limited only to the impacted objects and impacted edges. Effect of site and object cardinality on the three Delete, Insert and Weight- Update operations: For this experiment, we selected both theS 1 andS 2 datasets and applied the CMaxRNN approach to the three operations and computed their execution 98 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 0 5000 10000 15000 20000 Execution Time (seconds) #Number of objects Delete Insert Weight-Update SMaxRNN Figure 5.4: Execution times of CMaxRNN and SMaxRNN onS 1 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 0 1000 2000 3000 4000 5000 6000 Execution Time (seconds) #Number of sites Delete Insert Weight-Update SMaxRNN Figure 5.5: Execution times of CMaxRNN and SMaxRNN onS 2 times. In order to compute the execution time of each operation (Delete/Insert/Weight- Update), we sampled 100 iterations and picked the average of their execution time as the result. Figures 5.6 and 5.7 depict the results of our experiments. We observe in Figure 5.6 the execution time of all operations increases when the number of the objects increases in the datasetS 1 . Considering fixed site points inS 1 , with an increase in the number of objects the cost of local network expansion becomes higher. However, inS 2 99 0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 5000 10000 15000 20000 Execution Time (seconds) #Number of objects Delete Insert Weight-Update Figure 5.6: Execution times of the Delete/Insert/Weight-Update operations onS 1 0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 1000 2000 3000 4000 5000 6000 Execution Time (seconds) #Number of sites Delete Insert Weight-Update Figure 5.7: Execution times of the Delete/Insert/Weight-Update operations onS 2 (Figure 5.7) with an increase in the number of site points, the cost of expansion (i.e. the cost of CMaxRNN in response to the three operations) decreases since the object points reach their NN site faster. Effect of the spatial distribution of site and object datasets on the Delete, Insert and Weight-Update operations: First, we studied the effect of datasets with different spatial distributions on the Insert operation. The results for the Delete operation is qualitatively similar. For this experiment, we selected theS 2 (uniform) andS 4 (normal) datasets. Thereafter, we applied the CMaxRNN approach to the aforementioned datasets 100 0 50 100 150 200 250 300 350 400 450 500 0 10 20 30 40 50 60 70 80 90 100 110 120 0 1000 2000 3000 4000 5000 Number of Impacted Objects Average of execution time in seconds #Number of sites Insert Average (Uniform) Average (Normal) #I-OBJ (Uniform) #I-OBJ (Normal) Figure 5.8: Comparing the execution times of the Insert operation onS 2 (uniform) and S 4 (normal) and computed both its execution time and the number of impacted objects (i.e. I OBJ) in response to Insert operations. In order to compute the execution time, we ran 100 Insert iterations and picked the average of their execution time as the result. It is important to note in each run the new site is placed on a location that is returned as the optimal location in the previous CMaxRNN query run. Figure 5.8 presents the results of our experiments. The left Y-axis represents the average of the execution time and the right Y-axis shows the number of impacted objects (IOBJ). We observe that the number ofIOBJ is higher with the normal compared to the uniform distribution. This is because the distribution of object points varies more on a road network for object datasets with a normal distribution. Therefore, the RNN query returns a higher number ofIOBJ in more non-uniformly distributed areas. Accordingly, we observe that the average execution time with a normal distribution is higher than for uniform distribution and proportional to the number ofIOBJ and their network expansion, respectively. However, in Figure 5.8 the average execution time for both distributions look similar since the values are low compared to the Y-axis scale and are not distinguishable. Second, we focused on the effect of datasets with different spatial distributions on the Weight-Update operation. We applied the CMaxRNN approach to the S 2 and S 4 101 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 500 1000 1500 2000 2500 Number of Impacted Edges Average of execution time in seconds #Number of sites Weight-Update Average (Uniform) Average (Normal) #I-EDG (Uniform) #I-EDG (Normal) Figure 5.9: Comparing the execution times of the Weight-Update operation on S 2 (uniform) andS 4 (normal) datasets and computed both the execution times and the number of impacted edges (i.e. IEDJ) in response to Weight-Update operations. In order to compute both afore- mentioned results, we ran 100 weight-update iterations and picked their average as the results reported here. For each Weight-Update operation run, we randomly selected two objects and moved the population data of the first one to the second one. Figure 5.9 depicts the results of our experiments. The left Y-axis represents the average of the exe- cution time and the right Y-axis shows the number ofIEDG. We observe that in both datasets (uniform and normal) with an increase in the number of site points, the aver- age of execution time improves. This is because considering fixed object points with larger site points the cost of network expansion decreases. We also observe that both the average execution time and IEDG of datasets with normal distribution are higher than those with a uniform distribution. This effect is owed to the fact that with a normal distribution it takes longer for object points to reach their NN site; hence, it takes longer to expand their local networks and compute the optimal location. The size of their local networks also becomes larger which causes higher number ofIEDG while dealing with Weight-Update operations. Effect of the weight distribution of the object dataset on the Delete, Insert and Weight-Update operations With this experiment, we focus on the effect of the weight 102 distribution of the object datasets on the Insert and Weight-Update operations. The effect on the Delete operation is qualitatively similar. We selected one series of data from the S 2 (uniform weight) and S 3 (normal weight) with 1,000 site points and 20,000 object points. Thereafter, we applied the CMaxRNN approach and computed its execution time in response to the Insert and Weight-Update operations. The resulting execution time was computed by averaging the execution time over 100 runs of each operation. The method we used for performing 100 runs of each operation is as discussed in the previous experiments. Our results showed that in both operations the average execution time for object datasets with uniform and normal weight distribution is similar. To ver- ify this impact, we studied the distribution of execution times for each operation. Figure 5.10 presents the distribution of execution time in response to an Insert operation. As illustrated, the distribution of execution times with normal weight (the hashed bars) is more frequent in low values compared to those with uniform weight (the dotted bars). This is because, in object datasets with normal distributions, there exists a number of objects with high weight values which dominate the MaxRNN set. The more weight val- ues in the MaxRNN set means fewer impacted objects which results in lower execution times in the CMaxRNN query. Also, we observe in response to a Weight-Update operation (Figure 5.11), the dis- tribution of execution times with different weight distribution is similar. As mentioned earlier , the execution time of a Weight-Update is proportional to the number of impacted edges. On the other hand, changes in the weight of objects only affects the weight of impacted edges not their number. Therefore, the performance of Weight-Update opera- tions is not influenced by the changes in the weight of objects. 103 0 5 10 15 20 25 30 35 40 45 50 2 4 6 8 10 12 14 Frequency Execution Time (seconds) Insert Weight-Uniform Weight-Normal Figure 5.10: Distribution of the execution times of the Insert operation onS 2 (uniform weight) andS 3 (normal weight) 0 10 20 30 40 50 60 70 80 90 100 0.08 0.16 0.24 Frequency Execution Time (seconds) Weight-Update Weight-Uniform Weight-Normal Figure 5.11: Distribution of the execution times of the Weight-Update operation on S 2 (uniform weight) andS 3 (normal weight) 5.8.2.3 Case Study With this experiment, we studied the behavior of the CMaxRNN algorithm in response to changes of the location of 24 food trucks roaming on the LA county road network on a sample day. We retrieved their location data from TruxMap from 10:00 a.m. to 9:00 p.m. and stored this information in a time table. As presented in Table 5.3, the minimum interval between the arrival of two trucks or the arrival of a truck and the departure of another truck is 2 minutes. The maximum interval is 2 hours and 24 minutes. As for the object dataset, we selected the real-world object dataset (9,662 objects), aggregated their weights and created a new set with 483 objects. We assumed that we offer a decision making support system for food truck owners and that they could ask our system for the 104 Figure 5.12: Execution times of CMaxRNN for Insert operations in the interval from 9:57 and 11:11 a.m. Figure 5.13: Execution times of CMaxRNN for Delete operations in the interval from 12:50 and 2:07 p.m. optimal location before they decide to change the location of their trucks. Thereafter, we performed the CMaxRNN queries and computed execution times. Figure 5.12 presents executions time for CMaxRNN for the interval from 9:57 to 11:11 a.m. in response to the Insert operation. Figure 5.13 presents the execution time of CMaxRNN for the Delete operation and the same trucks studied in Figure 5.12 (but for the interval between 12:50 and 2:07 p.m.). Table 5.3 provides a summary of the results of this experi- ment and shows how the CMaxRNN returned the optimal result in a couple of seconds whereas the minimum interval between two operations is about 2 minutes. Therefore, 105 CMaxRNN supports continuous answers to MaxRNN queries for a real-world applica- tion. Also, comparing the execution time of SMaxRNN and the average in CMaxRNN verifies that the repeated execution of SMaxRNN is not feasible to provide continuous answers for MaxRNN queries. We also observed that the execution time of CMaxRNN queries in response to both Delete and Insert operations is proportional to the number of impacted objects. With a larger number of impacted objects, expanding their corre- sponding local networks and identifying the optimal location, takes more time. Table 5.3: Summary of the result of CMaxRNN queries on FoodTrucks Number of food trucks 24 Number of Insert operations 24 Number of Delete operations 24 Minimum interval between two operations 2 minutes Maximum interval between two operations 2 hours and 24 minutes SMaxRNN execution time 5.14 minutes Average of CMaxRNN execution time in response to an Insert operation 19 seconds Average of CMaxRNN execution time in response to a Delete operation 41 seconds Average of impacted objects in response to Insert operations 13 Average of impacted objects in response to Delete operations 23 5.9 Conclusions In this chapter, for the first time we introduced the problem of continuously maximizing the bichromatic reverse nearest neighbor for objects and sites located on spatial net- works. Accordingly, we proposed an incremental approach for efficient computation of these queries. We evaluated and showed the efficiency of our proposed solution with rigorous complexity analysis as well as extensive experimental study, using real-world and synthetic datasets. 106 Chapter 6 Conclusions and Future Work This dissertation introduced three variants of optimal location queries, namely optimal network location queries, multi criteria optimal location queries and dynamic optimal location queries by relaxing some of the constraints assumed with the basic optimal location queries. The first study on optimal network location queries relaxed the first constraint by considering the network distance as the distance measure between object points and their nearest neighbor site. With this study, we introduced two complementary approaches, EONL and BONL, which enable efficient computation of optimal network location queries with datasets of uniform and skewed distributions, respectively. Our experi- mental results with real-world datasets showed that EONL and BONL have their own exclusive use-cases in real-world applications given datasets with different spatial dis- tributions (uniform or skewed). In the second study, multi criteria optimal location queries, we relaxed the second constraint by considering multiple criteria rather than a single criterion in selecting the convenient site. With this study, we reduced the problem of multi criteria optimal loca- tion queries into the problem of maximizing the reverse skyline queries. Accordingly, we proposed two new approaches, Basic-Filtering and Gridbased-Filtering with the goal of reducing the cost of overlaps among SSRs and identifying the optimal location. In the third study, dynamic optimal location queries, we relaxed the third constraint by assuming that site and object points are not static and they change their geographic 107 location over time. With this study, we formalized the problem of DONL queries as Con- tinuous Maximal Reverse Nearest Neighbor (CMaxRNN) queries on spatial networks. Thereafter, we introduced an index structure that allows for efficient and incremental update of MaxRNN query results on spatial networks. There are many open optimal location query research challenges which have not been studied by the database community. As we mentioned before, relaxing the con- straints assumed in basic optimal location queries leads to new research topics. For instance, one of the key assumptions made in basic optimal location queries is the fact that both object and sites are considered as point geometries. However, in real-world applications objects and sites may have non-point geometries such as lines, multi-lines, polygons, and multi-polygon (studies proposed by Murray et al. [MT07] and Miller et al. [Mil96] in OR). As an example, we refer the readers to the real-world application developed by Ghaemi et al. [GSS + 09] described in Chapter 1. The main goal of this project is to identify candidate sites for building new parks (polygon or multi-polygon features) that can promote park and open space access for local residents. This appli- cation is an example of a general problem that focuses on the computation of optimal location queries with non-point data. Also as we discussed before, optimal location queries may have different objectives. For instance, in most of the work and solutions proposed thus for, the “optimality” is defined as maximizing the total weight of the objects that are closer to the new site than any other site. However, there exists some other real-world applications where optimal- ity might mean something different. For example, the optimal location to place a new fire station might be the location that minimizes the maximum distance of buildings to their closest fire station. This real-world application is an example of a general prob- lem called 1-center problem in operations research which is originally introduced by Hakimi ([Hak64], [Hak65])). Also, this problem is known as MinMax problem which 108 is studied by Cabello et al. ([CDBL + 05], [CDBL + 10]). Therefore, another research challenge to address could be the efficient computation of optimal location queries with the objective of minimizing the maximum distance between objects and their nearest site. There is a recent work on MinMax problem which considers the road network dis- tance as the distance measure ([XYL11]). However, in the literature there is no scalable solution reported for addressing the MinMax problems consideringL 1 andL 2 distance measures. With the proposed approach for multi-criteria optimal location queries ( discussed in Chapter 4) we assumed that a single site will be placed in response to a MaxRSKY query. However, in some real applications like laptop marketing analysis we might be interested to advertise multiple laptops to the market at the same time. Therefore another research challenge could be studying a solution to findl regions which have the greatest size of bichromatic reverse skyline set. We call this problem as lMaxRSKY . With the proposed approach for dynamic optimal location queries ( discussed in Chapter 5) we assumed that multiple operations received sequentially are responded to serially in the order in which they were received. Thus, the execution time of computing CMaxRNN for multiple operations is equal to the summation of their individual exe- cution time. However, in some real-world applications (e.g. disaster-response planning applications) request for multiple operations might occur simultaneously. Therefore , another research challenge could be developing a batch solution for CMaxRNN queries and parallelize some steps toward providing faster response for multiple operations. 109 References [Bal68] R. H. Ballou. Dynamic warehouse location analysis. Journal of Market- ing Research, 5:271–276, 1968. [BBCD02] B.L. Bhaduri, E.A. Bright, P.R. Coleman, and J.E. Dobson. Landscan: Locating people is what matters. Geoinformatics, 5(2):34–37, 2002. [BCL90] J. L. Bentley, K. L. Clarkson, and D. B. Levine. Fast linear expected-time alogorithms for computing maxima and convex hulls. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete algorithms, SODA ’90, pages 179–187, 1990. [BJKS06] R. Benetis, S. Jensen, G. Karciauskas, and Saltenis. Nearest and reverse nearest neighbor queries for moving objects. VLDB, 15(3):229–249, 2006. [BK02] O. Berman and D. Krass. The generalized maximal covering location problems. Computers and Operations Research, 29(6):563–581, 2002. [BKS01] S. B¨ orzs¨ onyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceedings of the 17th International Conference on Data Engineering, pages 421–430, 2001. [BKST78] J. L. Bentley, H. T. Kung, M. Schkolnick, and C. D. Thompson. On the average number of maxima in a set of vectors and applications. J. ACM, 25:536–543, October 1978. [BW08] O. Berman and J. Wang. The probabilistic 1-maximal covering problem on a network with discrete demand weights. Journal of the Operational Research Society, 59:13981405, 2008. [CDBL + 05] S. Cabello, J. M. Daz-Bez, S. Langerman, C. Seara, and I. Ventura. Reverse facility location problems. In Proc. 17TH Canadian Conference on Computational Geometry (CCCG05), pages 68–71, 2005. 110 [CDBL + 10] S. Cabello, J.M. Daz-Bez, S. Langerman, C. Seara, and I. Ventura. Facil- ity location problems in the plane based on reverse nearest neighbor queries. European Journal of Operational Research, 202(1):99 – 106, 2010. [CGGL03] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In ICDE, pages 717–719, 2003. [Cha86] Bernard Chazelle. Filtering search: A new approach to query-answering. SIAM Journal on Computing, 15:703–724, 1986. [Chu84] R. L. Church. The planar maximal covering location problem. Regional Science, 24:185–201, 1984. [CLZZ11] M. A. Cheema, X. Lin, W. Zhang, and Y . Zhang. Influence zone: effi- ciently processing reverse k nearest neighbour queries. In ICDE, 2011. [CM79] R. L. Church and M. E. Meadows. Location modeling utilizing maximum service distance criteria. Geogrpahical Analysis, 11:358373, 1979. [CM09] R.L. Church and A.T. Murray. Business Site Selection, Location Analysis, and GIS. John Wiley & Sons, Inc., 2009. [Coh78] J. L. Cohon. Multiobjective Programming and Planning. Mathematics in science and engineering. ; 140. Academic Press, 1978. [CR74] R.L. Church and C. Revelle. The maximal covering location problem. Papers of the Regional Science Association, 32:101–118, 1974. [CZL + 12] M. A. Cheema, W. Zhang, X. Lin, Y . Zhang, and X. Li. Continuous reverse k nearest neighbors queries in euclidean space and in spatial net- works. VLDB Journal, 21(1):69–95, 2012. [Dij59] E.W. Dijkstra. A note on two problems in connection with graphs. Numeriche Mathematik, 1:269–271, 1959. [DK83] David P. Dobkin and David G. Kirkpatrick. Fast detection of polyhedral intersection. Theoretical Computer Science, 27(3):241 – 253, 1983. [Dre95] Z. Drezner. Dynamic facility location: The progressive p-median prob- lem. Location Science, 3:1–7, 1995. [DS07] E. Dellis and B. Seeger. Efficient computation of reverse skyline queries. In Proceedings of the 33rd International Conference on Very large Databases, VLDB ’07, pages 291–302, 2007. 111 [DW91] Z. Drezner and G. O. Wesolowsky. Facility location when demand is time dependent. Naval Research Logistics, 38:763–777, 1991. [DWH92] M.S. Daskin and B. Medina W.J. Hopp. Forecast horizons and dynamic facility location planning. Annals of Operations Research, 40:125–152, 1992. [DZX05] Y . Du, D. Zhang, and T. Xia. The optimal location query. In Proceedings of Advances in Spatial and Temporal Databases, pages 163–180, 2005. [EH00] Martin Erwig and Fernuniversitat Hagen. The graph voronoi diagram with applications. Networks, 36:156–163, 2000. [Erl75] D. Erlenkotter. A comparative study of approaches to dynamic location problems. European Journal of Operational Research, 6:133–143, 1975. [FDA09] R. Z. Farahani, Z. Drezner, and N. Asgari. Single facility location and relocation problem with time-dependent weights and discrete planning horizon. Annals of Operations Research, 167(1):353–368, 2009. [FH11] R.Z. Farahani and M. Hekmatfar. Facility Location: Concepts, Models, Algorithms and Case Studies. Contributions to Management Science. Physica-Verlag HD, 2011. [FSA10] R. Z. Farahani, M. SteadieSeifi, and N. Asgari. Multiple criteria facility location problems: A survey. Applied Mathematical Modelling, 34:1689– 1709, 2010. [GH05] A.V . Goldberg and C. Harrelson. Computing the shortest path: A* search meets graph theory. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 156–165, 2005. [GSS + 09] P. Ghaemi, J. Swift, C. Sister, J. P. Wilson, and J. Wolch. Design and implementation of a web-based platform to support interactive envi- ronmental planning. Computers, Environment and Urban Systems, 33(6):482–491, 2009. [GSWBK10] P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. Optimal network location queries. In Proceedings of the 18th SIGSPATIAL Inter- national Conference on Advances in Geographic Information Systems, pages 478–481, 2010. [GSWBK12] P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. Continuous maximal reverse nearest neighbor query on spatial networks. In Proceed- ings of the 20th SIGSPATIAL International Conference on Advances in Geographic Information Systems, page forthcoming, 2012. 112 [Hak64] S.L. Hakimi. Optimum locations of switching centers and the absolute centers and medians of a graph. Operations Research, 12:450–459, 1964. [Hak65] S.L. Hakimi. Optimum distribution of switching centers in a communi- cation network and some related graph theoretic problems. Operations Research, 13:462–475, 1965. [HD07] M. W. Horner and J. A. Downs. The graph voronoi diagram with appli- cations. Tranportation Research Board, pages 47–54, 2007. [HM79] C.L. Hwang and A.S.M. Masud. Multiple objective decision making, methods and applications: a state-of-the-art survey. Lecture notes in economics and mathematical systems. Springer-Verlag, 1979. [HS09] M. Hekmatfar and M. SteadieSeifi. Multi-Criteria Location Problem. Contributions to Management Science. Physica-Verlag HD, 2009. [JKPT03] C. S. Jiensen, J. Kol´ aˇ rvr, T. B. Pedersen, and I. Timko. Nearest neighbor queries in road networks. In Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems, GIS ’03, pages 1–8, 2003. [KLP75] H. T. Kung, F. Luccio, and F. P. Preparata. On finding the maxima of a set of vectors. Journal of the ACM, 22:469–476, 1975. [KM00] F. Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In SIGMOD, pages 201–212, 2000. [KMS + 07] J. M. Kang, M. F. Mokbel, S. Shekhar, T. Xia, and D. Zhang. Continuous evaluation of monochromatic and bichromatic reverse nearest neighbors. In ICDE, 2007. [KRR02] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. In VLDB, pages 275–286, 2002. [KS04] M. Kolahdouzan and C. Shahabi. V oronoi-based k nearest neighbor search for spatial network databases. In VLDB, pages 840–851, 2004. [LC08] X. Lian and L. Chen. Monochromatic and bichromatic reverse skyline search over uncertain databases. In Proceedings of the 2008 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’08, pages 213–226, 2008. [LO01] O.I. Larichev and D. L. Olson. Multiple Criteria Analysis in Strategic Siting Problems. Kluwer Academic Publishers, 2001. 113 [Mil96] H.J. Miller. Gis and geometric representation in facility location prob- lems. International Journal of Geographical Information Science, 10:791–816, 1996. [MLDM06] K. Mouratidis, M. Lung, Y . Dimitris, and P. N. Mamoulis. Continuous nearest neighbor monitoring in road networks. In Proceedings of the 32nd International Conference on Very Large Databases, pages 43–54, 2006. [Mou04] David M. Mount. Geometric intersection. In Handbook of Discrete and Computational Geometry, chapter 38, pages 857–876, 2004. [MS82] A. Mehrez and A. Stulman. The maximal covering location problem with facility placement on the entire plane. Regional Science, 22:361–365, 1982. [MT07] A.T. Murray and D. Tong. Coverage optimization in continuous space facility siting. International Journal of Geographical Information Sci- ence, 21(7):757–776, 2007. [MZH83] N. Megidd, T. E. Zemels, and S. L. Hakimib. The maximum coverage location problem*, 1983. [PFCS05] D. Papadias, G. Fu, J. M. Chase, and B. Seeger. Progressive skyline com- putation in database systems. ACM Transactions on Database Systems (TODS), 30:2005, 2005. [PZM03] D. Papadias, J. Zhang, and N. Mamoulis. Query processing in spatial network databases. In VLDB, pages 802–813, 2003. [RS90] M. R.Garey and D. S.Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY , USA, 1990. [Sam90] H. Samet. The design and analysis of spatial data structures. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1990. [SET09] M. Safar, D. Ebrahimi, and D. Taniar. V oronoi-based reverse nearest neighbor query processing on spatial networks. Multimedia Systems, 15(5):295–308, 2009. [SGD86] F. Szidarovszky, M.E. Gershon, and L. Duckstein. Techniques for Multi- objective Decision Making in Systems Management. Advances in indus- trial engineering. Elsevier, 1986. 114 [SJLS08] H. Sun, C. Jiang, J. Liu, and L. Sun. Continuous reverse nearest neighbor queries on moving objects in road networks. In WAIM ’08, pages 238– 245, 2008. [SRAA01] I. Stanoi, M. Riedewald, D. Agrawal, and A. Abbadi. Discovery of influ- ence sets in frequently updated databases. In VLDB, pages 99–108, 2001. [SS09] J. Sankaranarayanan and H. Samet. Distance oracles for spatial networks. In Proceedings of the 2009 IEEE International Conference on Data Engi- neering, ICDE ’09, pages 652–663, 2009. [SSA] H. Samet, J. Sankaranarayanan, and H. Alborzi. Scalable network dis- tance browsing in spatial databases. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages 43–54. [SSA09] J. Sankaranarayanan, H. Samet, and H. Alborzi. Path oracles for spatial networks. volume 2, pages 1210–1221. VLDB Endowment, 2009. [TEO01] K. Tan, P. Eng, and B. C. Ooi. Efficient progressive skyline computation. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 301–310, 2001. [TSRB71] C. Toregas, R. Swain, C. Revelle, and L. Bergman. The location of emer- gency service facilities. Operations Research, 19(6):1363–1373, 1971. [Web09] A. Weber. Uber den Standort der Industrien, Erster Teil: Reine Theorie des Standortes. Tubingen: Mohr, 1909. [Wes73] G. O. Wesolowsky. Dynamic facility location. Management Science, 7:1241–1248, 1973. [WOY + 09] R. C. Wong, M. T. Ozsu, P. S. Yu, A. W. Fu, and L. Liu. Efficient method for maximizing bichromatic reverse nearest neighbor. In In VLDB, pages 1126–1149, 2009. [WT75] G. O. Wesolowsky and W. G. Truscott. The multiperiod location- allocation problem with relocation of facilities. Management Science, 22:57–65, 1975. [WTW + 09] X. Wu, Y . Tao, R. C. Wong, L. Ding, and J. X. Yu. Finding the influence set through skylines. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09, pages 1030–1041, 2009. 115 [WYCT08] W. Wu, F. Yang, C. Chan, and K. Tan. Continuous reverse k-nearest- neighbor monitoring. In MDM’08. IEEE Computer Society, 2008. [XYL11] X. Xiao, B. Yao, and F. Li. Optimal location queries in road network databases. In Proceedings 27th ICDE Conference, 2011. [XZ06] T. Xia and D. Zhang. Continuous reverse nearest neighbor monitoring. In ICDE ’06, 2006. [XZKD05] T. Xia, D. Zhang, E. Kanoulas, and Y . Du. On computing top-t most influential spatial sites. In In VLDB, pages 946–957, 2005. [YL01] C. Yang and K. I. Lin. An index structure for efficient reverse nearest neighbor queries. In ICDE, pages 51–60, 2001. [YMP05] M. L. Yiu, N. Mamoulis, and D. Papadias. Aggregate nearest neighbor queries in road networks. TKDE, 17, 2005. [ZDX06] D. Zhang, Y . Du, and T. Xia. Progressive computation of the min-dist optimal-location query. In Proceedings of the 32nd International Confer- ence on Very large Databases, VLDB ’06, pages 643–654, 2006. [ZWL + 11] Z. Zhou, W. Wu, X. Li, M. Lee, and W. Hsu. Maxfirst for maxbrknn. In ICDE’11, pages 828–839, 2011. 116
Abstract (if available)
Abstract
Optimal location queries have been widely used in spatial decision support systems and marketing in recent years. Given a set S of sites and a set O of weighted objects, a ""basic optimal location query"" finds the location(s) where introducing a new site maximizes the total weight of the objects that are closer to the new site than to any other site. Due to the intrinsic computational complexity of the optimal location problem, researchers have often resorted to making simplifying assumptions in order for the proposed solutions to scale with large datasets. However, there are many real-world applications where such restrictive assumptions may not hold. In this dissertation, we relax three of the aforementioned simplifying assumptions and correspondingly propose solutions for three popular variations of the basic optimal location problem, namely the ""optimal network location problem"", the ""multi-criteria optimal location problem"" and the ""dynamic optimal location problem"". These variations of the original problem allow for considering network distance (rather than p-norm distance), multiple preference criteria (rather than distance as the single preference criterion), and dynamic objects and sites (rather than static ones), respectively. In Chapter 3, we introduce two complementary approaches for efficient computation of optimal network location (ONL) queries, namely EONL (short for ""Expansion-based ONL"") and BONL (short for ""Bound-based ONL""), which enable efficient computation of ONL queries with object-datasets containing uniform and skewed distributions, respectively. Thereafter, we experimentally compare our proposed approaches and discuss their use cases with different real-world applications. Our experimental results with real datasets show that given uniformly distributed object-datasets (i.e., datasets with uniform spatial distributions), EONL is an order of magnitude faster than BONL, whereas with object-datasets with skewed distributions BONL outperforms EONL. Therefore, EONL and BONL have their own exclusive use cases in real-world applications and are complementary. ❧ In Chapter 4, we formalize the multi-criteria location problem as maximal reverse skyline query (MaxRSKY) and introduce two filter-and-refine approaches termed ""Basic-Filtering"" and ""Grid-based-Filtering"" that allow for efficient computation of MaxRSKY queries. The latter approach is an enhanced solution because it avoids redundant computation by filtering out the irrelevant parts of the search space for improved efficiency. Our extensive empirical analysis with both real-world and synthetic datasets show that our enhanced solution is more efficient in computing answers for MaxRSKY queries with large datasets containing thousands of sites and objects. For the datasets that the ""Basic-Filtering"" approach responds to a MaxRSKY query in hours, this computation can be completed in minutes using the ""Grid-based-Filtering"" approach. In Chapter 5, we formalize dynamic optimal network location queries as Continuous Maximal Reverse Nearest Neighbor (CMaxRNN) queries on Spatial Networks, and present a scalable and exact solution for CMaxRNN query computation. In our proposed approach we avoid computation of the optimal location query from scratch, and instead, compute the query incrementally to leverage computations from past queries. Our experimental results on a real-world dataset shows that the CMAxRNN queries are about two orders of magnitude faster than running the optimal location query from scratch.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Query processing in time-dependent spatial networks
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Efficient updates for continuous queries over moving objects
PDF
Location-based spatial queries in mobile environments
PDF
Differentially private learned models for location services
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Efficient reachability query evaluation in large spatiotemporal contact networks
PDF
Scalable data integration under constraints
PDF
Scalable processing of spatial queries
PDF
Efficient bounded-suboptimal multi-agent path finding and motion planning via improvements to focal search
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Approximate query answering in unstructured peer-to-peer databases
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Mechanisms for co-location privacy
PDF
Combining textual Web search with spatial, temporal and social aspects of the Web
PDF
Inferring mobility behaviors from trajectory datasets
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
DBSSC: density-based searchspace-limited subspace clustering
Asset Metadata
Creator
Ghaemi, Parisa
(author)
Core Title
Generalized optimal location planning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/27/2012
Defense Date
10/24/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
continuous maximal reverse nearest neighbor queries,dynamic optimal location queries,maximum reverse nearest neighbor queries,maximum reverse skyline queries,multi-criteria optimal location queries,OAI-PMH Harvest,optimal location queries,optimal network location queries,spatial networks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wilson, John P. (
committee chair
), Moore, James Elliott, II (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
ghaemi@usc.edu,parissa_ghaemi@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-120707
Unique identifier
UC11291211
Identifier
usctheses-c3-120707 (legacy record id)
Legacy Identifier
etd-GhaemiPari-1353.pdf
Dmrecord
120707
Document Type
Dissertation
Rights
Ghaemi, Parisa
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
continuous maximal reverse nearest neighbor queries
dynamic optimal location queries
maximum reverse nearest neighbor queries
maximum reverse skyline queries
multi-criteria optimal location queries
optimal location queries
optimal network location queries
spatial networks