Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable processing of spatial queries
(USC Thesis Other)
Scalable processing of spatial queries
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALABLE PROCESSING OF SPATIAL QUERIES by Seyed Jalal Kazemitabar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2016 Copyright 2016 Seyed Jalal Kazemitabar To my parents for their continuous support since I was born. i Acknowledgments Thanks to my advisor and my family ii Contents Acknowledgments ii Contents iii List of Figures vi List of Tables ix Abstract x 1 Introduction 1 1.1 Summary of Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Related Work 10 2.1 Proximity Detection Query . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Continuous Range Query . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Proximity Detection Query . . . . . . . . . . . . . . . . . . . . 11 2.2 Maximal Reverse Skyline Query . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Optimal Location . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Multi-dimensional Data Processing using MapReduce . . . . . 16 3 Scaling the Communication: The Proximity Detection Query 18 iii 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proximity Query Processing System . . . . . . . . . . . . . . . . . . . 24 3.2.1 Proximity Query Definition . . . . . . . . . . . . . . . . . . . 24 3.2.2 System Architecture and Query Processing . . . . . . . . . . . 26 3.3 Problem Definition and Solution Overview . . . . . . . . . . . . . . . 29 3.4 Phase I: Query Categorization . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Phase II: Probe Ordering . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.2 Probe Ordering Algorithm . . . . . . . . . . . . . . . . . . . . 38 3.5.3 Probe Selection Approach . . . . . . . . . . . . . . . . . . . . 41 3.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6.1 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . 43 3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Scaling the Computation: The Maximal Reverse Skyline Query 53 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.1 Multi-Criteria Optimal Location (MCOL) . . . . . . . . . . . 59 4.2.2 Maximal Reverse Skyline (MaxRSKY) . . . . . . . . . . . . . 60 4.3 Baseline Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 Basic Filtering Approach . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Precomputation . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 Multi-Granular Grid-based Filtering Approach . . . . . . . . . . . . . 77 4.5.1 Spatial Partitioning using Grids . . . . . . . . . . . . . . . . . 78 4.5.2 Dynamic Data-Aware Grid-Size Setting . . . . . . . . . . . . . 80 4.5.3 Load-Balanced Distributed Processing . . . . . . . . . . . . . 85 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 iv 4.6.1 Cost of Computing SSRs vs. Cost of Computing Overlap among SSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6.2 Baseline Filtering Approach vs. Basic Filtering Approach . . . 93 4.6.3 Effect of the Scoring Function . . . . . . . . . . . . . . . . . . 94 4.6.4 Sensitivity to Cell Size in the Plain Grid-Based Approach . . . 96 4.6.5 Multi-Granular Grid-Based Approach . . . . . . . . . . . . . . 98 4.6.6 Effect of Site and Object Cardinality . . . . . . . . . . . . . . 99 4.6.7 Effect of Site and Object Distribution . . . . . . . . . . . . . . 102 4.6.8 Effect of Load Balancing Mechanism . . . . . . . . . . . . . . 104 4.6.9 Effect of Dimension Cardinality . . . . . . . . . . . . . . . . . 104 4.6.10 Effect of Cluster Node Cardinality . . . . . . . . . . . . . . . . 106 Appendices 108 .1 The Boolean Overlap Detection Problem . . . . . . . . . . . . . . . . 109 .1.1 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 .1.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . 110 5 Conclusion and Future Work 112 Reference List 115 v List of Figures 3.1 Two objects involved in a proximity query. . . . . . . . . . . . . . . . 20 3.2 A batch of proximity queries. All queries can be answered by probing u 1 ,u 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Mobile region for u i at time t. . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Partitions of the real line and their analogous categories for proximity query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Solving queries based on the distance between their mobile regions. The status of a proximity query is determined if an object (e.g. u j ) is in the gray area formed by the other one (ie. u i ). . . . . . . . . . . . 33 3.6 There are chances the query is satisfied only by probing one object. u i has a higher chance to solve the query. . . . . . . . . . . . . . . . . 35 3.7 C j completely falls in the blind area. u i must be probed first. . . . . . 35 3.8 There are chances the query is unsatisfied only by probing one object. u i has a higher chance to solve the query. . . . . . . . . . . . . . . . . 36 vi 3.9 Cost breakdown at the default settings . . . . . . . . . . . . . . . . . 46 3.10 Object tracking cost (location update + probe messages) . . . . . . . 51 3.11 Server-side computation cost . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 Example site and object datasets in 2-dimensional space . . . . . . . 60 4.2 SSR region for object o 1 . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Overview of our filtering approach. . . . . . . . . . . . . . . . . . . . 67 4.4 Running Example SSR Regions . . . . . . . . . . . . . . . . . . . . . 68 4.5 Precomputation for the Basic Filtering Approach . . . . . . . . . . . 71 4.6 MaxRSKY Query Processing with the Basic Filtering Approach (Map function) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.7 An illustration of the plain Grid-based Filtering approach . . . . . . . 78 4.8 Solving the MaxRSKY query using the multi-granular grid-based fil- tering approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.9 Distribution of coarse-grained cells across the reducers. . . . . . . . . 87 4.10 MaxRSKYQueryProcessingwiththeMulti-GranularGrid-BasedFil- tering Approach (Reduce function) . . . . . . . . . . . . . . . . . . . 90 4.11 Baseline filtering vs. basic filtering . . . . . . . . . . . . . . . . . . . 94 4.12 Effect of the scoring function on the basic filtering approach . . . . . 95 4.13 Effect of cell granularity on the plain grid-based approach . . . . . . . 97 4.14 Multi-granular grid-based approach with the default parameters . . . 100 vii 4.15 Effect of object set cardinality . . . . . . . . . . . . . . . . . . . . . . 101 4.16 Effect of site set cardinality . . . . . . . . . . . . . . . . . . . . . . . 102 4.17 Effect of data distribution . . . . . . . . . . . . . . . . . . . . . . . . 104 4.18 Effect of load-balancing mechanism . . . . . . . . . . . . . . . . . . . 105 4.19 Effect of the number of dimensions . . . . . . . . . . . . . . . . . . . 106 4.20 Effect of the number of worker nodes . . . . . . . . . . . . . . . . . . 107 21 Representing a cell as an SSR . . . . . . . . . . . . . . . . . . . . . . 110 viii List of Tables 3.1 Summary of Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Effect of Category 3 on Probing Users. . . . . . . . . . . . . . . . . . 48 4.1 Common Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Overlap Table (OT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Summary of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Cost of computing SSRs vs. cost of computing overlap among SSRs . 93 ix Abstract In recent years, geospatial data have been produced in mass e.g., through bil- lions of smartphones and wearable devices. Current exponential growth in data generation by mobile devices on the one hand, and the rate and complexity of recent spatial queries on the other hand, highlights the importance of scalable query pro- cessing techniques. Traditional database technology, which operates on centralized architectures to process persistent and less dynamic spatial objects does not meet the requirements for scalable geospatial data processing. In this thesis, we specifically focus on two primary challenges in scaling spatial queries, i.e., the communication and computation costs, while guaranteeing the cor- rectness of query results. We utilize techniques such as batch processing and use of parallelized framework to address these challenges. We address the location tracking cost towards achieving scalability in comm- unication-intensive queries. The location tracking cost between the moving objects and the query processing server is a key factor in processing many moving object continuousqueries. Thechallengeisthatincreasingthenumberofqueriesandobjects would require frequent location updates which results in draining the battery power x onmobiledevices. Thus,existingapproacheswouldnotscaleunlessquerycorrectness is compromised. In this thesis, we propose batch processing of spatial queries as a method to optimize the location tracking cost to scale to large numbers of queries and objects without either compromising the query correctness or using excessive battery power. In our approach, the queries are categorized into independent groups and then processed in parallel. We specifically apply our approach to the proximity detection query and optimize the communication cost while processing millions of queries. Processing some spatial queries has become more resource-intensive in recent years. This is due to various reasons such as the introduction of queries that are more computationally complex compared to the classic ones, as well as an increase in the input size (e.g., the number of GPS-enabled devices). In this thesis, we propose opti- mized algorithms and utilize MapReduce to process a complex spatial problem, i.e., the Multi-Criteria Optimal Location (MCOL) problem. First, we formalize it as a Maximal Reverse Skyline (MaxRSKY) query. For the first time, we present an opti- mized solution that scales to millions of objects over a cluster of MapReduce nodes. Specifically, rather than batch processing the query which is typical of a MapRe- duce solution, we first partition the space and run a precomputation phase where we identify potential regions hosting the optimum solution, and then load balance the regions across the Reducers in a dynamic way to reduce the total execution time. xi Chapter 1 Introduction 1 Processing and analyzing geospatial data, i.e., data with geographic and spatial attributes, has recently gained attention in academia and industry [19, 78]. The data received in geospatial applications need to be processed in short time to meet the low latency requirements of continuous queries. Among all geo-spatial applications, large-scale ones are growing both in quantity and scale due to recent advancements in sensing technology and popularity of social media and smartphones. For example, consider the following two applications: Example 1: Location-Based Services (LBS): Online advertisement and dating applications are among many LBSs that receive real-time location updates and continuously run geospatial queries. As an example scenario, a user might be interested to receive special deals from stores within a small range. The value of such information deteriorates as he passes by the stores. In a similar vein, a Foursquare application, which is hosted by millions of users just in the U.S., can run a kNN query to find nearest friends who have just checked-in to restaurants around a client. The market share of these applications is growing fast for various reasons including current appeal for social media. Example 2: Intelligent Transportation Systems (ITS): Valuable real-time geo-tagged data are produced by loop detectors and GPS devices in cars, taxies, trucks, and by government agencies in big cities around the world [32, 78]. For example, millions of drivers in LA county would like to know the fastest route or time to destination while driving and expect this information to be promptly updated right after an accident happens in the road ahead. Once again, 2 the scale is growing every day as manufacturers inject new cars to road networks and more sensors are installed throughout roadways. Large-scale geospatial applications often share the following features: 1. Parallelizable: These applications include independent modules that can run in parallel. For instance, every Foursquare client looking for his friends would submit a separate query that can be processed in parallel with others. 2. Resource intensive: Due to recent changes, namely cost reduction in sensor technology, availability of GPS-enabled devices, and popularity of social net- working, geospatial applications encounter continuous rise in load. There are numerous success stories on companies offering LBS that faced thousand per- cent load growth within months. Moreover, large quantity of records received in real-time from one side, and meeting the low-latency requirements on the other side demand huge computational resources and bandwidth. A shopping or driving individual interacting with geospatial applications expects seamless response to his queries. 3. Changing load: Geospatial applications frequently confront change in their load. In fact, they encounter fluctuating load interrupted by (predictable) peaks. An LBS has to serve the largest number of queries during mid-day hours. Similarly, an ITS receives many queries during the rush hour. This is while both applications are almost idle during early hours after midnight. 3 1.1 Summary of Thesis Work Processing massive amounts of spatial data and queries by geospatial applica- tions poses considerable scalability requirements. However, the traditional database technology does not meet the requirements for scalable geospatial data processing. On the one hand, such technology primarily operates on persistent and less dynamic data objects which does not match the dynamics of location-based applications such as the many ones that involve smartphones. On the other hand, the existing query processing techniques are primarily designed for a centralized architecture which is bound by the limitations of scaling up rather than scaling out and thus inadequate to handle non-classic queries. All these calls for a shift towards scalable query pro- cessing approaches. In this thesis, we address the communication and computation costs as the two primary challenges in scaling spatial queries. Communication between a moving object, e.g. a mobile device, and a query processing server is the most common form of communicating location information to answer spatial queries, e.g. in location- based services. Correct, rather than approximate, processing of queries requires the mobile device to frequently detect and communicate its location to the server. This would cause quick drain of battery power on the mobile device and lead to user dissatisfaction. We analyzed different types of messages that are typically communicated between a mobile client and a server, and noticed that two of them together con- tribute to the location tracking cost which is the major cost in client-server com- munication in spatial queries. These messages are namely probe and location update 4 messages. Between these two types of messages, the probe message, i.e., polling the object to obtain its current location, contribute a primary role due to involving two units of messages to and from the mobile client. With effective probing, one can monitor the current location of the objects with sufficient accuracy for the existing queries, bystrikingabalancebetweencommunicationcostofprobingandaccuracyof the knowledge about current location of the objects. In this thesis, for the first time, we formulate optimal probing as a batch processing problem and propose a method to prioritize probing the objects such that the total number of probes required to answer all queries is minimized. Specifically, we focused on the proximity detection query, which continuously monitors the proximity of two given moving objects in order to determine whether their distance is within a certain desired threshold at each timestamp. We propose a two-phase probe optimization technique that efficiently computes the near-minimal subset of probes which can resolve a given batch of queries. At the first phase, considering the proximity threshold and approximate object locations of each proximity query, we categorize every query in the batch to determine whether each query requires probing, and if so, identify the probe(s) that with highest prob- ability can resolve the query (if queries are considered individually). Successively, at the second phase given the categorization of the queries from the first phase and considering the interdependencies between queries given the shared query objects among them, we use a systematic probe selection approach that intelligently chooses an order of probes that minimizes the total number of probes required to resolve all queries in the batch. Our probe optimization technique is parallelizable, and with extensive experiments we show that it scales well to support hundreds of millions of 5 queries, while on average it incurs 30% less communication cost as compared to the best existing proximity query answering technique. This is a considerable improve- ment since out of the two users involved in an unanswered query, at least one of them needs to be probed in order to solve the query. The second challenge that we addressed in this thesis is the scalability of compute-intensive spatial queries. Classic spatial queries such as range, Nearest- Neighbor (NN), and k Nearest-Neighbors (kNN) have been traditionally solved in small scale over a centralized server. However, with recent growth in the number of GPS-enabled devices on the one hand, and introduction of more complicated queries on the other hand, there is a need for new algorithms that are tailored to process queries over decentralized architectures. Scalable algorithms have been recently proposed to address the classic spa- tial queries over decentralized architectures (see Chapter 2.2.2). In this thesis, we introduce a complex non-classic spatial query, and for the first time we propose an efficient algorithm to solve this query in large scale and parallelize it on the MapRe- duce framework. We first target a real-world market research problem and formalize it as a spatial query. The spatial query is CPU-intensive and does not scale to even thousands of objects [26]. We propose three solutions to this query, where each solution evolves over the previous one. Experiments show that our final solution successfully scales to millions of objects. We next briefly explain the problem and our solution. Given a set S of sites and a set O of objects in a metric space, the Optimal Location (OL) problem is about computing a location in the space where introducing 6 a new site (e.g., a laptop) maximizes the number of the objects (e.g., customers) that would choose the new site as their “preferred” site among all sites. However, the existing solutions for the optimal location problem assume that there is only one criterion to determine the preferred site for each object (i.e., the metric distance between objects and sites), whereas with numerous real-world applications multiple criteriaareusedaspreferencemeasures. Inthisthesis, forthefirsttimewedevelopan efficientandexactsolutionfortheso-calledMulti-Criteria Optimal Location (MCOL) problem that can scale with large datasets. Toward that end, first we formalize the MCOL problem as maximal reverse skyline query (MaxRSKY). Given a set of sites and a set of objects in ad-dimensional space, MaxRSKY query returns a location in the space where if a new site s is introduced, the size of the (bichromatic) reverse skyline set of s is maximal. To the best of our knowledge, this thesis is the first to define and study MaxRSKY query in a large scale. Accordingly, we propose a filter- based solution, termed multi-granular grid-based approach, that effectively prunes the search space for efficient identification of the optimal location. Our extensive empiricalanalysesshowthatourapproachisinvariablyefficientincomputinganswers for MaxRSKY queries with large datasets containing millions of objects and sites. The rest of this thesis is organized as follows. In Chapter 2, we discuss the related work on processing spatial queries in scale, and specifically the (limited scale) work done on the proximity detection and MaxRSKY queries. In Chapter 3, we address the communication cost in the context of the proximity detection problem, were we describe different types of messages and propose an optimized approach for reducing the number of communicated messages towards correctly processing the queries. Next, in Chapter 4, we discuss our filter-based approaches tailored 7 for the MapReduce framework to address the MaxRSKY query. We conclude the dissertation and discuss directions for future work in Chapter 5. 1.2 Related Publications Parts of this thesis have been published in the spatial database conferences and journals. The list includes: Related to Chapter 1: • Kazemitabar, Seyed Jalal, Farnoush Banaei-Kashani, and Dennis McLeod. “Geostreaming in cloud.” Proceedings of the 2nd ACM SIGSPATIAL Inter- national Workshop on GeoStreaming. 2011 [77]. • Kazemitabar, Seyed Jalal, Ugur Demiryurek, Mohamed Ali, Afsin Akdogan, and Cyrus Shahabi. “Geospatial stream query processing using Microsoft SQL ServerStreamInsight."Proceedings of the VLDB Endowment 3, no. 1-2 (2010): 1537-1540. 2010 [78]. Related to Chapter 3: • Kazemitabar, Seyed Jalal, Farnoush Banaei-Kashani, Seyed Jalil Kazemitabar, and Dennis McLeod. “Efficient batch processing of proximity queries by opti- mized probing." In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 84-93. 2013 [76]. 8 Related to Chapter 4: • Kazemitabar, Seyed Jalal, Abhinav Sharma, Farnoush Banaei-Kashani, and Dennis McLeod. “Scalable Maximal Reverse Sklyine Query Processing using MapReduce." To be submitted 2016 [79]. • Banaei-Kashani, Farnoush, Parisa Ghaemi, Bahman Movaqar, and Seyed Jalal Kazemitabar. “Efficient Maximal Reverse Skyline Query Processing." Submit- ted 2016 [26]. 9 Chapter 2 Related Work 2.1 Proximity Detection Query 2.1.1 Continuous Range Query Each proximity detection query can be mapped to a corresponding continuous range query. However, that is a perspective we would like to avoid. Each proximity query, as we will define shortly, has a customized range (−neighborhood) which together with the two involved objects makes each query unique. Following the para- digm of a range query to answer a single proximity query asks for extra computation to find all matching objects initially, and then filtering out all but at most one of the matching results, making it computationally unjustifiable. While applications generally include a large number of queries, this approach seems not to scale even to a fair number of queries. On the other hand, using our proximity detection method, hundreds of millions of queries run in a minute on a single computer 1 . SINA[97]utilizesahash-basedapproachtoincrementallyprocessmovingrange queries over moving objects. The method focuses on performance aspects and relies 1 As a secondary difference with continuous range query, in current applications of proximity detection query each object is in relationship with a tiny subset of interest and not all moving objects. 10 on frequent messages sent by objects regarding their exact location, hence not focus- ing on communication cost. [70] processes static range queries efficiently by comput- ing safe regions that minimize location updates. Query results do not change as long as every object stays in its associated safe area. [37] utilizes the safe region concept to process moving range queries. The primary focus there is on computation-efficient processing of static rather than moving objects. None of these work focus on min- imizing the communication cost for processing moving range queries over moving objects which –despite the described computational overhead for adaptation– is the closest spatial query to the one we address here. 2.1.2 Proximity Detection Query Preamble: Processing spatial queries over moving objects requires a location tracking policy to locate objects. Location information is transmitted either through a source initiated message, i.e., a location update, or a destination (e.g., server) initi- ated message, i.e., a probe. Existing proximity detection methods focus and differ on how they send source-initiated messages. As such, we first explain different location update policies and then review current approaches to solve proximity queries. We finally explain how our work fits in the literature. Various location update policies can be used to help track objects: periodic, distance-based, zone-based position update, and dead reckoning [44, 105]. The first two policies require the object to transmit location update messages at fixed time intervals or upon traversing fixed distance blocks. Such methods are naive and barely 11 communication-efficient since real-world objects follow a dynamic rather than uni- form movement pattern. Dead reckoning and subsequent proximity detection meth- ods such as [119] require additional information on the movement patterns of objects. Real movement parameters are typically not available and should be replaced by pre- defined values, resulting in inefficient communication [128]. Most proximity detec- tion methods, including current state-of-art methods, utilize the zone-based position update policy in which an object notifies the server once it leaves a geographic zone. Such a zone (or region) is defined with regards to the queries and can be either static or moving. For the rest of this section, we explain proximity detection methods that use static and moving geographic zones to track objects. The proximity detection query was first introduced in [21] where objects actively process mutual queries and communicate in a distributed environment. For each pair involved in a query, a strip of width divides the space into two static half pages as safe regions. A query result remains unchanged as long as both objects stay in their safe regions. Each object keeps track of all its associated half pages and informs the corresponding object if it leaves a half page. Once informed, the other object then processes the query and sends back a new strip defining new safe regions for the two objects. Utilizing a distributed architecture does not scale to a fair number of objects and queries as an object needs to multicast its updated location information to fellow objects when it leaves multiple safe regions in a single move, resulting in huge growth in communication cost. In contrast, using a query processing server prevents 12 excessive communication as only the central server and not peer objects need to receive updated location information. Some proximity detection methods have used static regions together with a client-server architecture to locate objects approximately. In [118], a dynamic cen- tered circle surrounds each object to serve as its safe region. Circles for two objects involved in a query are guaranteed to be farther than the associated proximity dis- tance threshold. In a more recent method, the server divides the space into static grid cells [126]. Objects update the server when they leave a cell, helping the server answer many queries based on an object’s current cell and not its exact location. All objects involved in unanswered queries will be probed and the grid size would be adaptively updated to reduce server processing time. The method is further extended to road networks in [127]. Keeping static approximate regions for mov- ing objects results in frequent location updates sent by dynamic objects and is not communication-efficient [44]. Moving geographic zones have been recently used to track the location of objects. [44] proposes vector-based update policy in which an object moving in a road network is encompassed in a moving circle and propagates a location update message when it leaves this region. Reactive Mobile Detection (RMD), as the current state-of-the-art proximity detection method, utilizes the aforementioned policy and automatically tunes the size of moving regions to reduce the communication cost while processing proximity queries [128]. Specifically, it reacts to a location update message by increasing the radius of the moving region according to a scale factor. Similarly, it reacts to a probe message by decreasing the radius. 13 Our proximity detection method differentiates itself from the above work in two aspects. First, we focus on probing rather than location update to enable communication-efficient proximity detection. While a location update message is primarily initiated by the movement patterns of the moving object itself, deciding as to which objects to probe is within the control of the query processor and has potentials for optimization. Moreover, unlike a source-initiated location update, a server-initiated probe is always followed by another message from the object, causing more communication cost to both the server and the object, asking for special atten- tion. As the second differentiating feature, while all existing work process queries individually, we consider them as a batch to further optimize probing. 2.2 Maximal Reverse Skyline Query In this section, we review the related work under two main categories. First, we discuss the previous work on the problem of optimal location. Thereafter, we present a review of the literature on multi-dimensional query processing using MapReduce. 2.2.1 Optimal Location Among other variations of the optimal location problem, the multi-criteria optimal location (MCOL) a.k.a. multi-objective or multi-attribute optimal location, has been widely studied by researchers in the operations research (OR) community [45, 58, 59, 68, 72, 85, 115]. However, given the computational complexity of the MCOL problem most of the existing solutions 1) resort to the use of heuristics that 14 can only approximate the optimal location without any guaranteed error bounds, and more importantly 2) fail to scale with real datasets that often consist of millions of objects (rather than tens of sites and objects usually assumed by the existing solutions). Given similar scalability and accuracy issues with the existing solutions from the OR community for the general family of optimal location problems, recently the database community has shown interest in developing efficient and exact solutions for these problems. However, so far all proposed solutions from this community have focused on the basic (single-criterion) optimal location problem. In particular, Wong et al. [122] and Du et al. [54] formalized the basic optimal location problem as maximal reverse nearest neighbor (MaxRNN) query, and presented two scalable approaches to solve the problem in p-norm space (assuming L2-norm and L1-norm, respectively). Thereafter, Ghaemi et al. [61, 62, 63] and Xiao et al. [125] con- tinued the previous studies and proposed solutions for MaxRNN assuming network distance. Zhou et al. [134] presented an efficient solution to solve the extended MaxRkNN problem, which computes the optimal location where introducing a new site maximizes the number of objects that consider the new site as one of their k nearest sites. Zhang et al. [133] target a novel spatial query and name it as an optimal location problem. Specifically, given a set of object types (e.g., the set of types school, supermarket, and bus stop) and a set of groups where each group is a set of objects (e.g., group A consisting of school A, supermarket B, and bus stop A), the goal is to find a location (e.g., residential community A) where an aggregated cost function over all the groups is minimized (e.g., sum of distances to objects in group A is smaller than the sum of distance to all other groups). While using a 15 similar naming, the problem they addressed is different from the problem we discuss here. Recently, we have presented a solution to solve the MCOL problem over a few thousand objects and sites using a single machine [27]. To the best of our knowledge, we are the first to tackle the MCOL problem toward developing an efficient and exact solution that can scale with large datasets containing millions of sites and objects. 2.2.2 Multi-dimensional Data Processing using MapReduce The database community has been recently tackling data processing problems using the MapReduce framwork in order to meet the large size of real-world prob- lems. Efforts have been focused on enabling index structures and batch processing of queries over MapReduce mainly through framing existing established solutions. R-tree, as a hierarchical index structure, has been implemented in [35]. Voronoi diagrams, as a planar index structure, has been built over two dimensional data and then used to solve range and kNN queries in [17]. MD-HBase builds k-d tree and quad-tree index structures on HBase, a datastore built on top of Hadoop which is an opensource implementation of MapReduce [100, 121]. Fundamental multi-dimensional queries including various forms of spatial joins have been adapted to MapReduce. Range queries over trajectory data are processed in [93]. k nearest neighbor, k nearest group, earth mover’s distance similarity, and multi-way spatial join are among recent work on implementing variations of join operators over MapReduce [66, 71, 92, 130, 131]. Skyline and reverse skyline queries are implemented in [104]. The main challenge addressed in many of these work is the high communication cost of shuffling which is caused by having to copy a large 16 dataset, e.g., one side of the join operation, several times to different reducers or mappers accross the pipeline. The common idea is to apply a pruning heuristic to partition the data and avoid joining all data elements of the two (or more) datasets. This results in reducing the I/O cost and consequently the run-time of the algorithm. Recently, software systems have been introduced which are based on Hadoop and provide typical spatial indexes and queries. Hadoop-GIS is a software package thatprovidesbasicindexingandqueryingfacilitiesontopofHive, adatawarehousing infrastructure for Hadoop [16, 117]. Perhaps the most notable work is SpatialHadoop [57]. Support for multi-dimensional data has been natively embedded in Hadoop through modifying its architecture at different layers. Basic spatial indexes and queries, computational geometry operations, and a high-level spatial query language are efficiently embedded in its storage, operations, and language layers [55, 56, 57]. Neither of the above work has addressed the MaxRSKY query. It is important to note that reverse skyline query and MaxRSKY are two orthogonal problems. While with our focus problem (i.e., MaxRSKY) we can leverage any efficient solution for reverse skyline computation, as we show in Section 4.3 our main challenge is to identify a location which is on the reverse skyline set of a maximal number of objects. 17 Chapter 3 Scaling the Communication: The Proximity Detection Query 18 3.1 Introduction One of the required steps toward answering continuous location-based queries over moving objects is to obtain current (approximate and/or exact) locations of the objects from the moving objects. A well-known instance of such location-based queries is proximity detection query (or proximity query, for short), which contin- uously monitors the proximity of two given moving objects in order to determine whether their distance is within a certain desired threshold at each time instant t. A proximity query notifies two users that share a common interest about their being physically close. The users might or might not know each other beforehand. Prox- imity queries are used, for example, in large-scale location-based social networking applications (such as foursquare™) to find the user’s friends that currently happen to be in proximity of the user; to continuously identify and recommend suppliers of the items that a user may demand as she moves through an extensive urban area (e.g., the city of Los Angeles); and to recognize teammates or opponents of a player who reside in her proximity in massively multi-player online games (MMOGs) with thousands of users (e.g., PlanetSide-2™). Figure 3.1 illustrates the need for obtaining approximate and (some times) exact locations of the moving objects to answer proximity queries. In this figure, u 1 and u 2 (black filled circles) are two query objects shown at a certain time instant t, 1,2 is the proximity threshold between u 1 andu 2 (visualized as a circle with dashed line), and C 1 and C 2 are the bounding circles for u 1 and u 2 , depicting the current approximate location of the two objects, respectively. In a generic client-server prox- imity query processing system, the query processor (often a central server with all 19 Figure 3.1: Two objects involved in a proximity query. objects as its clients) is aware of the approximate location of each object (i.e., C 1 andC 2 in Figure 3.1), and whenever a moving object leaves its approximate location bound, the object sends a so-called location-update to the query processor to update itscurrentapproximatelocation. Whileinsomecasesknowingtheapproximateloca- tion of the objects is sufficient for the query processor to answer a proximity query at the corresponding time t, in many cases the query processor needs to further obtain the current exact location of one or both of the objects to be able to answer the query. For example, in Figure 3.1, knowing the approximate locations C 1 and C 2 of the two objects is insufficient for answering the query, because depending on where exactly u 1 and u 2 are located in their bounding circles, their actual distance might be less or more than the threshold 1,2 . In this case, the query processor can obtain the exact location of u 1 by so-called probing the object u 1 . As shown in the figure, knowing the exact location of u 1 is sufficient for the query processor to deduce that u 1 and u 2 are not in the 1,2 proximity of each other at time t. 20 Obtaining the location of the moving objects is the most frequent procedure performed to process proximity queries, resulting in a major communication over- head which hinders developing scalable proximity query processing systems. Accord- ingly, intheliteratureavarietyofcommunication-efficientproximityqueryanswering solutions have been proposed to reduce the communication cost of obtaining object locations (see Section 2.1 for a review of these solutions). However, these solutions have almost unanimously focused on reducing the cost of “location-update” and have mainly ignored the cost of “probing”. This is true despite the fact that these two operations are often both performed very frequently, while the cost of each probe is twice that of a location-update (because it involves sending an additional request from server to the client/object). In this chapter, we focus on developing an optimized probing technique for effi- cient proximity query answering. Our query answering solution supports numerous queries at the scale of the aforementioned applications. To the best of our knowledge, this work is the first to consider probe optimization towards communication-efficient proximity query answering. It is important to note that our proposed technique is complementary to the existing solutions that focus on the orthogonal problem of location-update optimization. Themainideabehindourproposedprobeoptimizationtechniqueisconsidering proximity queries as a batch of queries (rather than individually). With this batch processing approach, we can derive the minimum set of probes needed to resolve all queries in the batch, which often count much less than the total number of probes 21 Query Pair of objects i,j q 1 hu 1 ,u 2 i 4 q 2 hu 1 ,u 3 i 4 q 3 hu 1 ,u 4 i 4 q 4 hu 3 ,u 4 i 4 (a) Proximity queries (b) Constellation of objects Figure 3.2: A batch of proximity queries. All queries can be answered by probing u 1 ,u 3 . required if queries are considered individually. To realize the benefit of batch pro- cessing, one should observe the fact that with typical applications each object is often involved in several proximity queries. To further elaborate, see Figure 3.2 that illustrates a typical scenario in which none of the proximity queries q 1 to q 4 can be answered merely based on the approximate location information, and hence, prob- ing is required to resolve the queries. The current state-of-the-art proximity query answering solutions address each query individually and end up probing all involved objects u 1 to u 4 (i.e., four probes) to resolve the queries. However, one can observe that in this case probing onlyu 1 andu 3 (i.e., only two probes) is sufficient to answer all queries. However, deriving the minimum set of probes needed to answer a batch of queries (possibly containing millions of queries) is challenging. With a naive approach, one may attempt to consider all possible combinations/subsets of poten- tial probes and determine the smallest subset that resolves all queries. However, 22 such a naive approach is not practical considering 1) the tentatively large number of queriesinabatch, whichleadstohumongousnumberofpossibleprobecombinations, and 2) the real-time requirements of continuous proximity query processing, which allows only one time slot for processing of a batch of queries at each time instant t. Toaddressthischallenge, weproposeatwo-phaseprobeoptimizationtechnique that efficiently computes the near-minimal subset of probes which can resolve a given batch of queries. At the first phase, considering the proximity threshold and approximate object locations of each proximity query, we categorize every query in the batch to determine whether each query requires probing, and if so, identify the probe(s) that with highest probability can resolve the query (if queries are considered individually). Successively,atthesecondphasegiventhecategorizationofthequeries from the first phase and considering the interdependencies between queries given the shared query objects among them, we use a systematic probe selection approach that intelligently chooses an order of probes that minimizes the total number of probes required to resolve all queries in the batch. Our probe optimization technique is parallelizable, and with extensive experiments we show that it scales well to support hundreds of millions of queries, while on average it incurs 30% less communication cost as compared to the best existing proximity query answering technique. This is a considerable improvement since out of the two users involved in an unanswered query, at least one of them needs to be probed in order to solve the query. The chapter is organized as follows. Section 2.1 reviews existing algorithms. The assumed system architecture and proximity query processing algorithm are described in Section 3.2. Section 3.3 defines the probing problem and presents an 23 overview of our approach. Section 3.4 covers Phase I where queries are classified and queries with a straightforward optimum probing decision are discovered. Section 3.5 explains Phase II, where the probing problem over a batch of unanswered queries is modelled as a decision process and an optimal solution as well as a fast near- optimal solution are provided. We also briefly explain the optimal solution which is computationally complex. Experimental results are presented in Section 3.6. 3.2 Proximity Query Processing System In this section we provide an overview of our assumed query processing sys- tem. We start by defining the proximity query and then explain the assumed system architecture. Wedescribethequeryprocessingroutinebyfirstcategorizingthetrans- mitted messages and then providing optimizations towards communication-efficient query processing. 3.2.1 Proximity Query Definition The proximity query q ij is the problem of detecting whether a pair of mobile objectshu i ,u j i are in -neighbourhood, i.e., closer to each other than the distance threshold ij . A proximity query is satisfied if the two involved objects are closer than ij distance units and unsatisfied otherwise; resulting in a numeric value of 1 or 0 as the query result. We say that u i and u j are in a proximity relationship iff 24 there exists a query q ij . Without loss of generality, we assume that any proximity relationship is mutual, meaning that q ij =q ji ≡q i,j ∧ ij = ji ≡ i,j For brevity of notation, we use instead of i,j when the pair of objects are known from the context. A set of proximity queries is called a batch,Q. All the m =|Q| queries in a batch need to be processed periodically in epochs, i.e., intervals of length ΔT. The sequenceh0, ΔT, 2ΔT,...i shows the start times for consecutive epochs. Probing a mobile object is defined as inquiring it for its exact mobility infor- mation. The target mobile object, aka probee, always replies back by providing its current location and velocity. We assume that no prior knowledge is available on movement trajectories of objects. Also, the communication infrastructure guarantees lossless message trans- mission; meaning that messages issued by a sender are retrieved correctly on the receiver side. Therearenregisteredobjectsinthesystem. Eachobjectisawareofitsmobility information. Specifically, u i knows its exact locationhu i .X(t),u i .Y (t)i. All objects measuretheirpositionsatthestartofeachepoch. Thevelocityoftheobject,u i .V (t), is calculated as the average speed between two subsequent positionings (i.e., epochs). 25 3.2.2 System Architecture and Query Processing We assume a client-server architecture to address a set of proximity queries. Objects have their exact mobility information and would unveil it to the server if requested to do so. The server processes the set of all queries and informs the objects about the results. For each object, the server stores the corresponding registered queries and a region showing the approximate object location. In order for the query results to be correct, the server needs to process all queries in each epoch and decide their satisfiability. For privacy reasons it is not attractive to involve objects in query processing. Efficient tracking of moving objects by the server calls for a tracking policy that has specific features. First, the server should maintain approximate rather than exactobjectlocations. Althoughhavingexactobjectlocationmakesqueryprocessing trivial, imposing such a requirement makes objects communicate their location very often. Second, similar to a mobile object, its associated region on the server needs to be moving as otherwise, the object moves out of the region frequently. This feature poses considering a separate customized region for each mobile object which by itself enables fine-tuning the region to mobility behaviour and proximity queries of that object. Approximate information on object locations inserts uncertainty into query processing. Guaranteeing correct query processing in presence of such uncertainty requiresobjectstocontributeandinformtheserveroncetheyleavetheirapproximate location; asotherwise, theserverwouldfollowanincorrectapproximationanddeclare false query results. 26 Figure 3.3: Mobile region for u i at time t. Similar to [44, 128], we follow a vector-based update policy to track objects. As shown in Figure 3.3, u i is surrounded by a virtual mobile region C t (u i ) which is a moving circle centred at c t (u i ) with radius λ t (u i ). The circle is moving with a nominal velocity V t (u i ) which can be different from that of the object, u i .V (t). For brevity, we omit the time-stampt when referring to information in a snapshot (thus λ i and C i instead of λ t (u i ) and C t (u i )). Both the object u i and the server are aware of C(u i ) while regularly, only the object knows its own exact location and velocity. For privacy reasons, the server does not share the mobile region of u i with any other object (including those who are involved in a query with u i ). Within our client-server architecture, there are three situations in which com- munication is required to correctly process proximity queries over moving objects. First, when an object leaves its mobile region, the server needs to correct the mobile region as it is no more a valid approximation. The object issues a (source initiated) location update message to the server. This message includes the exact location 27 hu i .X,u i .Yi and velocity u i .V of the object as well as its id i. The server then refreshes c(u i ) and V (u i ) to mirror these values. This event costs one message per each object that leaves its mobile region. Second, whentheapproximatelocationinformationofu i andu j isnotsufficient to decide the satisfiability ofq i,j , the server makes an intelligent selection and probes one of the two objects for its exact location. The object then replies with its exact mobility information. In case the query cannot be solved solely by probing one object, the server probes the other one as well and answers the query based on the exact locations of both objects. We consider a cost of two for a probe message knowing that it always entails a message from the object side. This event costs two or four messages per each uncertain query. The cost though, can be aggregated across queries meaning that an object involved in multiple uncertain queries will still need to be probed and reply once to address all unanswered queries it is involved in. In this chapter, we focus on probe optimization and propose an algorithm that selects probees by looking into the whole batch of queries rather than each individual query. The third situation that initiates client-server communication happens when the server observes that a query result has changed. In this case it notifies both objects involved in the query. This event costs one message per each object per each updated query result. Following [128], we fine-tune the size of mobile region to reduce the communi- cation cost. The mobile region contracts and expands after each location update or probe message. We use a scale factor, α, to customize this amount. Upon contrac- tion of C(u i ), λ i ← 1 2α ·λ i , and upon an expansion, λ i ←α·λ i . Both client and the 28 server are aware of this rule and will automatically update the mobile region without having to communicate. 3.3 Problem Definition and Solution Overview Given a batch of proximity queries, our goal is to minimize the count of probees in each epoch, while continuously monitoring the queries and notifying the involved objects about any result updates. Objects measure their location at the start of each epoch and update the server if they leave their mobile regions. The server then analyses the whole set of queries and comes up with a decision as to what objects to probe next so that the expected number of probees for solving all queries is minimized. The server starts to probe selected objects until all queries are solved. Each probe message initiated during query processing involves twice communi- cation cost compared to a source initiated location update message. Moreover, as its name suggests, a source initiated message is mainly caused by movement patterns of the client, e.g. because the client took an exit on the right in the highway. This is while an intelligent server-side algorithm can solve all queries with less probes. Therefore, probing deserves special attention and would be our main focus in this chapter. Our probing algorithm consists of two phases. Having a set of queries, a fun- damental question is what subset of queries (and their corresponding objects) have a chance to affect the total number of probees. This question is answered in Phase I, 29 where we identify and analyse all possible categories for a query in terms of mutual object locations. In fact, these categories altogether form the state space for any proximity detection query at any time. We discover categories that are solvable without probing any objects and further exclude them from the batch. All queries classified as other categories will be further processed in Phase II. One interesting category includes those queries for which an optimal probing decision can be made definitely rather than probabilistically. For such queries, probing a specific object first would have no contribution in solving the query. This conclusion is made with- out considering any object movement patterns or location distribution. There are also categories of queries for which no trivial optimal probe choice is available. Phase I is explained in Section 3.4. Analysing each query individually as done in Phase I brings valuable insights into query processing. However, as illustrated in Figure 3.2, more optimized probing decisions can be made by considering all unanswered queries globally rather than individually. PhaseIIdealswithqueriesthatmightaffectotherqueriesinthebatch; meaning that probing some objects involved in some of these queries earlier than some others might help reduce the number of total probees. Two questions rise here. First, what are the exact queries that have a tied destiny in terms of affecting each other’s probing decision. Second, assuming a set of related queries whose answers are not known to the server, what is the best ordering to probe the involved objects so that the number of probees is minimized. These questions are addressed in Section 3.5. 30 Ideally, a probing strategy would start with simultaneously probing one of the two objects involved in every query since only the location of those two objects and no one else’s is important to answer that query. This way, all queries would be answered in at most two probe iterations, leaving no space for an ordering on probees. However, in the absence of knowledge on exact object locations, there is a potential to make better probing decisions by interacting with the objects and dividing the original problem into smaller ones that include smaller number of queries, thus dealing with less uncertainty. While our algorithm enables making a one-time decision to order all objects and apply all probe decisions in two iterations, we prefer to follow a longer iterative probing process that fits the criteria of ΔT time units before the next epoch starts. 3.4 Phase I: Query Categorization For any query q i,j , we assume without loss of generality that λ i >λ j meaning that the mobile region surroundingu i is larger than that ofu j . Figure 3.4 shows that Real line can be partitioned into five non-overlapping intervals. For any query at any time, the real valued distanced =dist(c i ,c j ) settles along one of these intervals; therefore the query would be classified under the corresponding category. In this section we analyse each category in detail. For each u i , consider two important circles with radii −λ i and +λ i that have a common center c i as defined in Section 3.2.2. Let I i and O i denote the inner 31 Figure 3.4: Partitions of the real line and their analogous categories for proximity query. and outer areas of the smaller and the larger circle, respectively. In Figures 3.5 to 3.8, we illustrate these areas by gray color and accordingly they will be referred to as gray areas (related toC i ). What is in between, the annulus, is named the blind area (related to C i ). I i and O i have an interesting geometric characteristic as described in the following lemma: Lemma 3.1. LetI i andO i denote the inner and outer gray areas of the smaller and the larger circle, respectively. The -neighborhood of any point in I i (O i ) includes (excludes) the entire C i . Hence if it occurs that another object u j falls in a corresponding gray area of C i , the server can solve q i,j by merely probing u j , deducing q i,j to be true if u j is located in I i and false if it happens to be in O i . Conversely, when u j resides in the annulus white area of C i , i.e., the blind area, having the location information of u i is a prerequisite to determine the status of the query. There are cases where the server can certainly solve q i,j solely based on the mobile regionsC(u i ) andC(u j ) without requiring the exact object locations. Specif- ically, Figure 3.5 shows the two categories that share this property (Categories 1 and 5). AllqueriesfallingunderthesecategorieswillbefilteredinPhaseI.Theremaining subset of queries belong to Categories 2 to 4 and will be further analysed in Phase 32 (a) Category 1: too close mobile regions. (b) Category 5: too far mobile regions. Figure 3.5: Solving queries based on the distance between their mobile regions. The status of a proximity query is determined if an object (e.g. u j ) is in the gray area formed by the other one (ie. u i ). II. As we will see, a straightforward least cost probing decision cannot be made in Categories 2 and 4. The optimum decision for category 3 can be discovered definitely rather than probabilistically. However, taking the probing action is postponed until Phase II to further manage the effect of probe on the whole batch. Category 1: d<−λ i −λ j Figure 3.5a depicts the case where any pair of points in the two mobile regions are inside -neighbourhood of each other. In other words, the maximum distance between any two points in the circles is smaller than the proximity threshold: maxDist C(u i ),C(u j ) < i,j In this case, q i,j is satisfied and the query result is respectively 1. Category 2: −λ i −λ j <d<−λ i +λ j Figure 3.6 shows the case where both objects have a chance to solely solve the query when probed and neither can be simply crossed out from probing choices. There are 33 some points in C(u j ) that are inside -neighborhood of all points in C(u i ) and vice versa (thus q i,j = 1 for these points). These points together form the gray subarea S j , i.e., the intersection of C(u j ) and the gray area of u i (Figure 3.6a). If the object u j is in S j , probing it would be sufficient to solve the query and there is no need to probe the other object. The probability of an object residing in this subarea can be quantified. Assuming a uniform distribution for the location of an object in its mobile region, p ij , the probability of successfully solving query q i,j given that only u i is probed, can be calculated as: p ij = |S i | πλ 2 i where|S i | represents the area for the gray subareaS i in Figure 3.6b. However, λ i > λ j entails that p ij > p ji which means probing u i is more likely to solve the uncertain query q i,j than u j . In other words, by hiding in the larger circle C(u i ), u i inserts more ambiguity into the problem than does u j and disclosing that informa- tion can contribute more (in probability) to solve the query. Thus, higher chances are that probing u i will solve the query. Category 3: −λ i +λ j <d< +λ i −λ j Figure3.7showsthecasewhereC(u j ) completelysitsintheblindarea; meaningthat no point inC(u j ), includinghu j .X,u j .Yi, can solely solve the queryq i,j . More specif- ically, for any pointx∈C(u j ), there is ay∈C(u i ) that is inside the-neighborhood of x and there is also a z∈ C(u i ) that is outside of it. On the other hand, C(u i ) intersects both gray areas in Figure 3.7b, implying that there are some points in C(u i ) that are inside the -neighborhood of all points in C(u j ) (thus q i,j = 1) and 34 (a) Only if u j is located in S j , probing this object can solve the query. (b) Only if u i is located in S i , probing this object can solve the query. Figure 3.6: There are chances the query is satisfied only by probing one object. u i has a higher chance to solve the query. (a)C j completely falls in the blind area. Thus, probingu j cannotsolely solve the query. (b) Parts of C i fall under gray area. So there are chances that probing u i alone can solve the query. Figure 3.7: C j completely falls in the blind area. u i must be probed first. some other points inC(u i ) that are outside the-neighborhood of all points inC(u j ) (thus q i,j = 0). Therefore, the server should definitely probe u i first since unlike for u j , there is a positive chance of finding the answer immediately. The server might later need to probe the second object if the exact location of the first one cannot solve the query (thus, another definite decision). Category 4: +λ i −λ j <d< +λ i +λ j 35 (a) Only if u j is located in S j , probing this object can solve the query. (b) Only if u i is located in S i , probing this object can solve the query. Figure 3.8: There are chances the query is unsatisfied only by probing one object. u i has a higher chance to solve the query. Similar to category 2, there is no definite probing strategy for this case. As shown in Figure 3.8, there are some points in C(u j ) that are outside of −neighbourhood of all points in C(u i ) and vice versa (thus q i,j = 0 for these points). Similar to our discussion under category 2, λ i >λ j entails that p ij >p ji . Category 5: d> +λ i +λ j Figure 3.5b represents the case where any pair of points in the two circles are outside -neighbourhood of each other.In other words, the minimum distance between mobile regions is larger than the proximity threshold: minDist( C(u i ),C(u j ) > i,j In this case, q i,j is unsatisfied and the query result is respectively 0. All queries in Figure 3.2 belong to Categories 2 and 4. The provided probabilis- tic metric in this section minimizes the probing cost for solving an individual query. However, our main objective of minimizing the aggregated probing cost among all 36 queries still remains unsolved as the analysis is missing a broad perspective on the whole batch of queries. In the next section we provide optimized probing algorithms for processing queries in Categories 2 to 4. 3.5 Phase II: Probe Ordering Solving any of the queries forwarded to this phase requires probing one or both involved objects. Given a batch of unanswered queries in an epoch, our algorithm suggests an ordering for probing objects. We first model the received batch of queries as a graph in Section 3.5.1. Next in Section 3.5.2, we discover related queries by partitioning the batch (i.e., the graph) into subsets where probing objects in each subset affects the total number of probees only in the affiliated subset. Subsets are independently processed and new probees are introduced iteratively by applying our algorithm to each individual subset. In Section 3.5.3, we suggest two approaches for selecting objects as probees in each iteration. We first briefly describe a baseline approach that yields to the “optimal solution" which we then leave for its high computational complexity. We then sacrifice the optimality of that approach in favour of the performance gained by a greedy alternative. In our Enhanced Probe Selection method, we avoid complexity by assigning each object a value and picking the most valuable objects in order to reduce the probe cost (Section 3.5.3). 37 3.5.1 Data Structure In order to capture how queries in Phase II affect each other, we need to maintain a global view on the whole batch of unanswered queries. We model the batch as a graph Γ = hU,E,Wi, whereU = n u 1 ,...,u n o represents the set of n mobile objects as nodes. An unanswered query q i,j , is represented by an edge e i,j ∈ E between u i and u j . W :U 7→{0, 1} tags each node with a binary value showing whether the server has decided to probe the object, i.e., a must-probe node, or this is not yet decided, i.e., a regular node of the graph. 3.5.2 Probe Ordering Algorithm Algorithm 1 depicts our batch proximity query processing algorithm run in each epoch. The server first receives the location update messages from objects. It then classifies queries in Phase I and filters out those that can be solved without issuing any probe messages. Next in Phase II, the yet unanswered set of queries is modelled as a graph Γ. Deciding to probe objects in some queries might affect the probing decision for objects insome other queries. Oneimportantquestion iswhat exactsubset ofqueries have a potential to affect each other in terms of probing the involved objects. After probing an object, there might be affiliated queries that remain unsolved, mandating to probe other fellow objects, hence introducing them as new must-probe objects. Consecutive occurrence of this situation is what we call the cascade effect. In order to find the exact group of queries with tied destinies, we start from an arbitrary query 38 and imagine the most extensive cascade. A cascade effect can cover a connected component of the graph but not beyond that. Recognizingtheconnectedcomponentsof(alarge) Γreducestheprobedecision problem to selecting from nodes of a (smaller) single component. This enables more optimized probing decision since all unrelated objects are excluded from available decision choices. Moreover, concurrent probes in different components logistically conformtothelimitationspresentinreal-worldapplications: consideringthenetwork turnaround time, a fully sequential probing scenario involving a large number of objects is not operational within a short interval bounded by ΔT time units. We now explain Algorithm 2, our iterative algorithm to address unanswered queries in one connected component. At each moment, the server needs to make a decision to select and probe objects from a component. Decision making is either trivial or non-trivial depending on the tags of nodes in the connected component. Given a connected component, the server needs to decide which node to probe first. This is a non-trivial decision as probing a node has consequences: it might not solve some affiliated queries and also affect what nodes have to be probed next. We suggest two approaches in Section 3.5.3 that addresses the probe selection problem. After probing a node u i , if an affiliated query q i,j is yet unanswered, it means that exact locations of both objects are required to solve the query. Consequently, u j will be tagged as must-probe and the server trivially schedules to probe it during the next iteration, i.e., recursive call to Algorithm 2 (line 25). Ideally, theservercontinuestoprobemust-probenodesinconsecutiveiterations until no remaining node has a must-probe tag. It is only after these iterations when 39 the server resumes making a non-trivial probe selection (Section 3.5.3). However, following this method in large real-world applications may not be operational as the network delay caused by consecutive probes may exceed the short duration of an epoch. We relax this problem by making two modifications to our original algorithm. First, we merge the trivial and non-trivial decision iterations and probe new must- probe objects together with those selected by a non-trivial decision in every iteration. The list of probees in iteration i is represented as l i in Algorithm 2. Second, when making a non-trivial probing decision for a component, we select multiple rather than only one object. The process can be considered as selecting the TopK most appropriate objects. The number of selected objects in each iteration is a system defined parameter which is proportional to the number of objects in the component and length of an epoch, ΔT. These modifications are reflected in Algorithm 2 and we use this algorithm in our experiments. After probing a node (line 7), we remove all its incident edges from the com- ponent and evaluate the affiliated queries (lines 10-18). Newly detected must-probe nodes are scheduled to be probed during the next iteration (line 20). Removing incident edges after probe can break a component into smaller ones. The server detects the smaller connected components and independently processes them during the next iteration (lines 21-25). The initial call to Algorithm 2 is made through Algorithm 1 where it initializes l 1 to all nodes in Category 3. These nodes are then trivially probed in the first iteration. 40 Algorithm 1 Batch Query Processing 1: Algorithm Batch Query Processor (Q) 2: Input: A batch of proximity queriesQ 3: Generates ordered list L = l 1 ,l 2 ,...,l s where l 1..s are disjoint sets of probees selected during s iterations. S i=1..s l i ⊆U 4: Receive location updates from objects and initialize the set U⊆U of objects whose exact locations are available . Phase I 5: Classify queries according to categories provided in Section 3.4 6: Process queries in Categories 1 and 5 and exclude them fromQ . Phase II 7: Create the graph Γ =hU,Ei from the batchQ 8: i← 1 9: Find connected components of Γ . Parallel execution for every component Γ j ⊂ Γ 10: for Γ j ⊂ Γ do 11: if|Γ j |> 1 then 12: Insert into l 1j any object from Γ j that satisfies the condition in Category 3 13: ProcessSubgraph(Γ j , i, l ij ) 14: Notify objects 3.5.3 Probe Selection Approach Deciding to select a user for probe in presence of uncertainty about exact user locations can be well modelled by a Markov Decision Process (MDP) [75]. We now briefly explain how the MDP framework maps to our problem but cannot be practically applied to it. The (A) decision maker taking (B) actions to interact with the (C) environment translates to the (A 0 ) server (B 0 ) probing objects to address the (C 0 ) set of unanswered queries over users. A state of the environment is represented 41 as a graph Γ as explained in Section 3.5.1. The number of iterations h in a finite- horizon optimal behaviour maps to the total number of objects that is n. Using available learning algorithms such as backward induction, the server automatically finds the optimal behavioural strategy, dubbed policy, probabilistically and selects proper probees. An optimal policy is guaranteed to exist since both the state and actionspacesarefiniteinourrepresentation[28]. Iterativepolicylearningalgorithms for this non-deterministic environment take polynomial time in each iteration in terms of the count of states. As the encoding of a state, i.e., Γ is exponential in terms of the number of objects, each iteration of these algorithms runs in pseudopolynomial time, thus not polynomial in terms of the number of objects. Hence, this approach does not scale to even a fair number of users and queries. The above baseline approach follows a dynamic programming approach and regards immediate as well as delayed (i.e., future) consequences of a probe. To avoid the complexity of the optimal solution, our Enhanced Probe Selection (EPS) method follows a greedy approach and adheres to the immediate effect of probing each object. To ease computation, EPS myopically approximates the immediate number of solved queries and ignores the subsequent consequences. This comes at the cost of loosing the optimality of our probing algorithm. We prioritize the objects according to their value assigned by theValue function. Put in the context of our MDP approach, EPS approximates the optimal value functions by roughly calculating immediate rewards. We prioritize objects by calculating their value: Value(u i ) = λ i λ v +... + λ i λ w 42 where u v ,...,u w are the objects in Γ that are involved in a query with u i . This function considers the number of unanswered queries as well as the relative size of mobile regions. An object involved in many unanswered queries will become a good probe candidate according to our proposed function. The rationale is that the more unanswered queries an object is involved in, the higher are the chances that at least one such query demands probing both objects. Thus, probing such an object might prevent probing some other objects. Another case is when the server’s knowledge of an object location is too approximate. An object having a larger mobile region compared to its fellow objects causes more ambiguity as we discussed in Phase I and is thus a better choice for probing. Analytical Evaluation: The EPS approach is proposed due to computational complexity of the baseline solution. The main task here is to sort objects based on their calculated values. This takes O(nlgn +m) in the worst case making it perform much faster compared to its optimal counterpart. 3.6 Empirical Evaluation 3.6.1 Experiment Methodology Experiments were run on Dualcore AMD Opteron nodes with 2.0 GHz CPU and 4GB RAM. For scalability tests including large numbers of queries, Dualcore Intel Xeon nodes with 2.0 GHz CPU and 64GB RAM were used. Each experiment was run on a single node operated by Linux. 43 We used the network-based moving object data generator [34] to generate the movement of users during 100 epoch. The duration of each epoch, ΔT, equals one minute. The data generator allows us to set the maximum speed limit for users and adjusts the speed based on the density of users as well as the speed limits of the underlying street or highway. ExperimentswererunonSanFranciscoBayArearoadnetwork. Thisregionhas an approximate area of 7000 sq. miles with an original population of 7 million people [4]. The network includes a spatial domain of [0, 98] 2 miles including highways and streets of nine counties in Bay Area. According to a recent estimate by Department of Motor Vehicles, more than 22 million automobiles have been registered in this area during 2011 [5]. All this information makes this area a good candidate for testing location-based applications. We focused on a friend-finder application as the mainstream application in our experiments and set the parameters accordingly. Table 3.1 shows the default values for the parameters. 400, 000 users are in relationship with an average of 10 close friends or family members while commuting with a speed of 8 meters/sec. which is approximately 17.7 MPH i.e., the average speed in streets of Bay Area [9]. The users would like to get notified if their close friends enter or leave the -neighborhood of 2 miles. The communication cost of our proximity detection algorithm was evaluated against a wide range of potential applications and audiences by varying different parameters. We also implemented the current state-of-the-art method, RMD, and compared it to our method [128]. For fairness, we ran experiments to find proper 44 Table 3.1: Summary of Parameters. Parameter Default Range Number of users 400K 50K- 800K Speed limit (meter/sec.) 8 1.5- 27 Avg. queries/user 10 5- 800 Proximity distance (miles) 2 0.1- 40 Initial mobile region radius λ 0.4 0.01- 10 Scale factor α 1.6 1.01- 16 Algorithm EPS RMD, EPS equivalent internal parameters for their method in the given road network. Results show that our method outperforms in all settings and saves more than 30% in location tracking cost compared to its closest counterpart. 3.6.2 Results Location Tracking Cost Evaluation Tracking cost, as the major communication overhead, is composed of location updatemessagesinitiatedbyusers’movingdynamics, andserver-sideprobemessages which are followed by a response from users providing their exact location. Figure 3.9 shows the tracking cost breakdown at the default setting. With almost similar location update costs, our Enhanced Probe Selection approach, EPS, incurs much less probing cost in each epoch compared to RMD. Figure 3.10 shows the average tracking cost with respect to various parameters. Figure 3.10a shows the effect of proximity distance on the tracking cost. For very 45 RMD EPS 0 2 4 6 8 10 12 x 10 4 Tracking Cost (Messages) Location Updates Probes Figure 3.9: Cost breakdown at the default settings small and very large values, most queries are easily filtered since most pairs are relatively far or close enough not to trouble the server. However, for intermediate values, chances increase that a pair falls in the fuzzy margin within the two bounds, thus increasing the communication cost. The scalability of our algorithm has been tested against large numbers of users and queries. We modelled applications with hundreds of millions of queries in Figure 3.10b in which each user shares interests with hundreds of other users in the city. Figure 3.10c shows the tracking cost versus number of users. In all cases our method incurs at least 30% fewer messages to solve proximity queries compared to RMD. The effect of commute speed has been analysed in Figure 3.10d. Walking users, and those driving with average and maximum speed in streets and highways of Bay Area have been modelled in our experiments according to available speed statistics [9, 96]. 46 Sensitivity of our approach to internal parameters was analysed by varying the initial mobile region radius, λ, and the scale factor, α. Figure 3.10e depicts that the tracking cost is barely affected by significant changes in initial λ. This stability is because our method fine-tunes the mobile region for each user separately: it abruptly reacts to the probes and location updates caused by the user’s movement behaviour and the queries it is involved in. Figure 3.10f shows the effect of scale factor on tracking cost. It can be observed that for a wide range of α values, the communication cost of EPS is less than that of RMD in its best case. Effect of Category 3 Optimal decision can be made in a definite way when a query belongs to this category. Thus, the more unanswered queries belong to this case, the closer the algorithm is to the minimum probing cost. Table 3.2 shows the average percentage ofusersthatsatisfiedthissituationinanepochcomparedtothetotalnumberofusers involved in unanswered queries (categories 2 to 4). Even for small numbers of queries (e.g. alimitednumberoffriendsinafriend-finderapplication),aconsiderableportion of users were involved in category 3. Probing these users early in query processing addresses many other unanswered queries respectively. CPU Usage Figure 3.11 shows the computation cost of our algorithm. Processing queries for a large number of users in a friend-finder application takes no more than a couple 47 Table 3.2: Effect of Category 3 on Probing Users. Queries/ user Involved users probed through Category 3 5 10.70% 10 13.29% 25 17.42% 50 20.31% 100 22.6% 200 24.44% 400 26.08% 600 27% 800 27.62% of seconds in each epoch (Figure 3.11a). Computation cost for scalability tests is shown in Figure 3.11b. We simulated an application with 240 million queries and the single-threaded implementation of our algorithm managed to process all queries within the one minute ΔT duration. For processing larger number of queries we hit hardware limitations and had to sacrifice CPU cycles to make experiments feasible within the given memory limits. Algorithm Iterations According to Algorithm 2, users having the maximum values in each connected component are probed for their exact location. In order to expedite query processing, a larger number of users can be chosen from each connected component (line 6). For experiments including large numbers of queries, in each iteration we selected from each component the top 1% of users having the highest values. Our experiments show that even with large number of queries, the algorithm ran no more than seven iterations to finish query processing in each epoch. This is due to the fact that after 48 each iteration, some edges of a component are removed (i.e., queries are solved), being replaced by multiple smaller components that would unleash more potentials for parallel probing and processing. 49 Algorithm 2 Processing an Independent Batch in Parallel 1: Algorithm ProcessSubgraph (Γ j ,i,l ij ) 2: Input: Connected component Γ j ⊂ Γ, probing iteration i, confirmed list of objects to be probed in this iteration. 3: Updates sublist l ij of probees and initiates the next iteration 4: repeat 5: Use a Probe Selection Approach (Section 3.5.3) to rank objects and select one to be inserted into l ij 6: until a system defined percentage of objects are selected 7: Probe all objects in l ij 8: l i ←l i S l ij 9: Receive exact locations. U←U S l ij . Post-probe query evaluation 10: mustProbeList =∅ 11: for u p ∈l ij do 12: forhu p ,u q i∈ Γ j do 13: if the query can be solved using exact locations 14: in U then 15: Removehu p ,u q i from Γ j 16: else 17: Insert u q into mustProbeList j 18: Remove edgeless nodes (objects) from Γ j 19: i←i + 1 20: l i ←l i S mustProbeList j 21: Find connected components of Γ j . Parallel execution for every component in Γ j that includes unanswered queries 22: for Γ jk ⊂ Γ j do 23: if|Γ jk |> 1 then 24: Insert into l 0 ik any object from Γ jk that are in mustProbeList j 25: ProcessSubgraph(Γ jk , i, l 0 ik ) 50 10 -1 10 0 10 1 0.5 1 1.5 2 2.5 x 10 5 Epsilon (miles) Tracking Cost (Messages) RMD EPS (a) Epsilon 0 200 400 600 800 0 1 2 3 4 5 x 10 5 Relationships Tracking Cost (Messages) RMD EPS (b) Avg. relationships/ user 0 2 4 6 8 x 10 5 0 0.5 1 1.5 2 2.5 x 10 5 Users Tracking Cost (Messages) RMD EPS (c) Objects 0 5 10 15 20 25 0 0.5 1 1.5 2 x 10 5 Speed (meters/sec.) Tracking Cost (Messages) RMD EPS (d) Speed limit 0 2 4 6 8 10 0 0.5 1 1.5 2 x 10 5 Lambda (miles) Tracking Cost (Messages) EPS (e) Tuning lambda 5 10 15 0.5 1 1.5 2 2.5 x 10 5 Alpha Tracking Cost (Messages) RMD EPS (f) Tuning alpha Figure 3.10: Object tracking cost (location update + probe messages) 51 0 2 4 6 8 x 10 5 0 0.5 1 1.5 2 2.5 3 Users Processing time (sec.) RMD Single-threaded EPS (a) Users 0 200 400 600 800 0 50 100 150 Relationships Processing time (sec.) RMD Single-threaded EPS (b) Queries Figure 3.11: Server-side computation cost 52 Chapter 4 Scaling the Computation: The Maximal Reverse Skyline Query 53 4.1 Introduction The problem of “optimal location” is a common problem with many applica- tions in spatial decision support systems and marketing tools. With this problem, given the sets S of sites and O of objects in a metric space, one must compute the optimal location where introducing a new site maximizes the number of the objects that would choose the new site as their preferred site among all sites. For instance, a city planner must solve an optimal location problem to answer questions such as “where is the optimal location to open a new public library such that the number of patrons in the proximity of the new library (i.e., the patrons that would perhaps prefer the new library as their nearest library among all libraries) is maximized?”. An important limitation of the existing solutions for the optimal location prob- lem is due to a common simplifying assumption that there is only one criterion to determine the preferred site for each object, i.e., the metric distance between objects and sites. In other words, the preferred site for an object is always assumed to be the closest site to the object. However, there are numerous real-world applications with which one needs to consider multiple criteria (possibly including the distance) to choose the most preferred site for each object. The extension of the optimal loca- tion problem which allows for using multiple criteria in selecting the preferred site for each object is termed Multi-Criteria Optimal Location (or MCOL, for short). For an instance of the MCOL problem, consider the following market analysis application. In order to decide on the ideal specifications of its next product to be released, a laptop manufacturer wants to identify the most preferred / desired combi- nationoflaptopspecificationsinthemarket. Forexample, thecurrentmostpreferred 54 combination of laptop specifications can be <5lb, 8GB, 2.3GHz, 14in>, where the numbers stand for weight, memory capacity, CPU speed, and display size of laptop, respectively. One can formulate this problem as an MCOL problem, where each site represents an existing laptop product in the market with known specifications, and each object represents a buyer in the market with known preferences on the spec- ifications of his/her desired laptop (the preferences of the buyers can be obtained, for example, by collecting and compiling their web search queries). In this case, laptop specifications (i.e., weight, memory capacity, CPU speed, and display size) are the criteria that objects (buyers) use to determine their preferred site (laptop). Accordingly, by solving this MCOL problem, the manufacturer can identify the spec- ifications of a new laptop product (i.e., the new site which is optimally located) such that the number of potential buyers is maximized. Similarly, a cell phone company can identify the features (e.g., the monthly voice service allowance in minutes, text service allowance in number of text messages, and data service allowance in GB) of a new cell phone plan that would attract the largest number of potential subscribers with different usage statistics. Real-world MCOL problems appear in large scale. Amazon (™) has reported more than three hundred million active customers in 2015 [1, 2]. A search in the website for a typical keyword such as “perfume” currently returns about one hundred andfiftythousanditems,whereeachitemhasadifferentsize,price,genderspecificity, intensity, i.e., concentration of fragrance and alchohol, type, etc. With a global market size of forty billion dollars, a fragrance producer would gain valuable insight on the features of a new potential fragrance by solving a MCOL involving a large number of users as objects [10]. Similar stories hold true for many other industries 55 such as the footwear industry where the American people spent more than twenty billion dollars in 2015 and a search in the Amazon website returns more than one and a half million results [11]. In a separate scenario, a world-renown chain restaurant or laptop manufacturer can benefit from batch processing of many MCOL problems each pertaining to a different region where there are other active local competitors. So it is critical of any solution to a MCOL problem to scale to large number of objects and sites. While the MCOL problem is previously studied in the operations research com- munity, the existing solutions for this problem are not only approximate solutions without any guaranteed error bound, but also more importantly, unscalable solutions that can merely apply to very small (i.e., hundreds of) object and sites datasets (Sec- tion 2.2.1 reviews such solutions); hence inapplicable to real-world applications. In this chapter, for the first time we focus on developing an efficient and exact solution for MCOL that can scale with large datasets containing millions of objects and sites. Toward that end, first we formalize the MCOL problem as maximal reverse skyline query (MaxRSKY). Given a set of sites and a set of objects in ad-dimensional space, MaxRSKY query returns a location in thed-dimensional space where if a new sites is introduced, the size of the (bichromatic) reverse skyline set of s is maximal. To the best of our knowledge, this chapter is the first to define and study MaxRSKY query in large scale. Second, we develop a baseline solution for MaxRSKY which derivesananswerforthequeryby1)computingtheskylinesetandthecorresponding skyline region for every object (the skyline region for an object is a region where if a new site is introduced, it will become a skyline site for the object), and 2) for each 56 skyline region computed for an object, overlap all regions to identify the maximum overlap region (i.e., the region where the largest number of skyline regions intersect). One can observe that among all maximum overlap regions identified for all skyline regions, theonewiththelargestnumberofoverlappingregionsiswhereifanewsiteis introduced, itsreverseskylinesetismaximal. Wecallthisregionthemaximaloverlap region. Our baseline solution illustrates the intrinsic computational complexity of MaxRSKY query, and shows that the dominating cost of computing an answer for MaxRSKY is due to the second step, i.e., maximal overlap region computation. Accordingly, in order to reduce the cost of overlap computation we propose a filter-based solution, termed the basic approach, which effectively reduces the search space for maximal overlap region computation. The basic approach achieves effi- ciency by 1) prioritizing maximum overlap computation for skyline regions by con- sidering the potential of including the maximal overlap region for each region, and 2) avoiding redundant maximum overlap computation for the regions that cannot pos- sibly include the maximal overlap region. Experiments show that the basic approach computes answers for MaxRSKY queries with datasets containing thousands of sites and objects in a two-dimensional space on a single machine [27]. While the above solution is more efficient than any other existing algorithm, in this chapter we make two further contributions in order to scale to the large size of real-world problems; thus, scaling the number of objects and sites from thousands to millions. First, given that only highly efficient algorithms can manage to run against large datasets in a reasonable amount of time, we propose a solution, termed the multi-granular grid-based approach, which uses grid (rather than the skyline regions 57 themselves) for region prioritization. The size of grid cells is dynamically tuned depending on the concentration of the skyline regions in the space, thus achieving both data-awareness and data-independence in processing the search space. Second, presence of various independent computations in our algorithm makes it a perfect match for a parallelized framework such as MapReduce. Thus, we extended our centralized solution over the MapReduce framework. Computing skyline and over- lap regions as well as filtering them is done in parallel for different objects, skyline regions, and grid cells in our MapReduce-based basic and grid-based solutions. Sim- ple adoption of the MapReduce framework does not achieve maximal performance boost to our solution as the execution time of a MapReduce task is highly affected by its longest running Mapper or Reducer. We modified our solution to balance the load across Reducers, so the execution time is further reduced. Our extensive empirical analyses show that our MapReduce-based implementation is invariably efficient in computing answers for MaxRSKY queries with large datasets containing millions of sites and objects. The remainder of this chapter is organized as follows. Section 2.2 reviews the related work. Section 4.2 formally defines the MCOL problem and formalizes this problem as MaxRSKY query. In Section 4.3, we present a baseline solution over MapReduce and following that in Sections 4.4 and 4.5, we present our proposed solutions for scalable computation of MaxRSKY query. Section 4.6 evaluates our proposed solutions via experiments. Finally, we concludes the work and discuss our directions for future research in Chapter 5. 58 4.2 Problem Definition In this section, we first formally define the problem of Multi-Criteria Optimal Location(MCOL).Then, weformalizethisproblemasmaximalreverseskylinequery (MaxRSKY). 4.2.1 Multi-Criteria Optimal Location (MCOL) Suppose we have a set S of sites s(s 1 ,s 2 ,...,s d ) where s i is the value of the i-th attribute for the site s, as well as a set O of objects o(o 1 ,o 2 ,...,o d ) in the same d-dimensional space whereo i indicates the preference ofo on thei-th attribute. For example, considering our laptop market analysis example from Section 4.1, each lap- top is a site with four attributes, namely, weight, memory capacity, CPU speed, and displaysize. Similarly, eachpotentialbuyerisrepresentedbyanobjectwithfourpref- erences corresponding to the four aforementioned attributes. Figure 4.1 illustrates six sites/laptops s 1 to s 6 each characterized by two attributes, weight and memory capacity (for simplicity of presentation, hereafter we consider a 2-dimensional space without loss of generality). In the same figure, three objects/buyers o 1 to o 3 are shown by indicating their preferences on weight and memory capacity of laptops in the same 2-dimensional space. Accordingly, we define the MCOL problem as follows. Given a set S of sites with d attributes and a set O of objects with d preferences corresponding to the same attributes, the multi-criteria optimal location problem seeks a location/region (or set of locations/regions) in the d-dimensional space such that introducing a new 59 Figure 4.1: Example site and object datasets in 2-dimensional space site in this location maximizes the number of objects that each considers the new site among its set of “preferred” sites. Intuitively, a site s is a preferred site for object o if given the preferences ofo, there is no other sites 0 inS that is more “preferred” by o as compared tos; in other words, intuitively for an objecto we say a sites is more preferred as compared to a site s 0 if considering its preferences collectively, o has no reason to choose s 0 over s. For example, in Figure 4.1 the set of preferred sites for the object o 1 is{s 2 ,s 3 }; note that while for o, s 2 and s 3 are not preferred over each other, they both are preferred as compared to all other sites s 1 , s 4 , s 5 , and s 6 . 4.2.2 Maximal Reverse Skyline (MaxRSKY) In this section, for the sake of self-containment we first review the formal defi- nitions of dynamic skyline query and bichromatic reverse skyline query. Thereafter, we define maximal reverse skyline (MaxRSKY) query, which is equivalent to and formalizes the MCOL problem. 60 DEFINITION 1 (DYNAMIC SKYLINE QUERY): Given a set S of sites with d attributes and a query object o in the same d-dimensional space, the dynamic skyline query with respect to o, termed DSL(o), returns all sites in S that are not “dominated” by other sites with respect to o. We say a site s 1 ∈S dominates a site s 2 ∈ S with respect to o iff 1) for all 1≤ i≤ d,|s i 1 −q i |≤|s i 2 −q i |, and 2) there exists at least one j (1≤j≤d) such that s j 1 −q j < s j 2 −q j For example, as shown in Figure 4.2, the skyline set for the object o 1 is DSL(o 1 ) ={s 2 ,s 3 }. Notethats 0 2 ands 0 6 areproxiesofthesitess 2 ands 6 transformed to the first quarter with respect to the reference point o 1 , respectively. DEFINITION 2 (BICHROMATIC REVERSE SKYLINE QUERY): Let S and O be the sets of sites and objects in a d-dimensional space, respectively. Given a query site s∈ S, the bichromatic reverse skyline query with respect to s returns all objects o∈O such that s is in the dynamic skyline set of o, i.e., s∈DSL(o) For instance, in Figure 4.1 the reverse skyline set of s 2 is {o 1 }, because DSL(o 1 ) ={s 2 ,s 3 }, DSL(o 2 ) ={s 1 ,s 5 }, DSL(o 3 ) ={s 6 }, and therefore, s 2 only belongs to DSL(o 1 ). DEFINITION 3 (MAXIMAL REVERSE SKYLINE QUERY (MaxRSKY)): Let S and O be the sets of sites and objects in a d-dimensional space, respectively. The MaxRSKY query returns a location in this d-dimensional space where if a new site s is introduced, the size of the (bichromatic) reverse skyline set of s is maximal 61 Figure 4.2: SSR region for object o 1 It is easy to observe that MaxRSKY query and MCOL problem are equivalent, becausemaximizingthereverseskylinesetofthenewlyintroducedsitesequivalently maximizes the number of objects whose sets of preferred sites includes. In this work we present a scalable solution to MaxRSKY query over the MapReduce framework. Table 4.1 shows the common notations used in this chapter. Table 4.1: Common Notations O : Set of objects S : Set of sites d : Number of dimensions in the space r : An SSR region L(r) : List of SSRs that overlap with a region r G : Set of grid cells 62 4.3 Baseline Solution Central to the solution for maximizing the reverse skyline is the concept of Skyline Search Region (SSR) [51]. The skyline search region for object o (or SSR (o)) is part of the data space containing points that are not dominated by any of the skyline sites of the objecto. For instance, considering the running example in Figure 4.2 with skyline points {s 2 ,s 3 } for objecto 1 , the skyline search regionSSR(o 1 ) is the shaded area bounded by the skyline points and the two axes. Note that SSR does not include the skyline points themselves since a skyline point does not dominate itself. LEMMA 1. (see [51] for proof) For a given object point o, let DSL(o) be the set of dynamic skyline sites for o. Let q be a query point. If q∈ SSR(o), then o is in reverse skyline of q. Accordingly, we propose our two-step baseline solution for maximal reverse skyline query as follows: 1. Compute the dynamic skyline set DSL(o)⊆S of sites for each object o∈O. Subsequently, construct the corresponding SSR for each object o∈ O. This step produces|O| regions. 2. Intersect the SSRs constructed in the previous step to compute the maximal reverse skyline region. Given|O| SSRs, this step involves three parts: (a) Perform a spatial join over the set of SSRs. For each SSR, there are O(|O|) other SSRs that overlap with it in the worst case. Every such 63 SSR is examined to identify all the intersection points between the two SSRs. (b) For each intersection point, the number of SSRs that overlap the point is calculated. (c) The point(s) hosting the maximum number of overlapping SSRs is returned as the answer. However, the proposed baseline approach suffers from computational complex- ities that render its use impractical given the often large sizes of object and site datasets. The dominating cost is incurred in the second step, where we identify the location that hosts the maximal number of overlapping SSRs. This is because 1. A high quantity of SSRs are present given the scale targeted in our problem. 2. The approach excessively performs two basic operations involved in processing MaxRSKY query. The first one is identifying all the intersection points that are introduced per each pair-wiseSSR intersection operation. The second one is to find allSSRs that overlap each such intersection point by running a point query. Both operations are time consuming specially due to the fact thatSSRs are concave d-polytopes, i.e., ugly shapes with many vertices. Discussion: The complexity of the baseline solution equals |O|×O(|O|)× cost id +k×cost pq ≡O(|O| 3 ) (4.1) 64 where cost id is the cost of identifying all the intersection points between two SSRs, k is the greatest number of intersecting points between two SSRs, and cost pq is the cost of a point query, i.e., finding the count of SSRs that overlap with a specific intersection point. cost id +k×cost pq is a costly operation as described in item 2 above. We implemented an R-Tree over MapReduce to index all SSRs in order to expedite the spatial join and also the point queries [35]. However, experiments show that SSRs are not uniformly distributed in the space and running a range query to find allSSRs overlapping a range is still ofO(|O|) complexity. As a result, the cost of the baseline approach is cubic in terms of the number of objects, i.e.,O(|O| 3 ). The term k is correlated with two factors: the number of dimensions and the number of sites in the DSLs. Each DSL includes only a tiny subset of the sites in S because each site s∈DSL(o) dominates a group of other sites in S. In general, the number of sites, i.e.,|S|, only contributes to build DSLs in step 1 and does not play a considerable role in increasing the cost of MaxRSKY query. In fact, an excessive number of sites would result in shrinked and well-shaped SSRs and expedites the MaxRSKY query processing (see Section 4.6 for further details). Scalable methods for finding DSLs have been discussed in Section 2.2.2 and is not our focus in this research. Given the above discussion, performing the MaxRSKY query in large scale is a costly operation. Therefore, in this study we focus on reducing the computational complexity of the second step of the baseline solution, i.e., overlap computation, through efficient utilization of distributed computing resources as well as algorithmic 65 enhancements. Toward this end, in Section 4.4 we present our basic filtering solution for efficient MaxRSKY computation. Thereafter, in Section 4.5 we further enhance the basic approach by proposing our multi-granular grid-based filtering approach. 4.4 Basic Filtering Approach The idea is to prioritize the SSRs and then filter some of them from being joined with all other SSRs. We do this through performing a less expensive initial computation to get an insight as to what SSRs have a higher chance to include the maximal reverse skyline location. Next, we process these candidates according to theirpriorityuntilweidentifyalocationwhichisguaranteedtohaveahighernumber of reverse skylines compared to any other location in the remaining unprocessed SSRs. As shown in Figure 4.3, our filtering approach consists of two main compo- nents: Precomputation and Query Processing. In the following, we will describe each component in detail. 4.4.1 Precomputation Skyline Computation and SSR construction In the precomputation component, we first compute the skyline set of all object pointsandbuildtheircorrespondingSSR regions. Accordingly, foreachobjecto∈O, we compute the DSL(o)∈S. Each object o partitions the d-dimensional space into 66 Figure 4.3: Overview of our filtering approach. 2 D orthants Ω i , each identified by a number in the range [0, 2 D -1]. For example, in Figure 4.4 (a) where D=2, o 4 partitions the space into four orthants (quadrants Ω 0 ,..., Ω 3 ). For simplicity and without loss of generality, hereafter we use 2D exam- ples. Since all orthants are symmetric and we are interested in the absolute distance between site points, we can transform all site points to Ω 0 and compute the skyline points in Ω 0 . As illustrated in Figure 4.4 (a), in order to compute theDSL(o 4 ), sites s 2 and s 6 is transformed to Ω 0 (s 0 2 , s 0 6 ). The DSL(o 4 ) includes {s 0 2 , s 3 }. Thereafter, based on the derived skyline points we build the SSR region in Ω 0 and respectively the other symmetric SSR regions in other orthants. In Figure 4.4 (a) the hatched area demonstrates theSSR(o 4 ) in Ω 0 and the shaded area presents the SSR region in all four quadrants. Accordingly, the shaded regions in Figure 4.4 (b)and (c) present the SSR region of object o 2 and o 3 , respectively. As we can see in Figure 4.4 (b) and (c), SSR regions in all four quadrants are not symmetric since they might be bounded by two axes. Figure 4.4 (d) illustrates the three SSR regions of o 1 , o 2 , and o 3 overlapped in a single view. 67 a. Dynamic Skyline of o 4 and its correspond- ing SSR region b. Dynamic Skyline of o 2 and its correspond- ing SSR region c. DynamicSkylineofo 6 anditscorresponding SSR region d. ThethreeSSRregionsillustratedinasingle view Figure 4.4: Running Example SSR Regions The map and reduce steps for computing skyline points is as follows. Map1: Given an input of the form < o,S > where o ∈ O, the map function simply forwards the input to a reducer by producing the output tuple<reducer id ,< o,S >>. We introduce this mapper to spread and parallelize the workload across a configurable set of reducers. 68 Reduce1: Given an input list of tuples in the form <o,S >, the reducer finds the dynamic skyline set for each input object. The reducer maintains a set of sites that are not yet dominated by any other site. In each iteration, a sites is checked against this set and would replace any site that it dominates. The set would include the dominating sites DSL(o) once all sites have been tested. The function outputs a value of the form <o,DSL(o)> per input tuple. It can be observed that using the MapReduce framework allows parallelizing the DSL calculation due to the skyline calculation being independent across objects. The set of objects is partitioned into smaller subsets where each one is processed by a multi-threaded mapper that calculates DSLs for its assigned objects in parallel using a multi-core node. The cost of DSL calculation is linear in terms of the number of objects and thus negligible compared to the cost of the consequent phases of the basic filtering approach (see [104] for optimized methods on calculating the dynamic skyline set). As far as storing the results is concerned, the default storage medium for MapReduce is HDFS which is mostly appropriate for batch processing workloads with sequential access patterns such as the one in our baseline approach. However, given the need for random access to the objects and SSRs in consequent phases, we store the map output in HBase. Overlapping Data Structure The idea is to precompute the likelihood of containing the maximum number of overlaps for each SSR region, and maintain a ranking of those regions based on 69 Table 4.2: Overlap Table (OT) r L(r) Sc(r) r 2 {r 1 , r 3 } 2 r 1 {r 2 } 1 r 3 {r 2 } 1 this likelihood from high to low. In particular, we compute the optimality likelihood for each SSR region by computing a “score” that reflects the total number of SSRs overlapping with this region. Obviously, the higher the score of a SSR region, the more is the chance of finding an optimal location within (or at least partly in) this SSR region. Motivated by this idea, for each SSR region r, we find a list of SSR regions that overlap withr, denoted byL(r). Accordingly, the total number of regions listed in L(r) is recorded as the score of region r, denoted by Sc(r). Table 4.2 shows the list of overlapping SSRs of Figure 4.4 (d). Table 4.2 is called the overlap table (OT). Each row of the overlap table is called an entry and is in the form of (r,L(r),Sc(r)) or briefly (r,L,Sc). For simplicity, in Table 4.2, r 1 , r 2 , and r 3 represent SSR 1 , SSR 2 and SSR 3 , respectively. It is important to note that regions r 1 and r 3 are not overlapping each other since they have only common intersection points on their boundaries, which are not included in SSR regions. Figure 4.5 shows the procedure we follow to populate the precomputed data structure. Below, we explain how we implement this procedure in three steps: 1. Computing Dynamic Skyline Set and Constructing SSR regions: As we men- tioned before, for each object o we first compute the dynamic skyline set of o, DSL(o). Then, we construct the corresponding SSR region, SSR(o), which is 70 bounded by the derived skyline points and the two axes. In order to support the overlapping of SSR regions, we build an R-tree over all SSRs created in this step. The time complexity of computing skyline sets and constructing SSR regions isO(|O||S| 2 ) andO(|O|), respectively which is linear in terms of the number of objects. 2. Computing Pair-Wise Overlapping SSRs and Populating OT: Once DSLs and SSRs are generated, we populate the overlap table entries with the values described next. For each region entry r, we perform a range query to find all SSRs that overlap with r. Then, we compute and store the region score Sc(r) (as described above). OT represents our region-optimality-likelihood list, to be used for computation of the optimal location (described next as presented in Figure 4.6). Since there are|O| SSRs, and given that anSSR overlaps with an O(|O|) other SSRs , the cost of OT generation would be quadratic in terms of the number of objects. For the sake of completeness, we have proposed a lemma to detect whether two SSRs overlap or not. The lemma, the proof of correctness, and the computa- tional complexity analysis for d-dimensional data are available in Appendix .1 Precomputation Procedure 1: For each o∈ O 2: Compute DSL(o) 3: Build SSR(o) 4: Construct overlap table; (r, L(r), Sc(r)) 5: Sort overlap table based on Sc(r) Figure 4.5: Precomputation for the Basic Filtering Approach 71 for the interested reader. As a brief explanation, our overlap detection lemma utilizes the definition of “domination” (Section 4.2) to detect if two SSRs overlap or not. 3. Sorting Overlap Table: Finally, we sort all entries in OT in descending order of Sc(r), to identify the regions with higher likelihood of optimality. This step is ofO(|O|log|O|) complexity. TheMapReduceimplementationforStep1abovehasbeenexplainedinSection 4.4.1. Towards Step 2, we first follow the method described in [35] to implement an R-tree over MapReduce to index all SSRs. Next, the OT records are generated as follows: Map2: Let an input tuple be of the form < r,cte. > where r is an SSR region. The map function queries the R-tree to find all MBRs that overlap with r. Let r 0 be the SSR region enclosed in the MBR. r 0 is inserted into L(r) if it overlaps with r and is ignored otherwise . The score of r is calculated as Sc(r) =|L(r)| and is persisted in HBase as we explain ahead. We leverage HBase by identifying a commonality between our algorithm and the storage mechanism of HBase: HBase sorts the inserted tuples in ascending order of their keys irrespective of the insertion order. Given that our algorithm involves sorting the OT tuples as well, we can dissolve theO(|O|log|O|) cost of sorting in step 3 above through adjusting output <key,value> tuples produced by the Map function. Given two regions r 1 ,r 2 , the corresponding OT tuples would be implicitly sorted by HBase if key(r 1 )<key(r 2 ). Considering this argument, our Map function outputs a tuple of the form<key =< INTEGER_MAX−Sc(r),r>,value =L(r)> for a provided region r and stores 72 it in HBase. Reduce: No reduce operation is involved. 4.4.2 Query Processing Inthissection, wediscusshowMaxRSKYqueryisefficientlycomputedbyusing the precomputed data structure. With the query processing, we use the information recordedintheoverlaptableatthefirstphasetocomputeascoreforeachSSR region, which is equal to the total number of regions overlapping this region in a pair-wise relation. This score provides an over-estimate of the actual number of overlapping regions. Oneshouldobservethatahigherscoreforaregionindicateshigherpotential of containing an optimal location. Next, through a refinement process we sort the regions based on their scores in descending order, and starting from the regions with higher scores, we use an efficient technique to compute the actual set of overlapping regions for each entry of the overlap table. It is important to note that through this refinement process we only have to compute the actual overlap(s) for an entry if the score of the region is more than the influence value of the actual overlaps computed thus far. Map3: Figure 4.6 illustrates the procedure we follow in a Mapper to answer a MaxRSKY query. Below, we explain how we implement this procedure in more details in three steps: Step 1 (Initializing Optimal Result Set): EachMapperreceivesasinputasubset of the OT tuples, i.e., (r,L(r),Sc(r)), that were generated by Map2 (Section 4.4.1), and outputs the MaxRSKY region(s) that are locally optimal given the Mapper’s 73 MaxRSKY Query Computation (Map Function) 1: Initialize q o , S o , and I o to? 2: For each SSR region r of OT records assigned to this Mapper 3: If Sc(r)≥I o 4: For each SSR region r0∈L(r) 5: Compute the intersection points between regions r and r0 6: For each intersection point q found 7: Perform a point query from q to find all SSRs covering q 8: Let S be the result of the above point query 9: I s =|S| 10: If I s >I o 11: Update q o =q, I o = I s , S o = S 12: Return local optimal solution set S o and I o Figure 4.6: MaxRSKY Query Processing with the Basic Filtering Approach (Map function) input. SupposeS o isasetofSSRs whoseintersectioncorrespondstothelocaloptimal region returned by a Mapper. Also, we assumeS o has a local optimal influence value denoted by I o which represents the number of regions belonging to the set S o . At this step, we initializeS o to empty set andI o to zero. The initialization can be done in O(1). Step 2 (Identifying Overlap Regions for Each OT Entry): For each entry (r,L,Sc) of OT, we identify the optimal overlap locations by performing the fol- lowing sub-steps: 1. We comparer with all SSRs inL and compute a setQ of all intersection points between r and an overlapping SSR in L. 74 2. For each point q ∈ Q, we perform a point query for q to find a set S of SSRs coveringq. Accordingly, we compute the influence value of S, i.e.,I s , by counting the number of SSRs belonging to S. We update S o and I o , if the influence value I s of S is larger than the current I o . Although the aforementioned sub-step can find an optimal solution, it is inef- ficient because it has to process all possible pairs of overlapping SSRs. In fact, some entries of OT need not be considered and processed if there exists another entry whose intersection has a larger influence value (see line 3 in Figure 4.6). This aforementioned technique is called influence-based pruning [122] which prunes a lot of candidate SSRs that do not have a chance of introducing an unknown optimal result. This impact improves significantly the efficiency of MaxRSKY computation. Step 3 (Finding the Local Maximum Influence Value): Once the computation termi- nates, S o includes the set of local optimal overlap location(s) with the largest value of I s across all SSRs passed to the Mapper. Specifically, the Map function outputs a value of the form <cte.,<q,I o ,S o >>. Reduce2: The reduce function finds the global maximum by returning the tuples that have the maximum I o value across all input tuples. For ease of presentation, a Mapper as described in Figure 4.6 introduces only one local optimum overlap location even in presence of multiple locations sharing the same highest influence value. However, both the Mappers and the Reducers in our approach can simply return all such optimal locations without incurring extra computational complexity. 75 Discussion: The baseline approach blindly incurs the cost of cost id +k×cost pq for all pairs of overlapping SSRs (Equation 4.1). This is avoided in the basic approach through first solving the simpler boolean decision problem of whether there exists at least one such intersection point for the two SSRs. At this point, we have an insight as to whatSSRs have a higher chance to include the maximal reverse skyline location and can further refine them to find the optimal location. Equation 4.2 shows the computational complexity of the basic approach: |O|×O(|O|)×cost boolean_overlap SSR | {z } precomputation + query processing z }| { α×|O|×O(|O|)× cost id +k×cost pq (4.2) where cost boolean_overlap SSR is the cost to detect whether two SSRs overlap (boolean decision) and α∈ [0, 1] is the filtering power. The below equation uses Equations 4.1 and 4.2 to compare the asymptotic complexity of basic vs. baseline approaches: O(|O| 2 )×cost boolean_overlap SSR +α×O(|O| 3 )<O(|O| 3 ) (4.3) This equation, together with the experiments explain why the basic approach is more efficient compared to the baseline approach and thus scales to larger sizes of the problem given the same set of compute resources. 76 4.5 Multi-Granular Grid-based Filtering Approach Our experimental results (see Section 7) show that the basic approach suf- fers from two main drawbacks which affect its performance when dealing with large datasets: 1. Equations 4.1 and 4.2 include factors withO(|O| 3 ) andO(|O| 2 ) complexity. When the value of|O| is in the order of hundreds of thousands or millions, such complexity results in extremely long run time regardless of the (limited) number of nodes available in a parallel processing infrastructure. 2. For each entry of OT, there is a large number of pair-wise overlapped regions which results in an overestimated value for the score of each entry as compared to the actual number of overlaps in a location. Due to the over-estimated value of scores, the influence-based pruning method has less significant impact on filtering those entries with less likelihood of containing the optimal location. Consequently, a large number ofOT entries are processed during computation. In terms of Equation 4.2, α would have a not-so-small value (α9 0). Therefore, there is a large number of entities which have a large set of pair-wise overlapping regions where most of them have no impact in identifying the optimal location. Figure 4.7 (a) illustrates this impact. In this Figure, the list of pair-wise overlapping regions with SSR 1 is {(SSR 1 ,SSR 2 ), (SSR 1 ,SSR 3 ), (SSR 1 ,SSR 4 )} whereas the actual overlapping set is {(SSR 1 ,SSR 2 ,SSR 4 )}. Accordingly, the pair (SSR 1 ,SSR 3 ) has no impact in identifying the optimal location. 77 a. Overlapping of four SSRs b. Imposing a grid on SSRs Figure 4.7: An illustration of the plain Grid-based Filtering approach Toavoidtheaforementionedissueswiththebasicfilteringapproach, wepropose the multi-granular grid-based filtering approach for scaling to large datasets. Our approach is based on key concepts that we explain here. 4.5.1 Spatial Partitioning using Grids SSRs, as ugly data-dependent concave d-polytopes, are central elements in solving the MaxRSKY query. However, introducing a well-shaped tunable element that reduces the dependency of the algorithm on SSRs, and yet does not have to scale at the pace of|O| can result in tangible improvements in run-time. Of course, the correctness of the solution should not be compromised during this process. With our grid-based approach, we impose a grid over SSR regions and spatially subdivide them into a regular grid of squares (or generally hypercubes) of equal side 78 length. By imposing the grid, SSRs are decomposed into a set of smaller entities (grid cells) where those entities are considered as the unit of overlapping. Therefore, in OT, entries are based on grid cells defined in the form of (c,L(c),Sc(c)), where c represents a grid cell, L is the list of SSR regions covering partially or fully cell c and Sc(c) reflects the total number of SSR regions overlapping with cell c. Con- sidering the finer resolution of overlapping unit in grid-based filtering, the elements listed inL are closer to the actual overlapping set as compared to the basic filtering. As a result, the pair-wise overlaps that have no impact in identifying the optimal result are eliminated from computation. For instance, in Figure 4.7 (b), for the three grid cells 1, 2, and 3, their corresponding lists L(1), L(2), and L(3) are computed as {(SSR 1 ,SSR 2 ,SSR 4 )}, {(SSR 1 )} and {(SSR 1 ,SSR 3 )}, respectively. Also, their corresponding score values, Sc(c), are 3, 1 and 2, respectively, which have the same value as the actual ones. Considering the two facts that list L provides a more accurate view of the actual overlaps and also the fact that Sc(c) values are close to the actual ones, both overlapping computation and influence-based pruning are per- formed in a more efficient way. These impacts improve significantly the performance of grid-based filtering (results to be discussed in section 4.6). In terms of the implementation, one can slightly modify the basic filtering approach to incorporate the grid into the algorithm. The Precomputation step would include building the DSLs as before, followed by scoring the grid cells and building theOT. Each tuple inOT is in the form<c,L(c),Sc(c)> wherec is a grid cell from the set of all cells C. The Query Processing phase would order the cells based on their score. Given a cell with the next highest score, we run point queries to find the maximum number of SSRs that overlap with any point in the cell. This is required 79 to ensure the correctness of the algorithm in presence of grid cells. Accordingly, the cost of MaxRSKY query would be updated to: |C|×O(|O|)×cost boolean_overlap cell | {z } precomputation + query processing z }| { α cell ×|C|×O(|O| 2 ) (4.4) Applying a grid excels over the basic approach in various setups. An example scenario which is very likely in presence of a large number of objects is when the SSRs have densely covered regions of the space. Compared to the basic approach, processing the few cells in a region here would avoid the cost of processing many more SSRs located at the same region. The grid-based approach also improves in the cost of pair-wise overlap detection which is a basic operation towards solving the MaxRSKY query. In the Precomputation step, detecting whether a well-shaped cell overlaps with an SSR is less expensive compared to detecting whether two concave SSRs overlap or not. In other words, cost boolean_overlap cell in Equation 4.4 is less than cost boolean_overlap SSR in Equation 4.2 (see Appendix .1 for further information). 4.5.2 Dynamic Data-Aware Grid-Size Setting Using grid cells as introduced above provides performance improvement over the basic approach as experiments show in Section 4.6. However, there are challenges that make a mere use of regular grid, as described in Section 4.5.1, inefficient, if not impractical, in large scale: 80 1. Scoring all the cells is costly in presence of a large number of fine-grained cells, not to mention the number of objects (see the cost of precomputation step in Equation 4.4). 2. Finding an appropriate cell size that results in small run time is challenging. On the one hand, selecting a too big or too small value as the size of the grid cell does not scale. On the other hand, there is no known and reliable rule of thumb for tuning the cell size that would guarantee a short execution time for the MaxRSKY query over a large constellation of SSRs. Given the above, in our multi-granular grid-based approach we start with imposing a coarse-grained grid over the space and score the very few resulting cells. Next, following the ranking idea proposed in the previous section, we use a priority queue such as a max-heap to order the cells according to their score. The top of the heap includes the cell with the highest score. This cell overlaps with the highest number of SSRs compared to any other cell in the heap. We further pop this big cell and break it into smaller ones by subdividing it into two parts per each dimen- sion. Next, we compute the score for these new cells and push them back into the heap. We repeat the process of popping the highest scored cell from the heap and breaking it into smaller sub-cells until we pop a point, i.e., a cell of size zero. At this point, we have found a point which hosts the maximum number of overlapping SSRs compared to any other coarse or fine grained cell in the heap. The score of a point equals the actual count of overlappingSSRs and is no more an overestimation. Thus, we can claim that we have found an answer to the MaxRSKY query. One can 81 optionally continue this process to find any other point which potentially has the same score. Figure 4.8 illustrates our grid-based approach over the same SSRs introduced in Figure 4.7. Had we followed the plain grid-based filtering approach, all the cells in Figure 4.8a would be scored. Instead, here we divide the space into four coarse- grained cells (labeled as 1 through 4 in Figure 4.8b) and score them accordingly. Next, cells are pushed into a max-heap, where we iteratively pop the highest-scored cell, break it into smaller child cells, and push the children back into the heap. We continue this process in figures 4.8c through 4.8f in which the top element in the heap is a point that has a higher score compared to any remaining element in the heap. ThispointisaMaxRSKYsolution. Ifwecontinuetheprocessandithappensthatthe points that are popped from the heap have the same score as the MaxRSKY point, then we have the set of all MaxRSKY solutions, forming a MaxRSKY region where introducing any site in this region will still have the highest number of overlapping SSRs compared to any other location. The MaxRSKY region is marked as black in Figure 4.8f. Our multi-granular grid-based approach filters a considerable portion of the search space without having to look into those regions and precomputing a score for their sub-regions (Figure 4.8g). In other words, our approach is data-aware and only focuses on parts of the space that have a chance to host the solution points. Contrary to constructing a quadtree which involves partitioning the space into fine- grained sub-spaces in all the regions that include data, our dynamic partitioning 82 (a) A grid over SSRs of Figure 4.7 (b) Dividing the space into coarse-grained cells and calculating their scores (c) Poping Cell 3 as the highest-scored. Conse- quently, its sub-cells are scored and pushed in the heap (d) Poping Cell 5 (e) Poping Cell 10 (f) Poping Cell 14. The points inside this cell form the MaxRSKY region (black colored) (g) Colored regions show the parent cells whose child cells were neither scored nor processed Figure4.8: SolvingtheMaxRSKYqueryusingthemulti-granulargrid-basedfiltering approach. 83 approach involves only a tiny subset of cells at different granularities [109]. This results in a significant performance gain when processing the query at scale. TheSSRsthatoverlapwithasub-cellcanbeidentifiedinanoptimizedmanner using only the subset ofSSRs that overlap with the parent cell. This leverage is due to the space partitioning approach that we take here. We further explain this and prove the correctness of our filtering approach through the following lemmas: Lemma 4.1. The set ofSSRs that overlap a sub-cell is a subset of thoseSSRs that overlap with its parent cell. In other words, given cells c parent and c child where c parent covers c child , it can be proven that L(c child )⊆L(c parent ). We use the above lemma to prove the correctness of our algorithm. Given Lemma 4.1 and the definition of the scoring function as explained in Section 4.5.1, one can observe that: Lemma 4.2. The score of a sub-cell is no more than the score of its parent cell. In other words, Sc(c child )≤Sc(c parent ). Also, Lemma 4.3. The score of a cell overestimates, i.e., is greater or equal than, the actual number of the SSRs that intersect with any given point in that cell. Lemma 4.4. A point is a cell of size zero. The score of a point equals the actual number ofSSRs that overlap with it. This can be calculated by running a point query against the set of SSRs that overlap with the parent cell. Theorem 4.5. The first point returned by the multi-granular grid-based filtering approach is a MaxRSKY location. 84 Proof: If there were a point that intersected with a higher number of SSRs, then it would have a higher score and would have been popped from the heap earlier than this point was popped. Lack of such a point, together with the above lemmas prove that there is no other point across any other cell in the space that overlaps with a higher number of SSRs. The above argument explains why filtering a coarse-grained parent cell without “precomputing” its sub-cells does not affect the correctness of our algorithm. Exper- iments in Section 4.6 show the great filtering power of our proposed approach and its key contribution in scaling to large datasets. Having explained the algorithmic contributions of our approach, we next explain how we can tune MapReduce to balance the processing load across its resources, so higher scalability is achieved in processing the MaxRSKY query. 4.5.3 Load-Balanced Distributed Processing Like other parallel processing frameworks, the elapsed time of a MapReduce job depends on the completion of its longest running task. Efficient utilization of resources in a job relies on evenly balancing the load across all the computational resources, so that the idle resources are minimized. Towards this, the MapReduce framework runs a partition function after executing the map function to ensure an even number of intermediate keys is assigned to each reducer, thus avoiding data skew [49, 90]. While this approach works for many data intensive problems, the MaxRSKY query and many problems in the field of data analytics are compute intensive. The key to scalable execution of such problems is to avoid computational 85 skew through considering the value of data and not just its amount [83]. In other words, a load-balancing mechanism that results in further utilization of resources would decide based on the content of the data other than its size; thus requiring data-awareness. Within our grid-based method, we take a round-robin approach to balance the load across the reducers. Once the initial coarse-grained cells are scored, we assign the cells to the reducers in a round-robin fashion, so each reducer would get a mix of the high and low scored cells. This results in minimizing the idle time for reducers through decreasing the difference in their elapsed times. Figure 4.9a illustrates our load balancing mechanism for a MaxRSKY query with four dimensions. Following the algorithm proposed in Section 4.5.2, we halve the space in each dimension, so a total of 2 4 = 16 course-grained cells is obtained. Next, these cells are scored and sorted accordingly. The highest scored cell is labeled asc 1 and the lowest scored cell is labeled asc 16 for ease of presentation. Assumingn reducers running on a cluster, our load-balancing technique would assign to reducer r k , the set of cells c i i n ≡k . Let’s assume our MapReduce cluster is configured to initiate four reducers. The cells c 1 ,c 5 ,c 9 ,c 13 would be assigned to reducer r 1 . Our load-balancing approach has the following advantages: 1. No assumption is made on the distribution of the objects and sites. 2. No sampling or other extra operation that is common among other load- balancingapproachesisrequired[106]. Thedata-awarenessisinherentlygained as part of the algorithm in the basic and multi-granular grid-based approaches. 86 1 2 3 4 0 2 4 6 8 10 x 10 4 Reducer id Cell Score (a) Round-Robin distribution 1 2 3 4 0 2 4 6 8 10 x 10 4 Reducer id Cell Score (b) Sorted distribution Figure 4.9: Distribution of coarse-grained cells across the reducers. 3. Not only the load is more evenly balanced among the processing nodes, but also a higher filtering power within a reducer would be achieved as a result of the planned variance between the scores of the cells assigned to a reducers. In other words, other than reducing the inter-reducer execution-time difference, the intra-reducer execution-time has been optimized as well. This results in scaling to larger datasets. To the above point, consider a load-balancing mechanism based on the sorted distri- bution of cells as illustrated in Figure 4.9b. The variance of scores for coarse-grained cells within each reducer is smaller in this case compared to the round-robin distri- bution. As a result, filtering other cells would be more competitive. This is because a higher number of cells in the heap have to be broken into fine-grained cells before a point is found whose influence value exceeds the score of any other cell in the heap. 87 We further compare the performance of different load-balancing methods in Section 4.6. The MapReduce implementation of our multi-granular grid-based approach involves three jobs: 1. SSR generation: given a set of objects and sites, the associated SSRs are created (see Map1 and Reducer1 in Section 4.4.1). 2. Calculating the score for coarse-grained cells: this job can be considered as a counterpart to the precomputation step in the basic approach with the dif- ference that the OT generation here is much faster due to processing a fewer number of elements, i.e., coarse-grained cells in the table, where each element incurs much less computation cost to calculate the score. 3. Load-balancing and fine-grained query processing: course-grained cells are evenly dispatched to different processing nodes where local MaxRSKY points are identified. The final solution is the point with the maximum score among the identified local maxima. Below we explain the map and reduce functions for the second and third jobs. Map2: Let an input tuple be of the form < r,cte > where r is an SSR region. The map function discovers the cells that overlap with r, and then produces output tuples of the form < cell id ,r >. This is done through calculating the MBR for r, and then running a range query against the grid to find all overlapping cells. Given a subset of cells that overlap an MBR, we then check the actual SSR against every single cell to see if they overlap. There are two subtleties here. First, by prioritizing 88 to detect whether a cell overlaps an MBR as opposed to its enclosed SSR, we are following a filter and refinement strategy to avoid, as much as possible, paying the intersection detection cost of cell vs. SSR through solving the easier intersection detection problem of cell vs. MBR. This would allow scaling to larger datasets and dimensions given the same compute resources. Second, after using the grid for tunable space partitioning and scoring in Sections 4.5.1 and 4.5.2, in this map function we use the same grid for a different purpose, i.e., as an index structure to run a range query. Thus, utilizing the grid for dual purposes. Reduce2: Given a list of input tuples in the form < cell id , List of SSRs>, the reducer calculates the score of the cell and stores in HBase a record of the form <key =<MAX_INT−score,cell id >,value = List of SSRs> per input tuple. Map3: Given an input tuple in the form produced by Reduce2, the map function simply dispatches it to the proper reducer by producing an output tuple < key = reducer id ,value =< cell id ,score, List of overlapping SSRs>>. Due to the small number of coarse-grained cells, this map task can be instantly done using a single node. reducer id can be simply calculated as (counter + +)%|R| wherecounter is an integer variable within the map task and|R| is the number of reducers which is a configurable parameter in MapReduce. Reduce3: Each reducer receives a list of values in the form < cell id ,score, List of overlapping SSRs>> and uses the algorithm shown in Figure 4.10 to return the point(s) with the highest score. Finally, the MaxRSKY solution is identified by selecting the point(s) having the maximum score across the reducer outputs. 89 Multi-Granular Grid-Based MaxRSKY Query Computation (Reduce Function) 1: Initialize the local optimal solution set S o =?, the maximum score I o = 0, and Finished = False 2: Insert all the input cells into the maximum heap according to their scores 3: Repeat 4: Pop an element from the heap 5: If it is a point, q 6: If score(q) =I o or S o =? 7: S o =S o ∪{q} 8: I o =score(q) 9: Else 10: Finished = True 11: Else 12: Break this (parent) cell into 2 d child cells 13: For each SSR that overlaps the parent cell 14: Run a range query to find the overlapping child cells and increase their score by one 15: Push the child cells into the heap according to their calculated score 16: Until: Finished = True or heap is empty 17: Return S o and I o Figure 4.10: MaxRSKY Query Processing with the Multi-Granular Grid-Based Fil- tering Approach (Reduce function) 4.6 Experimental Results In this section, we present the results of our empirical study of the proposed solutions. Table 4.3 shows the range of values for different parameters that were extensively tested in our experiments. A default value is noted for each parameter in the table. We use the default values unless otherwise specified for a parameter of study. Our default values represent common real-world applications such as online marketplaces where the interests of clients, i.e., objects, have been recorded through 90 their online search behavior. This can be done through the Amazon or Google Shopping websites in which vendors offer variations of a product, i.e., sites. Using all the default values would provide a scenario with two hundred thou- sand objects in a four dimensional space, where each dimension has a range of thirty two different values. The default number of sites equals to five percent of the pos- sible points in the space, i.e., roughly fifty two thousand sites out of more than one million possible desired specifications for a client. Objects follow a correlated distri- bution, e.g. due to market trends or going viral through word of mouth, and sites followanindependentdistribution. Theeffectofdifferentdatadistributionshasbeen evaluated in our experiments to cover different markets. The default load-balancing mechanism is round-robin and the algorithm used is multi-granular grid-based fil- tering approach. We use a cluster of five MapReduce nodes where one node is the master and four other nodes do the actual query processing. 4.6.1 Cost of Computing SSRs vs. Cost of Computing Over- lap among SSRs For this experiment, we used a dataset consisting of 5,000 objects in a 3- dimensional space. As we explained earlier, all other parameters get a default value. As a result, the cardinality of sites would be equal to 5% of the points in space, thus there are 1638 sites. We applied the baseline approach to the aforementioned dataset and computed the execution time of skyline and SSR computation and the execu- tion time of computing the overlaps among SSRs to identify the optimal location 91 Table 4.3: Summary of Parameters Parameter Default Range Number of objects|O| 200K 5K- 3.2M Number of sites|S| (% of space) 5% 0.1%- 50% Dimensions d 4 2- 6 Object distribution correlated correlated, anti-correlated, independent Site distribution independent correlated, anti-correlated, independent Range per dimension 32 32- 1024 MapReduce nodes 4 4- 16 Load balancing round-robin round-robin, sorted, random Algorithm multi-granular cell-based • baseline • basic with MBR-MBR, SSR- SSR, and MBR-SSR scoring functions • plain cell-based, multi- granular cell-based separately. Table 4.4 presents the results of this experiment. The overlap detec- tion cost is orders of magnitude greater than the dynamic skyline generation cost. A scalable solution for finding the dynamic skyline set using MapReduce has been introduced in [104]. In addition, the cost of the overlap computation increases sig- nificantly with larger datasets. Therefore, in this study we focused on reducing the computational complexity of computing the overlaps among SSRs and identifying the optimal location. 92 Table 4.4: Cost of computing SSRs vs. cost of computing overlap among SSRs Dynamic Skyline 39s RTree Construction 2m:03s ImportToHbase (DSL Table) 27s Overlap Detection 23h:37m:18s 4.6.2 Baseline Filtering Approach vs. Basic Filtering Approach We applied the basic filtering approach to the same dataset of Section 4.6.1 and computeditsexecutiontime. AsillustratedinFigure4.11, basicfilteringoutperforms the baseline approach by a factor of 3. The figure distinguishes the time for creating the overlap table (green colored), versus the overlap detection for the basic approach (yellow colored). Spending a little amount of time to get insight about what SSRs have a higher chance to include the optimal location has prevented several hours of extra overlap detection incurred in the baseline approach. The time complexity of the two approaches were compared in Equation 4.3 in Section 4.4.2. The baseline approach incurs a fixed cubic cost in the order of the number of objects due to its plain batch processing of the SSRs. This is while the basic approach first incurs a quadratic cost for precomputation, and then provides some level of filtering while yet involving a cubic cost (with a lower coefficient com- pared to the baseline approach). While the observation in Figure 4.11 verifies the fact that the basic filtering approach is more efficient, neither of the approaches scale to more than a couple of thousands of objects. 93 Baseline Basic 0 5 10 15 20 25 30 Filtering Approach Execution Time (hour) OT Generation Overlap Detection Figure 4.11: Baseline filtering vs. basic filtering 4.6.3 Effect of the Scoring Function We next applied different scoring functions to the basic filtering approach. Our goals in this experiment were to first, test the limits of this approach, and second, to evaluate how a brief versus a detailed preprocessing phase affects the overall execution time. We used the same 3-dimensional setup introduced in Section 4.6.1 except that 10,000 instead of 5,000 objects were used. The following three scoring functions were used to score the SSRs: 1. SSR-SSR: This is the same scoring function we introduced in Section 4.4. The score of an object is incremented if its SSR overlaps the SSR of another object. 2. MBR-SSR: The score is incremented when the MBR of this object overlaps the SSR of another object. 94 MBR−MBR MBR−SSR SSR−SSR 0 10 20 30 40 50 60 Filtering Approach Execution Time (hour) OT Generation Overlap Detection (a) Execution time MBR−MBR MBR−SSR SSR−SSR 0 1000 2000 3000 4000 5000 6000 Scoring Function Count of Filtered SSRs (b) Number of filtered SSRs Figure 4.12: Effect of the scoring function on the basic filtering approach 3. MBR-SSR: The score is incremented when the MBR of this object overlaps the MBR of another object. Thefirstscoringfunctionisthemostinformativeofthethreewhilethethirdone is the least informative. We analyzed the boolean overlap detection cost of the three scoring functions in Appendix .1. As illustrated in Figures 4.12, our observations confirm that while the SSR-SSR scoring function is the most costly to compute, it results in a higher filtering power in the query processing phase, thus having a shorter execution time overall. This is while the MBR-MBR scoring function is the quickest to calculate, but the least insightful on the actual number of overlapping SSRs. The scores here are by far higher than the actual number of overlaps. Thus, the influence value of a processed SSRs would typically not surpass the (overestimated) score of the next SSR in the list, requiring the algorithm to continue processing several SSRs withoutbeingabletofilterthem(Figure4.12b). IttooktheMapReducecluster 95 almostthirtyhourstorunthebasicapproachoverjust10,000objects. Thisisbecause this approach contains factors withO(|O| 3 ) andO(|O| 2 ) complexity, and that even with SSR-SSR scoring function the value of scores are overestimated, resulting in processing thousands of SSRs before the optimal location is identified. All the above confirm that an SSR-based scoring for the purpose of finding the optimal location does not scale to large datasets targeted in our study. 4.6.4 Sensitivity to Cell Size in the Plain Grid-Based Approach With this experiment, we studied the effect of changing the granularity of the imposed grid on the execution time of the plain grid-based filtering approach. We used the default parameters for this experiment. The grid-based approach is different from the basic filtering in that the fundamental processing element here is a grid- cell, i.e., a partition of the space, as opposed to data-dependent SSRs. We made two primary observations that we explain here. First, the plain grid-based approach is capable of running our default scenario, i.e., with 200,000 objects in a 4-dimensional space, within the reasonable time of 7 hours and 20 minutes. This is a great achievement compared to the basic approach which took much longer to process 10,000 objects in a 3-dimensional space, and indicates that the grid-based approach can scale to large datasets due to its emphasis on space partitioning, as well as ranking of the partitions. 96 2 4 8 16 32 0 20 40 60 80 100 Splits per Dimension Execution Time (hour) Terminated Terminated OT Generation Overlap Detection Figure 4.13: Effect of cell granularity on the plain grid-based approach Second, the plain grid-based approach is sensitive to the granularity of cells. As illustrated in Figure 4.13, the execution time of plain grid-based filtering fluctuates with changes in grid cell size. These fluctuations occur because the precomputation time to form the OT table and the number of pruned entries vary from one grid cell size to another. However, for some grid cell values belonging to the middle of the range (e.g., the cell sizes 8 and 4 in Figure 4.13), the execution time is lower compared to those of the rest of the range. For coarser granularities, the execution time increases because with larger cell sizes, we are dealing with larger numbers of point queries which result in higher execution time. Also, with finer granularities, the execution time deteriorates due to the large precomputation cost. Coarse and fine granular experiments were manually terminated after they were not able to identify the optimal location within a considerable amount of time. For example, in the experiment where each dimension is split into 32 pieces, only 8% of the entries in the OT were generated after more than three days of execution. 97 Theplaingrid-basedapproachrequiressettingacommonsizeforallcellsbefore the execution starts. This is unlike the multi-granular approach which dynamically adjusts different sizes for cells in different areas of the space depending on the data distribution. An implication of this would be that the plain approach would not scale to larger number of dimensions as the number of fixed-size cells grows expo- nentially. For example, for the case of splitting each dimension to 8 pieces in the experiment above, the number of cells would grow from 8 4 to 8 5 in 5-dimensional spacewhichmeansaneighttimeincreaseintheprecomputationcostsolelyasaresult of the increased number of entries in the OT. This slowers down the plain grid-based approach compared to the multi-granular filtering regardless of other bottlenecks that the two approaches might commonly confront in large dimensions. 4.6.5 Multi-Granular Grid-Based Approach Figure 4.14 depicts the performance of our multi-granular grid-based filter- ing approach using the default settings. The MaxRSKY computation over 200,000 objects in a 4-dimensional space is processed in only slightly more than an hour; a duration that is orders of magnitude shorter than the execution time of the basic and baseline approaches and five times shorter than the minimum run time for the plain grid-based approach. This is due to being linear in terms of the number of the objects involved, and dynamically filtering cells of different sizes without manually selecting a fixed cell size as was the case in the plain grid-based approach. We next analyze the filtering power in the multi-granular approach. There is a difference in the way one can analyze the filtering power in the basic and plain 98 grid-based approaches, versus the multi-granular approach. The former approaches involve only one round of scoring, i.e., when forming the overlap table, after which the scored entities were ranked and filtered accordingly. However, with our dynamic approach, we have iterations of scoring and ranking over cells of different sizes. As a result, a simple comparison of the total number of unprocessed entities, e.g., cells, between the former and latter approaches is not reasonable. Figure 4.14b shows the filtering power of our multi-granular approach. The filtering power is different in each granularity. There is more value in not having to process a coarse-grained cell compared to a fine-grained cell, because the former one is the parent of several instances of the latter one. For example, in a 4-dimensional space, filtering a size-8 cell is as valuable as filtering thousands of size-1 cells inside it (if not more valuable due to the overhead involved). 4.6.6 Effect of Site and Object Cardinality In order to evaluate the effect of site and object cardinality on our proposed approaches, we implemented two experiments. With the first one, we considered a fixed site-dataset and used various object-datasets of different sizes. With the second experiment, we fixed the object-dataset and used various site-datasets. Below, we describe each experiment in more detail. Effect of Object Dataset: For this experiment, we tested the scalability of our approach across large numbers of objects. Figure 4.15 depicts the results of our experiment. This diagram shows that the execution time improves with an increase in the number of object points, since query processing involves more SSRs and, 99 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Multi−Granular Grid−Based Execution Time (hour) Scoring Coarse−Grained Cells Dynamic Query Processing (a) Execution time cell size instances in the heap portion never explored deeper 16 0 %0 8 146 %57 4 1596 %91 2 2434 %93 1 2726 %90 (b) Filtering power Figure 4.14: Multi-granular grid-based approach with the default parameters hence, more overlap computation. The growth trend in the figure is linear with respect to the number of object points. This is because the cost of the grid-based approach is linear with respect to the number of objects. This is unlike the basic and baseline approaches which have a cubic complexity in terms of the number of objects. Our observation shows that the multi-granular grid-based approach scales to solving the MaxRSKY problem over millions of objects. This is orders of magnitude improvement over our earlier work as the best existing approach which scales only to 1,000 objects in the same 4-dimensional space and cannot scale due to limited memory and lack of parallel computational resources [27]. 100 0.1 0.20.4 0.8 1.6 3.2 x 10 6 0 5 10 15 20 25 Objects Execution Time (hour) Scoring Coarse−Grained Cells Dynamic Query Processing Figure 4.15: Effect of object set cardinality Effect of Site Dataset: For this experiment, we used different cardinalities of site points proportional to 0.1, 1, 5, 10, 25, and 50 percent of the more than 1 million possible points in the space. Figure 4.16 demonstrates that the execution time consistently decreases with an increase in the number of site points. This is because as the number of site points grows, the SSRs would shrink and consequently reduce the computation cost in two aspects. First, the number of their dynamic skyline points, i.e,|DSL| decreases. This turns them into cubic-like shapes and results in a decrease in the boolean overlap detection cost (see Appendix .1), thus less cost is incurred per each overlap detection operation. Second, due to their shrinked size, they overlap with fewer cells, causing a reinforcing chain of reductions in the frequency of overlap detection: a parent cell overlaps with fewer SSRs, so all its children have a smaller set of initial SSRs to verify whether they overlap with them; consequently, this effect is conveyed to the next iterations of scoring until the end of query processing. 101 0.1 1 5 10 25 50 10 0 10 2 10 4 Site to Space Ratio (percentage) Execution Time (minute) Terminated Scoring Coarse−Grained Cells Dynamic Query Processing Figure 4.16: Effect of site set cardinality 4.6.7 Effect of Site and Object Distribution In order to evaluate the effect of site and object distribution on our grid-based approach, we implemented two experiments. With the first one, we considered an independent distribution for the sites and used various object-datasets of different distributions. With the second experiment, we considered a correlated distribution for the objects and used site-datasets with different distributions. The two sets of experiments share the default experiment of correlated objects against independent sites. Below, we describe each experiment in more detail. Effect of Object Distribution: We applied the grid-based filtering approach to datasets with correlated, anti-correlated, and independent objects and computed theirexecutiontimes. Figure4.17adepictstheresultsofourexperiment. Thelongest execution times occurred when datasets with independent distributions were used, followed by the dataset with correlated distribution of objects and lastly the dataset with anti-correlated distribution. With independent distributions, both object and 102 site points are uniformly distributed in the data space. Thus, for a given object its skyline set is scattered across the data space which results in large SSR regions. The larger the SSRs, the greater the time required for overlap computation and identify- ing the optimal location. The correlated and anti-correlated distribution of objects represent real-world applications where the interests of clients are affected by others, e.g. through common trends that go viral through mass advertisement or word of mouth. Effect of Site Distribution: We applied the grid-based filtering approach to datasets with correlated, anti-correlated, and independent sites and computed their execution times. Figure 4.17b depicts the results of our experiments. With corre- lated distribution, both object and site points are closely distributed which results in tiny SSRs and faster execution times for the overlap and MaxRSKY computation. Although the distribution of the anti-correlated site dataset used for the evaluation of skyline queries is clustered, the size of the SSRs is quite large and hence their overlap computation is computationally more time consuming as compared to those of the other distributions. A sample real-world scenario for this dataset is when the interests of the clients have suddenly shifted perhaps due to a breakthrough in tech- nology, e.g. introduction of the next big thing by one of the players in the market. This would make most of the existing products irrelevant and not correlated with the new desires of the clients. 103 anti−correlated correlated independent 0 1 2 3 4 5 Object Distribution Execution Time (hour) Scoring Coarse−Grained Cells Dynamic Query Processing (a) Effect of object distribution anti−correlated correlated independent 10 0 10 1 10 2 10 3 10 4 Site Distribution Execution Time (minute) Scoring Coarse−Grained Cells Dynamic Query Processing (b) Effect of site distribution Figure 4.17: Effect of data distribution 4.6.8 Effect of Load Balancing Mechanism Figure 4.18 depicts the effect of our load-balancing technique. The experiments are consistent with our analysis in Section 4.5.3 in that the round-robin distribution of coarse-grained cells allows a higher variance between the scores, thus boosting the intra-reducer filtering power. The experiment with the round-robin load-balancing performs 23% and 33% better than the MaxRSKY computation with random and sorted load-balancing. Similar superiority was maintained in our experiments over more complex MaxRSKY problems. 4.6.9 Effect of Dimension Cardinality Figure 4.19 depicts the effect of the number of dimensions on the execution time. To make the experiments on low dimensions more challenging, we extended 104 Round−Robin Random Sorted 0 50 100 150 Load Balancing Execution Time (minute) Scoring Coarse−Grained Cells Dynamic Query Processing Figure 4.18: Effect of load-balancing mechanism the range, i.e., the distinct values per dimension from 32 to 1024 and 64 for 2 and 3-dimensional spaces (2D and 3D) so we have at least one million distinct points in the space for all the experiments. Despite this change, the experiments for 2 and 3 dimensions took no more than three minutes. For the experiments in 4D, 5D, and 6D, we used the default range of 32 values per dimension. This means more than 32 million and 1 billion points were present in the latter two experiments. While the experiment in 5D took 19 hours to finish, we terminated the experiment for 6D after two days. We used the Equation .7 and the execution time of the experiment over 5D space to estimate the execution time for 6D. The average cardinality of sites in the skyline set for 5D was 23 sites, i.e.,|DSL| = 23. We calculated this metric over a sample set of objects in the 6D space as 32 sites, i.e.,|DSL| = 32. Using Equation .7, our method would take about 5 days to identify the optimal location in the studied 6 dimensional space. 105 2 3 4 5 6 10 0 10 2 10 4 Dimension Execution Time (minute) Estimated Scoring Coarse−Grained Cells Dynamic Query Processing Figure 4.19: Effect of the number of dimensions 4.6.10 Effect of Cluster Node Cardinality Figure4.20showstheeffectofthenumberofnodesintheMapReduceclusteron theexecutiontimeforthe5DdatasetstudiedinSection4.6.9. Itcanbeobservedthat increasing the number of nodes reduces the execution time. However, the increase in performance does not match the amount of compute resources added to the cluster. We tested the effect of the cardinality of nodes on other experimental setups and did not observe a considerable reduction in the execution time. Increasing the number of nodes allows more parallel processing which is ben- eficial in reducing the run-time. However, as the number of nodes increases, fewer coarse-grained cells would be assigned to each node, and so the findings of other nodes on what coarse and fine-grained cells can be filtered would remain hidden to this node. This is occurring because the MapReduce framework was originally designedtorunonashared-nothingarchitecture, e.g., forsolvingembarrassinglypar- allel problems. As we explain in Section 5, one of our future directions is to address 106 4 8 16 0 5 10 15 20 Number of Nodes Execution Time (hour) Scoring Coarse−Grained Cells Dynamic Query Processing Figure 4.20: Effect of the number of worker nodes this issue and fully benefit from the potentials of a parallel processing framework through changing the processing platform as well as minor changes to our already parallelized solutions. 107 Appendices 108 .1 The Boolean Overlap Detection Problem .1.1 Lemmas Lemma .6. A point is within SSR(o) if none of the unfolded points of the DSL(o) dominates the point. DEFINITION 4 (LEFT-OVERLAP): SSR 1 left-overlaps SSR 2 if any of the unfolded points of DSL 1 is within SSR 2 Lemma .7. SSR 1 overlaps with SSR 2 if either of the below holds: • SSR 1 left-overlaps SSR 2 , or • SSR 2 left-overlaps SSR 1 . Theorem .8. SSR 1 overlaps with SSR 2 if • At least one point in SSR 1 is not dominated by any of the vertices of SSR 2 , or • At least one point in SSR 2 is not dominated by any of the vertices of SSR 1 . Proof. Deducted from Lemmas .6, .7 and Definition 4. Lemma .9. A cell can be represented as an SSR by placing 1. A hypothetical object o in the center of the cell 2. d sites, each cellsize 2 units away from the object in the direction of each dimension (sites s 1 ,s 2 in Figure 21). 109 Figure 21: Representing a cell as an SSR 3. Another d sites, each mirroring a site introduced in step 2 above (sites s 0 1 ,s 0 2 in Figure 21). It can be observed that the 2×d sites together form an SSR in the form of a cell. .1.2 Computational Complexity The cost of detecting whether SSR 1 left-overlaps SSR 2 can be described as: 2 d ·|DLS 1 |· 2 d ·|DLS 2 | = 2 2d ·|DLS 1 |·|DLS 2 | (.5) This follows Lemma .6, where we first unfold the sites in DSL 1 , and then check if any of the unfolded points inDSL 2 dominates any of the unfolded points associated to DSL 1 . The total cost of boolean overlap detection is: 2 2d ·|DLS 1 |·|DLS 2 | + 2 2d ·|DLS 2 |·|DLS 1 | = 2 2d+1 ·|DLS 1 |·|DLS 2 | (.6) This directly follows Lemma .7. 110 To calculate the boolean overlap detection cost in the grid-based approach, we use Lemma .9 to convert a cell into an SSR. This results in the following cost: 2 2d+1 · 2d·|DLS 2 | (.7) which is less than the cost in Equation .6 in most scenarios given that the number of dimensions is typically much less than the size of the dynamic skyline set. Finally, the cost for detecting whether two cells or MBRs overlap can be obtained in a similar way: 2 2d+1 · 2d· 2d = 2 2d+3 ·d 2 (.8) 111 Chapter 5 Conclusion and Future Work 112 In this thesis, for the first time we argued that in addition to efficient “location update”, efficient “probing” can also improve the communication-scalability of prox- imity query answering significantly. Accordingly, we proposed a probe optimization technique that considers proximity queries in batch and minimizes the number of probes required to answer the entire batch of queries. Furthermore, we showed that our proposed probe optimization technique is parallelizable, and hence, allows for scale-out. Our experiments demonstrate that our proposed proximity query answer- ing approach enables continuous processing of hundreds of millions of proximity queries. Moreover, for the first time we proposed a scalable solution for the problem of maximal reverse skyline query computation. Accordingly, we proposed a dynamic grid-based filtering approach for efficient computation of MaxRSKY with a focus on avoiding the cost of overlap computation among a large number of SSRs through space partitioning and dynamic data-aware pruning of space. We further utilized the MapReduce framework to parallelize our approach and balanced the processing load for further resource utilization and scalability. We verified and compared the performance of our proposed solutions with rigorous complexity analysis as well as an extensive experimental evaluation. We plan to pursue our study on scaling the communication efficiency in two directions. First, we believe that our proposed probe optimization approach can be generalized to enhance the efficiency of other spatial queries on moving objects. Accordingly, we intend to investigate and address specific challenges of probe opti- mization in the context of this family of queries. Second, given that our proposed 113 probe optimization approach is parallelizable, we will seek enhancements of our pro- posed technique by leveraging the features offered by cloud computing. We plan to extend our work on computational scalability in multiple directions. First, we would like to port our parallel MaxRSKY solutions to a more advanced parallel processing framework such as Apache Spark, a successor of MapReduce, and havetheprocessesrunningondifferentnodespauseafterafewiterationsofexecution and share their findings on the influence value of their potential locally optimal location, soalargernumberofcellscanbefilteredacrossallnodes. Thesecondfuture direction is to explore the moving object MCOL problem, where the objects are not static over time, e.g. a scenario where user interests change over time. Using the existing method for dynamic objects would require us to re-run the whole algorithm upon any change in object locations. Last but not least, we want to devise more efficient multi-dimensional overlap detection algorithms, so our MaxRSKY query processing approach can be applied to applications with high number of dimensions. 114 Reference List [1] Amazon. http://www.amazon.com. [2] Annual number of worldwide active amazon customer accounts. http://www.statista.com/statistics/237810/ number-of-active-amazon-customer-accounts-worldwide/. [3] Apache HBase. http://hbase.apache.org/. [4] Bay Area Census. http://www.bayareacensus.ca.gov/. [5] Department of Motor Vehicles statistics for 2011. http://apps.dmv.ca.gov/ about/profile/official.pdf. [6] Microsoft SQL Azure. http://www.microsoft.com/windowsazure/ sqlazure/. [7] MongoDB. http://www.mongodb.org/. [8] Neo4j. http://neo4j.org/. [9] San Francisco commute speeds drop dramati- cally. http://sanfrancisco.cbslocal.com/2011/12/06/ san-francisco-commute-speeds-drop-dramatically/. [10] Size of the global fragrance market. http://www.statista.com/statistics/ 259221/global-fragrance-market-size/. [11] Size of the global fragrance market. http://www.statisticbrain.com/ footwear-industry-statistics/. [12] Smartphone sales to hit 1bn a year for first time in 2013. http://www. guardian.co.uk/business/2013/jan/06/smartphone-sales-1bn-2013. 115 [13] United States Facebook statistics. http://www.socialbakers.com/ facebook-statistics/united-states. [14] Vertica. http://www.vertica.com/. [15] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The Design of the Borealis Stream Processing Engine. In Sec- ond Biennial Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, CA, January 2005. [16] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceed- ings of the VLDB Endowment, 6(11):1009–1020, 2013. [17] A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi- based geospatial query processing with mapreduce. In Cloud Computing Tech- nology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 9–16. IEEE, 2010. [18] A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi- based geospatial query processing with mapreduce. In Cloud Computing Tech- nology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 9 –16, 30 2010-dec. 3 2010. [19] M. Ali, B. Chandramouli, J. Fay, C. Wong, S. Drucker, and B. S. Raman. Online visualization of geospatial stream data using the worldwide telescope. In Proceedings of the International Conference on Very Large Data Bases (VLDB), September 2011. [20] M. Ali, B. Chandramouli, J. Fay, C. Wong, S. Drucker, and B. S. Raman. Online visualization of geospatial stream data using the worldwide telescope. In Proceedings of the International Conference on Very Large Data Bases (VLDB), September 2011. [21] A. Amir, A. Efrat, J. Myllymaki, L. Palaniappan, and K. Wampler. Buddy tracking - efficient proximity detection among mobile friends. In INFOCOM, 2004. [22] G. Anthes. Invasion of the mobile apps. Commun. ACM, 54(9):16–18, Sept. 2011. 116 [23] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems, PODS ’02, pages 1–16, New York, NY, USA, 2002. ACM. [24] S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, 30(3):109–120, 2001. [25] B. Bamba, L. Liu, A. Iyengar, and P. S. Yu. Distributed processing of spatial alarms: A safe region-based approach. In Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems, ICDCS’09, pages 207–214, Washington, DC, USA, 2009. IEEE Computer Society. [26] F. Banaei-Kashani, P. Ghaemi, B. Movaqar, and S. J. Kazemitabar. Efficient maximal reverse skyline query processing. Submitted, 2016. [27] F. Banaei-Kashani, P. Ghaemi, B. Movaqar, and S. J. Kazemitabar. Efficient maximal reverse skyline query processing. GeoInformatica, Manuscript sub- mitted for publication. [28] N. Bäuerle and U. Rieder. Markov Decision Processes with Applications to Finance. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. [29] R. Bellman. Dynamic Programming. Dover Publications, 1957. [30] J. L. Bentley, K. L. Clarkson, and D. B. Levine. Fast linear expected-time alogorithms for computing maxima and convex hulls. In Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, SODA ’90, pages 179–187, 1990. [31] J.L.Bentley, H.T.Kung, M.Schkolnick, andC.D.Thompson. Ontheaverage number of maxima in a set of vectors and applications. J. ACM, 25:536–543, October 1978. [32] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Verscheure, H. Koutsopoulos, and C. Moran. Ibm infosphere streams for scalable, real- ti intelligent transportation services. In Proceedings of the 2010 international conference on Management of data, SIGMOD ’10, pages 1093–1104, New York, NY, USA, 2010. ACM. 117 [33] S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceed- ings of the 17th International Conference on Data Engineering, pages 421–430, 2001. [34] T. Brinkhoff. A framework for generating network-based moving objects. Geoinformatica, 6(2):153–180, June 2002. [35] A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on processing spatial datawithmapreduce. InScientific and Statistical Database Management,pages 302–319. Springer, 2009. [36] B. Chazelle. Filtering search: A new approach to query-answering. SIAM J. Comput, 15:703–724, 1986. [37] M. Cheema, L. Brankovic, X. Lin, W. Zhang, and W. Wang. Continuous monitoring of distance-based range queries. Knowledge and Data Engineering, IEEE Transactions on, 23(8):1182 –1199, Aug. 2011. [38] M. Cheema, L. Brankovic, X. Lin, W. Zhang, and W. Wang. Continuous monitoring of distance-based range queries. Knowledge and Data Engineering, IEEE Transactions on, 23(8):1182–1199, 2011. [39] M. Cheema, X. Lin, W. Zhang, and Y. Zhang. Influence zone: Efficiently processing reverse k nearest neighbors queries. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 577 –588, april 2011. [40] M. Cheema, W. Zhang, X. Lin, and Y. Zhang. Efficiently processing snapshot and continuous reverse k nearest neighbors queries. VLDBJ, accpted in Jan. 2012. [41] L. Chen and X. Lian. Dynamic skyline queries in metric spaces. In Proceed- ings of the 11th international conference on Extending database technology: Advances in database technology, EDBT ’08, pages 333–343, 2008. [42] S. Chen, B. C. Ooi, and Z. Zhang. An adaptive updating protocol for reducing moving object database workload. Proc. VLDB Endow., 3(1-2):735–746, Sept. 2010. [43] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In In ICDE, pages 717–719, 2003. 118 [44] A. Civilis, C. S. Jensen, and S. Pakalnis. Techniques for efficient road-network- based tracking of moving objects. IEEE Trans. on Knowl. and Data Eng., 17(5):698–712, May 2005. [45] J. L. Cohon. Multiobjective programming and planning. Mathematics in science and engineering. ; 140. Acad. Press, New York [u.a.], 1978. [46] T.Condie, N.Conway, P.Alvaro, J.M.Hellerstein, K.Elmeleegy, andR.Sears. Mapreduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI’10, pages21–21, Berkeley, CA,USA, 2010. USENIX Association. [47] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Bench- marking cloud serving systems with ycsb. In Proceedings of the 1st ACM sym- posium on Cloud computing, SoCC ’10, pages 143–154, New York, NY, USA, 2010. ACM. [48] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2009. [49] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [50] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A.Pilchin,S.Sivasubramanian,P.Vosshall,andW.Vogels. Dynamo: amazon’s highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, SOSP ’07, pages 205–220, New York, NY, USA, 2007. ACM. [51] E.DellisandB.Seeger. Efficientcomputationofreverseskylinequeries. In Pro- ceedings of the 33rd international conference on Very large data bases, VLDB ’07, pages 291–302, 2007. [52] K. Deng, X. Zhou, and H. T. Shen. Multi-source skyline query processing in road networks. In In ICDE, 2007. [53] D. P. Dobkin and D. G. Kirkpatrick. Fast detection of polyhedral intersection. Theoretical Computer Science, 27(3):241 – 253, 1983. [54] Y. Du, D. Zhang, and T. Xia. The optimal location query. In Proceedings of Advances in Spatial and Temporal Databases, pages 163–180, 2005. 119 [55] A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. Cg_hadoop: computational geometry in mapreduce. In Proceedings of the 21st ACM SIGSPATIAL Inter- national Conference on Advances in Geographic Information Systems, pages 284–293. ACM, 2013. [56] A. Eldawy and M. F. Mokbel. Pigeon: A spatial mapreduce language. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 1242–1245. IEEE, 2014. [57] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In In Proceedings of the IEEE International Conference on Data Engineering, 2015. [58] R. Farahani and M. Hekmatfar. Facility Location: Concepts, Models, Algo- rithms and Case Studies. Contributions to Management Science. Physica- Verlag HD, 2011. [59] R.Z.Farahani,M.SteadieSeifi,andN.Asgari. Multiplecriteriafacilitylocation problems: A survey. Applied Mathematical Modelling, 34:1689–1709, 2010. [60] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. SPADE: the system S declarative stream processing engine. In Proceedings of the 2008 ACM SIG- MOD international conference on Management of data, SIGMOD ’08, pages 1123–1134, New York, NY, USA, 2008. ACM. [61] P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. Optimal net- work location queries. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010. [62] P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. Continuous maximal reverse nearest query on spatial networks. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Informa- tion Systems, 2012. [63] P. Ghaemi, K. Shahabi, J. P. Wilson, and F. Banaei-Kashani. A compara- tive study of two approaches for supporting optimal network location queries. GeoInformatica, 18(2), 2014. [64] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Pro- ceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, pages 29–43, New York, NY, USA, 2003. ACM. 120 [65] T. Groenfeldt. Are SSDs Ready for the Enterprise? http: //www.ciozone.com/index.php/Server-Technology-Zone/ Are-SSDs-Ready-for-the-Enterpriseu.html. [66] H. Gupta, B. Chawda, S. Negi, T. A. Faruquie, L. V. Subramaniam, and M. Mohania. Processing multi-way spatial joins on map-reduce. In Proceedings of the 16th International Conference on Extending Database Technology, pages 113–124. ACM, 2013. [67] B.He, M.Yang, Z.Guo, R.Chen, B. Su, W.Lin, andL.Zhou. Comet: batched stream processing for data intensive distributed computing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 63–74, New York, NY, USA, 2010. ACM. [68] M. Hekmatfar and M. SteadieSeifi. Multi-Criteria Location Problem. Contri- butions to Management Science. Physica-Verlag HD, 2009. [69] Y. Hsueh, R. Zimmermann, and W. Ku. Adaptive safe regions for continu- ous spatial queries over moving objects. In Database Systems for Advanced Applications, pages 71–76. Springer, 2009. [70] H. Hu, J. Xu, and D. L. Lee. A generic framework for monitoring continuous spatial queries over moving objects. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD’05, pages479–490, New York, NY, USA, 2005. ACM. [71] J. Huang, R. Zhang, R. Buyya, and J. Chen. Melody-join: Efficient earth mover’s distance similarity joins using mapreduce. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 808–819. IEEE, 2014. [72] C. Hwang and A. Masud. Multiple objective decision making, methods and applications: a state-of-the-art survey. Lecture notes in economics and math- ematical systems. Springer-Verlag, 1979. [73] C. Hwang and K. Yoon. Multiple Attribute Decision Making: Methods and Applications : a State-of-the-art Survey. Lecture Notes in Economics and Mathematical Systems. Springer, 1981. [74] A. Ishii and T. Suzumura. Elastic stream computing with clouds. IEEE Inter- national Conference on Cloud Computing, pages 195–202, 2011. 121 [75] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [76] S. J. Kazemitabar, F. Banaei-Kashani, S. J. Kazemitabar, and D. McLeod. Efficient batch processing of proximity queries by optimized probing. In Pro- ceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 84–93. ACM, 2013. [77] S. J. Kazemitabar, F. Banaei-Kashani, and D. McLeod. Geostreaming in Cloud. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming, IWGS ’11, pages 3–9, New York, NY, USA, 2011. ACM. [78] S. J. Kazemitabar, U. Demiryurek, M. Ali, A. Akdogan, and C. Shahabi. Geospatial stream query processing using Microsoft SQL Server StreamInsight. Proceedings of the VLDB Endowment, 3:1537–1540, September 2010. [79] S. J. Kazemitabar, A. Sharma, F. Banaei-Kashani, and D. McLeod. Scalable maximal reverse sklyine query processing using mapreduce. Submitted, 2016. [80] D. Kossmann and T. Kraska. Data management in the cloud: Promises, state- of-the-art, and open questions. Datenbank-Spektrum, 10(3):121–129, 2010. [81] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. In In VLDB, pages 275–286, 2002. [82] H. T. Kung, F. Luccio, and F. P. Preparata. On finding the maxima of a set of vectors. Journal of the ACM, 22:469–476, 1975. [83] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel pro- cessing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing, pages 75–86. ACM, 2010. [84] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44:35–40, April 2010. [85] O. Larichev and D. L. Olson. Multiple Criteria Analysis in Strategic Siting Problems. Kluwer Academic Publishers, 2001. [86] J.Lee,Y.Lee,S.Kang,S.Lee,H.Jin,B.Kim,andJ.Song. Bmq-index: Shared and incremental processing of border monitoring queries over data streams. In Mobile Data Management, 2006. MDM 2006. 7th International Conference on, pages 38–38. IEEE, 2006. 122 [87] M.-W. Lee and S.-w. Hwang. Robust distributed indexing for locality-skewed workloads. In Proceedings of the 21st ACM international conference on Infor- mation and knowledge management, CIKM ’12, pages 1342–1351, New York, NY, USA, 2012. ACM. [88] X. Li, P. Karras, L. Shi, K. Tan, and C. Jensen. Cooperative scalable moving continuous query processing. In Mobile Data Management (MDM), 2012 IEEE 13th International Conference on, pages 69–78. IEEE, 2012. [89] X. Lian and L. Chen. Monochromatic and bichromatic reverse skyline search over uncertain databases. In Proceedings of the 2008 ACM SIGMOD interna- tional conference on Management of data, SIGMOD ’08, pages 213–226, 2008. [90] J. Lin et al. The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce. In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, volume 1, 2009. [91] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 51–62, New York, NY, USA, 2010. ACM. [92] W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k near- est neighbor joins using mapreduce. Proceedings of the VLDB Endowment, 5(10):1016–1027, 2012. [93] Q.Ma, B.Yang, W.Qian, andA.Zhou. Queryprocessingofmassivetrajectory data based on mapreduce. In Proceedings of the first international workshop on Cloud data management, pages 9–16. ACM, 2009. [94] S. Madden, D. DeWitt, and M. Stonebraker. Database parallelism choices greatly impact scalability. http://www.vertica.com/2007/10/30/ database-parallelism-choices-greatly-impact-scalability. [95] S. Madden, D. DeWitt, and M. Stonebraker. Data base parallelism choices greatly impact scalability. The Database Column: A multi-author blog on database technology and innovation, http://www. databasecolumn. com/2007/10/database-parallelismchoices. html, 2007. [96] B. J. Mohler, W. B. Thompson, S. H. Creem-Regehr, H. L. Pick, and W. H. Warren. Visual flow influences gait transition speed and preferred walking speed. Experimental Brain Research, 181(2):221–228, 2007. 123 [97] M.F.Mokbel, X.Xiong, andW.G.Aref. Sina: scalableincrementalprocessing of continuous queries in spatio-temporal databases. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04, pages 623–634, New York, NY, USA, 2004. ACM. [98] D. M. Mount. Geometric intersection. In Handbook of Discrete and Computa- tional Geometry, chapter 38, pages 857–876, 2004. [99] K. Mouratidis, M. L. Yiu, D. Papadias, and N. Mamoulis. Continuous nearest neighbor monitoring in road networks. In Proceedings of the 32nd interna- tional conference on Very large data bases, VLDB ’06, pages 43–54. VLDB Endowment, 2006. [100] S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi. Md-hbase: a scalable multi-dimensional data infrastructure for location aware services. In Mobile Data Management (MDM), 2011 12th IEEE International Conference on, vol- ume 1, pages 7–16. IEEE, 2011. [101] C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, and X. Wang. Nova: continuous Pig/Hadoop workflows. In Proceedings of the 2011 international conference on Management of data, SIGMOD ’11, pages 1081–1090, New York, NY, USA, 2011. ACM. [102] D. Papadias, G. Fu, J. M. Chase, and B. Seeger. Progressive skyline compu- tation in database systems. ACM Trans. Database Syst, 30:2005, 2005. [103] D. Papadias, J. Zhang, and N. Mamoulis. Query processing in spatial network databases. In In VLDB, pages 802–813, 2003. [104] Y. Park, J.-K. Min, and K. Shim. Parallel computation of skyline and reverse skyline queries using mapreduce. Proceedings of the VLDB Endow- ment, 6(14):2002–2013, 2013. [105] P. Pesti, L. Liu, B. Bamba, A. Iyengar, and M. Weber. Roadtrack: scaling location updates for mobile clients on road networks with query awareness. Proc. VLDB Endow., 3(1-2):1493–1504, Sept. 2010. [106] S. R. Ramakrishnan, G. Swart, and A. Urmanov. Balancing reducer skew in mapreduce workloads using progressive sampling. In Proceedings of the Third ACM Symposium on Cloud Computing, page 16. ACM, 2012. 124 [107] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proceedings of the 2001 conference on Applica- tions, technologies, architectures, and protocols for computer communications, SIGCOMM ’01, pages 161–172, New York, NY, USA, 2001. ACM. [108] R. Rea. Ibm infosphere streams: Redefining real time analytics, 2010. [109] H. Samet. The quadtree and related hierarchical data structures. ACM Com- puting Surveys (CSUR), 16(2):187–260, 1984. [110] B. Satzger, W. Hummer, P. Leitner, and S. Dustdar. Esc: Towards an elastic stream computing platform for the cloud. Cloud Computing, IEEE Interna- tional Conference on, pages 348–355, 2011. [111] M. Sharifzadeh and C. Shahabi. The spatial skyline queries. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06, pages 751–762, 2006. [112] Y. Simmhan, B. Cao, M. Giakkoupis, and V. K. Prasanna. Adaptive rate stream processing for smart grid applications on clouds. In Proceedings of the 2nd international workshop on Scientific cloud computing, ScienceCloud ’11, pages 33–38, New York, NY, USA, 2011. ACM. [113] Y. Simmhan, Q. Zhou, and V. K. Prasanna. Semantic information integra- tion for smart grid applications. In Green IT: Technologies and Applications, chapter 19, pages 361–380. Springer Berlin Heidelberg, 2011. [114] R. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, 1999. [115] F. Szidarovszky, M. Gershon, and L. Duckstein. Techniques for multiobjective decision making in systems management. Advances in industrial engineering. Elsevier, 1986. [116] K. Tan, P. Eng, and B. C. Ooi. Efficient progressive skyline computation. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 301–310, 2001. [117] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. 125 [118] G. Treu and A. Küpper. Efficient proximity detection for location based ser- vices. In Proceedings of the 2nd Workshop on Positioning, Navigation and Communication (WPNC), 2005. [119] G.Treu, T.Wilder,andA.Küpper. Efficientproximitydetectionamongmobile targets with dead reckoning. In Proceedings of the 4th ACM international workshop on Mobility management and wireless access, MobiWac ’06, pages 75–83, New York, NY, USA, 2006. ACM. [120] H. Wang, R. Zimmermann, and W.-S. Ku. Distributed continuous range query processing on moving objects. In DEXA, pages 655–665, 2006. [121] T. White. Hadoop: the definitive guide: the definitive guide. " O’Reilly Media, Inc.", 2009. [122] R. C. Wong, M. T. Ozsu, P. S. Yu, A. W. Fu, and L. Liu. Efficient method for maximizing bichromatic reverse nearest neighbor. In In VLDB, pages 1126– 1149, 2009. [123] R. C.-W. Wong, M. T. Özsu, A. W.-C. Fu, P. S. Yu, L. Liu, and Y. Liu. Maximizingbichromaticreversenearestneighborforlp-normintwo-andthree- dimensional spaces. The VLDB Journal, 20(6):893–919, Dec. 2011. [124] X. Wu, Y. Tao, R. C. Wong, L. Ding, and J. X. Yu. Finding the influence set through skylines. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09, pages 1030–1041, 2009. [125] X.Xiao, B.Yao, andF.Li. Optimallocationqueriesinroadnetworkdatabases. In Proceedings 27th ICDE Conference, 2011. [126] Z.XuandA.Jacobsen. Adaptivelocationconstraintprocessing. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07, pages 581–592, New York, NY, USA, 2007. ACM. [127] Z. Xu and H.-A. Jacobsen. Processing proximity relations in road networks. In Proceedings of the 2010 international conference on Management of data, SIGMOD ’10, pages 243–254, New York, NY, USA, 2010. ACM. [128] M. L. Yiu, L. H. U, S. Šaltenis, and K. Tzoumas. Efficient proximity detection among mobile users via self-tuning policies. Proc. VLDB Endow., 3(1-2):985– 996, Sept. 2010. 126 [129] E. Zeitler and T. Risch. Massive scale-out of expensive continuous queries. Proceedings of the VLDB Endowment, 4, 2011. [130] C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in mapreduce. In Proceedings of the 15th International Conference on Extending Database Technology, pages 38–49. ACM, 2012. [131] D. Zhang, C.-Y. Chan, and K.-L. Tan. Nearest group queries. In Proceedings of the 25th International Conference on Scientific and Statistical Database Man- agement, page 7. ACM, 2013. [132] J. Zhang. Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study. In Proceedings of the ACM SIGSPATIAL Inter- national Workshop on High Performance and Distributed Geographic Informa- tion Systems, HPDGIS ’10, pages 3–10, New York, NY, USA, 2010. ACM. [133] J. Zhang, W.-S. Ku, M.-T. Sun, X. Qin, and H. Lu. Multi-criteria optimal location query with overlapping voronoi diagrams. In EDBT, 2014. [134] Z. Zhou, W. Wu, X. Li, M. Lee, and W. Hsu. Maxfirst for maxbrknn. In ICDE’11, pages 828–839, 2011. 127
Abstract (if available)
Abstract
In recent years, geospatial data have been produced in mass e.g., through billions of smartphones and wearable devices. Current exponential growth in data generation by mobile devices on the one hand, and the rate and complexity of recent spatial queries on the other hand, highlights the importance of scalable query processing techniques. Traditional database technology, which operates on centralized architectures to process persistent and less dynamic spatial objects does not meet the requirements for scalable geospatial data processing. ❧ In this thesis, we specifically focus on two primary challenges in scaling spatial queries, i.e., the communication and computation costs, while guaranteeing the correctness of query results. We utilize techniques such as batch processing and use of parallelized framework to address these challenges. ❧ We address the location tracking cost towards achieving scalability in communication-intensive queries. The location tracking cost between the moving objects and the query processing server is a key factor in processing many moving object continuous queries. The challenge is that increasing the number of queries and objects would require frequent location updates which results in draining the battery power on mobile devices. Thus, existing approaches would not scale unless query correctness is compromised. In this thesis, we propose batch processing of spatial queries as a method to optimize the location tracking cost to scale to large numbers of queries and objects without either compromising the query correctness or using excessive battery power. In our approach, the queries are categorized into independent groups and then processed in parallel. We specifically apply our approach to the proximity detection query and optimize the communication cost while processing millions of queries. ❧ Processing some spatial queries has become more resource-intensive in recent years. This is due to various reasons such as the introduction of queries that are more computationally complex compared to the classic ones, as well as an increase in the input size (e.g., the number of GPS-enabled devices). In this thesis, we propose optimized algorithms and utilize MapReduce to process a complex spatial problem, i.e., the Multi-Criteria Optimal Location (MCOL) problem. First, we formalize it as a Maximal Reverse Skyline (MaxRSKY) query. For the first time, we present an optimized solution that scales to millions of objects over a cluster of MapReduce nodes. Specifically, rather than batch processing the query which is typical of a MapReduce solution, we first partition the space and run a precomputation phase where we identify potential regions hosting the optimum solution, and then load balance the regions across the Reducers in a dynamic way to reduce the total execution time.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scalable data integration under constraints
PDF
Query processing in time-dependent spatial networks
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Efficient updates for continuous queries over moving objects
PDF
Location-based spatial queries in mobile environments
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Spatial query processing using Voronoi diagrams
PDF
Generalized optimal location planning
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
From matching to querying: A unified framework for ontology integration
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Iteratively learning data transformation programs from examples
Asset Metadata
Creator
Kazemitabar, Seyed Jalal
(author)
Core Title
Scalable processing of spatial queries
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/26/2016
Defense Date
05/30/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,query processing,scalability,spatial databases
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Nakano, Aiichiro (
committee member
), O'Leary, Daniel (
committee member
)
Creator Email
kazemita@usc.edu,kazemitabar@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-281092
Unique identifier
UC11279546
Identifier
etd-Kazemitaba-4629.pdf (filename),usctheses-c40-281092 (legacy record id)
Legacy Identifier
etd-Kazemitaba-4629.pdf
Dmrecord
281092
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kazemitabar, Seyed Jalal
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
query processing
scalability
spatial databases