Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Spatial query processing using Voronoi diagrams
(USC Thesis Other)
Spatial query processing using Voronoi diagrams
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPATIAL QUERY PROCESSING USING VORONOI DIAGRAMS by Mehdi Sharifzadeh A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2007 Copyright 2007 Mehdi Sharifzadeh Dedication To my parents and my wife ii Acknowledgments I would like to thank my academic advisor, Professor Cyrus Shahabi, for his great sup- port throughout my PhD studies at the University of Southern California. Cyrus’s high academic standards and great vision was always a source of inspiration to me. I would also like to thank Professor Craig Knoblock, Professor Antonio Ortega, Professor Gau- rav Sukhatme and Professor Karen Liu for serving in my PhD qualification and disser- tation committee. I am grateful for their support, guidance, and insight. I am also thankful to all my friends and labmates for their advice and help. During my graduate studies at USC, I shared most of my life with them as my close family members in the U.S.. My very special thanks goes to my family for their encouragement and support. My parents gave me the courage to pursue my dreams no matter how far they are. My dear wife, Leila, was my best friend with her love, encouragement, support, advice and patience over these years. It is impossible to express my appreciation to them. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures vii Abstract x 1 Chapter One: Introduction 1 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Incorporating V oronoi Diagrams into R-trees . . . . . . . . . . . . . . 6 1.4 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Chapter Two: Background 8 2.1 V oronoi Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Delaunay Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Chapter Three: Related Work 15 3.1 V oronoi Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Spatial Skyline Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Chapter Four: VoR-Tree: Incorporating Voronoi Diagrams into R-trees 21 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 V oR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 I/O-Optimal Query Processing . . . . . . . . . . . . . . . . . . . . . . 28 4.4 Query Processing using V oR-tree . . . . . . . . . . . . . . . . . . . . . 31 iv 4.4.1 k Nearest Neighbor Query (kNN) . . . . . . . . . . . . . . . . 31 4.4.2 Reverse k Nearest Neighbor Query (RkNN) . . . . . . . . . . . 35 4.4.3 k Aggregate Nearest Neighbor Query (kANN) . . . . . . . . . 42 4.4.4 Spatial Skyline Query (SSQ) . . . . . . . . . . . . . . . . . . . 48 5 Chapter Five: Performance Evaluation 54 6 Chapter Six: Conclusions 60 References 63 A Case Studies 69 A.1 Optimal Sequenced Route Query . . . . . . . . . . . . . . . . . . . . . 69 A.2 Learning Thematic Maps from Labeled Spatial Data . . . . . . . . . . . 70 A.3 Spatial Aggregation in Sensor Network Databases . . . . . . . . . . . . 72 B Spatial Skyline Query 75 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 B.2 Formal Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 79 B.2.1 General Skyline Query . . . . . . . . . . . . . . . . . . . . . . 80 B.2.2 Spatial Skyline Query . . . . . . . . . . . . . . . . . . . . . . 81 B.3 Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 B.3.1 Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 B.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 B.4.1 B 2 S 2 : Branch-and-Bound Spatial Skyline Algorithm . . . . . . 87 B.4.1.1 B 2 S 2 Correctness . . . . . . . . . . . . . . . . . . . 89 B.4.2 VS 2 : V oronoi-based Spatial Skyline Algorithm . . . . . . . . . 91 B.4.2.1 VS 2 Correctness . . . . . . . . . . . . . . . . . . . . 93 B.5 Continuous Spatial Skyline Query . . . . . . . . . . . . . . . . . . . . 95 B.5.1 V oronoi-based Continuous SSQ (VCS 2 ) . . . . . . . . . . . . . 98 B.5.1.1 VCS 2 Correctness . . . . . . . . . . . . . . . . . . . 101 B.6 Non-spatial Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 B.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.8 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . 109 v List of Tables 4.1 Trace of VR-kNN fork =3 . . . . . . . . . . . . . . . . . . . . . . . 34 B.1 Summary of notations used in Appendix B . . . . . . . . . . . . . . . . 83 B.2 B 2 S 2 for the example of Figure B.6 . . . . . . . . . . . . . . . . . . . 91 B.3 VS 2 for the example of Figure B.8 . . . . . . . . . . . . . . . . . . . . 95 B.4 VCS 2 for the example of Figure B.11 . . . . . . . . . . . . . . . . . . 102 B.5 USGS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B.6 Experimental results for extreme scenarios . . . . . . . . . . . . . . . . 108 vi List of Figures 1.1 Searching for a better query result in the search region (SR) using an R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The ordinary V oronoi diagram of nine points, the pointp and its V oronoi cellV(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Searching for a better query result in the search region using a V oronoi diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Ordinary V oronoi diagram of a set of nine points . . . . . . . . . . . . 9 2.2 Properties V-3 and V-4 of V oronoi diagrams . . . . . . . . . . . . . . . 10 2.3 Delaunay graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Convex hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Points indexed by an R-tree . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 a) V oronoi diagram and b) V oR-tree of the points shown in Figure 2.5 . 26 4.2 Inserting the pointx in V oR-tree . . . . . . . . . . . . . . . . . . . . . 27 4.3 Search Region ofa for nearest neighbor query given the query pointq . 28 4.4 Search Region of a hypothetical query . . . . . . . . . . . . . . . . . . 30 4.5 1NN algorithm using V oR-tree . . . . . . . . . . . . . . . . . . . . . . 33 4.6 kNN algorithm using V oR-tree . . . . . . . . . . . . . . . . . . . . . . 33 4.7 a) Improving over BFS, b)p2 R2NN(q) . . . . . . . . . . . . . . . . . 35 4.8 Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.9 VR-RkNN fork =2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.10 RkNN algorithm using V oR-tree . . . . . . . . . . . . . . . . . . . . . 40 vii 4.11 k Aggregate Nearest Neighbor query forQ=fq 1 ;q 2 ;q 3 g and function a) f=sum , b)f=max . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.12 Search Region ofp for functionf . . . . . . . . . . . . . . . . . . . . 43 4.13 VR-kANN fork =3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.14 kANN algorithm using V oR-tree . . . . . . . . . . . . . . . . . . . . . 45 4.15 Finding the cell containing centroidq . . . . . . . . . . . . . . . . . . 46 4.16 Dominance regions of a)p 1 , and b)fp 1 ;p 3 g . . . . . . . . . . . . . . . 49 4.17 SSQ algorithm using V oR-tree . . . . . . . . . . . . . . . . . . . . . . 51 5.1 I/O vs. k for ab)kNN, and cd) RkNN . . . . . . . . . . . . . . . . . . 55 5.2 I/O vs. ab)k, and cd)MBR(Q) forkANN . . . . . . . . . . . . . . . 57 5.3 I/O vs. ab)jQj, and cd)MBR(Q) for SSQ . . . . . . . . . . . . . . . 58 A.1 California county map as a typical thematic map. . . . . . . . . . . . . 71 A.2 Merging V oronoi cells corresponding to the points with a common label (e.g., county) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.3 Two non-uniformly distributed sensor networks . . . . . . . . . . . . . 73 B.1 A set of restaurantsP=fr 1 ;:::;r 4 g and team membersQ=fm 1 ;:::;m 3 g. 75 B.2 A 2-dimensional database of six objects . . . . . . . . . . . . . . . . . 80 B.3 The spatial skyline of a set of nine points . . . . . . . . . . . . . . . . . 82 B.4 a) Theorem 10, b) Theorem 12, and c) Theorem 13 . . . . . . . . . . . 84 B.5 Pseudo-code of the B 2 S 2 algorithm . . . . . . . . . . . . . . . . . . . . 88 B.6 Points indexed by an R-tree . . . . . . . . . . . . . . . . . . . . . . . . 90 B.7 Pseudo-code of the VS 2 algorithm . . . . . . . . . . . . . . . . . . . . 92 B.8 Example points for VS 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.9 Visible region ofq 1 2CH(Q) . . . . . . . . . . . . . . . . . . . . . . 97 B.10 Change patterns of convex hull ofQ when the location ofq changes toq 0 98 B.11 Example points for VCS 2 . . . . . . . . . . . . . . . . . . . . . . . . . 99 viii B.12 Query cost vs. number of query points and the area covered byQ . . . . 105 B.13 Query cost and continuous SSQ . . . . . . . . . . . . . . . . . . . . . 107 ix Abstract Spatial query processing in spatial databases, Geographic Information Systems (GIS), and on-line maps attempts to extract specific geometric relations among spatial objects. As a prominent category of spatial queries, the class of nearest neighbor queries retrieve spatial objects that minimize specific functions in terms of their distance to a given object (e.g., closest data point to a query point). The most efficient algorithms that address nearest neighbor queries utilize the R-tree index structure to avoid I/O opera- tions for the groups of data objects that do not contain the final query result. However, they still result in unnecessary I/O operations as R-trees are not efficient in elaborate exploration of the portion of data space that includes the result. In this dissertation, we propose a new index structure, termed VoR-tree, that incor- porates V oronoi diagrams into the R-tree index structure for I/O-optimal processing of nearest neighbor queries on point datasets. The neighborhood information encoded in V oronoi diagrams allows a V oR-tree-based algorithm to optimally explore the data space towards the result. We complement our efforts by proposing I/O-optimal algorithms for k nearest neighbor, reversek nearest neighbor,k aggregate nearest neighbor, and spatial skyline query types. These algorithms perform the least amount of I/O with respect to the information provided in their underlying V oR-tree. Therefore, they find the query result with less I/O operations than their best competitor for each query type. x Chapter 1 Introduction 1.1 Motivation and Problem Statement Tremendous innovations in sensing devices and data acquisition methods and explosive growth of data processing techniques in different application contexts have generated petabytes of data to be processed. A remarkable portion of this data belongs to the prominent category of spatial data. The satellite images taken from different regions of the Earth, the medical images of a human body, and the digitized city maps generated by cartographers are examples of spatial data. The commonality among these datasets is that they are all composed of geometric objects (e.g., points or lines) related according to a single coordinate system. For instance, the dataset representing the road network of a city consists of a set of 2-dimensional points (intersections) and connected line segments (roads) in the space of latitude and longitude (the Earth’s surface). The category of spatial data is associated with geometric interpretations and hence demands special data processing techniques. Spatial database management systems (SDBMS) enable efficient hosting of spatial data and provide facilities to answer ques- tions regarding their relationships. These questions, termed spatial queries, seek for data objects possessing a specific geometric relation with a given spatial object or among themselves. The geometric relation inquired is mainly specified via a query function in terms of the distance metric defined in the data space. As an example, consider a spatial dataset containing the locations of all restaurants in the city of Los Angeles. Hence, the data space includes 2-dimensional points in the latitude/longitude coordinate system. 1 Figure 1.1: Searching for a better query result in the search region (SR) using an R-tree Suppose that the distance metric representing the distance between any two locations in this space is the length of the straight line segment connecting them (i.e., Euclidean distance). Considering this dataset together with Euclidean distance in the space ofR 2 , a Nearest Neighbor (NN) query is a spatial query that returns the closest restaurant to a given location. This query is formalized as finding the data object(s) that minimizes the Euclidean distance to a given spatial object in the space ofR 2 . That is, it exposes the closeness relation among the data objects with respect to a given object. The class of nearest neighbor (NN) queries constitutes an important class of spatial queries, especially in the geospatial domain. These queries search for data objects that minimize a distance-based function with reference to one or more query objects (e.g., points). Examples are k Nearest Neighbor (kNN) [RKV95, HS99], Reverse k Nearest Neighbor (RkNN) [KM00, SAA00, TPL04], k Aggregate Nearest Neighbor (kANN) [PTMH05] and skyline queries [BKS01, PTFS05, SS06b]. The applications of NN queries are numerous in geospatial decision making, location-based services, and sensor networks domains. For example, in urban planning, one may need to find the set of parks that are closest to a set of houses (termed as spatial skyline queries in [SS06b]). 2 Many studies in the past decade focus on devising different algorithms to address NN queries efficiently. The most successful of such algorithms are those who utilize R-trees [Gut84] as their underlying index structures. The general approach is to use R-tree in two phases. First, R-tree’s hierarchical structure is used to quickly arrive to the neighborhood of the result set. Second, the R-tree nodes intersecting with the local neighborhood (Search Region) of an initial answer are investigated to refine this answer and find all the members of the result set. To illustrate, consider Figure 1.1 which shows rectangles corresponding to the nodes of an R-tree. Each rectangle represents the minimum bounding rectangle of the points stored at the leaves of the subtree of R-tree rooted at a specific node N i . An R-tree-based algorithm for NN processing examines these nodes in the order given by a function in terms of their distance to the query point(s) (e.g., q). Hence, during the first phase, the algorithm is quickly directed to a leaf node such as N 1 which contains an initial answer a to the query (i.e., a point close toq 1 ). This directed traversal of R-tree benefits from the hierarchical grouping of points in R-tree nodes. The algorithm does not require to examine the group of points corresponding to the nodes such as N 4 that are farther than a to q. Given the initial answera, we know that any better result for the query must be located in a limited area, the search region SR ofa shown in dark grey color (e.g., for 1NN query SR ofa is the circle centered atq which passes througha). Hence, in the second phase the algorithm tries to examine only the nodes such asN 3 that intersect with this area. While using R- trees is very efficient for the first phase, the algorithm usually results in the unnecessary investigation of many nodes that none or only a small subset of their including points belong to the actual result set (e.g.,N 2 andN 3 ). The reason is that either the exact shape of SR is not easy to compute with respect to the query or the heuristics used to determine 1 The closeness is determined based on the distance-based function used to specify the query. 3 V(p): Voronoi cell of p Figure 1.2: The ordinary V oronoi diagram of nine points, the point p and its V oronoi cellV(p) the intersection of a node with SR are too conservative. Therefore, many R-tree-based NN algorithms usually perform excessive I/O operations. Given a set of data points, a general V oronoi diagram uniquely partitions the space into disjoint regions. The region (cell) corresponding to a point o covers the points in space that are closer to o than to any other data point. Figure 1.2 shows the ordinary V oronoi diagram of nine points inR 2 with respect to Euclidean distance. Our stud- ies on different spatial queries [SS06b, SS06a, SS] shows that V oronoi diagrams are extremely efficient in exploring an NN search region, while due to lack of an efficient access method, their arrival to this region is slow (see Appendix A). Consider the search region illustrated in Figure 1.3. An NN algorithm that utilizes V oronoi diagram of data points to get to this region requires a directed traversal of the diagram. It starts from a random point and iteratively visits the neighboring points towards the boundaries of the region. At each iteration, the closer V oronoi neighbor of the current point to q is visited. Hence, the traversal is again directed by the distance-based function specifying the query. For a dataset of n points uniformly distributed in a square-shaped area, this takesO( p n=2) steps on average. In the worst case, the algorithm performs an exhaus- tive examination of the dataset. However, once the algorithm visits the first point whose V oronoi cell intersects with the search region, it benefits from the connectivity provided 4 Figure 1.3: Searching for a better query result in the search region using a V oronoi diagram by the neighboring cells to completely explore the region. That is, it visits the cells that jointly cover the region as if we tile the space using visited cells (grey cells in Figure 1.3). Hence, it minimizes the number of data points that must be examined to refine the initial answer to the query. Motivated by the above observation, in this dissertation, we design a spatial index structure based on R-tree and Voronoi diagram and develop algorithms utilizing the structure for I/O-optimal processing of the class of nearest neighbor queries on point datasets. The coarse granule rectangle nodes of R-tree enables us to get to the search region of an NN query in logarithmic time while the fine granule polygons of V oronoi diagram allow us to optimally tile or cover the region and find the result. Section 1.2 includes our thesis statement and Section 1.3 discusses our main contribution. 1.2 Thesis Statement The thesis statement of this dissertation is: 5 A spatial index structure that incorporates Voronoi diagram into R- tree can be utilized for I/O-optimal processing of the class of nearest neighbor queries on point datasets. 1.3 Incorporating Voronoi Diagrams into R-trees In this dissertation, our agenda is to incorporate V oronoi diagrams into the R-tree index structure to benefit from the best of two worlds. The resulting data structure, termed V oR-tree, is a regular R-tree enriched by the V oronoi cells and neighbors of each point stored together with the point’s geometry in its data record. V oR-tree is different from an access method for V oronoi diagrams such as V oronoi history graph [Hag03], os-tree [Man01], and D-tree [XZLL04] that utilize the general concept exploited by V oronoi diagrams in specific application settings. Instead, V oR-tree benefits from the best of two worlds: coarse granule hierarchical grouping of R-trees and fine granule explo- ration capability of V oronoi diagrams. Unlike similar approaches [ZL01] that index the geometry of V oronoi cells, V oR-tree indexes the actual data objects. Hence, all R-tree- based query processing algorithms are still applicable on V oR-trees. However, adding the connectivity provided by the V oronoi diagrams enables us to propose I/O-optimal algorithms for different NN queries. Here, the optimality means that we examine only the disk pages including the points whose V oronoi cells or those of their immediate neighbors are intersecting with the current search region (see Section 4.3). While both V oronoi diagrams and R-trees are defined for the space ofR d , we focus on 2-d points that are widely available/queried in geospatial applications. To show the efficiency of V oR-trees for NN query processing, in Chapter 4 we study four NN query types and their state-of-the-art R-tree-based algorithms: 1) kNN and Best-First Search (BFS) [HS99], 2) RkNN and TPL [TPL04], 3) kANN and MBM 6 [PTMH05], and 4) Spatial Skyline query (SSQ) and VS 2 /B 2 S 2 [SS06b]. For each query, we first define the query. Then, we propose our V oR-tree-based algorithm followed by the proof of its correctness and its complexities. We show through analysis that our V oR-tree-based algorithms have better I/O complexity than their best competitor for each query type. Our extensive experiments verify that they outperform the pure R-tree-based approaches in terms of I/O cost by wide margins in many scenarios. 1.4 Road Map The remainder of this dissertation is organized as follows. In Chapter 2, we provide the background for V oronoi diagrams and related structures and discuss their properties. In Chapter 3, we review the related work. In Chapter 4, we propose our index structure that incorporates V oronoi diagrams into R-tree and subsequently present our I/O-optimal algorithms for processing nearest neighbor queries utilizing this index structure. Chapter 5 shows the results of our experiments with real-world datasets. Finally, Chapter 6 draws the conclusion. 7 Chapter 2 Background 2.1 Voronoi Diagram The V oronoi diagram of a given set P = fp 1 ;:::;p n g of n points inR d partitions the space ofR d inton regions. Each region includes all points inR d with a common closest point in the given set P according to a distance metric D(:;:) [OBSC00]. That is, the region corresponding to the point p 2 P contains all the points q 2 R d for which we have 8p 0 2P;p 0 6=p; D(q;p)·D(q;p 0 ) (2.1) The equality holds for the points on the borders of p’s and p 0 ’s regions. Incorporat- ing arbitrary distance metrics D(:;:), defined in Equation 2.1 results in different vari- ations of V oronoi diagrams. A thorough discussion on these variations is presented in [OBSC00]. Figure 2.1 shows the ordinary V oronoi diagram of nine points inR 2 where the distance metric is Euclidean. We refer to the regionV(p) containing the pointp as its V oronoi cell. With Euclidean distance inR 2 ,V(p) is a convex polygon. Each edge of this polygon is a segment of the perpendicular bisector of the line segment connectingp to another point of the setP . We call each of these edges a Voronoi edge and each of its end-points (vertices of the polygon) a Voronoi vertex of the point p. For each V oronoi 8 p V(p) V. neighbor of p V. edge of p V. vertex of p Figure 2.1: Ordinary V oronoi diagram of a set of nine points edge of the pointp, we refer to the corresponding point in the setP as a Voronoi neigh- bor ofp. We useVN(p) to denote the set of all V oronoi neighbors ofp. We also refer to pointp as the generator of V oronoi cellV(p). Finally, the set given by VD(P)=fV(p 1 );:::;V(p n )g (2.2) is called the V oronoi diagram of the setP with respect to the distance functionD(:;:). Throughout the rest of this dissertation, we use Euclidean distance function inR 2 (i.e., D =L 2 andP ½R 2 ). Also, we simply use V oronoi diagram to denote ordinary V oronoi diagram of a set of points inR 2 . We review some important properties of V oronoi diagrams [OBSC00]. We utilize these properties in Chapter 4. Property V-1: The V oronoi diagram of a setP of points,VD(P), is unique. Property V-2: The average number of vertices per V oronoi cells of the V oronoi diagram of a set of points inR 2 does not exceed six. That is, the average number of V oronoi neighbors of each point of P is at most six. Notice that this property holds regardless of the distribution of points of P inR 2 . We use this property to derive the 9 4 1 6 3 2 7 5 8 Figure 2.2: Properties V-3 and V-4 of V oronoi diagrams complexity of our query processing algorithms in Chapter 4. Property V-3: Given the V oronoi diagram of P , the nearest point of P to a pointp2 P is one of the V oronoi neighbors of p. That is, the closest point to p is one of the generator points whose V oronoi cells share an edge with V(p). In the V oronoi diagram illustrated in Figure 2.2, the closest generator point to p 1 is p 2 which is one of p 1 ’s V oronoi neighbors. In Chapter 4, we utilize the generalization of this property, given below, in ourkNN query processing algorithm. Property V-4: Let p 1 ;:::;p k be the k>1 nearest points of P to a point q (i.e., p i is thei-th nearest neighbor ofq). Then,p k is a V oronoi neighbor of at least one point p i 2 fp 1 ;:::;p k¡1 g (p k 2 VN(p i ); see [KS] for a proof). In Figure 2.2, the closest generator point to the pointq isp 1 whose V oronoi cell containsq. The second and third nearest points toq (p 2 andp 3 ) are both V oronoi neighbors ofp 1 . 10 Figure 2.3: Delaunay graph 2.2 Delaunay Graph Consider an undirected graph DG(P) = G(V;E) with the set of verticesV = P . For each two pointsp andp 0 inV , there is an edge connectingp andp 0 inG iffp 0 is a V oronoi neighbor ofp in the V oronoi diagram ofP . The graphG is called the Delaunay graph of points inP . This graph is a connected planar graph. Figure 2.3 illustrates the Delaunay graph corresponding to the points of Figure 2.1. In Chapter 4, we traverse the Delaunay graph of the database points to find the set of skyline points. Also, we traverse this graph to find the set of reverse nearest neighbors of a query point in the same chapter. 2.3 Convex Hull The convex hull of points inP ½R d , is the unique smallest convex polytope (polygon when d = 2) which contains all the points in P [dBvKOS00]. Figure 2.4 shows the convex hull of the points of Figure 2.1 as a hexagon. The set of vertices of convex hull of points inP is a subset ofP . We use convex point to denote any of these vertices and non-convex point to refer to all other points of P . In Figure 2.4, p 0 and p are convex and non-convex points, respectively. We also use CH(P) and CH v (P) to refer to the convex hull of P and the set of its vertices, respectively. It is clear that the shape of 11 Convex Hull Convex point p' p Figure 2.4: Convex hull the convex hull of a set P only depends on the convex points in P . Consequently, the location of any non-convex pointp2P does not affect the shape ofCH(P). The following property shows the relation between the V oronoi diagram and convex hull of the same set of points. Property V-5: In the ordinary V oronoi diagram of the points of setP ½R 2 , the V oronoi cell of p i 2 P is unbounded iff p i is on the boundary of the convex hull of P (i.e., p i 2CH v (P)). Comparing Figures 2.1 and 2.4, it is clearly seen that the V oronoi cell of any convex point such asp 0 is unbounded. Throughout the remainder of this dissertation, we assume that when required we clip the unbounded V oronoi cells using a universal bounding rec- tangle that include all the database points. For a wide range of applications, especially in geospatial domain, such a rectangle can be easily defined (e.g., a rectangle covering the entire city). 2.4 R-tree R-tree [Gut84] is the most prominent index structure widely used for spatial query processing. As a straightforward extension of B-trees to multidimensional spaces, R- trees group the data points in R d using d-dimensional rectangles. The grouping is 12 9 3 2 1 5 4 6 7 11 10 12 13 14 1 2 3 7 8 4 5 mindist(N 7 , q) mindist(N 6 , q) 6 p 1 e 1 N 4 N 5 N 3 N 2 N 1 N 7 N 6 R e 3 e 4 e 5 e 2 e 6 e 7 p 2 p 3 p 4 p 5 p 14 p 8 p 7 p 10 p 9 p 11 p 12 p 13 p 6 Figure 2.5: Points indexed by an R-tree based on the closeness of the points. Figure 2.5 shows the R-tree built using the set P =fp 1 ;:::;p 13 g of points inR 2 . Here, the capacity of each node is three entries. The leaf nodesN 1 ;:::;N 5 store the coordinates of the grouped points together with optional pointers to their corresponding records. Each intermediate node (e.g., N 6 ) contains the Minimum Bounding Rectangle (MBR) of each of its child nodes (e.g.,e 1 for nodeN 1 ) and a pointer to the disk page storing the child. The same grouping criteria is used to group intermediate nodes into upper level nodes. Therefore, the MBRs stored in the single root of R-tree collectively cover the entire data setP . In Figure 2.5, the root node R contains MBRse 6 ande 7 enclosing the points in nodesN 6 andN 7 , respectively. R-tree-based processing algorithms utilize some metrics to bound their search space using the MBRs stored in the nodes. The widely used function ismindist(N;q) which returns the minimum possible distance between a pointq and any point in the MBR of 13 nodeN. Figure 2.5 showsmindist(N 6 ;q) andmindist(N 7 ;q) forq. In Chapter 4, we show how R-tree-based approaches use this lower-bound function. 14 Chapter 3 Related Work 3.1 Voronoi Diagram Although V oronoi diagrams first were used in 1644 by Rene Descartes, they were named after the mathematician Georgy Fedoseevich V oronoi in 1905. In 1850, a German math- ematician Johann Peter Gustav Lejeune Dirchlet applied them to 2- and 3-dimensional points [OBSC00]. Different variations of V oronoi diagrams have been used as index structures for the nearest neighbor search. In [Hag03], Hagedoorn introduces a directed acyclic graph based on V oronoi diagrams. He uses the data structure to answer exact nearest-neighbor queries with respect to general distance functions in O(log 2 n) time using only O(n) space. In [Man01], Maneewongvatana propose a hierarchical index structure for point data, termed overlapped split tree (os-tree). Each node of os-tree is associated with a convex polygon referred as the cover which includes all the points in space whose nearest neighbor is a data point associated with the node. Hence, the cover of node n contains the union of the V oronoi cells of points stored in the subtree rooted by n. That is, the same principal used in V oronoi diagrams is utilized in os-trees to partition the space using subset of V oronoi edges into regions each have different sets of nearest neighbors. Subsequently, an NN query is addressed by recursively visiting os-tree nodes starting from the root towards the nodes whose cover includes the query point. At leaves, the candidate points are examined and pruned towards the final answer. In [XZLL04], Xu et al. study indexing location data in location-based wireless services. They propose 15 D-tree, an index structure that can be used to efficiently process NN queries (planar point queries in their terminology). D-tree simply indexes the point data using the subsets of the points’ V oronoi edges that partition the entire set into two subsets. This partitioning principle is used in all levels of the tree. Many studies also focus on utilizing individual V oronoi cells for query processing. Korn et al. [KM00] describe four examples of the V oronoi cell computation problem drawn from different spatial/vector space domains in which the influence set of a given point is required. Stanoi et al. in [SRAA01] combine the properties of V oronoi cells (influence sets in their terminology) with the efficiency of R-trees to retrieve reverse nearest neighbors of a query point from the database. Zhang et al. [ZZP + 03] deter- mine the so-called validity region around a query point as the V oronoi cell of its nearest neighbor. The cell is the region within which the result of the nearest neighbor query remains valid as the location of the query point is changing. To provide an efficient similarity search mechanism in a peer-to-peer data network, Banaei-Kashani et al. pro- pose that each node maintains its V oronoi cell based on its local neighborhood in the content space [BKS04]. As a more practical example, Kolahdouzan et al. [KS] propose a V oronoi-based data structure to improve the performance of exactk nearest neighbor search in spatial network databases. 3.2 Nearest Neighbor Queries Numerous algorithms for k nearest neighbor (kNN) queries in spatial databases have been proposed. A majority of these algorithms are based on utilizing spatial index structures such as R-tree and usually perform in two filter and refinement steps. Rous- sopoulos et al. in [RKV95] present a branch-and-bound R-tree traversal algorithm that uses two mindist and minmaxdist metrics to prune R-tree nodes. Cheung et al. in 16 [CF] simplify this algorithm without reducing its efficiency. Korn et al. in [KSF + 96] present a multi-stepkNN search and Seidl et al. in [SK98] propose an optimal version of this method. Hjaltason et al. [HS99] propose Best First Search (BFS), an incremental nearest neighbor algorithm that is based on utilizing an index structure and a priority queue. BFS is an I/O-optimal algorithm with respect to R-tree as it tries to perform the least I/O considering the information provided in its utilized R-tree. Jung et al. in [JP02] propose an algorithm to find the shortest distance between any two points by partition- ing a large graph into layers of smaller subgraphs and pushing up the pre-computed shortest paths between the borders of the subgraphs in a hierarchical manner. Jensen et al. in [JKPT03] discuss data models and graph representations for NN queries in road networks and provide alternative solutions for it. Papadias et al. in [PZMT] propose a solution for NN queries in network databases by generating and expanding a search region around a query point. In [KS], Kolahdouzan et al. proposed a solution for NN queries in road network databases that is based on utilizing network V oronoi diagrams. Other variations of kNN queries have also been studied and their solutions are usually motivated by the solutions of their regulark nearest neighbor queries. Sistla et al. in [SWCS] first identified the importance of the continuous nearest neighbors (CNN) and described modelling methods and query languages for the expres- sion of these queries. Song et al. in [SR] propose the first algorithms for CNN queries based on performing several point-NN queries at predefined sample points. Tao et al. in [TPS] propose a solution for CNN queries based on performing one single query for the entire path. Ferhatosmanoglu et al. in [FSAA01] introduce the problem of constrained NN queries, where the nearest neighbors in specific range or direction are requested. Korn et al. in [KM] introduce the problem of reverse nearest neighbor (RNN) queries. They propose a pre-computed index structure, termed RNN-tree, that efficiently 17 addresses only R1NN queries. Yang et al. in [YL] propose an index structure, Rdnn- tree, that can efficiently address both RNN and NN queries. Stanoi et al. in [SRAA01] utilize an interesting property of first RNNs (discussed in Chapter 4) to avoid the pre- computation phase of the previous two solutions. They perform six constrained NN queries to find the candidates result. This phase is followed by six more NN queries to discard the false hits. Benetis et al. in [BJKS06] propose to use TPR-tree, a time para- meterized index structure that is used to index continuously moving objects, to address nearest and reverse nearest neighbor queries for moving objects. In their setting, both query and data points are moving and the query result is retrieved for an entire given time period. Koudas et al. in [KOT04] discuss approximate NN queries with guaran- teed error for streams where access to the entire data is not feasible. Finally, Tao et al. in [TPL04] propose an R-tree-based solution for RkNN queries in multidimensional dynamic databases. Their solution, termed TPL, does not require prior knowledge ofk and hence does not perform any pre-computation. As a filter-refinement approach, TPL traverses R-tree to search for candidate results and prune the false hits. The two steps are integrated so that each R-tree node is accessed at most once. The class of aggregate nearest neighbor (ANN) queries has been introduced by Papa- dias et al. in [PTMH05]. They study different realizations of this query and propose sev- eral R-tree-based solutions. Their superior approach, Minimum Bounding Box (MBM), utilizes mindist metric to examine only the nodes that most likely contain the result. They further extend their algorithms for the two scenarios of disk-resident queries and approximate ANN retrieval. They also provide a theoretical cost model for their algo- rithms. 18 3.3 Spatial Skyline Queries The best known skyline algorithm that can answer SSQs is the Branch-and-Bound Sky- line algorithm (BBS) [PTFS05]. BBS is a progressive optimal algorithm for the gen- eral skyline query. In the setting of BBS, a dynamic skyline query specifies a new n-dimensional space based on the originald-dimensional data space. First, each pointp in the database is mapped to the point ^ p=(f 1 (p);:::;f n (p)) where eachf i is a function of the coordinates of p. Then, the query returns the general (i.e., static) skyline of the new space (the corresponding points in the original space). We can define the spatial skyline query as a special case of the dynamic query. Given the query set Q, we use f i = D(p;q i ) to map each pointp to ^ p. Therefore, our spatial skyline can be defined as a special case of dynamic skyline presented in [PTFS05]. While BBS is a nice general algorithm for any functionf, since it has no knowledge of the geometry of the problem space, it is not as efficient as our proposed algorithms for the spatial case when f is a distance function. The Block Nested Loop (BNL) approach [BKS01] for general skyline computation can also address SSQs. BNL outperforms BBS when the number of skyline points is small and the dimensionality is high [PTFS05]. Although with a large set of query points for SSQ the number of derived attributes is high, the number of original attributes which are used by BBS remains unchanged; the points usually have only two dimensions in real-world examples. This makes BBS to be the best competitor approach and hence we compared our techniques only with BBS in Section B.7. There are studies in the area of spatial databases related to SSQ. Papadias et al. [PTMH05] proposed efficient algorithms to find the point with minimum aggregate dis- tance to query points inQ. The aggregate distance is a monotone function over distances to points ofQ. The optimum point is one of the spatial skyline points with respect toQ. Their algorithm seeks only for the optimum point so no dominance check is required. 19 Huang and Jensen [HJ04] studied the interesting problem of finding locations of interest which are not dominated with respect to only two attributes: their network distance to a single query locationq and the detour distance fromq’s predefined route through the road network. Their proposed algorithms rely on existing nearest neighbor and range query approaches to find a candidate set. They then apply naive in-memory skyline computation on the candidate set to extract the result. While their in-route skyline query is distance-based similar to SSQ, it focuses on a specific application. Our SSQ prob- lem however targets a different and broader range of applications for which we propose efficient customized skyline computation algorithms in vector space. 20 Chapter 4 VoR-Tree: Incorporating Voronoi Diagrams into R-trees 4.1 Introduction The introduction of R-trees [Gut84] (and their extensions) for indexing multi- dimensional data marked a new era in developing novel R-tree-based algorithms for various forms of Nearest Neighbor (NN) queries. These algorithms utilize the simple rectangular grouping principle used by R-tree that represents close data points with their Minimum Bounding Rectangle (MBR). They generally use R-tree in two phases. In the first phase, starting from the root node, they iteratively search for an initial resulta. To find a, they visit/extract the nodes that minimize a function of the distance(s) between the query points(s) and the MBR of each node. Meanwhile, they use heuristics to prune the nodes that cannot possibly contain the answer. During this phase, R-tree’s hierarchi- cal structure enables these algorithms to find the initial resulta in logarithmic time. Now, the local neighborhood of a may contain a better result. This portion of data space close to a that may contain a result a 0 better than a is denoted as Search Region (SR) of a for an NN query [HS99]. To complete the result, an R-tree-based algo- rithm must explore this portion of space and examine any possibly better data. The best approach is to visit/examine only the points in SR of a in the direction that most likely contains a better result (e.g., from a towards the query point q for a better NN). 21 However, with R-tree-based algorithms the only way to retrieve a point in this neigh- borhood is through R-tree’s leaf nodes. Hence, in the second phase, a blind traversal must repeatedly go up the tree to visit higher-level nodes and then come down the tree to visit their descendants and the leaves to explore this neighborhood. This traversal is combined with pruning those nodes that are not intersecting with SR ofa and hence con- tain no point of SR. Here, different algorithms use alternative thresholds and heuristics to decide which R-tree nodes should be investigated further and which ones should be pruned. While the employed heuristics are always safe to cover the entire SR and hence guarantee the completeness of result, they are highly-conservative for two reasons: 1) They use the distance to the coarse granule MBR of points in a nodeN as a lower-bound for the actual distances to the points inN. This lower-bound metric is not tight enough for many queries (e.g., RkNN) 2) With some queries (e.g., kANN), the irregular shape of SR makes it difficult to identify intersecting nodes using a heuristic. As a result, the algorithm examines all the nodes intersecting with a larger superset of SR. That is, the conservative heuristics prevent the algorithm to prune many nodes/points that are not even close to the actual result. In sum, to explore a portion of the space starting from a certain location a and subsequently applying a diffusive examination of the points in that local neighborhood, R-tree-based algorithms have to blindly visit/examine higher level nodes using conservative heuristics. The reason is that the coarse granule grouping used by R-trees is not providing a complete distance-based connectivity among the data points. V oronoi diagrams are extremely efficient in exploring a local neighborhood in a geo- metric space. Their dual representation, Delaunay graphs, connect any two data points whose corresponding cells share a border (and hence are close in a certain direction). Thus, to explore the neighborhood of a pointa it suffices to start from the V oronoi cell containing a and repeatedly traverse unvisited neighboring V oronoi cells (as if we tile 22 the space using visited cells). The fine granule polygons of V oronoi diagram allows an optimal coverage of any complex-shaped neighborhood. This makes V oronoi diagrams efficient structures to explore the search region during processing NN queries. More- over, the search region of many NN queries can be redefined as a limited neighborhood through the edges of Delaunay graphs. Consequently, an algorithm can traverse any complex-shaped SR without requiring the MBR-based heuristics of R-tree (e.g, in Sec- tion 4.4 we prove that the reverse kth NN of a point p is at most k edges far from p in the Delaunay graph of data). In this chapter, we first propose V oR-tree index structure for data points that incor- porate V oronoi diagrams into the R-tree. Adding the connectivity provided by the V oronoi diagrams enables us to utilize V oR-tree in I/O-optimal algorithms for differ- ent NN queries. We study four NN query types and their state-of-the-art R-tree-based algorithms: 1)kNN and Best-First Search (BFS) [HS99], 2) RkNN and TPL [TPL04], 3) kANN and MBM [PTMH05], and 4) Spatial Skyline query (SSQ) and VS 2 /B 2 S 2 [SS06b]. Throughout the chapter, we discuss the best approaches in the literature for each NN query type in details. For each query, we propose our V oR-tree-based algo- rithm followed by the proof of its correctness and its complexities. Finally, through extensive experiments using two real-world datasest, we evaluate the performance of our algorithms. For 1NN queries, our algorithm uses the hierarchical structure of V oR-tree and the geometry of V oronoi cells stored at leaf nodes to prune many nodes visited by the com- petitor R-tree-based BFS approach [HS99]. ForkNN queries, our incremental algorithm uses an important property of V oronoi diagrams to retrieve/examine only the points neighboring the(k¡1)-th closest points to the query point. Our experiments verify that our algorithm outperforms BFS [HS99] in terms of I/O cost (number of accessed disk pages; up to 18% improvement). For RkNN queries, we prove that the result is also 23 at most k edges far from the query point q in Delaunay graph of data. Unlike TPL [TPL04], our algorithm is scalable with respect to k and outperforms TPL in terms of I/O cost by at least 3 orders of magnitude. For kANN queries, our algorithm through a diffusive exploration of the irregular-shaped SR prunes many nodes/points examined by the MBM algorithm [PTMH05]. It accesses a small fraction of disk pages accessed by MBM (50% decrease in I/O when MBR of query points cover 4% of the entire data space). Finally, we use V oR-trees to address SSQs [SS06b]. We consider SSQ as an NN query as it finds the points that collectively minimize distances to a set of query points considering the definition of spatial domination. Built on the V oronoi-based algorithm VS 2 of [SS06b], our I/O-optimal algorithm provides proven guarantees on the number of visited V oronoi cells. Unlike VS 2 , it incrementally returns the result ordered by a given arbitrary monotone function. Our experiments show that our V oR-tree-based algorithm is more I/O-efficient than the R-tree-based algorithm B 2 S 2 of [SS06b] (up to 17% less I/O). 4.2 VoR-tree In this section, we show how we use an R-tree to index the V oronoi diagram of a set of points together with the actual points. We refer to the resulting index structure as VoR-tree, an R-tree of point data augmented with the points’ V oronoi diagram. Suppose that we have stored all the data points of setP in an R-tree. For now, assume that we have pre-built VD(P), the V oronoi diagram of P . As we discussed in Section 2.4, each leaf node of R-tree stores a subset of data points ofP . The leaves also include pointers to the records containing extra information about the corresponding points. In the record of the pointp stored in leaf nodeN, we store the set of V oronoi neighbors of 24 p (i.e.,VN(p)) and the vertices of the V oronoi cell ofp (i.e.,V(p)). The above instance of R-tree built using points inP is the VoR-tree ofP . Figure 4.1a shows the V oronoi diagram of the same points indexed by R-tree in Figure 2.5. To bound the V oronoi cells with infinite edges (e.g., V(p 3 )), we clip them using a large rectangle bounding the entire area covered by the points in P (the dotted rectangle). Figure 4.1b illustrates the V oR-tree of the points of P . For simplicity, it shows only the contents of leaf nodeN 2 including pointsp 4 , p 5 , andp 6 , the generators of grey V oronoi cells depicted in Figure 4.1a. The record associated with each pointp in N 2 includes both V oronoi neighbors and V oronoi vertices ofp in a common sequential order. We refer to this record as Voronoi record of p. Each V oronoi neighbor of p maintained in this record is actually a pointer to the disk page storing the neighbor point’s information (including its V oronoi record). In Section 4.4, we use these pointers to navigate within the V oronoi diagram. In sum, V oR-tree ofP is an R-tree onP blended with V oronoi diagramVD(P) and Delaunay graphDG(P) ofP . It is clear that the V oronoi neighbors and V oronoi cell of the same point can be computed from each other. However, V oR-tree stores both of these sets to avoid the computation cost when both are required. For applications focusing on specific queries, only the set required by the corresponding query processing algorithm can be stored. 4.2.1 Operations In this section, we show the basic operations required to build and maintain V oR-tree. Given the set of points P , the batch operation to build the V oR-tree of P first uses a classic approach such as Fortune’s sweepline algorithm [OBSC00] to build the V oronoi diagram of P . The V oronoi neighbors and the vertices of the cell of each point is then 25 (a) 3 2 1 5 4 12 14 1 2 3 7 8 4 5 6 7 6 13 (b) N 4 N 5 N 3 N 2 N 1 N 7 N 6 R p 4 p 5 p 6 … … … … V(p 6 )={...} VN(p 5 )={ p 1 , p 2 , p 6 , p 4 , p 7 } V(p 5 )={…} VN(p 6 )={ p 5 , p 2 , p 3 , p 12 , p 4 } V(p 4 )={a, b, c, d, e, f} VN(p 4 )={ p 5 , p 6 , p 12 , p 14 , p 8 , p 7 } Figure 4.1: a) V oronoi diagram and b) V oR-tree of the points shown in Figure 2.5 stored in its record. Finally, it easily uses an efficient bulk construction approach for R- trees [dBSW97] to index the points inP considering their V oronoi records. The resulted R-tree is the V oR-tree ofP . For dynamic maintenance of V oR-trees, we require insert, delete and update opera- tions. To insert a new pointx in a V oR-tree, we first use V oR-tree (or the corresponding R-tree) to locatep, the closest point tox in the setP . This is the point whose V oronoi cell includes x. Then, we insert x in the corresponding R-tree. Finally, we need to build/store V oronoi cell/neighbors of x and subsequently update those of its neighbors 26 p 2 p 3 p 4 p 1 v 1 v 2 v 3 v 4 x Figure 4.2: Inserting the pointx in V oR-tree in their V oronoi records. We use the algorithm for incremental building of V oronoi dia- grams presented in [OBSC00]. Figure 4.2 shows this scenario. Inserting the point x residing in the V oronoi cell of p 1 , changes the V oronoi cells/neighbors (and hence the V oronoi records) of pointsp 1 ,p 2 ,p 3 andp 4 . We first clipV(p 1 ) using the perpendicular bisector of line segment xp 1 (i.e., line B(x;p 1 )) and store the new cell in p 1 ’s record. We also update V oronoi neighbors of p 1 to include the new point x. Then, we select the V oronoi neighbor ofp 1 corresponding to one of (possibly two) V oronoi edges ofp 1 that intersect withB(x;p 1 ) (e.g.,p 2 ). We apply the same process using the bisector line B(x;p 2 ) to clip and update V(p 2 ). Subsequently, we add x to the V oronoi neighbors ofp 2 . Similarly, we iteratively apply the process top 3 andp 4 untilp 1 is selected again. At this point the algorithm terminates and as the result it updates the V oronoi records of points p i and computes V(x) as the regions removed from the clipped cells. The V oronoi neighbors of x are also set to the set of generator points of updated V oronoi cells. Finally, we store the updated V oronoi cells and neighbors in the V oronoi records corresponding to the affected points. Notice that finding the affected points p 1 ;:::;p 4 is straightforward using V oronoi neighbors and the geometry of V oronoi cells stored in V oR-tree. To delete a pointx from V oR-tree, we first locate the V oronoi record ofx using the corresponding R-tree. Then, we access its V oronoi neighbors through this record. The 27 Figure 4.3: Search Region ofa for nearest neighbor query given the query pointq cells and neighbors of these points must be updated after deletion of x. To perform this update, we use the algorithm in [OBSC00]. It simply uses the intersections of perpendicular bisectors of each pair of neighbors ofx to update their V oronoi cells. We also remove x from the records of its neighbors and add any possible new neighbor to these records. At this point, it is safe to deletex from the corresponding R-tree. The update operation of V oR-tree to change the location of x is performed using a delete followed by an insert. The average time and I/O complexities of all three opera- tions are constant. With both insert and delete operations, only V oronoi neighbors of the pointx (and hence its V oronoi record) are changed. These changes must also be applied to the V oronoi records of these points which are directly accessible through that of x. According to Property V-2, the average number of V oronoi neighbors of a point is six. Therefore, the average time and I/O complexities of insert/delete/update operations on V oR-trees are constant. 4.3 I/O-Optimal Query Processing In this section, we define the concept of I/O-optimality for a query processing algorithm. To provide a formal definition, we first define the Search Region of a query. Both these definitions have been adopted from [HS99]. 28 Definition 1. Given a database P of points inR d , a query Q and a database point a2 P , the Search Region (SR) ofa for queryQ is the portion ofR d that may contain a resulta 0 better thana. Definition 1 simply states that if we have already retrieved a potential answera2P for the query Q, SR of a is the subset ofR d that may contain a database point a 0 as a better answer toQ. To illustrate, consider Figure 4.3 where the filled dots show the database points and the single white dot is the query pointq. The pointa is a potential answer for the nearest neighbor (NN) query that looks for the closest database point to q. Any database point closer than a to q (a better answer for our NN query) resides in the diskC centered atq with the radiusD(q;a). This disk, shown in grey color in Figure 4.3, is SR of the pointa for the NN query givenq. In Section 4.4, we show SRs of four nearest neighbor queries for which we provide query processing algorithms. Now, we define the I/O-optimality for a query processing algorithm. Definition 2. A query processing algorithmA is I/O-Optimal with respect to the index structure it employs iffA finds the query result with the least amount of I/O with respect to the information provided in the index structure. Definition 3 states that an algorithm is I/O-Optimal if and only if it optimally utilizes the index structure to first find an approximate answer to the queryQ and then access the minimum number of disk pages to explore the search region of this answer for possibly better results. The optimality of the first step is usually accomplished by utilizing the grouping principle used by the index structure (e.g., rectangular hierarchy of R-tree and V oR-tree) as well as a set of heuristics to get close to the portion of data space that contains the result. In the second step, an I/O-Optimal algorithm only examines the disk pages including at least one point inside SR of the current answer. With R-tree-based algorithms, I/O-optimality means that the algorithm only accesses those nodes of R-tree whose MBR intersect with SR. That is, it efficiently utilizes the 29 Figure 4.4: Search Region of a hypothetical query extra information provided in the index structure about the data points to prune the search space. Any R-tree nodes that does not intersect with the current SR (that of the best answer found so far) cannot contain a better answer and hence does not require to be accessed. For example, Hjaltson and Samet’s BBS algorithm [HS99] for kNN queries is I/O-Optimal with respect to R-tree as it only examines the nodes whose MBR intersect with the circular SR of NN query. Papadias et al. also provided an I/O-Optimal R-tree-based algorithm for general skyline queries in [PTFS05]. Now, we discuss I/O-optimality for the algorithms that utilize V oR-trees as their underlying index structure. The extra information provided in the V oronoi records of a V oR-tree facilitates exploring the SR of a query to examine each of its points. Figure 4.4 illustrates SR of a hypothetical query (shown in dark grey color) as well as the V oronoi cells of the data points in the same neighborhood. An I/O-Optimal V oR-tree-based algorithm traverses the V oronoi diagram of data points through the pointers provided in V oronoi records and examines only those points whose V oronoi cell intersect with SR. 30 As shown in the figure, the set of V oronoi cells of these points (such asp 1 ) jointly cover the entire SR. Hence, the algorithm must access the disk pages including the V oronoi records of these points. However, to limit the traversal to the borders of SR and also guarantee a complete traversal of SR, the algorithm must also visit the disk pages stor- ing those points (such asp 2 ) for which at least one immediate V oronoi neighbor intersect with SR. In Figure 4.4, the I/O-Optimal algorithm accesses only those disk pages that store the V oronoi records of the points whose cells are depicted in grey color. There- fore, we utilize the following definition in designing our query processing algorithms in Section 4.4: Definition 3. A VoR-tree-based query processing algorithm A is I/O-Optimal iff A accesses only those disk pages that store at least one point whose Voronoi cell or that of one of its immediate Voronoi neighbors intersect with the search region of the query. 4.4 Query Processing using VoR-tree In this section, we discuss our algorithms to process different nearest neighbor queries using V oR-trees. For each query, we first review its state-of-the-art algorithm. Then, we present our algorithm showing how maintaining V oronoi records in V oR-tree boosts the query processing capabilities of the corresponding R-tree. 4.4.1 k Nearest Neighbor Query (kNN) Given a query pointq, k Nearest Neighbor (kNN) query finds thek closest data points to q. Given the data setP , it findsk pointsp i 2 P for which we haveD(q;p i )· D(q;p) for all pointsp2 P nfp 1 ;:::;p k g [RKV95]. We usekNN(q)=fp 1 ;:::;p k g to denote the ordered result set;p i is thei-th NN ofq. 31 The I/O-optimal algorithm for findingkNNs using an R-tree is the Best-First Search (BFS) [HS99]. BFS traverses the nodes of R-tree from the root down to the leaves. It maintains the visited nodes N in a minheap sorted by their mindist(N;q). Con- sider the R-tree of Figure 4.1a (V oR-tree without V oronoi records) and query point q. BFS first visits the root R and adds its entries together with their mindist() values to the heap H (H=f(N 6 ;0);(N 7 ;2)g). Then, at each step BFS accesses the node at the top of H and adds all its entries into H. Extracting N 6 , we get H=f(N 7 ;2);(N 3 ;3); (N 2 ;7);(N 1 ;17)g. Then, we extract N 7 to get H=f(N 5 ;2); (N 3 ;3);(N 2 ;7);(N 1 ;17);(N 4 ;26)g. Next, we extract the point entries of N 5 where we findp 14 as the first potential NN ofq. Now, asmindist() of the first entry ofH is less than the distance of the closest point toq we have found so far (bestdist=D(p 14 ;q)=5), we must visitN 3 to explore any of its potentially better points. ExtractingN 3 we realize that its closest point toq isp 8 withD(p 8 ;q)=8>bestdist and hence we returnp 14 as the nearest neighbor (NN) of q (NN(q)=p 14 ). As an incremental algorithm, we can con- tinue the iterations of BFS to return allk nearest neighbors ofq in their ascending order toq. Here,bestdist is the distance of thek-th closest point found so far toq . For a general query Q on setP , we define the search region (SR) of a pointp2 P as the portion ofR 2 that may contain a result better than p in P . BFS is considered I/O-optimal as at each iteration it visits only the nodes intersecting the SR of its best candidate resultp (i.e., the circle centered atq and with radius equal toD(q;p)). How- ever, as the above example shows nodes such asN 3 while intersecting SR ofp 14 might have no point closer than p 14 to q. We show how one can utilize V oronoi records of V oR-tree to avoid visiting these nodes. First, we show our VR-1NN algorithm for processing 1NN queries (see Figure 4.5 for the pseudo-code). VR-1NN works similar to BFS. The only difference is that once VR-1NN finds a candidate pointp, it accesses the V oronoi record ofp. Then, it checks 32 Algorithm VR-1NN (pointq) 01. minheapH =f(R;0)g;bestdist=1; 02. repeat 03. remove the first entrye fromH; 04. ife is a leaf node 05. for each pointp ofe 06. ifD(p;q)<bestdist 07. bestNN =p;bestdist=D(p;q); 08. ifV(bestNN) containsq returnbestNN; 09. else //e is an intermediate node 10. for each child nodee 0 ofe 11. insert(e 0 ;mindist(e 0 ;q)) intoH; Figure 4.5: 1NN algorithm using V oR-tree Algorithm VR-kNN (pointq, integerk) 01. NN(q)= VR-1NN(q); 02. minheapH =f(NN(q);D(NN(q);q))g; 03. Visited=fNN(q)g;counter =0; 04. while (counter <k) 05. remove the first entryp fromH; 06. outputp; incrementcounter; 07. for each V oronoi neighbor ofp such asp 0 08. ifp 0 = 2Visited 09. add(p 0 ;D(p 0 ;q)) intoH andp 0 intoVisited; Figure 4.6: kNN algorithm using V oR-tree whether the V oronoi cell of p contains the query point q (Line 7 of Figure 4.5). If the answer is positive, it returns p (and exits) as p is the closest point to q according to the definition of a V oronoi cell. Incorporating this containment check in VR-1NN algorithm, avoids visiting (i.e., prunes) nodeN 3 in the above example asV(p 14 ) contains q. To extend VR-1NN for generalkNN processing, we utilize Property V-4 of V oronoi diagrams discussed in Chapter 2. This property states that in Figure 4.1 where the first NN ofq isp 14 , the second NN ofq (p 4 ) is a V oronoi neighbor ofp 14 . Also, its third NN (p 8 ) is a V oronoi neighbor of either p 14 or p 8 (or both as in this example). Therefore, once we find the first NN of q we can easily explore a limited neighborhood around its V oronoi cell to find other NNs 33 step action H 1 p 14 = 1st NN, addVN(p 14 ) (p 4 ;7);(p 8 ;8);(p 12 ;13);(p 13 ;18) 2 p 4 = 2nd NN, addVN(p 4 ) (p 8 ;8);(p 12 ;13);(p 6 ;13);(p 7 ;14);(p 5 ;16);(p 13 ;18) 3 p 8 = 3rd NN, terminate. - Table 4.1: Trace of VR-kNN fork =3 (e.g., we examine only V oronoi neighbors ofNN(q) to find the second NN ofq). Figure 4.6 shows the pseudo-code of our VR-kNN algorithm. It first uses VR-1NN to find the first NN ofq (p 14 in Figure 4.1a). Then, it adds this point to a minheapH sorted on the ascending distance of each point entry toq (H=(p 14 ;5)). Subsequently, each following iteration removes the first entry from H, returns it as the next NN of q and adds all its V oronoi neighbors to H. Assuming k = 3 in the above example, Table 4.1 shows the trace of VR-kNN iterations. Correctness: The correctness of VR-kNN follows the correctness of BFS and the defi- nition of V oronoi diagrams. Complexity: We compute I/O complexities in terms of V oronoi records and R-tree nodes retrieved by algorithm. VR-kNN once finds NN(q) executes exactly k itera- tions each extracting V oronoi neighbors of one point. Property V-2 states that the aver- age number of these neighbors is constant. Hence, the I/O complexity of VR-kNN is O(©(jPj)+k) where ©(jPj) is the complexity of finding 1st NN of q using V oR-tree (or R-tree). The time complexity can be determined similarly. Improvement over BFS: We show how, for the same kNN query, VR-kNN accesses less number of disk pages (or V oR-tree nodes) comparing to BFS. Figure 4.7a shows a query pointq and 3 nodes of a V oR-tree with 8 entries per node. With the corresponding R-tree, BFS accesses nodeN 1 where it findsp 1 , the first NN ofq. To findq’s 2nd NN, BFS visits both nodesN 2 andN 3 as theirmindist is less thanD(p 2 ;q) (p 2 is the closest 2nd NN found inN 1 ). However, VR-kNN does not accessN 2 andN 3 . It looks for 2nd NN in V oronoi neighbors of p 1 which are all stored in N 1 . Even when it returns p 2 as 34 (a) 1 2 3 1 2 3 4 (b) 1 2 1 2 Figure 4.7: a) Improving over BFS, b)p2 R2NN(q) 2nd NN, it looks for 3rd NN in the same node asN 1 contains all V oronoi neighbors of bothp 1 andp 2 . The above example represents a sample of manykNN query scenarios where VR-kNN achieves a better I/O performance than BFS. 4.4.2 Reverse k Nearest Neighbor Query (RkNN) Given a query pointq, Reverse k Nearest Neighbor (RkNN) query retrieves all the data points p 2 P that have q as one of their k nearest neighbors. Given the data set P , it finds all p 2 P for which we have D(q;p) · D(q;p k ) where p k is the k-th nearest neighbor of p in P [TPL04]. We use RkNN(q) to denote the result set. Figure 4.7b shows pointp together withp 1 andp 2 asp’s 1st and 2nd NNs, respectively. The pointp is closer top 1 than toq and hencep is not in R1NN(q). However, q is inside the circle centered at p with radius D(p;p 2 ) and hence it is closer to p than p 2 to p. As p 2 is the 2nd NN ofp,p is in R2NN(q). The TPL algorithm for RkNN search proposed by Tao et al. in [TPL04] uses a two- step filter-refinement approach on an R-tree of points. TPL first finds a set of candidate RNN points S cnd by a single traversal of the R-tree, visiting its nodes in ascending distance from q and performing smart pruning. The pruned nodes/points are kept in a 35 setS rfn which are used during the refinement step to eliminate false positives fromS cnd . We review TPL starting with its filtering criteria to prune the nodes/points that cannot be in the result. In Figure 4.7b, consider the perpendicular bisectorB(q;p 1 ) which divides the space into two half-planes. Any point such as p locating on the same half-plane as p 1 (denoted asB p 1 (q;p 1 )) is closer top 1 than toq. Hence,p cannot be in R1NN(q). That is, any pointp in the half-planeB p 1 (q;p 1 ) defined by the bisector ofqp 1 for an arbitrary pointp 1 cannot be in R1NN(q). With R1NN queries, TPL uses this criteria to prune the points that are in B p 1 (q;p 1 ) of another point p 1 . It also prunes the nodes N that are in the union ofB p i (q;p i ) for a set of candidate pointsp i . The reason is that each point inN is closer to one ofp i ’s than toq and hence cannot be in R1NN(q). A similar reasoning holds for general RkNN queries. ConsideringB p 1 (q;p 1 ) andB p 2 (q;p 2 ),p is not inside the intersection of these two half-planes (the region in black in Figure 4.7b). It is also outside the intersection of the corresponding half-planes of any two arbitrary points of P . Hence, it is not closer to any two points than toq and thereforep is in R2NN(q). With R1NN query, pruning a node N using the above criteria means incrementally clipping N by bisector lines of n candidate points into a convex polygon N res which takesO(n 2 ) times. The residual regionN res is the part ofN that may contain candidate RNNs of q. If the computed N res is empty, then it is safe to prune N (and add it to S rfn ). This filtering method is more complex with RkNN queries where it must clip N with each of ¡ n k ¢ combinations of bisectors of n candidate points. To overcome this complexity, TPL uses a conservative trim function which guarantees that no possible RNN is pruned. With R1NN, trim incrementally clips the MBR of the clippedN res from the previous step. With RkNN, as clipping with ¡ n k ¢ combinations, each withk bisector lines, is prohibitive, trim utilizes heuristics and approximations. TPL’s filter step is applied in rounds. Each round first eliminates candidates ofS cnd which are pruned by at leastk entries inS rfn . Then, it adds to the final result RkNN(q) those candidates which 36 1 2 4 7 3 5 6 8 Figure 4.8: Lemma 1 are guaranteed not to be pruned by any entry ofS cnd . Finally, it queues more nodes from S rfn to be accessed in the next round as they might prune some candidate points. The iteration on refinement rounds is terminated when no candidate left (S cnd =?). While TPL utilizes smart pruning techniques, there are two drawbacks: 1) Fork>1, the conservative filtering of nodes in the trim function fails to prune the nodes that can be discarded. This results into increasing the number of candidate points [TPL04]. 2) For many query scenarios, the number of entries kept in S rfn is much higher than the number of candidate points which increases the workspace required for TPL. It also delays the termination of TPL as more refinement rounds must be performed. Improvement over TPL: Similar to TPL, our VR-RkNN algorithm also utilizes a filter- refinement approach. However, utilizing properties of V oronoi diagram of P , it elimi- nates the exhaustive refinement rounds of TPL. It uses the V oronoi records of V oR-tree of P to examine only a limited neighborhood around a query point to find its RNNs. First, the filter step utilizes two important properties of RNNs to define this neighbor- hood from which it extracts a set of candidate points and a set of points required to prune false hits (see Lemmas 2 and 3 below). Then, the refinement step finds kNNs of each candidate in the latter set and eliminates those candidates that are closer to their k-th NN than toq. 37 1 2 3 4 9 4 5 6 1 2 3 5 7 6 8 10 11 12 Figure 4.9: VR-RkNN fork =2 We discuss the properties used by the filter step. Consider the Delaunay graph DG(P). We definegd(p;p 0 ) the graph distance between two verticesp andp 0 ofDG(P) (points of P ) as the minimum number of edges connecting p and p 0 in DG(P). For example, in Figure 2.3 we havegd(p;p 0 )=1 andgd(p;p 00 )=2. Lemma 1. Let p k 6=p be the k-th closest point of set P to a point p 2 P . The upper bound of the graph distance between verticesp andp k in Delaunay graph ofP isk (i.e. gd(p;p k )·k). Proof. The proof is by induction. Consider the point p in Figure 4.8. First, for k=1, we show thatgd(p;p 1 )·1. Property V-4 of V oronoi diagrams, discussed in Section 2.1, states thatp 1 is a V oronoi neighbor ofp;p 1 is an immediate neighbor ofp in Delaunay graph of P and hence we have gd(p;p 1 )=1. Now, assuming gd(p;p i )·i for 0·i·k, we show that gd(p;p k+1 )·k+1. Property V-4 states that p k+1 is a V oronoi neighbor of at least one p i 2 fp 1 ;:::;p k g. Therefore, we have gd(p;p k+1 ) · max(gd(p;p i )) +1·k+1. In Figure 4.9, consider the query point q and the V oronoi diagram VD(P[fqg) (q added to VD(P)). Lemma 1 states that if q is one of the kNNs of a point p, then we 38 have gd(p;q)·k; p is not farther than k distance from q in Delaunay graph of P[fqg. This yields the following lemma: Lemma 2. If p is one of reverse kNNs of q, then we have gd(p;q) · k in Delaunay graphDG(P[fqg). As the first property of RNN utilized by our filter step, Lemma 2 limits the local neighborhood aroundq that may containq’s RkNNs. In Figure 4.9, the non-black points cannot be R2NNs ofq as they are farther thank=2 fromq inDG(P[fqg). We must only examine the black points as candidate R2NNs ofq. However, the number of points ink graph distance fromq grows exponentially withk. Therefore, to further limit these can- didates, the filter step also utilizes another property first proved in [SAA00] for R1NNs and then generalized for RkNNs in [TPL04]. In Figure 4.9, consider the 6 equi-sized partitionsS 1 ;:::;S 6 defined by 6 vectors originating fromq. Lemma 3. Given a query point q inR 2 , the kNNs of q in each partition defined as in Figure 4.9 are the only possible RkNNs ofq (see [TPL04] for a proof). The filter step adds to its candidate set only those points that are closer than k +1 from q in DG(P[fqg) (Lemma 2). From all candidate points inside each partition S i (defined as in Figure 4.9), it keeps only the k closest ones to p and discards the rest (Lemma 3). Notice that both filters are required. In Figure 4.9, the black pointp 9 while in distance 2 fromq is eliminated during our search for R2NN(q) as it is the 3rd closest point to q in partition S 4 . Similarly, p 10 , the closest point to q in S 6 , is eliminated as gd(q;p 10 ) is3. To verify each candidatep, in refinement step we must examine whetherp is closer to q than p’s k-th NN (i.e., p k ). Lemma 1 states that the upper bound of gd(p;p k ) is k (gd(p;p k )·k). Candidatep can also bek edges far fromq (gd(p;q)·k). Hence,p k can be in distance2k fromq (all black and grey points in Figure 4.9). All other points (shown 39 Algorithm VR-RkNN (pointq, integerk) 01. minheapH =fg;Visited=fg; 02. for1·i·6 minheapS cnd (i)=fg; 03. VN(q)= FindV oronoiNeighbors(q); 04. for each pointp inVN(q) 05. add(p;1) intoH; addp intoVisited; 06. while (H is not empty) 07. remove the first entry(p;gd(q;p)) fromH; 08. i= sector aroundq that containsp ; 09. p n = last point inS cnd (i) (infinity if empty); 10. ifgd(q;p)·k andD(q;p)·D(q;p n ) 11. add(p;D(q;p)) toS cnd (i); 12. for each V oronoi neighbor ofp such asp 0 13. ifp 0 = 2Visited 14. gd(q;p 0 )=gd(q;p)+1; 15. add(p 0 ;gd(q;p 0 )) intoH andp 0 intoVisited; 16. for each candidate setS cnd (i) 17. for the firstk points inS cnd (i) such asp 18. p k =k-th NN ofp; 19. ifD(q;p)·D(p k ;p) outputp ; Figure 4.10: RkNN algorithm using V oR-tree as grey crosses) are not required to filter the candidates. Thus, it suffices to visit/keep this setR and compare them toq with respect to the distance to each candidatep. However, visiting the points in R through VD(P) takes exponential time as the size of R grows exponentially withk. To overcome this exponential behavior, VR-RkNN finds thek-th NN of each candidatep which takes onlyO(k 2 ) time for at most6k candidates. Figure 4.10 shows the pseudo-code of the VR-RkNN algorithm. VR-RkNN main- tains 6 sets S cnd (i) including candidate points of each partition. Each set S cnd (i) is a minheap storing the (at most)k NNs ofq inside partitionS i . First, VR-RkNN finds the V oronoi neighbors ofq as if we addq intoVD(P). This is easily done using the insert operation of V oR-tree without actual insertion ofq (see Section 4.2.1). In the filter step, VR-RkNN uses a minheap H sorted on the graph distance of its point entries toq to traverseVD(P) in ascendinggd() fromq. It first adds all neighbors p i of q to H with gd(q;p i )=1. In Figure 4.9, the points p 1 ;:::;p 4 are added to H. 40 Then, VR-RkNN iterates over the top entry ofH. At each iteration, it removes the top entry p. If p, inside S i , passes both filters defined by Lemma 2 and 3, the algorithm adds (p;D(q;p)) to the candidate set of partition S i (S cnd (i); e.g., p 1 to S 1 ). It also accesses the V oronoi record of p through which it adds the V oronoi neighbors of p to H (incrementing its graph distance). The filter step terminates when H becomes empty. In our example, the first iteration adds p 1 to S cnd (1), and p 5 , p 6 and p 7 with distance 2 toH. After the last iteration, we haveS cnd (1)=fp 1 ;p 5 g,S cnd (2)=fp 6 ;p 7 g, S cnd (3)=fp 4 ;p 11 g,S cnd (4)=fp 3 ;p 8 g,S cnd (5)=fp 2 ;p 12 g, andS cnd (6)=fg. The refinement step starts at Line 17 of pseudo-code (Figure 4.10). It examines the points in eachS cnd (i) and adds them to the final result iff they are closer to theirk-th NN than toq (R2NN=fp 1 ;p 2 g). Finding thek-th NN is straightforward using an approach similar to VR-kNN of Section 4.4.1. Correctness: Lemma 4. Given a query pointq, VR-RkNN correctly finds RkNN(q). Proof. It suffices to show that the filter step of VR-RkNN is safe. That is, it does not prune the points that might be in the result set. This follows building the filter based on Lemmas 1, 2, and 3. Complexity: VR-RkNN once finds NN(q) starts finding k closest points to q in each partition. It requires retrievingO(k) V oronoi records to find these candidate points as they are at most k edges far from q. Finding k NN of each candidate point also requires accessing O(k) V oronoi records. Therefore, the I/O complexity of VR-RkNN isO(©(jPj)+k 2 ) where©(jPj) is the complexity of finding NN(q). 41 (a) 1 2 3 (b) 1 2 3 Figure 4.11: k Aggregate Nearest Neighbor query for Q=fq 1 ;q 2 ;q 3 g and function a) f=sum , b)f=max 4.4.3 k Aggregate Nearest Neighbor Query (kANN) Given the set Q=fq 1 ;:::;q n g of query points, k Aggregate Nearest Neighbor Query (kANN) finds the k data points in P with smallest aggregate distance to Q. We use kANN(q) to denote the result set. The aggregate distance adist(p;Q) is defined as f(D(p;q 1 );:::;D(p;q n )) where f is a monotonically increasing function [PTMH05]. For example, consideringP as meeting locations andQ as attendees’ locations, 1ANN query with f=sum finds the meeting location traveling towards which minimizes the total travel distance of all attendees (p minimizes f=sum in Figure 4.11a). With f=max, it finds the location that leads to the the earliest time that all attendees arrive (assuming equal fixed speed; Figure 4.11b). Positive weights can also be assigned to query points (e.g., adist(p;Q)= P n i=1 w i D(p;q i ) wherew i ¸ 0). Throughout this sec- tion, we use functionsf andadist() interchangeably. The best R-tree-based solution forkANN queries is the MBM algorithm [PTMH05]. Similar to BFS forkNN queries, MBM visits only the nodes of R-tree that may contain a result better than the best one found so far. Based on two heuristics, it utilizes two corresponding functions that return lower-bounds on theadist() of any point in a node N to prune N: 1) amindist(N;M)=f(nm;:::;nm) where nm=mindist(N;M) is the minimum distance between the two rectangles N and M, the minimum bounding box of Q, and 2) amindist(N;Q) = f(nq 1 ;:::;nq n ) where nq i = mindist(N;q i ) is 42 1 2 3 (a)f=sum (b)f=max Figure 4.12: Search Region ofp for functionf the minimum possible distance between q i and the points of node N. For each node N, MBM first examines ifamindist(N;M) is larger than the aggregate distance of the current best result p (bestdist= adist(p;Q)). If the answer is positive, MBM discards N. Otherwise, it examines if the second lower-bound amindist(N;Q) is larger than bestdist. If yes, it discards N as N cannot contain a better answer than p. Otherwise, MBM visitsN’s children. Once MBM finds a data point it updates its current best result and terminates when no better point can be found. We show that MBM’s conservative heuristics which are based on the rectangular grouping of points into nodes do not properly suit the shape of kANN’s search region SR (the portion of space that may contain a better result). Hence, they fail to prune many nodes. Figures 4.12a and 4.12b illustrate SRs of a pointp forkANN queries with aggregate functions f=sum and f=max, respectively (regions in grey). The point p 0 is in SR ofp iff we haveadist(p 0 ;Q)·adist(p;Q). The equality holds on SR’s bound- ary (denoted as SRB). For f=sum (and weighted sum), SR has an irregular circular shape that depends on the query cardinality and distribution (an ellipse for 2 query points) [PTMH05]. For f=max, SR is the intersection of n circles centered at q i ’s with radius=max(D(p;q i )). The figure shows SRBs of several points as contour lines 43 Figure 4.13: VR-kANN fork =3 defined as the locus of points p 2R 2 where adist(p;Q)=c (constant). As shown, SR ofp 0 is completely inside SR ofp iff we haveadist(p 0 ;Q)<adist(p;Q). The figure also shows the centroidq ofQ defined as the point inR 2 that minimizesadist(q;Q). Notice thatq is inside SRs of all points. MBM once finds p as a candidate answer to kANN, tries to examine only the nodes N that intersect with SR of p. However, the conservative lower-bound func- tion amindist() causes MBM not to prune nodes such as N in Figure4.12a that fall completely outside SR ofp (amindist(N;Q)=3+8+2=13<adist(p;Q)=14). Now, we describe our VR-kANN algorithm for kANN queries using the example shown in Figure 4.13. VR-kANN uses V oR-tree to traverse the V oronoi cells covering the search space of kANN. It starts with the V oronoi cell that contains the centroid q of Q. The generator of this cell is the closest point of P to q (p q =p 2 in Figure 4.13). VR-kANN uses a minheapH of pointsp sorted on theamindist(V(p);Q) which is the 44 Algorithm VR-kANN (setQ, integerk, functionf) 01. p q = FindCentroidNN(Q,f); 02. minheapH =f(p q ;0)g; 03. minheapRH =f(p q ;adist(p q ;Q))g; 04. Visited=fp q g;counter =0; 05. repeat 06. remove the first entryp fromH; 07. while the first entryp 0 ofRH has 08. adist(p 0 ;Q)·amindist(V(p);Q) 09. removep 0 fromRH; outputp 0 ; 10. incrementcounter; ifcounter =k terminate; 11. for each V oronoi neighbor ofp such asp 0 12. ifp 0 = 2Visited 13. add(p 0 ;amindist(V(p 0 );Q)) intoH; 14. add(p 0 ;adist(p 0 ;Q)) intoRH; 15. addp 0 intoVisited; Figure 4.14: kANN algorithm using V oR-tree minimum possibleadist of any point in V oronoi cell ofp. The function plays the same role as the two lower-bound functions used in heuristics of MBM. It sets a lower-bound on adist() of the points in a V oronoi cell. Later, we show how we efficiently compute this function. VR-kANN also keeps thek points with minimumadist() in a result heap RH with theiradist() as keys. With the example of Figure 4.13, VR-kANN first adds (p 2 ;0) and (p 2 ;40) into H and RH, respectively. At each iteration, VR-kANN first deheaps the first entry p from H. It reports any point p i in RH whose adist() is less thanamindist(V(p);Q). The reason is that the SR ofp i has been already traversed by the points inserted inH and no result better thanp i can be found (see Lemma 5). Finally, it inserts all non-visited V oronoi neighbors ofp intoH andRH (p 1 , p 3 , p 4 , p 6 , andp 7 ; p 1 andp 2 are the best 1st and 2nd ANN inRH). The algorithm stops when it reports the k-th point of RH. In our example VR-kANN subsequently visits the neighbors of p 1 , p 3 ,p 4 ,p 5 ,p 6 , andp 8 where it reportsp 1 andp 2 as the first two ANNs. It reportsp 3 and stops when it deheapsp 10 andp 7 . Figure 4.14 shows the pseudo-code of VR-kANN. 45 Figure 4.15: Finding the cell containing centroidq The only two remaining pieces of VR-kANN are the functions FindCentroidNN() used to find the closest data point p q to centroid q and amindist(V(p);Q) that finds the minimum possibleadist() for the points inV(p). We show how we compute these functions. Centroid Computation: When q can be exactly computed (e.g., for f=max it is the center of smallest circle containingQ), VR-kANN performs a 1NN search using V oR- tree and retrieves p q . However, for many functions f, the centroid q cannot be pre- cisely computed [PTMH05]. Withf=sum, q is the Fermat-Weber point which is only approximated numerically. As VR-kANN only requires the closest point to q (not q itself), we provide an algorithm similar to gradient descent to find p q 1 . Figure 4.15 illustrates this algorithm. We first start from a point close to q and find its closest point p 1 using VR-1NN (e.g., the geometric centroid of Q with x = (1=n) P n i=1 q i :x and y = (1=n) P n i=1 q i :y) for f=sum). Second, we compute the partial derivatives of f=adist(q;Q) with respect to variablesq:x andq:y: @ x = @adist(q;Q) @x = P n i=1 (x¡x i ) p (x¡x i ) 2 +(y¡y i ) 2 @ y = @adist(q;Q) @y = P n i=1 (y¡y i ) p (x¡x i ) 2 +(y¡y i ) 2 (4.1) Computing @ x and @ y at point p 1 , we get a direction d 1 . Drawing a ray r 1 originating fromp 1 in directiond 1 enters the V oronoi cell ofp 2 intesecting its boundary at pointx 1 . 1 The method SPM in [PTMH05] uses a similar approach to approximate the centroid. 46 We compute the directiond 2 atx 1 and repeat the same process using a rayr 2 originating from x 1 in direction d 2 which enters V(p q ) at x 2 . Now, as we are inside V(p q ) that includes centroid q, all other rays consecutively circulate inside V(p q ). Detecting this situation, we returnp q as the closest point toq. Minimum aggregate distance in a Voronoi cell: The functionamindist(V(p);Q) can be conservatively computed asadist(vq 1 ;:::;vq n ) wherevq i =mindist(V(p);q i ) is the minimum distance between q i and any point in V(p). However, when the centroid q is outside V(p), minimum adist() happens on the boundary of V(p). Based on this fact, we find a better lower-bound foramindist(V(p);Q). For pointp 1 in Figure 4.15, if we compute the direction d 1 and ray r 1 as stated for centroid computation, we real- ize that amindist(p 0 ;Q) (p 0 2 V(p)) is minimum for a point p 0 on the edge v 1 v 2 that intersects with r 1 . The reason is the circular convex shape of SRs for adist(). There- fore, amindist(V(p);Q) returns adist(vq 1 ;:::;vq n ) where vq i =mindist(v 1 v 2 ;q i ) is the minimum distance betweenq i and the V oronoi edgev 1 v 2 . Correctness: Lemma 5. Given a query pointq, VR-kANN correctly and incrementally findskANN(q) in the ascending order of theiradist() values. Proof. It suffices to show that when VR-kANN reports p, it has already exam- ined/reported all the points p 0 where adist(p 0 ;Q)·adist(p;Q). VR-kANN reports p when for all the cellsV inH, we haveamindist(V;Q)¸adist(p;Q). That is, all these visited cells are outside SR of p. In Figure 4.13, p 2 is reported when H contains only the cells on the boundary of the grey area which contains SR ofp 2 . As VR-kANN starts visiting the cells from the V(p q ) containing centroid q, when reporting any point p, it has already examined/inserted all points in SR of p in RH. As RH is a minheap on adist(), the results are in the ascending order of theiradist() values. 47 Complexity: The V oronoi cells of visited points constitute an almost-minimal set of cells covering SR of result (including k points). These cells are in a close edge distance to the returnedk points. Hence, the number of points visited by VR-kANN is O(k). Therefore, the I/O complexity of VR-kANN is O(©(jPj)+k) where ©(jPj) is the complexity of finding the closest point to centroidq. General aggregate functions: In general, anykANN query with an aggregate function for which SR of a point is continuous is supported by the pseudo-code provided for VR-kANN. This covers a large category of widely used functions such as sum, max and weighted sum. With functions such as f=min, each SR consists of n different circles centered at query points of Q. As the result, Q has more than one centroid for function f. To answer a kANN query with these functions, we need to change VR- kANN to perform parallel traversal ofVD(P) starting from the cells containing each of n c centroids. 4.4.4 Spatial Skyline Query (SSQ) Given the setQ=fq 1 ;:::;q n g of query points, the Spatial Skyline query (SSQ) returns the setS(Q) including those points ofP which are not spatially dominated by any other point of P . The point p spatially dominates p 0 iff we have D(p;q i ) · D(p 0 ;q i ) for all q i 2 Q andD(p;q j ) < D(p 0 ;q j ) for at least oneq j 2 Q [SS06b]. Figure 4.16a shows a set of nine data points and two query pointsq 1 andq 2 . The pointp 1 spatially dominates the pointp 2 as bothq 1 andq 2 are closer top 1 than top 2 . Here,S(Q) isfp 1 ;p 3 g. Consider circlesC(q i ;p 1 ) centered at the query pointq i with radiusD(q i ;p 1 ). Obvi- ously, q i is closer to p 1 than any point outside C(q i ;p 1 ). Therefore, p 1 spatially domi- nates any point such asp 2 which is outside all circlesC(q i ;p i ) for allq i 2 Q (the grey region in Figure 4.16a) For a pointp, this region is referred as the dominance region of p [SS06b]. 48 (a) Dominance region of p 1 C(q 1 , p 1 ) q 1 q 2 p 2 p 3 p 1 (b) p 1 N q 2 p 3 q 1 Figure 4.16: Dominance regions of a)p 1 , and b)fp 1 ;p 3 g We introduced SSQ in [SS06b] in which two algorithms B 2 S 2 and VS 2 were pro- posed. Both algorithms utilize the following facts (see Appendix B for proofs): Lemma 6. Any pointp2P which is inside the convex hull ofQ (CH(Q)) or its Voronoi cellV(p) intersects withCH(Q) is a skyline point (p2 S(Q). We use “definite skyline points” to refer to these points. Lemma 7. The set of skyline points ofP depends only on the set of vertices of the convex hull ofQ (i.e.,CH v (Q)). The R-tree-based B 2 S 2 is a customization of a general skyline algorithm, termed BBS, [PTFS05] for SSQ. It tries to avoid expensive dominance checks for the definite skyline points insideCH(Q), identified in Theorems 10 and 13, but also prunes unnec- essary query points to reduce the cost of each examination (Theorem 12). VS 2 employs the V oronoi diagram of the data points to find the first skyline point whose local neigh- borhood contains all other points of the skyline. The algorithm traverses the V oronoi diagram of data points ofP in the order specified by a monotone function of their dis- tances to query points. Utilizing VD(P), VS 2 finds all definite skyline points without any dominance check. While both B 2 S 2 and VS 2 are efficiently processing SSQs, there are two draw- backs: 1) B 2 S 2 still uses the rectangular grouping of points together with conservative mindist() function in its filter step and hence, similar to MBM for kANN queries, it 49 fails to prune many nodes. To show a scenario, we first define the dominance region of a set S as the union of the dominance regions of all points of S (grey region in Figure 4.16b forS=fp 1 ;p 3 g). Any point in this region is spatial dominated by at least one point p2S. Now, consider the MBR of R-tree nodeN in Figure 4.16b. B 2 S 2 does not prune N (visitsN) as we havemindist(N;q 2 )<D(p 1 ;q 2 ) andmindist(N;q 1 )<D(p 3 ;q 1 ), and henceN is not dominated by neither byp 1 nor byp 3 . However,N is completely inside the dominance region offp 1 ;p 3 g and cannot contain any skyline point. 2) VS 2 , while computationally more efficient that B 2 S 2 , provides no guarantee on its I/O-optimality. Also, the algorithm does not support arbitrary monotone functions for reporting the result. We propose our V oR-tree-based algorithm for SSQ which incrementally returns the skyline points ordered by a monotone function provided by the user (progressive similar to BBS [PTFS05] and B 2 S 2 ). To start, we first study the search region of a setS in SSQ. This is the region that may contain points that are not spatially dominated by a point of S. Therefore, SR is easily the complement of the dominance region ofS (white region in Figure 4.16b). It is straightforward to see that SR ofS is a continuous region as it is defined based on the union of a set of concentric circlesC(q i ;p i ). An I/O-optimal algorithm once finds a set of skyline points S must examine only the points inside the SR of S. Our VR-S 2 algorithm shown in Figure 4.17 satisfies this principle. VR-S 2 reports skyline points in the ascending order of a user-provided monotone function f=adist(). It maintains a result minheap RH that includes the candidate skyline points sorted on adist() values. To maintain the order of output, we only add these candidate points into the final ordered result S(Q) when no point with lessadist() can be found. The algorithm’s traversal ofVD(P) is the same as that of VR-kANN with aggregate functionadist() (compare the two pseudo-codes). Likewise, VR-S 2 uses a minheapH 50 Algorithm VR-S 2 (setQ, functionf) 01. compute the convex hullCH(Q); 02. p q = FindCentroidNN(Q,f); 03. minheapH =f(p q ;0)g; 04. minheapRH =f(p q ;adist(p q ;Q))g; 05. setS(Q)=fg;Visited=fp q g; 06. whileH is not empty 07. remove the first entryp fromH; 08. while the first entryp 0 ofRH has 09. adist(p 0 ;Q)·amindist(V(p);Q) 10. removep 0 fromRH; 11. ifp 0 is not dominated byS(Q) addp 0 intoS(Q); 12. for each V oronoi neighbor ofp such asp 0 13. ifp 0 = 2Visited 14. addp 0 intoVisited; 15. ifV(p 0 ) is not dominated by points ofS(Q) andRH 16. add(p 0 ;amindist(V(p 0 );Q)) intoH; 17. ifp 0 is a definite skyline point or p 0 is not dominated by points ofS(Q) andRH 18. add(p 0 ;adist(p 0 ;Q)) intoRH; 19. whileRH is not empty 20. remove the first entryp 0 fromRH; 21. ifp 0 is not dominated byS(Q) addp 0 intoS(Q); Figure 4.17: SSQ algorithm using V oR-tree sorted onamindist(V(p);Q). It starts this traversal from a definite skyline point which is immediately added to the result heap RH. This is the point p q whose V oronoi cell contains the centroid of functionf (here,f=sum). At each iteration, VR-S 2 deheaps the first entry p of H. Similar to VR-kANN, it examines any point p 0 in RH whoseadist() is less than p’s key (amindist(V(p);Q)). Ifp 0 is not dominated by any point inS(Q), it addsp 0 toS(Q) (see Lemma 8). Similar to B 2 S 2 and VS 2 , for dominance checks VR-S 2 employs only the vertices of convex hull ofQ (CH v (Q)) instead of the entireQ (see Theorem 12). Subsequently, accessing p’s V oronoi records, it examines unvisited V oronoi neighbors ofp. For each neighborp 0 , if V(p 0 ) is dominated by any point in S(Q) or RH (discussed later), VR-S 2 discards p 0 . The reason is that V(p 0 ) is entirely outside SR of current S(Q)[RH . Otherwise, it 51 addsp 0 toH. At the end, ifp 0 is a definite skyline point or is not dominated by any point inS(Q) orRH, VR-S 2 adds it toRH. When the heapH becomes empty, any remain- ing point inRH is examined against the points ofS(Q) and if not dominated is added to S(Q). In Figure 4.13, VR-S 2 visits p 1 -p 27 and incrementally returns the ordered set S(Q)=fp 1 ;p 2 ;p 3 ;p 6 ;p 8 ;p 9 ;p 10 g. Spatial domination of a Voronoi cell: To provide a safe pruning approach, we define a conservative heuristic for the domination ofV(p). We declareV(p) as spatially dom- inated if we have mindist(V(p);q i )¸D(s;q i ) for a point s in current candidate S(Q). We show that all points ofV(p) are dominated. Assume that the above condition holds. For any point x in V(p), we have D(x;q i )¸mindist(V(p);q i ). By transitivity we get D(x;q i )¸D(s;q i ). That is, each x in V(p) is spatially dominated by s 2 S(Q). For example,V(p 13 ) is dominated byp 1 as we have the following: mindist(V(p 13 );q 1 )=12>D(p 1 ;q 1 )=2 mindist(V(p 13 );q 2 )=16>D(p 1 ;q 2 )=15 mindist(V(p 13 );q 3 )=33>D(p 1 ;q 3 )=23 Correctness: Lemma 8. Given a query setQ, VR-S 2 correctly and incrementally finds skyline points in the ascending order of theiradist() values. Proof. To prove the order, notice that VR-S 2 ’s traversal is the same as VR-kANN’s traversal. Thus, according to Lemma 5 the result is in the ascending order ofadist(). To prove the correctness, we first prove that ifp is a skyline point (p2 S(Q)) then p is in the result returned by VR-S 2 . The algorithm examines all the points in the SR of the result it returns which is a superset of the actualS(Q). As any un-dominated point 52 is in this SR, VR-S 2 adds p to its result. Then, we prove if VR-S 2 returns p, then we havep is a real skyline point. The proof is by contradiction. Assume thatp is spatially dominated by a skyline pointp 0 . Earlier, we proved that VR-S 2 returnsp 0 at some point as it is a skyline point. We also proved that when VR-S 2 adds p to its result set, it has already reported p 0 as we have adist(p 0 ;Q)<adist(p;Q). Therefore, while examining p, VR-S 2 has checked it againstp 0 for dominance and has discardedp. This contradicts our assumption thatp is in the result of VR-S 2 . Complexity: Similar to VR-kANN, VR-S 2 also visits only the points neighboring a point in the result set S(Q). Hence, it accesses only O(jS(Q)j) V oronoi records. Therefore, the I/O complexity of VR-S 2 isO(jS(Q)j+©(jPj)) where©(jPj) is the I/O complexity of finding the point from which VR-S 2 starts traversingVD(P). 53 Chapter 5 Performance Evaluation We conducted several experiments to evaluate the performance of query processing using V oR-tree. For each of four studied queries, we compared our algorithm with the competitor R-tree-based approach, with respect to the average number of disk I/O (page accesses incurred by the underlying R-tree/V oR-tree index structures). For R-tree- based algorithms, this is the number of accessed R-tree nodes. For V oR-tree-based algo- rithms, the number of disk pages accessed to retrieve V oronoi records is also counted. Here, we do not report CPU costs as all algorithms are mostly I/O-bound. We evaluated all approaches by investigating the effect of the following parameters on their perfor- mances: 1) number of NNsk forkNN,kANN, and RkNN queries, 2) number of query points (jQj) and the size of area covered by MBR of Q for kANN and SSQ, and 3) cardinality of the dataset for all queries. We used two real-world datasets indexed by both R*-tree and V oR-tree (page size=1K bytes, node capacity=30). USGS dataset, obtained from the U.S. Geologi- cal Survey (USGS), consists of 950;000 locations of different businesses in the entire U.S. 1 . NE dataset contains 123;593 locations in New York, Philadelphia and Boston 2 . The experiments were performed by issuing 1000 instances of each query type on a DELL Precision 470 with Xeon 3.2 GHz processor and 3GB of RAM. For convex hull computation in VR-S 2 , we used the Graham Scan algorithm. 1 http://geonames.usgs.gov/ 2 http://www.rtreeportal.org/ 54 (a) 0 5 10 15 20 25 30 35 1 4 16 64 256 k number of accessed pages VR-kNN BFS USGS (b) 0 5 10 15 20 25 30 35 1 4 16 64 256 k number of accessed pages VR-kNN BFS NE (c) 1 10 100 1000 10000 100000 1 4 16 64 k number of accessed pages VR-RkNN TPL USGS (d) 1 10 100 1000 10000 1 4 16 64 k number of accessed pages VR-RkNN TPL NE Figure 5.1: I/O vs. k for ab)kNN, and cd) RkNN In the first set of experiments, we measured the average number of disk pages accessed (I/O cost) by VR-kNN and BFS algorithms varying values of k. Figure 5.1a illustrates the I/O cost of both algorithms using USGS. As the figure shows, utilizing V oronoi cells in VR-kNN enables the algorithm to prune nodes that are accessed by BFS. Hence, VR-kNN accesses less number of pages comparing to BFS, especially for larger values ofk. Withk=128, VR-kNN discards almost 17% of the nodes which BFS finds intersecting with SR. This improvement over BFS is increasing whenk increases. The reason is that the radius of SR used by BFS’s pruning is first initialized to D(q;p) where p is the k-th visited point. This distance increases when k increases and causes many nodes intersect with SR and hence not be pruned by BFS. VR-kNN however uses Property V-4 to define a tighter SR. We also realized that this difference in I/O costs increases if we use smaller node capacities for the utilized R-tree/V oR-tree. Figure 5.1b shows a similar observation for the result of NE. 55 The second set of experiments evaluates the I/O cost of VR-RkNN and TPL for RkNN queries. Figures 5.1c and 5.1d depict the I/O costs of both algorithms for differ- ent values of k using USGS and NE, respectively (the scale of y-axis is logarithmic). As shown, VR-RkNN significantly outperforms TPL by at least 3 orders of magnitude, especially fork>1 (to find R4NN with USGS, TPL takes 8 seconds (on average) while VR-RkNN takes only 4 milliseconds). TPL’s filter step fails to prune many nodes as k trim function is highly conservative. It uses a conservative approximation of the intersection between a node and SR. Moreover, to avoid exhaustive examinations it prunes using only n combinations of ¡ n k ¢ combinations of n candidate points. Also, TPL keeps many pruned (non-candidate) nodes/points for further use in its refinement step. VR-RkNN’s I/O cost is determined by the number of V oronoi/Delaunay edges tra- versed fromq and the distanceD(q;p k ) betweenq andp k =k-th closest point toq in each one of 6 directions. Unlike TPL, VR-RkNN does not require to keep any non-candidate node/point. Instead, it performs single traversals around its candidate points to refine its results. VR-RkNN’s I/O cost increases very slowly when k increases. The reason is thatD(q;p q ) (and hence the size of SR utilized by VR-RkNN) increases very slowly withk. TPL’s performance is variable for different data cardinalities. Hence, our result is different from the corresponding result shown in [TPL04] as we use different datasets with different R-tree parameters. Our next set of experiments studies the I/O costs of VR-kANN and MBM forkANN queries. We used f=sum andjQj=8 query points all inside an MBR covering 4% of the entire dataset and varied k. Figures 5.2a and 5.2b show the average number of disk pages accessed by both algorithms using USGS and NE, respectively. Similar to previous results, VR-kANN is the superior approach. Its I/O cost is almost half of that of MBM whenk·16 with USGS (k·128 with NE dataset). This verifies that VR-kANN’s traversal of SR from the centroid point effectively covers the circular irregular shape of 56 (a) 0 10 20 30 40 50 60 1 4 16 64 256 k number of accessed pages VR-kANN MBM USGS (b) 0 10 20 30 40 50 60 70 80 1 4 16 64 256 k number of accessed pages VR-kANN MBM NE (c) 0 10 20 30 40 50 60 70 80 0.25% 1% 2.25% 4% 16% MBR(Q) number of accessed pages VR-kANN MBM USGS (d) 0 10 20 30 40 50 60 70 80 0.25% 1% 2.25% 4% 16% MBR(Q) number of accessed pages VR-kANN MBM NE Figure 5.2: I/O vs. ab)k, and cd)MBR(Q) forkANN SR ofsum (see Figure 4.12a). That is, the traversal does not continue beyond a limited neighborhood enclosing SR. However, MBM’s conservative heuristic explores the nodes intersecting a wide margin around SR (a superset of SR). Increasing k decreases the difference between the performance of VR-kANN and that of MBM. The intuition here is that with large k, SR converges to a circle around the centroid point q (see outer contours in Figure 4.12a). That is, SR becomes equivalent to the SR of kNN with query point q. Hence, the I/O costs of VR-kANN and MBM converge to that of their corresponding algorithms forkNN queries with the same value fork. The next set of experiments investigates the impact of closeness of the query points on the performance of each kANN algorithm. We varied the area covered by the MBR(Q) from 0.25% to 16% of the entire dataset. With f=sum andjQj=k=8, we measured the I/O cost of VR-kANN and MBM. As Figures 5.2c and 5.2d show, when the area covered by query points increases, VR-kANN accesses much less disk pages comparing to MBM. The reason is a faster increase in MBM’s I/O cost. When the query points are distributed in a larger area, the SR is also proportionally large. Hence, the 57 (a) 0 10 20 30 2 8 32 128 |Q| number of accessed pages VR-S2 VS2 B2S2 USGS (b) 0 20 40 60 2 8 32 128 |Q| number of accessed pages VR-S2 VS2 B2S2 NE (c) 0 10 20 30 40 50 60 0.05% 0.1% 0.25% 1% MBR(Q) number of accessed pages VR-S2 VS2 B2S2 USGS (d) 0 20 40 60 80 100 0.05% 0.1% 0.25% 1% MBR(Q) number of accessed pages VR-S2 VS2 B2S2 NE Figure 5.3: I/O vs. ab)jQj, and cd)MBR(Q) for SSQ larger wide margin around SR intersects with much more nodes leaving them unpruned by MBM. We also observed that changing the number of query points in the same MBR does not change the I/O cost of MBM. This observation matches the result reported in [PTMH05]. Similarly, VR-kANN’s I/O cost is the same for different query sizes. The reason is that VR-kANN’s I/O is affected only by the size of SR. Increasing the number of query points in the same MBR only changes the shape of SR not its size. To study the performance of SSQ algorithms, our next set of experiments compares the I/O costs of our VR-S 2 algorithm with those of VS 2 and B 2 S 2 for different number of query points inMBR(Q) covering 0.1% of each dataset. We used a smallMBR(Q) for this experiment to get a reasonable number of skyline points. Here,S(Q) contains 120 points (on average) for NE. With both datasets (Figures 5.3a and 5.3b), VR-S 2 always outperforms the competitor algorithms. As the second best algorithm, VS 2 accesses almost the same number of disk pages as VR-S 2 does. However, notice that VR-S 2 58 not only provides proven bounds on its I/O cost but also utilizes arbitrary user-provided function to order its result. Finally, our last set of experiments studies how closeness of query points affects each of SSQ algorithms. Figures 5.3c and 5.3d shows the I/O costs of all three algorithms for jQj=8 query points distributed in areas of different sizes. For both datasets, VR-S 2 is the superior approach for smaller areas (e.g., MBR(Q)<1% for USGS). While VR-S 2 is always better than R-tree-based B 2 S 2 , it accesses more pages than VS 2 for larger areas. Both algorithms utilize V oronoi diagrams to explore the SR of query and hence behave similarly. However, with highly scattered query points, VR-S 2 accesses a limited num- ber of extra pages beyond the SR to provide an ordered incremental result. These pages are not accessed by unordered traversal of VS 2 . Notice that whenMBR(Q) is large, the number of many skyline points is also large (e.g.,S(Q) » =800 whenMBR(Q)=1% with NE). This is considered as an extreme case as in typical querying scenarios the skyline set is small. 59 Chapter 6 Conclusions We studied processing spatial queries using an auxiliary data structure, termed V oronoi diagram. We first referred to our previous studies on the application of V oronoi dia- grams in several application contexts. During each of these studies, we proposed utiliz- ing V oronoi diagram of data objects towards solving the corresponding spatial problem and efficient processing of the underlying spatial query. Our V oronoi-based algorithms always outperform their competitor approaches. We observed that the efficiency of our V oronoi-based solutions roots in two fasci- nating properties of V oronoi diagrams: 1) The neighborhood information encoded in V oronoi diagrams provide an effective connectivity between data points which facili- tates a detailed exploration of the data space through traversing V oronoi neighborhood links (see Chapter 4). 2) V oronoi diagrams and their several variations nicely capture varieties of distance-based relations in a spatial data space and hence encode the solution space of the corresponding spatial queries (see Appendices A and B). In this dissertation, inspired by the above properties, we introduced V oR-tree, an index structure that incorporates V oronoi diagram and Delaunay graph of a set of data points into an R-tree that indexes their geometries. V oR-tree benefits from both the neighborhood exploration capability of V oronoi diagrams and the hierarchical structure of R-trees. For four different spatial nearest neighbor queries, we proposed I/O-optimal algorithms utilizing V oR-trees. All our algorithms utilize the hierarchy of V oR-tree to access the portion of data space that contains the query result. Subsequently, they use the V oronoi information associated with the points at the leaves of V oR-tree to traverse the 60 space towards the actual result. That is, they utilize the association between neighboring V oronoi cells to explore the space in the direction that potentially gives a better answer to the query. Founded on geometric properties of V oronoi diagrams, our algorithms also redefine the search region of spatial nearest neighbor queries to expedite this exploration. Our theoretical analysis and extensive experiments with real-world datasets prove that V oR-trees enable I/O-optimal processing ofkNN, ReversekNN,k Aggregate NN, and spatial skyline queries on point data. Comparing to the competitor R-tree-based algorithms, our V oR-tree-based algorithms exhibit performance improvements up to 18% forkNN, 99.9% for RkNN, 64% forkANN, and 17% for spatial skyline queries. The efficiency of V oR-tree in answering general spatial queries must be further stud- ied. We believe that this index structure potentially boosts the efficiency of R-tree- based algorithms for processing a query class more general than spatial nearest neighbor queries. A promising direction in extending this work is to revisit the design of V oR-tree and the relevant query processing algorithms for non-point datasets. Spatial data objects such as polygons and lines might require special handling when inserted into V oR-trees or examined to be considered as query answers. Another interesting direction is to extend V oR-tree as well as V oronoi-based algo- rithms discussed in this dissertation to the space of road network databases. Here, the locations of the spatial objects are restricted to the roads of the network. Moreover, the distance between two objects is a function of the network paths connecting them (e.g., shortest path). This new realistic distance metric depends not only on the locations of the two points in the space ofR 2 but also on the connectivity of the roads. In many cases, this distance cannot be approximated by the Euclidean distance [SKS]. There are two challenges specific to network spaces: 1) Finding the V oronoi cell containing a point is not trivial in network space. The reason is that the cells are not simple polygons. Instead, they are parts of graph edges corresponding to network segments. 2) The representation 61 of network V oronoi cells is not straightforward as the cells in the network space may consist of disconnected portions of graph edges. Addressing the above challenges, one can extend Euclidean V oR-trees to Network V oR-trees using which all nearest neighbor queries can be processed with respect to the network distance. 62 References [BGS01] Philippe Bonnet, J. E. Gehrke, and Praveen Seshadri. Towards Sensor Database Systems. In Proceedings of the Second International Confer- ence on Mobile Data Management, pages 3–14, 2001. [BJKS06] Rimantas Benetis, Christian S. Jensen, Gytis Kariauskas, and Simonas Saltenis. Nearest and Reverse Nearest Neighbor Queries for Moving Objects. In The VLDB Journal: The International Journal on Very Large Data Bases, volume 15, September 2006. [BKS01] Stephan B¨ orzs¨ onyi, Donald Kossmann, and Konrad Stocker. The Skyline Operator. In Proceedings of ICDE’01, pages 421–430, 2001. [BKS04] Farnoush Banaei-Kashani and Cyrus Shahabi. SWAM: A family of access methods for similarity-search in peer-to-peer data networks. In Proceed- ings of the Thirteenth Conference on Information and Knowledge Man- agement (CIKM’04), pages 304–313, November 2004. [BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pages 322–331. ACM Press, 1990. [CF] K. L. Cheung and A. W. Fu. Enhanced Nearest Neighbor Search on the R-tree. In SIGMOD Record, 27(3):16-21, 1998. [CGGL03] Jan Chomicki, Parke Godfrey, Jarek Gryz, and Dongming Liang. Sky- line with Presorting. In Proceedings of ICDE’03, pages 717–816. IEEE Computer Society, 2003. [Cla02] Keith C. Clarke. Getting Started with GIS. Prentice Hall, 4th edition, 2002. [dBSW97] Jochen Van den Bercken, Bernhard Seeger, and Peter Widmayer. A generic approach to bulk loading multidimensional index structures. 63 In Matthias Jarke, Michael J. Carey, Klaus R. Dittrich, Frederick H. Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Pro- ceedings of 23rd International Conference on Very Large Data Bases (VLDB’97), pages 406–415. Morgan Kaufmann, August 1997. [dBvKOS00] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Com- putational Geometry: Algorithms and Applications. Springer Verlag, 2nd edition, 2000. [FSAA01] Hakan Ferhatosmanoglu, Ioana Stanoi, Divyakant Agrawal, and Amr El Abbadi. Constrained nearest neighbor queries. In SSTD, pages 257–278, 2001. [Gut84] Antonin Guttman. R-trees: a Dynamic Index Structure for Spatial Search- ing. In SIGMOD ’84: Proceedings of the 1984 ACM SIGMOD interna- tional conference on Management of data, pages 47–57, New York, NY , USA, 1984. ACM Press. [Hag03] Michiel Hagedoorn. Nearest neighbors can be found efficiently if the dimension is small relative to the input size. In Proceedings of the 9th International Conference on Database Theory - ICDT 2003, volume 2572 of Lecture Notes in Computer Science, pages 440–454. Springer, January 2003. [HJ04] Xuegang Huang and Christian S. Jensen. In-Route Skyline Querying for Location-based Services. In 4th International Workshop on Web and Wireless Geographical Information Systems (W2GIS’04), volume 3428, pages 120–135. Springer, 2004. [HJLO06] Zhiyong Huang, Christian S. Jensen, Hua Lu, and Beng Chin Ooi. Sky- line Queries Against Mobile Lightweight Devices in MANETs. In Pro- ceedings of ICDE’06. IEEE Computer Society, 2006. [HS99] G´ ısli R. Hjaltason and Hanan Samet. Distance Browsing in Spatial Data- bases. TODS, ACM Transactions on Database Systems, 24(2):265–318, 1999. [JKPT03] Christian S. Jensen, Jan Kol´ aˇ rvr, Torben Bach Pedersen, and Igor Timko. Nearest neighbor queries in road networks. In Proceedings of the 11th ACM international symposium on Advances in Geographic Information Systems (GIS’03), pages 1–8. ACM Press, 2003. [JP02] Sungwon Jung and Sakti Pramanik. An Efficient Path Computation Model for Hierarchically Structured Topological Road Maps. In IEEE Transaction on Knowledge and Data Engineering, 2002. 64 [KM] Flip Korn and S. Muthukrishnan. Influence Sets Based on Reverse Near- est Neighbor Queries. In Proceedings ACM SIGMOD International Con- ference on Management of Data, May 2000, Dallas, Texas, USA. [KM00] Flip Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In Proceedings of the 2000 ACM SIGMOD interna- tional conference on Management of data, pages 201–212. ACM Press, 2000. [KOT04] Nick Koudas, Beng Chin Ooi, Kian-Lee Tan, and Rui Zhang 0003. Approximate nn queries on streams with guaranteed error/performance bounds. In VLDB, pages 804–815, 2004. [KRR02] Donald Kossmann, Frank Ramsak, and Steffen Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. In Proceedings of VLDB’02, pages 275–286, 2002. [KS] Mohammad Kolahdouzan and Cyrus Shahabi. Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases. In VLDB 2004, Toronto, Canada. [KSF + 96] Flip Korn, Nikolaos Sidiropoulos, Christos Faloutsos, Eliot Siegel, and Zenon Protopapas. Fast Nearest Neighbor Search in Medical Image Data- bases. In VLDB’96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, pages 215–226. Morgan Kaufmann, 1996. [LYWL05] Xuemin Lin, Yidong Yuan, Wei Wang, and Hongjun Lu. Stabbing the Sky: Efficient Skyline Computation over Sliding Windows. In Proceed- ings of ICDE’05, pages 502–513. IEEE Computer Society, 2005. [Man01] S. Maneewongvatana. Multi-dimensional Nearest Neighbor Searching with Low-dimensional Data. PhD thesis, Computer Science Department, University of Maryland, College Park, MD, 2001. [OBSC00] Atsuyuki Okabe, Barry Boots, Kokichi Sugihara, and Sung Nok Chiu. Spatial Tessellations, Concepts and Applications of Voronoi Diagrams. John Wiley and Sons Ltd., 2nd edition, 2000. [PTFS05] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. Progressive Skyline Computation in Database Systems. ACM Trans. Database Syst., 30(1):41–82, 2005. [PTMH05] Dimitris Papadias, Yufei Tao, Kyriakos Mouratidis, and Chun Kit Hui. Aggregate Nearest Neighbor Queries in Spatial Databases. ACM Trans. Database Syst., 30(2):529–576, 2005. 65 [PZMT] Dimitris Papadias, Jun Zhang, Nikos Mamoulis, and Yufei Tao. Query Processing in Spatial Network Databases. In VLDB 2003, Berlin, Ger- many. [RKV95] Nick Roussopoulos, Stephen Kelley, and Fr´ ed´ eic Vincent. Nearest Neigh- bor Queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 22-25, 1995, pages 71–79. ACM Press, 1995. [SAA00] Ioana Stanoi, Divyakant Agrawal, and Amr El Abbadi. Reverse nearest neighbor queries for dynamic databases. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 44–53, 2000. [SK98] Thomas Seidl and Hans-Peter Kriegel. Optimal Multi-Step K-Nearest Neighbor Search. In SIGMOD 1998, Proceedings ACM SIGMOD Inter- national Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 154–165. ACM Press, 1998. [SKS] Cyrus Shahabi, Mohammad R. Kolahdouzan, and Mehdi Sharifzadeh. A Road Network Embedding Technique for k-Nearest Neighbor Search in Moving Object Databases. In ACMGIS 2002, McLean, VA, USA. [SKS07] Mehdi Sharifzadeh, Mohammad Kolahdouzan, and Cyrus Shahabi. The Optimal Sequenced Route Query. The VLDB Journal, 2007. SSN: 1066-8888 (Print) 0949-877X (Online), DOI: 10.1007/s00778-006-0038- 6, Issue: Online First. [SR] Zhexuan Song and Nick Roussopoulos. K-Nearest Neighbor Search for Moving Query Point. In The Seventh International Symposium on Spatial and Temporal Databases, SSTD’2001, Redondo Beach, CA, USA. [SRAA01] Ioana Stanoi, Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi. Discovery of Influence Sets in Frequently Updated Databases. In Pro- ceedings of the 27th International Conference on Very Large Data Bases, pages 99–108. Morgan Kaufmann Publishers Inc., 2001. [SS] Mehdi Sharifzadeh and Cyrus Shahabi. Processing Optimal Sequenced Route Queries using V oronoi Diagrams. Submitted for review. [SS04a] Mehdi Sharifzadeh and Cyrus Shahabi. Approximate V oronoi Cell Com- putation on Geometric Data Streams. Technical report, Computer Science Department, University of Southern California, 2004. No. 04-835. 66 [SS04b] Mehdi Sharifzadeh and Cyrus Shahabi. Supporting Spatial Aggregation in Sensor Network Databases. In Proceedings of the 12th ACM Inter- national Symposium on Advances in Geographic Information Systems, pages 166–175, 2004. [SS05] Mehdi Sharifzadeh and Cyrus Shahabi. Utilizing Voronoi Cells of Loca- tion Data Streams for Accurate Computation of Aggregate Functions in Sensor Networks. GeoInformatica, 10(1):9–36, March 2005. [SS06a] Mehdi Sharifzadeh and Cyrus Shahabi. Additively Weighted Voronoi Diagrams for Optimal Sequenced Route Queries. In Proceedings of the 3rd International Workshop on Spatio-Temporal Database Management (STDBM’06), September 2006. CEUR Workshop Proceedings, online CEUR-WS.org/V ol-174/paper5.pdf. [SS06b] Mehdi Sharifzadeh and Cyrus Shahabi. The Spatial Skyline Queries. In Proceedings of the 32nd International Conference on Very Large Data Bases: VLDB’06, September 2006. [SSK03] Mehdi Sharifzadeh, Cyrus Shahabi, and Craig A. Knoblock. Learning Approximate Thematic Maps from Labeled Geospatial Data. In Inter- national Workshop on Next Generation Geospatial Information (NG2I), 2003. [SSK05] Mehdi Sharifzadeh, Cyrus Shahabi, and Craig A. Knoblock. Learning Approximate Thematic Maps from Labeled Geospatial Data. In Peggy Agouris and Arie Croitoru, editors, Next Generation Geospatial Informa- tion: From Digital Image Analysis to SpatioTemporal Databases, chap- ter 3, pages 129–141. A.A. Balkema Publishers, 2005. [SWCS] P. Sistla, O. Wolfson, S. Chamberlain, and Dao S. Modeling and Query- ing Moving Objects. In IEEE ICDE 1997, Proceedings of the Thirteenth International Conference on Data Engineering, April 7-11, 1997 Birm- ingham U.K. [TEO01] Kian-Lee Tan, Pin-Kwang Eng, and Beng Chin Ooi. Efficient Progressive Skyline Computation. In Proceedings of VLDB’01, pages 301–310, 2001. [TN00] Yannis Theodoridis and Mario A. Nascimento. Generating Spatiotempo- ral Datasets on the WWW. SIGMOD Record, 29(3):39–43, 2000. [TPL04] Yufei Tao, Dimitris Papadias, and Xiang Lian. Reverse kNN Search in Arbitrary Dimensionality. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, pages 744–755. Morgan Kauf- mann, 2004. 67 [TPS] Yufei Tao, Dimitris Papadias, and Qiongmao Shen. Continuous Near- est Neighbor Search. In VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20-23, 2002 Hong Kong, China. [XZLL04] Jianliang Xu, Baihua Zheng, Wang-Chien Lee, and Dik Lun Lee. The D-Tree: An Index Structure for Planar Point Queries in Location-Based Wireless Services. IEEE Transactions on Knowledge and Data Engineer- ing, 16(12):1526–1542, 2004. [YL] Congjun Yang and King-Ip Lin. An Index Structure for Efficient Reverse Nearest Neighbor Queries. In ICDE 2001, Proceedings of the 17th Inter- national Conference on Data Engineering, April 2-6, 2001, Heidelberg, Germany. [ZL01] Baihua Zheng and Dik Lun Lee. Semantic Caching in Location- Dependent Query Processing. In SSTD, pages 97–116, 2001. [ZZP + 03] Jun Zhang, Manli Zhu, Dimitris Papadias, Yufei Tao, and Dik Lun Lee. Location-based Spatial Queries. In Proceedings of the 2003 ACM SIG- MOD international conference on Management of data, pages 443–454. ACM Press, 2003. 68 Appendix A Case Studies To show the efficiency of V oronoi diagrams in processing various spatial queries, we studied four different problems in different contexts as our case studies. To solve each problem, we developed a V oronoi-based algorithm employing customized variations of V oronoi diagrams. In this appendix, we briefly discuss our solutions for Optimal Sequenced Route Query [SKS07, SS06a, SS], Learning Thematic Maps from labeled Spatial Data [SSK03, SSK05], and Spatial Aggregation in Sensor Network Databases [SS04b, SS05, SS04a]. For the details of these three case studies, we refer the reader to the corresponding publications. We extensively discuss our fourth case study, Spatial Skyline Queries, as well as its corresponding solutions in Appendix B. A.1 Optimal Sequenced Route Query Our first case study shows utilizing V oronoi diagrams with respect to weighted Manhat- tan distance to address the Optimal Sequenced Route (OSR) queries. We, for the first time, introduced OSR query in [SKS07] as follows. Suppose that we are planning a Saturday trip in town as following: first we intend to visit a shopping center in the after- noon to check the season’s new arrivals, then we plan to dine in an Italian restaurant in early evening, and finally, we would like to watch a specific movie at late night. Nat- urally, we intend to drive the minimum overall distance to these destinations. In other words, we need to find the locations of the shopping centers i , the Italian restaurantr j , and the theatert k that shows the specific movie, which driving toward them considering 69 the sequence of the plan shortens our trip (in terms of distance or time). Note that in this example, a time constraint enforces the order in which these destinations are to be visited; we usually do not have dinner in the afternoon, or do not go for shopping at late night. We call this type of query, where the order of the sequence in which some points must be visited is enforced and cannot be changed, the Optimal Sequenced Route or OSR query. This query is essential in other application domains such as crisis man- agement, air traffic flow management, supply chain management, and video surveillance [SKS07]. In [SS06a], we showed that unique regions of the space ofR 2 could be identified within which the result of a given OSR query with a certain sequence is invariant. Sub- sequently, we identified an additively weighted function as the distance function used by Additively Weighted (AW-) Voronoi diagrams, a variation of ordinary V oronoi diagrams, in order to encode these regions. We theoretically proved the relation between OSR queries and AW-V oronoi diagrams and developed a novel query processing technique that outperforms our previous algorithms. In [SS], we extended our approach for road network databases by introducing the notion of network AW-V oronoi diagrams. To the best of our knowledge, no study has defined AW-V oronoi diagrams in the metric space of road networks. Hence, in [SS] we proposed an efficient algorithm to build network AW-V oronoi diagrams. A.2 Learning Thematic Maps from Labeled Spatial Data A Thematic Map is a map primarily designed to show a theme, a single spatial distribu- tion or a pattern, using a specific map type [Cla02]. These maps show the distribution of a feature over a limited geographic area. Each map defines a partitioning of the area into 70 Figure A.1: California county map as a typical thematic map. a set of closed and disjoint regions, each includes all the points with a common charac- teristic (e.g., the same label). Figure A.1 illustrates a California county map that can be viewed as a thematic map with county name as the common feature of each region. Suppose that a set of geocoordinates of houses in California and their correspond- ing county names are given. The problem is to find an approximate thematic map of CA counties. In [SSK03] and [SSK05], we proposed to utilize Spatial Classification methods to learn approximate thematic maps from labeled data. A spatial classifier of geocoordinates based on the corresponding county name values as the training data, is able to predict the county of a new house not present in the original training data. Illus- trating the decision boundaries of county classes resulted from the spatial classification constitutes an approximate thematic map of CA counties. Spatial classification exploits the fact that closer points in the data space are more related to each other and hence more likely belong to the same class. This adequately qualifies V oronoi diagrams as distance-aware spatial classifiers. Inspired by this fact, in [SSK05] we proposed to use V oronoi diagrams to build thematic map of counties as 71 Figure A.2: Merging V oronoi cells corresponding to the points with a common label (e.g., county) follows. We first generate the ordinary V oronoi diagram of input labeled points (i.e., house locations). Hence, the county of any new point is equal to that of the generator point of its containing V oronoi cell. The boundaries between different counties form a subset of the V oronoi diagram for the labeled data. Merging V oronoi cells corresponding to the points with identical counties forms the map region for that county. Figure A.2 shows the V oronoi diagram of a set of points labeled with four different feature values and the merge step for one of the feature values (i.e. A). We showed in [SSK05] that the accuracy of our V oronoi-based method is very close to the superior competitor method. This dominant performance combined with the simplicity of our method makes it an excellent practical method for building thematic maps. A.3 Spatial Aggregation in Sensor Network Databases Our last case study investigates the application of V oronoi diagrams in a different form of spatial databases. Consider a sensor network deployed in a physical environment to monitor a real-world phenomenon. The measurement values generated by each sen- sor node are data samples representing this phenomenon. The set of all measurements associated with the geocoordinates of the generating nodes can be conceptualized as a 72 s 2 (5) s 1 (25) s 4 (5) s 3 (5) s 1 (5) s 4 (25) s 2 (5) s 3 (5) Figure A.3: Two non-uniformly distributed sensor networks spatial database, termed Sensor Network Database (SNDB) [BGS01]. Considering this new database, the class of traditional aggregation queries are of main interest within the sensor network community. The common basis of all aggregation queries is to compute a summary value based on a set of database items. To illustrate, consider the 4-node sensor network measuring humidity values shown in Figure A.3a. A typical aggregate query on these values is to find the average of humidity in the region covered by all sensor nodes. In general, no assumption should be made about the distribution of the sensor nodes in the environment. Hence, applying traditional average on a non-uniformly distributed observation set may lead to erroneous results and false reasoning about the phenomenon. That is, as the phenomenon is usually a continuous process, only a uniform sample set is a good representative of the whole process. Furthermore, traditional aggregation operators are not amenable to outlier and noisy values. The average of the humidity monitored by both sensor networks in Figures A.3a and A.3b is10. With a continuously increasing humidity from left to right, the real average humidity of Figure A.3a should be 15. While in Figure A.3b, the real average is expected to be close to 5 since sensor nodes 4 happens to be close to a local maxima in the space of humidity values. To solve this problem, we introduced the use of Spatial Aggregation in [SS04b] that takes into consideration the location of the sensor node generating each measure- ment. This new class of spatial queries utilizes distance-based aggregation functions that respect the spatial distribution of measurements in the space. As a representative of 73 this query class, we formalized the spatial average operator as a weighted average over sensor measurements. We proposed a V oronoi-based operator that utilizes the V oronoi diagram of the node locations to partition the deployment field into convex polygons (i.e., V oronoi cells). The weight assigned by our method to the measurements of each node is the area of its corresponding V oronoi cell. With our spatial aggregation opera- tor, the sensor measurements of sparse areas (e.g.,s 1 in Figure A.3) contribute more to the final result as compared to those of dense areas. For in-network processing of our V oronoi-based operator, we proposed a distributed V oronoi cell computation algorithm for sensor networks. Each node employs this algorithm to compute its own weight used by the aggregation operator. The algorithm generates the V oronoi cell of a fixed point using a stream of point data. At each sensor node, the algorithm iteratively uses the stream of locations it receives from the other nodes to construct its own local V oronoi cell. In a real-world scenario, the sensor nodes frequently fail and stop generating mea- surement values. As a result, the V oronoi cell of each node is changing because of the dynamism of the entire sensor network. Hence, in [SS05] we proposed a sampling algorithm that enables each node to maintain a good approximation of its cell while it receives updates from other nodes. In [SS04a], we theoretically computed the expected approximation error of our algorithm and its expected sample size. We showed how one can change the parameters of our algorithm to maintain a V oronoi cell of his desired accuracy. 74 Appendix B Spatial Skyline Query B.1 Introduction Suppose that the members of a multidisciplinary task force team located at different (fixed) offices want to put together a list of restaurants for their weekly lunch meet- ings. These meeting locations must be interesting in terms of traveling distances for all the team members; for each restaurant r in the list, no other restaurant is closer to all members than r. That is, there exists no better restaurant of choice than an interesting restaurant in terms of all of their comparable attributes (here, distances to team mem- bers). Generating this list becomes even more challenging when the team members are mobile and change location over time. Figure B.1 illustrates the locations of three mem- bers m i and four restaurants r i . Restaurant r 1 (r 2 ) is an interesting candidate location for the meeting as no other restaurant is closer thanr 1 tom 1 (r 2 tom 2 ). However,r 3 is not interesting asr1 (andr 2 ) is closer to all members. m 1 m 2 m 3 r 1 r 2 r 3 r 4 Figure B.1: A set of restaurantsP=fr 1 ;:::;r 4 g and team membersQ=fm 1 ;:::;m 3 g. 75 Suppose the information about all restaurants are stored in a database as objects with static attributes (e.g., rating of the restaurant). The database literature refers to the interesting objects with respect to their static attributes as skyline objects of the database; those that are not dominated by any other object. These static skyline objects depend only on the database itself (and the definition of the interesting object). Even though the interesting restaurants to our team members can also be considered as the skyline of the database of restaurants, there are two major distinctions: 1) The restaurants’ distance attributes based on which the domination is defined are dynamically calculated based on the user’s query (i.e., location of the team members). Consequently, the result depends on both data and the given query. 2) As these attributes are spatial, there is a unique corresponding geometric interpretation of spatial skyline in the space of the objects. The query retrieving the set of interesting restaurants in our motivating example belongs to a broader novel type of spatial queries. In [SS06b], for the first time, we intro- duced the concept of these Spatial Skyline Queries (SSQ). Given a set of data pointsP and a set of query pointsQ in thed-dimensional space, an SSQ retrieves those points of P which are not spatially dominated by any other point inP . The pointp spatially domi- natesp 0 iff we haveD(p;q i )·D(p 0 ;q i ) for allq i 2Q andD(p;q j )<D(p 0 ;q j ) for some q j 2 Q. An interesting variation is where the domination is determined with respect to both spatial and non-spatial attributes ofP (e.g., ratings of restaurants). Besides online map services and group navigation/planning, SSQ is critical for many applications. In the domain of trip planning, the spatial skyline of hotels with respect to the fixed loca- tions of conference venue, beaches and museums includes all the interesting hotels for lodging during a business/pleasure trip. No other hotel is closer than these hotels to all three must-see locations. In crisis management domain, the residential buildings that must be evacuated first in the event of several explosions/fires are those which are in the spatial skyline with respect to the fire locations. The reason is that these places 76 are either potentially trapped in the convex hull of fires or located at the edges of the expanding fire. In defense and intelligence applications, consider the locations of sol- diers penetrating into enemy’s camps as query locations and the enemy’s guard stations as data points. The stations in the spatial skyline are those from which an attack might be initiated against the platoon of soldiers. Since the introduction of the skyline operator by B¨ orzs¨ onyi et al. [BKS01], sev- eral efficient algorithms have been proposed for the general skyline query. These algo- rithms utilize techniques such as divide-and-conquer [BKS01], nearest neighbor search [KRR02], sorting [CGGL03], and index structures [BKS01, TEO01, PTFS05] to answer the general skyline queries. Several studies have also focused on the skyline query processing in a variety of problem settings such as data streams [LYWL05] and data residing on mobile devices [HJLO06]. However, to the best of our knowledge, no study has addressed the spatial skyline queries. The most relevant work is the BBS algorithm proposed by Papadias et al. [PTFS05], which can be utilized to address SSQ by consid- ering it as a special case of dynamic skyline query. However, since BBS is addressing a more general problem, it overlooks the geometric properties of SSQ and hence its performance is not optimal in spatial domain. We compare our techniques with BBS in Section B.7. Optimal processing of SSQs is more challenging than the general skyline queries as each dominance check here requires the computation of the dynamic distance attributes. Notice that the algorithms for Group or Aggregate Nearest Neighbor queries [PTMH05] are related but not applicable to SSQ as they only find the optimal (best) object based on a fixed preference function. Hence, they require no intermediate dominance checks that are vital in processing SSQs. In this appendix, we exploit the geometric properties of the SSQ problem space as a nearest neighbor query. In the vector space ofR 2 , we study the distance-based function 77 of spatial domination utilized by SSQs and show how an ordinary V oronoi diagram of data points in P captures this function. We analytically prove that the points whose V oronoi cells are inside or intersect with the convex hull of query points consist a subset of the spatial skyline ofP . Inspired by this result, we propose two algorithms, B 2 S 2 and VS 2 , for static query points and one algorithm, VCS 2 , for streaming Q whose points change location over time (e.g., are mobile). The R-tree-based B 2 S 2 is a customization of BBS [PTFS05] for SSQ that benefits from our theoretical foundation by exploiting the geometric properties of the problem space. B 2 S 2 is more efficient than BBS as it not only avoids expensive dominance checks for definite skyline points but also prunes unnecessary query points to reduce the cost of each examination. VS 2 , however, employs the V oronoi diagram and Delaunay graph of the data points as a roadmap to find the first skyline point whose local neighborhood contains all other points of the skyline. The algorithm traverses the nodes of the graph (i.e., data points ofP ) in the order specified by a monotone function over their distances to query points. Similar to B 2 S 2 , VS 2 also efficiently exploits the geometric properties of the problem space while traversing the Delaunay graph. VS 2 reduces the complexity of the naive SSQ search fromO(jPj 2 jQj) toO(jSj 2 jCj+ p jPj), wherejSj andjCj are the solution size and the number of vertices of the convex hull of Q, respectively. The p jPj factor can be reduced further toO(logjPj) if an index structure is used. We comprehensively study the scenario where the query points are the locations of multiple moving objects with un-predefined trajectories. This occurs in our motivating applications when the team members are mobile agents or the soldiers are moving. Here, we frequently receive the latest query locations as spatial data streams. For each pattern of movement, we extract the locus of all points whose dominance change and hence trigger an update to the old spatial skyline. Consequently, to address continuous SSQ 78 we propose VCS 2 that exploits the pattern of change in Q to avoid unnecessary re- computation of the skyline and hence efficiently perform updates. Furthermore, we study a more general case of SSQ where one intends to find the skyline with respect to both static non-spatial attributes of the data points and their distances to query points. For instance, the best restaurant in Los Angeles might be dominated in terms of distance to our team members but it is still in the skyline because of its rating. We show that B 2 S 2 , VS 2 , and VCS 2 all can support this variation of SSQ as well. Finally, through extensive experiments with both real-world and synthetic datasets, we show that B 2 S 2 , VS 2 , and VCS 2 can efficiently answer an SSQ query. Both R-tree- based B 2 S 2 and V oronoi-based VS 2 outperform the best competitor approach BBS in terms of processing time by a wide margin (up to 6 times better). Our experimental results with synthetically moving objects verify that on average for 3-10 query points only less than 25% of movements require the entire skyline to be recomputed. For the other 75% of movements, VCS 2 outperforms VS 2 by a factor of 3 which makes it the superior algorithm for continuous SSQ. B.2 Formal Problem Definition Assume that we have a database ofN objects. Each database objectp withd real-valued attributes can be conceptualized as a d-dimensional point (p 1 ;:::;p d ) 2 R d where p i is the i-th attribute of p. We use P to refer to the set of all these points. Figure B.2a illustrates a database of six objectsP =fa;b;c;d;e;fg each representing the descrip- tion of a hotel with two attributes: distance to beach and price. Figure B.2b shows the corresponding points in the 2-dimensional space where x andy axes correspond to the range of attributes distance and price, respectively. Throughout this appendix, we 79 object distance price (mile) ($) a 0.5 200 b 2 150 c 2.5 25 d 4 125 e 1.5 100 f 3 75 price 50 100 150 200 1 2 3 4 distance to beach a b d e c f (a) (b) Figure B.2: A 2-dimensional database of six objects use point and object interchangeably to refer to each database object. In the following sections, we first define the general skyline query using the above database conceptu- alization. Then, we introduce our spatial skyline query based on the definition of the general skyline query. B.2.1 General Skyline Query Given the two points p=(p 1 ;:::;p d ) and p 0 =(p 0 1 ;:::;p 0 d ) inR d , p dominates p 0 iff we have p i · p 0 i for 1 · i · d and p j < p 0 j for at least one 1 · j · d. To illustrate, in Figure B.2b the point f=(3;75) dominates the point d=(4;125). Now, given a set of pointsP , the skyline ofP is the set of those points ofP which are not dominated by any other point inP . The skyline of the points shown in Figure B.2b is the setS =fa;c;eg. The Skyline Query is to find the skyline set of the given databaseP considering attributes of the objects in P as dimensions of the space. Notice that every point of the skyline does not need to dominate a point ofP . For instance in Figure B.2b, while the pointsc ande each dominate two other points, the pointa dominates no point. 80 B.2.2 Spatial Skyline Query Let the set P contain points in the d-dimensional spaceR d , and D(:;:) be a distance metric defined in R d where D(:;:) obeys the triangle inequality. Given a set of d- dimensional query pointsQ=fq 1 ;:::;q n g and the two pointsp andp 0 inR d ,p spatially dominates p 0 with respect to Q iff we have D(p;q i ) · D(p 0 ;q i ) for all q i 2 Q and D(p;q j ) < D(p 0 ;q j ) for some q j 2 Q. Figure B.3 shows a set of nine 2-d points and two query pointsq 1 andq 2 . With Euclidean distance metric, the pointp spatially dom- inates the pointp 0 as bothq 1 andq 2 are closer to p than top 0 . Note that if we draw the perpendicular bisector line of the line segment pp 0 , q 1 , q 2 , and p will be located on the same side of the bisector line (where p 0 is not; see Figure B.3). We use this geometric interpretation of the spatial dominance relation between two points in Section B.3.1 to justify the foundation of our proposed algorithms. For each pointp, consider circlesC(q i ;p) centered at the query pointq i with radius D(q i ;p). Obviously,q i is closer to any point insideC(q i ;p) than top. Therefore, by the above definition any point such asp 00 which is inside the intersection of allC(q i ;p) for all q i 2 Q spatially dominates p. We call this intersection area which potentially includes all the points which spatially dominate p as the dominator region of p. Similarly, the locus of all points such as p 0 which are spatially dominated by p is the intersection of outside of all circlesC(q i ;p) (the grey region in Figure B.3) 1 . For a pointp, we refer to this region as the dominance region ofp. Given the two setsP of data points andQ of query points, the spatial skyline ofP with respect to Q is the set of those points in P which are not spatially dominated by any other point ofP ; the points which are not inside the dominance region of any other point. The pointp2 P is in the spatial skyline ofP with respect toQ iff for any point 1 Assume that the large rectangle shows the universe space of the data points. 81 Dominance region of p C(q 1 , p) p q 1 q 2 p' p'' Dominator region of p Bisector line of pp' p''' Figure B.3: The spatial skyline of a set of nine points p 0 2 P there is a query pointq i 2 Q for which we haveD(p;q i )· D(p 0 ;q i ). That is,p is in the spatial skyline iff we have: 8p 0 2P;p 0 6=p; 9q i 2Qs:t:D(p;q i )·D(p 0 ;q i ) (B.1) We use spatial skyline point and skyline point interchangeably to denote any point in the spatial skyline. Considering the above definitions, the Spatial Skyline Query (SSQ) is to find the spatial skyline points of the given setP with respect to the query setQ. The naive brute-force search algorithm for finding the spatial skyline of P given a query setQ requires examining all points inP against each other. For each pointp,jQj distances D(p;q i ) are computed and compared against the corresponding distances of other points. If no point spatially dominatesp, thenp is added to the solution. The time complexity of this naive algorithm is O(jPj 2 jQj) as it exhaustively examines all data points against each other. However, an optimal SSQ algorithm must examine each point p against only those points which are inside the dominator region ofp. In Section B.3, we establish the theo- retical foundation of our efficient SSQ algorithms which allows us to avoid a significant number of distance computations of the naive approach. 82 Symbol Meaning P set of database points N jPj, cardinality ofP D(:;:) distance function inP Q set of query pointsfq 1 ;:::;q n g n jQj, cardinality ofQ C(q i ;p) circle centered atq i with radiusD(q i ;p) V(p) V oronoi cell ofp CH(Q) convex hull ofQ CH v (Q) set of vertices ofCH(Q) S(Q) spatial skyline points ofP with respect toQ Table B.1: Summary of notations used in Appendix B B.3 Foundation We first identify the geometric structure of the solution space corresponding to the SSQ problem. In particular, we intend to understand how the spatial skyline points of a given query are related to the data and query points. These relationships constitute the foundation of our proposed solutions to SSQ. Notice that while throughout the appendix we use Euclidean distance function in 2-d space, our results hold in general for any dimensiond. B.3.1 Theories We assume that the set of data points P and query points Q are given. In this section, we exploit the geometric properties of spatial skyline points of P with respect to Q. To reduce our search space in finding spatial skyline points, we prove one lemma (9) and two theorems (10 & 13) that would help us to immediately identify definite skyline points and one theorem (12) to eliminate some of the query points not contributing to the search. The first property holds for both general and spatial skylines. Lemma 9. For eachq i 2Q, the closest point toq i inP is a skyline point. 83 p q 1 q 2 q 3 p' Bisector line of pp' p q 1 q q 3 p' q 2 Bisector line of pp' p q 1 q 2 q 3 p' q 4 ' (a) (b) (c) Figure B.4: a) Theorem 10, b) Theorem 12, and c) Theorem 13 Proof. If p is the closest point to q i in P , we have D(p;q i ) < D(p 0 ;q i ) for all p 0 2 P (p 0 6= p). By definition, no point in P spatially dominates p. Hence, p is a skyline point. A restated form of Lemma 9 is valid for the general skyline where all the points with the minimum value in one dimension are in the skyline. Lemma 9 shows that the inclusion of some data points in the spatial skyline ofP does not depend on the location of any other point of P . For example in Figure B.3, the point p 000 is a skyline point as it is the closest point to the query point q 1 . It is a skyline point only because of q 1 ’s location regardless of where the other data points ofP are located. In Section B.4.2, we utilize Lemma 9 to start our search for skyline points from a definite skyline point such as p 000 . The following theorem also shows that specific points in P are skyline points independent of the locations of other points ofP . Theorem 10. Any pointp2P that is inside the convex hull ofQ is a skyline point. Proof. The proof is by contradiction. Assume that p which is inside the convex hull CH(Q) is not a skyline point (see Figure B.4a). Then, there is a point p 0 2 P which spatially dominatesp. Therefore, if we draw the perpendicular bisector line of the line segment pp 0 , all the query points q i will be located on the same side of the line where p 0 is located. Hence, CH(Q) is also on the same side as p 0 . That is, the perpendicular 84 bisector line ofpp 0 separatesCH(Q) fromp. This contradicts our assumption thatp is insideCH(Q) and proves thatp is a skyline point. Theorem 10 enables our SSQ algorithms to efficiently retrieve a large subset of sky- line points only by examining them against the query points. This theorem is the basis of our proposed SSQ algorithms in Section B.4. The next theorem improves the appli- cation of Theorem 10 by proving that even some of the query points have no impact on the final spatial skyline. Hence, our SSQ algorithms can easily ignore them. We first prove the following lemma: Lemma 11. Given two query sets Q 0 ½ Q, if a point p 2 P is a skyline point with respect toQ 0 , thenp is also a skyline point with respect toQ. Proof. Asp is a skyline point with respect toQ 0 , for any pointp 0 2 P there is a query point q i 2 Q 0 for which we have D(p;q i ) · D(p 0 ;q i ) (see Equation B.1). As Q 0 is a subset of Q, we have q i 2 Q. Therefore, according to Equation B.1 the point p is a skyline point with respect toQ. Lemma 11 holds for both general and spatial skylines. In the general case, if a point p is in the skyline considering only a subset of its coordinates (i.e., attributes), it remains in the final skyline when all coordinates ofp are considered. Theorem 12. The set of skyline points ofP does not depend on any non-convex query pointq2Q. Proof. Assume that the query point q is not a convex point (i.e., q = 2 CH v (Q)) (see Figure B.4b). Consider the set Q 0 = Q¡fqg. We prove that the spatial skylines of Q and Q 0 are equal (i.e., S(Q 0 ) = S(Q)). First, assume that p is a skyline point with respect to Q 0 (i.e., p 2 S(Q 0 )). According to Lemma 11, as we have Q 0 ½ Q, p is a skyline point with respect toQ. 85 Now, we assume that we have p 2 S(Q). We show that p 2 S(Q 0 ). The proof is by contradiction. Assume that p is not in S(Q 0 ). Then, there is a point such as p 0 2 P that spatially dominates p with respect to Q 0 . By definition, the query points in Q 0 and the pointp are on different sides of the perpendicular bisector line of line segment pp 0 ; Q 0 is on the same side as p 0 . Therefore, the convex hull of Q 0 , CH(Q 0 ), and p are separated by the bisector line. Moreover, as q is not a convex point of Q, we have CH(Q) = CH(Q 0 ). Hence,CH(Q) is also separated fromp by the bisector line. That is,p 0 spatially dominatesp with respect to bothQ 0 andQ and so we getp = 2S(Q) which contradicts our assumption. Therefore, we showed thatp2S(Q 0 ). Combining all the above, we proved that S(Q 0 ) = S(Q) and hence the inclu- sion/exclusion of the single non-convex query point q in/from Q does not change the set of skyline points ofP . To illustrate, consider the point q inside the convex hull of query points shown in Figure B.4b. Theorem 12 implies that both dominance and dominator regions of any pointp is independent fromq. The intuition here is that the circleC(q;p) is completely inside the union of the circles C(q i ;p) for q i 2 CH v (Q). With the result of Theorem 12, we reduce the time complexity of our SSQ algorithms by disregarding the distance computation operations against the non-convex query points such asq (see Section B.4). Finally, the last theorem specifies those skyline points which are identified by examining only the data points in a limited local proximity around them. Theorem 13. Any pointp2 P whose Voronoi cellV(p) intersects with the boundaries of convex hull ofQ is a skyline point. Proof. The proof is by contradiction. Assume that the V oronoi cell ofp intersects with CH(Q) but p is not a skyline point (i.e., p = 2 S(Q)) (see Figure B.4c). Hence, a point such as p 0 2 P spatially dominates p with respect to Q. Therefore, the perpendicular 86 bisector line of line segment pp 0 separates both p 0 and Q from p. That is, p and the convex hull of Q are on different sides of the bisector line. As V(p) intersects with CH(Q), the intersection of V(p) and CH(Q) is also separated from p by the bisector line. It means that the points in this intersection region are closer to p 0 than to p. That is, there are some points such asx insideV(p) for which we haveD(x;p 0 ) < D(x;p). This inequality contradicts the definition ofV(p) given in Equation 2.1 which states that any point inV(p) is closer top than to any other point in the setP . Thus,p is a skyline point with respect toQ. B.4 Solutions In this section, we propose two algorithms to solve the SSQ problem. Both algorithms are empowered by the foundation established in Section B.3; they utilize Lemma 9 and Theorems 10 and 13 to eliminate unnecessary dominance checks for the definite skyline points and Theorem 12 to prune some of query points. B.4.1 B 2 S 2 : Branch-and-Bound Spatial Skyline Algorithm Our B 2 S 2 algorithm is an improved customization of the original BBS algorithm for “spatial skyline” [PTFS05]. Similar to [PTFS05], we assume that the data points are indexed by a data-partitioning method such as R-tree. For each data point p, let mindist(p;A) be the sum of distances between p and the points in the set A (i.e., P q2A D(p;q)). Likewise, we define mindist(e;A) as the sum of minimum distances between the rectanglee and the points ofA (i.e., P q2A mindist(e;q)). Figure B.5 shows the pseudo-code of B 2 S 2 . We describe the algorithm using the set of data points P = fp 1 ;:::;p 13 g and query points Q = fq 1 ;:::;q 4 g shown in Figure B.6. Each intermediate entrye i in the corresponding R-tree represents the min- imum bounding box of the node N i . B 2 S 2 starts by computing the convex hull of Q 87 Algorithm B 2 S 2 (setQ) 01. compute the convex hullCH(Q); 02. setS(Q)=fg; 03. boxB =MBR(R); 04. minheapH =f(R;0)g; 05. whileH is not empty 06. remove first entrye fromH; 07. ife does not intersect withB, discarde; 08. ife is insideCH(Q) or 09. e is not dominated by any point inS(Q) 10. ife is a data pointp 11. addp toS(Q); 12. B =B\MBR(SR(p;Q)); 13. else //e is an intermediate node 14. for each child nodee 0 ofe 15. ife 0 does not intersect withB, discarde 0 ; 16. ife 0 is insideCH(Q) or 17. e 0 is not dominated by any point inS(Q) 18. add(e 0 ;mindist(e 0 ;CH v (Q))) toH; 19. returnS(Q); Figure B.5: Pseudo-code of the B 2 S 2 algorithm and determines the set of its vertices CH v (Q) (e.g., CH v (Q) = fq 1 ;q 2 ;q 3 g). Sub- sequently, B 2 S 2 begins to traverse the R-tree from its root R down to the leaves. It maintains a minheapH sorted based on themindist values of the visited nodes. Table B.2 shows the contents ofH at each step. First, B 2 S 2 inserts(e 6 ;mindist(e 6 ;CH v (Q))) and(e 7 ;mindist(e 7 ;CH v (Q))) corresponding to the entries of the rootR intoH. Then, e 6 with the minimummindist is removed fromH and its childrene 1 ,e 2 , ande 3 together with theirmindist values are inserted intoH. Similarly,e 1 is removed and the children ofe 1 are added toH. In the next iteration, the first entryp 2 is insideCH(Q) and hence is added toS(Q) as the first skyline point found. Once the first skyline point is found (i.e.,S(Q)6=?), any entrye must be checked for dominance before insertion into and after removal from H. If e is dominated by a skyline point p, then B 2 S 2 discards e. This dominance check is done against all points p inS(Q);e is dominated byp ife is completely inside the dominance region ofp (see 88 Section B.2.2 for the definition). That is, ife does not intersect with any circleC(q;p) for q 2 CH v (Q). To decrease the use of this costly test, B 2 S 2 first applies two easier tests: 1) If the entrye does not intersect with the MBR of the union of all circlesC(q;p) (termed as SR(p;Q)), then p dominates e. Similarly, if e does not intersect with the intersection of all such MBRs each corresponding to a current skyline points in S(Q), then a point inS(Q) dominatese. 2) Ife is completely inside the convex hullCH(Q), e cannot be dominated (see Theorem 10). Ife does not pass any of the above tests, B 2 S 2 requires to checke against the entireS(Q). B 2 S 2 maintains the rectangle B corresponding to the intersection area described above and updates it when a new skyline point is found (see dotted box in Figure B.6). Considering our example, B 2 S 2 removesp 3 fromH. Asp 3 is insideB and is not domi- nated byp 2 , it is added to the skyline points and the rectangleB is updated accordingly (lines 11-12 in Figure B.5). The next step examinese 2 which is not dominated by cur- rent skyline pointsp 2 andp 3 . Amonge 2 ’s children, onlyp 5 is inserted toH. p 4 is outside B (i.e., dominated by p 3 ) and hence is discarded. Then, B 2 S 2 removes e 7 and extracts its childrene 4 ande 5 ase 7 is not dominated. Entrye 4 does not intersect withB ande 5 is dominated byp 2 . Hence, B 2 S 2 discards both entries. At this step,p 5 is removed and added to the skyline points. The remaining steps discard both dominated entriesp 1 and e 3 . Finally, the pointsp 2 ,p 3 andp 5 consist the final spatial skyline. B.4.1.1 B 2 S 2 Correctness The correctness of B 2 S 2 follows that of BBS as both algorithms use the same approach to explore the solution space. B 2 S 2 benefits from all the properties of BBS. For instance, B 2 S 2 can also utilize any arbitrary monotone function instead of mindist() to sort the entries of its heap. Consequently, B 2 S 2 is also able to employ any monotone preference function to support ranked skyline queries [PTFS05]. 89 9 3 2 1 5 4 6 7 3 8 10 12 13 11 1 2 3 7 6 4 5 2 1 4 p 1 e 1 N 4 N 5 N 3 N 2 N 1 N 7 N 6 R e 3 e 4 e 5 e 2 e 6 e 7 p 2 p 3 p 4 p 5 p 8 p 7 p 6 p 10 p 9 p 11 p 12 p 13 Figure B.6: Points indexed by an R-tree However, B 2 S 2 is more efficient than BBS as it reduces the complexity of the costly dominance checks. 1) B 2 S 2 uses only the query points on the convex hull ofQ instead of the entire Q (Theorem 12) which decreases the number of distance computations during dominance checks. In most cases, B 2 S 2 requiresjCH v (Q)j < jQj operations to compute distances of each point. 2) B 2 S 2 utilizes the rectangle B to facilitate the pruning of dominated rectangular entries from O(jQj(d 2 +jS(Q)j)) to O(d 2 ). Instead of computing the minimum distances between e and each q 2 Q and comparing them against those of each skyline point inS(Q), B 2 S 2 first checks whethere andB intersect. 3) B 2 S 2 does not require any dominance check for points insideCH(Q) (Theorem 10). Moreover, B 2 S 2 utilizes Theorem 13 to precede its expensive dominance checks with an intersection check between the V oronoi cellV(p) andCH(Q) when the later is cheaper (e.g., whenjS(Q)j andjQj are large comparing toCH v (Q) and the number of vertices ofV(p)). 90 step heap contents (entrye:mindist(:;:)) S(Q) 1 (e 6 :8),(e 7 :52) ? 2 (e 1 :18),(e 2 :49),(e 7 :52),(e 3 :115) ? 3 (p 2 :38),(p 3 :42),(e 2 :49),(e 7 :52), ? (p 1 :70),(e 3 :115) 4 (e 7 :52),(p 5 :53),(p 1 :70),(e 3 :115) fp 2 ;p 3 g 5 (p 5 :53),(p 1 :70),(e 3 :115) fp 2 ;p 3 g 6 (p 1 :70),(e 3 :115) fp 2 ;p 3 ;p 5 g Table B.2: B 2 S 2 for the example of Figure B.6 B.4.2 VS 2 : Voronoi-based Spatial Skyline Algorithm According to Theorems 10 and 13, the points whose V oronoi cells are inside or intersect with the convex hull of the query points are skyline points. Therefore, our VS 2 algorithm utilizes the V oronoi diagram (i.e., the corresponding Delaunay graph) of the data points to answer an SSQ problem. We assume that the R-tree on the data points does not exist. Instead, the V oronoi neighbors of each data point is known. To be specific, the adjacency list of the Delaunay graph of the points inP is stored in a flat file. To preserve locality, points are organized in pages according to their Hilbert values. VS 2 starts traversing the Delaunay graph from a data point (e.g.,NN(q 1 ), the clos- est point toq 1 ). The traversal order is determined by the monotone functionmindist(p; CH v (Q)). VS 2 maintains two different lists to track the traversal; Visited which con- tains all visited points andExtracted which contains those visited points whose V oronoi neighbors have also been visited (extracted). Similar to B 2 S 2 , VS 2 also maintains the rectangleB which includes all candidate skyline points. Figure B.7 shows the pseudo-code of VS 2 . It first computes the convex hull of query points and initializes all the data structures. Then, the closest point to one of the query points (e.g.,p = NN(q 1 )) and its corresponding entry(p;mindist(p;CH v (Q))) are added to Visited and minheap H, respectively. Then, VS 2 iteratively examines (p;key), the top entry ofH. If allp’s V oronoi neighbors have been already visited (i.e., p2 Extracted), then(p;key) is removed fromH. Subsequently, ifp is not dominated by current S(Q), then p is added to S(Q) as a skyline point. If p = 2 Extracted, p is 91 Algorithm VS 2 (setQ) 01. compute the convex hullCH(Q); 02. setS(Q)=fg; 03. minheapH =f(NN(q 1 );mindist(NN(q 1 );CH v (Q)))g; 04. setVisited=fNN(q 1 )g; setExtracted=fg; 05. boxB =MBR(SR(NN(q 1 );Q)); 06. whileH is not empty 07. (p;key)= first entry ofH; 08. ifp2Extracted 09. remove(p;key) fromH; 10. ifp is insideCH(Q) or 11. p is not dominated byS(Q) 12. addp toS(Q); 13. B =B\MBR(SR(p 0 ;Q)); 14. else 15. addp toExtracted; 16. ifS(Q)=? or a V oronoi neighbor ofp is inS(Q) 17. for each V oronoi neighbor ofp such asp 0 18. ifp 0 2Visited, discardp 0 ; 19. ifp 0 is insideB orV(p 0 ) intersects withB 20. addp 0 toVisited; 21. add(p 0 ;mindist(p 0 ;CH v (Q))) toH; 22. returnS(Q); Figure B.7: Pseudo-code of the VS 2 algorithm added toExtracted. Moreover, if at least one of the skyline points identified so far is a V oronoi neighbor ofp, then VS 2 adds each unvisited V oronoi neighborp 0 ofp toVisited andH iff a)p 0 is inside current rectangleB or b)p 0 ’s V oronoi cellV(p 0 ) intersects with B. Subsequently,B is updated accordingly and the entry(p 0 ;mindist(p 0 ;CH v (Q))) is added to the heapH. Finally, whenH becomes empty, VS 2 returnsS(Q) as the result. We describe VS 2 using the example shown in Figure B.8. The three query pointsq i and the data points are shown as white and black dots, respectively. Table B.3 shows the contents of heap H. First, VS 2 adds (p 1 ;midist(p 1 ;CH v (Q))) to H and marks p 1 as visited. B is also initialized accordingly to the dotted box in Figure B.8. The first itera- tion visitsp 3 ,p 4 ,p 5 ,p 6 , andp 8 asp 1 ’s V oronoi neighbors and adds their corresponding entries to H. It also adds p 1 to the Extracted list. The second iteration removes p 1 fromH asp 1 and its neighbors have been already visited. It also addsp 1 toS(Q) asp 1 92 3 2 1 1 3 2 5 6 4 9 8 7 10 11 Figure B.8: Example points for VS 2 is inside CH(Q). The third iteration adds p 9 , p 10 , and p 11 as the only unvisited neigh- bors of p 3 which are inside B to H. The next two iterations immediately remove p 3 and thenp 6 fromH and add them toS(Q) as their neighbors have been already visited. The subsequent iterations addp 5 ,p 4 , andp 2 to the skyline and eliminate the remaining entries ofH as they are all dominated. B.4.2.1 VS 2 Correctness Lemma 14. Given a query set Q, VS 2 first adds the data points inside CH(Q) to the skyline S(Q). All other points are examined and added to S(Q) in ascending order of theirmindist() values. Proof. We first show thatp 0 , the first data point added toS(Q), is insideCH(Q) if there is at least one point insideCH(Q). The traversal of Delaunay graph once started from pointNN(q 1 ) is towards the point with smaller mindist. The reason is that the mindist of the top entry of H is always decreasing beforep 0 being added toS(Q). It is clear that themindist of any point inside CH(Q) is smaller than those of the points outsideCH(Q). Hence, after a few iterations, the top entry ofH becomes a point insideCH(Q). Then during the next iterations, the 93 point whosemindist is less than those of all of its neighbors is added as the first point to S(Q). As mindist is monotone with respect to the distance to each query point q i , this happens only inside CH(Q). Therefore, p 0 is inside CH(Q). If there is no point inside CH(Q) then V(p 0 ) intersects with CH(Q). Once p 0 is added, all other points inside CH(Q) are added to S(Q) as their mindist are smaller than those of outside points. Finally, as the pseudo-code of VS 2 shows, subsequent iterations always examine and add the top entry ofH which has the minimummindist toS(Q). Hence, the non- dominated data points outsideCH(Q) are added toS(Q) in the ascending order of their mindist. Lemma 15. Given a query setQ, VS 2 identifies all spatial skyline points with respect to Q. Proof. We provide only the sketch of the proof. It is clear that VS 2 examines all the data points except those that either their V oronoi cells are outside the rectangle B or none of their already-visited V oronoi neighbors has a skyline neighbor. The first group of points are inside the dominator region of one of the visited points and hence need to be discarded. The second group of points are obviously outsideCH(Q) and the V oronoi cell of none of their V oronoi neighbors intersects withCH(Q). These neighbors are all dominated. V oronoi neighbors of these neighbors are also dominated and hence do not intersect with CH(Q). The two layer of dominated V oronoi neighbors around such a point p guarantees that p is also dominated by a point which dominates one of these neighbors. Similar to B 2 S 2 , VS 2 utilizes our foundation described in Section B.3 to efficiently reduce the number of dominance checks. The number of data points visited by VS 2 is much less than those inside rectangle B. The reason is that VS 2 usually stops its traversal earlier than reaching the boundaries ofB. 94 step heap contents (pointp:mindist(:;:)) S(Q) 1 (p 1 :24) ? 2 (p 1 :24),(p 3 :28),(p 6 :32),(p 5 :34), ? (p 4 :38),(p 8 :44) 4 (p 3 :28),(p 6 :32),(p 5 :34),(p 4 :38) fp 1 g (p 8 :44),(p 9 :49),(p 10 :49),(p 11 :63) 6 (p 6 :32),(p 5 :34),(p 4 :38),(p 8 :44), fp 1 ;p 3 g (p 7 :46),(p 9 :49),(p 10 :49),(p 11 :63) 8 (p 5 :34),(p 4 :38),(p 8 :44),(p 7 :46), fp 1 ;p 3 ;p 6 g (p 9 :49),(p 10 :49),(p 11 :63) ... ... fp 1 ;p 3 ;p 6 ; p 5 ;p 4 ;p 2 g Table B.3: VS 2 for the example of Figure B.8 The time complexity of VS 2 isO(jS(Q)j 2 jCH v (Q)j+©(jPj)) where©(jPj) is the complexity of finding the point from which VS 2 starts visiting inside CH(Q) (e.g., NN(q 1 )). If VS 2 utilize an index structure, then ©(jPj) is O(logjPj). Otherwise, As the Delaunay graph ofP is connected, VS 2 starts from a random point and keeps visiting the neighboring point closest toq 1 until it reachesq 1 . For a uniformly distributedP in a square-shaped area, this takes©(jPj)=O( p jPj=2) steps. B.5 Continuous Spatial Skyline Query The algorithms proposed in the previous sections are appropriate for the applications where the query points inQ represent the locations of some stationary objects. Hence, the spatial skyline of the data points P with respect to Q once found is not changing. Now consider the scenario in which each query point q i represents the location of a moving object in the space of R 2 . All moving objects frequently report their latest locations. Consequently, we gradually receive the new location of each object as a spatial data stream. Arrival of each new location causes an update to a single point ofQ. We intend to maintain the spatial skyline ofP with respect to the set of latest locations of all objects (i.e., the current query setQ). Upon an update toQ, one can always rerun an SSQ algorithm to return the new set of skyline points. However, this approach is expensive specially with the R-tree-based 95 algorithms such as B 2 S 2 and BBS where the entire R-tree must be traversed per each update. These algorithms cannot tolerate updates for a fast rate data stream. A smarter solution is to update the set of previously found skyline points when a query pointq moves. With B 2 S 2 and BBS, this still requires a complete traversal of R- tree and examining all candidate data points. However, a more efficient approach must examine only those data points which may change the spatial skyline according to q’s new location. In the following, we first identify the subset of query points on which the dominance of a data point p depends. Then, we prove Lemma 17 which states that q affects the dominance ofp iffp is in the line-of-sight ofq. That is, the line segmentpq does not intersect with the interior ofCH(Q). We term visible region ofq to refer to the locus of all such points p. Finally, in Section B.5.1 we identify the points which may change the skyline for different moves of q and propose our algorithm for continuous SSQ. The algorithm utilizes Lemma 17 to choose the data points that must be examined when the location ofq changes. First, we find the query points which may change the dominance of a data point outsideCH(Q). Consider the data pointp and query pointsQ=fq 1 ;:::;q 4 g in Figure B.9. The two tangent lines from p to CH(Q), divide the vertices in CH v (Q) into two disjoint sets CH + v (Q) = fq 1 ;q 2 ;q 3 g and CH ¡ v (Q) = fq 4 g. The sets CH + v (Q) and CH ¡ v (Q) include the vertices on the closer and farther hulls of the convex hullCH(Q) top, respectively. The following lemma proves that only the query points in the closer hull top can change the dominance ofp. Lemma 16. Given a data pointp and a query setQ, the dominance ofp only depends on the query points inCH + v (Q). Proof. Suppose a point p 0 dominates p with respect to CH + v (Q). It is clear that the bisector line of pp 0 which separates p and CH + v (Q) intersects with the tangent lines from p to CH(Q). Hence, it also separates the entire CH(Q) from p. Therefore, p is 96 p q 4 q 1 q 3 L 1 L 2 q 2 Figure B.9: Visible region ofq 1 2CH(Q) dominated with respect toQ too. The proof of the reverse case is also similar. Therefore, p is (not) dominated with respect toQ iffp is (not) dominated with respect toCH + v (Q). Hence, the dominance ofp depends only onCH + v (Q) (notCH ¡ v (Q)). Lemma 16 states that the dominator region ofp does not depend onq = 2 CH + v (Q). That is, this region is not changed by the circle C(q;p) if q 2 CH ¡ v (Q). Figure B.9 illustrates this case where the dotted circleC(q 4 ;p) does not contribute to the dominator region ofp. Utilizing Lemma 16, we easily find the locus of all points whose dominance depends on the location of a given query point. Considerq 1 in the example shown in Figure B.9. The linesL 1 andL 2 pass through the line segmentsq 1 q 2 andq 1 q 3 , respectively (q 2 and q 3 are immediate neighbors ofq 1 onCH(Q)). Each line divides the space into two half planes. The figure illustrates the half planes using arrows. The union of the half planes which does not contain CH(Q) (visible region of q 1 ) is the locus of the data points whose dominance does depend onq 1 . The reason is that for any point such asp in this region,CH + v (Q) does not containq 1 and hence according to Lemma 16p’s dominance does not depend onq 1 . The general statement of the above result follows. Lemma 17. The locus of data points whose dominance depends onq2 CH v (Q) is the visible region ofq. 97 q' q q q' + q' q (a) (b) (c) q' + q q' ++ q q' q (d) (e) (f) Figure B.10: Change patterns of convex hull ofQ when the location ofq changes toq 0 B.5.1 Voronoi-based Continuous SSQ (VCS 2 ) Assume that the location of the query pointq changes. We useq 0 andQ 0 to refer to the new location ofq and the new set of query locations, respectively (Q 0 =Q[fq 0 g¡fqg). In this section, we propose our V oronoi-based Continuous Spatial Skyline algorithm (VCS 2 ). We show how VCS 2 uses the visible regions ofq andq 0 to partition the space into several regions for each a specific update must be applied to the old skyline. VCS 2 is more efficient than the naive approach as it avoids excessive unnecessary dominance checks. The algorithm traverses the Delaunay graph of the data points similar to VS 2 . It first computes CH(Q 0 ), the convex hull of the latest query set. Then, it compares CH(Q 0 ) with the old convex hull CH(Q). Depending on how CH(Q 0 ) differs from CH(Q), VCS 2 decides to either traverse only specific portions of the graph and update the old skyline S(Q) or rerun VS 2 and generate a new one. With the former case, it tries to examine only those points on the graph whose dominance changes because ofq’s move- ment. 98 3 2 1 1 3 2 5 6 4 9 8 7 10 11 1 12 13 14 15 Figure B.11: Example points for VCS 2 We now describe the situations where VCS 2 updates the skyline based on the change in CH(Q 0 ). Later, we show how the update is applied. We first enumerate different ways that the convex hull of query points might change when a query point q moves. Figures B.10a-B.10f show only six change patterns. Each figure illustrates a case where q’s location changes to q 0 (i.e., q moves to q 0 ). The grey and the thick-edged polygons show CH(Q) and CH(Q 0 ), respectively. In general, identifying the appropriate pat- tern by comparingCH(Q) andCH(Q 0 ) is not a straightforward task. Therefore, VCS 2 only tries to recognize specific simple patterns I-V and subsequently updates the sky- line accordingly as we explain later. For all other change patterns such as pattern VI, it reruns VS 2 . For each of patterns I-V, specific portions of the Delaunay graph must be traversed by VCS 2 : Pattern I) CH(Q) and CH(q 0 ) are equal as both q and q 0 are inside the convex hull. According to Theorem 12, the skyline does not change and no graph traversal is required. Patterns II-V) The visible regions of q 2 CH v (Q) and q 0 2 CH v (Q 0 ) together parti- tion the space into six regions (seven regions for pattern V). The intersection region of CH(Q) and CH(Q 0 ) contains data points which are in skyline with respect to both Q 99 andQ 0 . Therefore, VS 2 does not traverse this portion of the Delaunay graph. The region labeled by “++” includes the points insideCH(Q 0 ) and outsideCH(Q). According to Theorem 10, any point in this region is a skyline point with respect toQ 0 and hence must be added to the skyline. The points in the regions labeled by “+” might be skyline points and must be examined. The intuition here is that their dominator regions have become smaller because ofq’s movement. The regions labeled by “¡” contain the points whose dominator regions have been expanded and hence might be deleted from the old sky- line. The points in the regions specified by “£” must be examined for inclusion in or exclusion from the skyline as their dominator regions have changed. Finally, neither q norq 0 affects the dominance of the points in the unlabeled white region. The reason is that this region is outside of the visible regions of bothq andq 0 (Lemma 17). For each pattern I-V, only the data points in the union of the labeled regions might change the skyline. We use the term candidate region to refer to this region. Once VCS 2 identifies any of patterns I-V, it tries to update the skyline. The algorithm first assigns the old S(Q) to S(Q 0 ). For pattern I, VCS 2 returns the same old skyline as it is still valid. For patterns II-V, VCS 2 starts traversing the Delaunay graph from the closest data point to q 0 . Similar to VS 2 , it initializes the rectangle B and traverses the points which are inside the intersection of the candidate region andB ordered by theirmindist values. At each iteration, if the point with the minimummindist has been dominated it is deleted fromS(Q 0 ) otherwise it is added toS(Q 0 ). At the end, VCS 2 evaluates the skylineS(Q 0 ) and removes all the points which are dominated by another point inS(Q 0 ). The reason behind this final check is to examine the old skyline points which are not in the traversal range of VCS 2 . We show the effectiveness of VCS 2 using the example of Figure B.11 where shows q 0 1 the new location of q 1 in Figure B.11. First, VCS 2 computes the convex hull of Q 0 = fq 0 1 ;q 2 ;q 3 g and compares it with CH(Q). The change pattern matches pattern 100 V in Figure B.10. Therefore, the update to S(Q) involves both insertion and deletion. Then, VCS 2 initializes S(Q 0 ) to the old S(Q) resulted from applying VS 2 in Section B.4.2 and rectangle B accordingly. It also adds (p 8 ;mindist(p 8 ;CH v (Q 0 ))) in to H wherep 8 is the closest point toq 0 1 , the new location ofq 1 . Table B.4 shows the contents of the minheapH. The second iteration extractsp 8 and adds all of its V oronoi neighbors except p 1 in to H as p 1 is not in the candidate region of pattern V. Similarly, p 3 is extracted and all its neighbors exceptp 4 is added in toH. Next, asp 3 is not dominated it remains inS(Q 0 ). The next two iterations addp 8 andp 10 toS(Q 0 ). The sixth iteration, visitsp 6 , the only unvisited V oronoi neighbor ofp 7 inB which is subsequently removed from S(Q 0 ) as it is dominated by p 1 2 S(Q 0 ). The final four iterations of VCS 2 also eliminate remaining points inH as they are all dominated. No point in the final skyline set is dominated and hence VCS 2 returnsS(Q 0 ) as the result. The above example shows how VCS 2 avoids dominance checks for points such asp 4 outside the candidate region of the identified change pattern and p 1 inside the convex hull of query points. This proves VCS 2 ’s superiority over the naive approach. B.5.1.1 VCS 2 Correctness The correctness of VCS 2 follows that of VS 2 . The intuition is that VCS 2 also examines all the candidate skyline points. It also examines those old skyline points that are now dominated with respect to the new query setQ 0 and removes them from the skyline. B.6 Non-spatial Attributes One might be interested to find the skyline with respect to both static non-spatial attributes of the data points and their distances to points of Q. For instance, the best restaurant in LA might be dominated in terms of distance to our team members but it is 101 step heap contents S(Q 0 ) 1 (p 8 :39) fp 1 ;p 3 ;p 6 ; p 5 ;p 4 ;p 2 g 2 (p 3 :32),(p 8 :39),(p 10 :42),(p 7 :50), fp 1 ;p 3 ;p 6 ; (p 14 :64),(p 15 :66) p 5 ;p 4 ;p 2 g 3 (p 3 :32),(p 8 :39),(p 10 :42),(p 7 :50), fp 1 ;p 3 ;p 6 ; (p 9 :51),(p 12 :52),(p 11 :60),(p 14 :64), p 5 ;p 4 ;p 2 g (p 15 :66) 4 (p 8 :39),(p 10 :42),(p 7 :50),(p 9 :51), fp 1 ;p 3 ;p 6 ; (p 12 :52),(p 11 :60),(p 14 :64),(p 15 :66) p 5 ;p 4 ;p 2 g 5 (p 10 :42),(p 7 :50),(p 9 :51),(p 12 :52), fp 1 ;p 3 ;p 6 ;p 5 (p 11 :60),(p 14 :64),(p 15 :66) p 4 ;p 2 ;p 8 g 6 (p 7 :50),(p 9 :51),(p 12 :52),(p 11 :60), fp 1 ;p 3 ;p 6 ;p 5 ; (p 14 :64),(p 15 :66) p 4 ;p 2 ;p 8 ;p 10 g 7 (p 6 :41),(p 7 :50),(p 9 :51),(p 12 :52), fp 1 ;p 3 ;p 6 ;p 5 ; (p 11 :60),(p 14 :64),(p 15 :66) p 4 ;p 2 ;p 8 ;p 10 g 8 (p 7 :50),(p 9 :51),(p 12 :52),(p 11 :60), fp 1 ;p 3 ;p 6 ;p 5 ; (p 14 :64),(p 15 :66) p 4 ;p 2 ;p 8 ;p 10 g 9 (p 9 :51),(p 12 :52),(p 11 :60),(p 14 :64), fp 1 ;p 3 ;p 5 ;p 4 ; (p 15 :66) p 2 ;p 8 ;p 10 g 10 (p 12 :52),(p 11 :60),(p 14 :64),(p 15 :66) fp 1 ;p 3 ;p 5 ;p 4 ; p 2 ;p 8 ;p 10 g 11 (p 12 :52),(p 11 :60),(p 14 :64),(p 15 :66), fp 1 ;p 3 ;p 5 ;p 4 ; (p 13 :66) p 2 ;p 8 ;p 10 g Table B.4: VCS 2 for the example of Figure B.11 still in the skyline considering its rating. We show how B 2 S 2 , VS 2 , and VCS 2 can sup- port this variation of SSQ. LetA be a subset of non-spatial attributes of data points ofP . Assume thatS(A) is the set of skyline points considering only the non-spatial attributes inA. Also, letS(A;Q) be the skyline when both attributes inA and distances to query points inQ are considered. Consequently, we haveS(A)½S(A;Q) andS(Q)½S(A;Q). Our algorithms can easily be changed to findS(A;Q) as follows. 1) We first use a gen- eral skyline algorithm to find S(A). Notice that this is a batch one-time computation independent from the query (i.e., Q). 2) We modify each of our algorithms such that the domination check for each point inside the search region and outside CH(Q) con- siders both its non-spatial attributes and its distances to the points of CH v (Q). 3) The search region (i.e., rectangle B) is extended to examine all possible candidate points. The following lemma specifies the limits of the new search region. 102 Lemma 18. Any point p farther than all points in S(A) from all q i 2 Q, cannot be a skyline point with respect to attributes inA and query points inQ. That is, 8p 0 2S(A);8q i 2Q;D(p;q i )¸D(p 0 ;q i ))p = 2S(A;Q) Utilizing Lemma 18, we change the rectangleB to the MBR of the points violating the above equation. This is straightforward as the region is the union of circlesC(q i ;p) wherep2 S(A). Therefore, our B 2 S 2 , VS 2 , and VCS 2 algorithms answer SSQs when mixed with non-spatial attributes. B.7 Performance Evaluation We conducted several experiments to evaluate the performance of our proposed approaches. In real-world applications with a small number of original attributes, the traditional BBS approach outperforms algorithms such as BNL [BKS01]. Hence, we only compared our algorithms with BBS. First, we compared both B 2 S 2 and VS 2 with the BBS approach, with respect to: 1) overall query response time, and 2) num- ber of dominance checks. Moreover, we compared the disk I/O accesses incurred by the underlying R-tree index structures used by B 2 S 2 with those of BBS. We evaluated all approaches by investigating the effect of the following parameters on their perfor- mances: 1) number of query points (i.e., jQj), 2) the area covered by MBR of Q, 3) cardinality of the dataset, and 4) density of the data points. We also evaluated the per- formance of VCS 2 using a synthetic dataset of moving objects. For our experiments, we used a real-world dataset which is obtained from the U.S. Geological Survey (USGS) 2 . The USGS dataset consists of 950;000 locations of eight 2 http://geonames.usgs.gov/ 103 Points Size Points Size 1) Hospital 0.56% 5) Church 13% 2) Building 1.6% 6) School 15% 3) Summit 7% 7) Populated place 18% 4) Cemetery 12% 8) Institution 34% Table B.5: USGS dataset different businesses in the entire country. Table B.5 shows the size of each business type in terms of a fraction of the entire dataset. This dataset is indexed by an R*-tree [BKSS90] index with the page size of 1K bytes and a maximum of 50 entries in each node (capacity of the node). The index structure is based on the original attributes of the data points (i.e., latitude and longitude) and is utilized by both BBS and B 2 S 2 approaches. VS 2 and VCS 2 use a pre-built Delaunay graph of the entire dataset. We also used the Graham Scan algorithm [dBvKOS00] for convex hull computation in both VS 2 and VCS 2 . The experiments were performed on a DELL Precision 470 with Xeon 3.2 GHz processor and 3GB of RAM. We ran 1000 SSQ queries using randomly selected query points and reported the average of the results. In the first set of experiments, we varied the number of query points and studied the performance of our proposed algorithms. We set the maximum MBR(Q) to 0.3% of the entire dataset. Figure B.12a depicts query response time of B 2 S 2 , B 2 S 2 *, VS 2 and BBS where B 2 S 2 * is a variation of B 2 S 2 in which we usemindist(e;Q) instead of mindist(e;CH v (Q)) to sort the heap entries. Therefore, both B 2 S 2 * and BBS find the skyline points in the same order. Notice that B 2 S 2 can employ any monotone function. As shown Figure B.12a, as the superior algorithm VS 2 significantly outperforms BBS by a wide margin (3-6 times better in most cases). Given four query points, VS 2 finds the skyline in only 0.12 seconds while BBS requires 0.78 seconds for the same query. While the figure illustrates the results for SSQ with respect to up to 10 query points, the trend shows that VS 2 performs significantly better than BBS as the number of query points 104 0 0.5 1 1.5 2 2.5 3 3.5 4 2 4 6 8 10 |Q| CPU cost (sec) BBS B2S2 B2S2* VS2 0 1 2 3 4 5 6 2 4 6 8 10 |Q| number of dominance checks (K) BBS B2S2 B2S2* VS2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 2 4 6 8 10 |Q| number of accesed nodes (K) BBS B2S2 B2S2* a) Time b) Dominance Checks c) I/O 0 1 2 3 4 5 6 0.01% 0.10% 0.25% 0.45% 0.75% MBR(Q) CPU cost (sec) BBS B2S2 B2S2* VS2 0 1 2 3 4 5 6 7 0.01% 0.10% 0.25% 0.45% 0.75% MBR(Q) number of dominance checks (K) BBS B2S2 B2S2* VS2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.01% 0.10% 0.25% 0.45% 0.75% MBR(Q) number of accesed nodes (K) BBS B2S2 B2S2* d) Time e) Dominance Checks f) I/O Figure B.12: Query cost vs. number of query points and the area covered byQ increases. The figure indicates that while both B 2 S 2 and B 2 S 2 * outperform BBS, B 2 S 2 is faster. The reason is that while calculatingmindist, it computes only the distances to convex query points ofQ instead of the entireQ (on average 6 points wherejQj=10). Figure B.12b shows the average number of dominance checks performed by each algorithm. As expected, VS 2 constantly performs only 65%-75% of the dominance checks performed by BBS. Considering the fact that each VS 2 ’s check is also faster than BBS’s, this explains the superiority of VS 2 in terms of performance. The figure shows that both B 2 S 2 and B 2 S 2 * require less checks than BBS which shows the efficiency of utilizing the rectangle B to avoid unnecessary checks (see Section B.4.1). Figure B.12c illustrates the number of R*-tree nodes accessed by B 2 S 2 , B 2 S 2 *, and BBS. As the figure shows, BBS and B 2 S 2 * access exactly the same nodes as they both use a commonmindist function during the search. B 2 S 2 ’s use of a different function results 105 into slightly more node accesses which has no impact on its overall performance as shown in Figure B.12b. The reason here is that 1) the total number of dominance checks of B 2 S 2 and B 2 S 2 * are less than that of BBS (see Figure B.12b) and, 2) each check in B 2 S 2 and B 2 S 2 * requires distance computation only for the convex query points. Even the extra convex hull computation in our algorithms does not downgrade their performance. The next set of experiments investigates the impact of closeness of the query points on the performance of each algorithm. We varied the area covered by the MBR of Q from 0.01% to 0.75% of the entire USGS dataset (i.e., approximately 90 to 7K data points in the MBR). Figures B.12d-f depicts the average query response time, number of dominance checks, and R*-tree node accesses for all the algorithms, respectively (jQj = 6). As the query points get farther, more points need to be examined by an SSQ algorithm which downgrades its performance. The results show a trend similar to those of the first experiment; VS 2 is the superior algorithm. It performs only 33% of BBS’s dominance checks which makes it by one order of magnitude faster than BBS even for highly scattered query points (i.e.,MBR(Q) = 0:75%). The reason is that as the MBR ofQ grows, BBS requires more R-tree nodes to examine. However, VS 2 once enters CH(Q) performs computation only in a local neighborhood. Likewise, B 2 S 2 demonstrates superiority over BBS. The third set of experiments focuses on the performance of the algorithms when the density of the data points changes. We used five different point types from Table B.5 with hospitals and institutions as the most sparse and dense points, respectively. Figures B.13a-c illustrates the results forjQj = 6 and maximum MBR(Q) = 0:5%. As expected, it always takes more time and I/O to find skyline for denser data points. The reason is that for denser data more skyline points correspond to a query setQ with fixed-size MBR and hence all proposed algorithms must examine more points. This 106 0 0.5 1 1.5 2 0.56% 1.60% 7% 15% 34% Density CPU cost (sec) BBS B2S2 B2S2* VS2 0 1 2 3 4 5 0.56% 1.60% 7% 15% 34% Density number of dominance checks (K) BBS B2S2 B2S2* VS2 a) Time b) Dominance Checks 0.0 0.5 1.0 1.5 2.0 2.5 0.56% 1.60% 7% 15% 34% Density number of accesed nodes (K) BBS B2S2 B2S2* 0% 20% 40% 60% 80% 100% 3 4 5 6 7 8 9 10 |Q| action triggered VS2 VCS2 NOP c) I/O d) Continuous SSQ Figure B.13: Query cost and continuous SSQ experiment also shows that the cardinality of the dataset has the same impact as that of the density on the performance of all the algorithms. Our next set of experiments studies the performance of VCS 2 . We used GSTD [TN00] to generate trajectories of 3-10 moving query points whereMBR(Q) = 0:3%. The movements of query points obey a uniform distribution and follows a moderate speed. For different query sizes jQj, we counted the number of moves (out of 1000 moves per query point) for which VCS 2 1) identifies the movement pattern I and hence does nothing (shown as NOP), 2) detects movement patterns II-V and updates the old skyline (shown as VCS 2 ), or 3) re-runs VS 2 (shown as VS 2 ). Figure B.13d shows the percentage of the movements with the corresponding triggered actions. Surprisingly, the figure shows that on average for all query sizes only less than 25% of movements require 107 Data (# of Query (# of CPU (sec.) # of Dominance Checks points) points) BBS B 2 S 2 VS 2 BBS B 2 S 2 VS 2 Schools Colorado 31.4 6.7 6.5 3,745 3,460 1,353 (139,523) Summits (3,171) Hospitals Summits 22,171 1,379 1,314 10,629 5,726 5,167 (5,314) (69,498) Summits Buildings >30,000 2,212 1,427 - 143,222 65,749 (69,498) (15,126) Table B.6: Experimental results for extreme scenarios the entire skyline to be recomputed (i.e., re-running VS 2 ). For three query points, this never happens as any movement follows one of the patterns II, III or V. The figure also verifies that as the number of query points increases, the chance that a movement changes the skyline decreases. We separately measured the average query response time of VCS 2 as only 35% of that of VS 2 for all query sizes. Therefore, this experiment justifies the superiority of VCS 2 for continuous SSQ over VS 2 . Finally, Table B.6 shows the results of our last set of experiments with some extreme cases: spatial skyline of 1) all U.S. schools with respect to3;000 summits in Colorado whose MBR covers about 2.5% of the entire U.S. (large Q), 2) all U.S. hospitals with respect to all U.S. summits (jQj À jPj), and 3) all U.S. summits with respect to all buildings of USGS dataset (large Q). Notice that in the last two cases, both data and query sets cover the same area (i.e., entire U.S.). Our results show the superiority of both VS 2 and B 2 S 2 over BBS. When the size of query set is significantly more than that of data, the overhead of convex hull computation becomes the dominant factor in the performance of VS 2 and B 2 S 2 . However, these algorithms still outperform BBS as even with large query sets the number of convex points is significantly small (e.g., only 19 points for 69;498 points in the second case). This highly reduces the cost of dominance checks performed by our algorithms. In contrast, each dominance check in 108 BBS involves a large number of distance computations (2£69;498 computations in the second case). B.8 Summary and Future Directions We introduced the novel concept of spatial skyline queries. We studied the geometric properties of the solution to these queries. We analytically proved that a set of definite skyline points exists. These are the points whose V oronoi cells are inside or intersect with the convex hull of query points (see Theorems 10 and 13 in Section B.3). We also proved that the locations of those query points inside the convex hull of all query points have no effect on the final spatial skyline (Theorem 12). Based on these theoretical findings, we proposed two efficient algorithms for SSQ considering static query points. Through extensive experiments, we showed the superiority of our algorithms over the traditional BBS approach. Our VS 2 algorithm exhibits up to 6 times faster performance improvement over BBS. We also studied a variation of the SSQ problem for moving query points. Our VCS 2 algorithm effectively updates the spatial skyline upon any change to the locations of the query points. Our approach exhibits a significantly better performance than re- computing the skyline even using our efficient VS 2 algorithm. Finally, we showed how all three proposed algorithms can address the SSQ problem when mixed with non-spatial attributes. A promising direction for future work is to study the challenges of SSQ in metric spaces such as road networks. Intuitively, similar geometric properties corresponding to those we proved in Section B.3 for a vector space can be proved for metric spaces. One can utilize these properties to propose efficient SSQ algorithms for road network databases. 109
Abstract (if available)
Abstract
Spatial query processing in spatial databases, Geographic Information Systems (GIS), and on-line maps attempts to extract specific geometric relations among spatial objects. As a prominent category of spatial queries, the class of nearest neighbor queries retrieve spatial objects that minimize specific functions in terms of their distance to a given object (e.g., closest data point to a query point). The most efficient algorithms that address nearest neighbor queries utilize the R-tree index structure to avoid I/O operations for the groups of data objectsthat do not contain the final query result. However, they still result in unnecessary I/O operations as R-trees are not efficient in elaborate exploration of the portion of data space thatincludes the result.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Query processing in time-dependent spatial networks
PDF
Location-based spatial queries in mobile environments
PDF
Scalable processing of spatial queries
PDF
Efficient updates for continuous queries over moving objects
PDF
Generalized optimal location planning
PDF
Dynamic pricing and task assignment in real-time spatial crowdsourcing platforms
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Efficient reachability query evaluation in large spatiotemporal contact networks
PDF
WOLAP: wavelet-based on-line analytical processing
PDF
Scalable data integration under constraints
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Enabling spatial-visual search for geospatial image databases
PDF
A data integration approach to dynamically fusing geospatial sources
PDF
Spatial approaches to reducing error in geocoded data
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Enabling query answering in a trustworthy privacy-aware spatial crowdsourcing
PDF
GeoCrowd: a spatial crowdsourcing system implementation
PDF
Inferring mobility behaviors from trajectory datasets
Asset Metadata
Creator
Sharifzadeh, Mehdi
(author)
Core Title
Spatial query processing using Voronoi diagrams
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2007-05
Publication Date
04/16/2007
Defense Date
03/22/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,spatial databases,spatial query,spatial skyline,Voronoi diagram,VoR-tree
Language
English
Advisor
Shahabi, Cyrus (
committee chair
), Knoblock, Craig A. (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
sharifza@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m394
Unique identifier
UC1435868
Identifier
etd-Sharifzadeh-20070416 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-324735 (legacy record id),usctheses-m394 (legacy record id)
Legacy Identifier
etd-Sharifzadeh-20070416.pdf
Dmrecord
324735
Document Type
Dissertation
Rights
Sharifzadeh, Mehdi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
spatial databases
spatial query
spatial skyline
Voronoi diagram
VoR-tree