Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 840 (2005)
(USC DC Other)
USC Computer Science Technical Reports, no. 840 (2005)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
The Optimal Sequenced Route Query Mehdi Sharifzadeh, Mohammad Kolahdouzan, Cyrus Shahabi Computer Science Department University of Southern California Los Angeles, CA 90089-0781 [sharifza, kolahdoz, shahabi]@usc.edu ABSTRACT Severalvariationsofnearestneighbor(NN)queryhavebeen investigated by the database community. However, real- world applications often result in the formulation of new variations of the NN problem demanding new solutions. In this paper, we study an unexploited and novel form of NN queries named Optimal Sequenced Route (OSR) query in both vector and metric spaces. OSR strives to find a route ofminimumlengthstartingfromagivensourcelocationand passingthroughanumberof typedlocationsinaspecificse- quenceimposedonthetypesofthelocations. Wefirsttrans- form the OSR problem into a shortest path problem on a large planar graph. We show that a classic shortest path al- gorithmsuchasDijkstra’sisimpracticalformostreal-world scenarios. Therefore, we propose LORD, a light threshold- based iterative algorithm, that utilizes various thresholds to filter out the locations that cannot be in the optimal route. Then we propose R-LORD, an extension of LORD which uses R-tree to examine the threshold values more efficiently. Finally, LORD and R-LORD are not applicable in metric spaces, hence we propose another approach that progres- sivelyissuesNNqueriesondifferentpointtypestoconstruct the optimal route for the OSR query. Our extensive exper- iments using both real-world and synthetic datasets verify that our algorithms significantly outperform the Dijkstra- basedapproachintermsofprocessingtime(uptotwoorders ofmagnitude)andrequiredworkspace(upto90%reduction on average). 1. INTRODUCTION A nearest neighbor query is defined as finding the ob- ject(s) with the shortest distance(s) to a query point. Al- though this type of query is useful, but more often, a user intends to make a plan for a trip to several (and possibly different types of) locations in some sequence, and is inter- ested in finding the optimal route that minimizes her total traveling in distance or time. Besides commercial applica- tions such as navigation devices in vehicles or online map services, where this type of queries has great demand and numerous benefits, this query is of absolute importance in crisis management and defense/intelligense systems, where being able to respond to a series of incidents in fastest time is vital. In this paper, we introduce and address this type of query in spatial databases. 1.1 Motivation Supposethatweareplanningacartripintownasfollow- ing: firstweintendtoleavehometowardagasstationtofuel thecar, thenweplantostopbyalibrarybranchtocheckin a book, and finally, we need to go to a post office to mail a package. Naturally, we prefer to drive the minimum overall distance to these destinations. In other words, we need to findthelocationsofthegasstation g i , thelibrarybranchl j , and the post office p k , which driving toward them consider- ing the sequence of the plan shortens our trip (in terms of distanceortime). Wecallthisthe Optimal Sequenced Route or OSR. Using Figure 1, we can show that this query may not be optimally answered by simply performing a series of inde- pendentnearestneighborqueriesatdifferentlocations. The figure shows a network of equally sized connected squares, threedifferenttypesof point sets shownbywhite, blackand graycircles, whichrepresentgasstations, libraries, andpost offices, respectively, and a starting point p (shown by4). A greedy approach to solve our query is to first locate the closestgasstationtop,g 2 ,thenfindtheclosestlibrarytog 2 , l 2 , andfinallyfindtheclosestpostofficeto l 2 ,p 2 . Assuming the length of each edge of the squares is 1 unit, the length of the route specified by this greedy approach, (p;g2;l2;p2), shown by dotted lines in the figure, is 15 units. However, the route (p;g 1 ;l 1 ;p 1 ) (shown with solid lines in the figure) with the length of 12 units is the optimum answer to our query. Note that g1 in not the closest gas station to p and l1 is actually the farthest library to g1. This shows that the optimum result for our specific query can be substantially different from what a greedy approach would suggest. 1.2 Uniqueness Tothebestofourknowledge,althoughdifferentvariations ofnearestneighborquerieshavebeenextensivelystudiedby the database research community, no one has explored the problem of Optimal Sequenced Route (OSR) Query. This problem is closely related to the Traveling Salesman Prob- lem (TSP).TSPasksfortheminimumcostround-triproute from a starting point to a given set of points. As a classic problem in graph theory, TSP is the search for the Hamilto- nian cycle with the least weight in a weighted graph. With TSP, all the points in the set are participating in the route and the sequence in which the points must be visited is re- quested. In contrast, OSR enforces a specific sequence to find the appropriate points from a number of point sets. The most similar TSP-related problem to OSR is Sequen- tial Ordering Problem (SOP) in which a Hamiltonian path withaspecificnodeprecedenceconstraintisneeded. Similar to all TSP variations, the solution path to SOP must still passthroughallthegivenpoints. Conversely,themainchal- lenge with OSR is to efficiently select a sequence of points, 1 1 2 3 4 5 6 7 8 9 10 11 12 2 1 4 3 6 5 8 7 g 3 g 2 g 1 g 4 l 2 l 1 l 3 p 3 p 1 p 2 p x U 1 U 2 U 3 Figure 1: A network with three different types of point sets whereeachofwhichcanbeanymemberofagivenpointset. ThecommercialonlineYellowPagessuchasthoseofYahoo! and MapQuest can only search for the k-nearest neighbors in one specific category (or point set) to a given query loca- tion and cannot find the optimal sequenced route from the query to a group of point sets. 1.3 Contributions In this paper, we introduce and formally define the prob- lem of OSR query in spatial databases. We also propose alternativesolutionstotheOSRqueriesforbothvectorand metric spaces. Forvectorspaces,wefirstproposeasolutionthatisbased ongeneratingaweighteddirectedgraphfromtheinputroad network, and then utilizing Dijkstra’s algorithm to find the distances from a starting point to all possible end points on the generated graph. This solution becomes impracti- cal when the generated graph is large, which is the case for most real-world problems. Hence, we propose a second so- lution, LORD, that utilizes some threshold values to filter outthepointsthatcannotpossiblybeontheoptimalroute, and then generates the optimal route in reverse sequence (i.e., from ending to the starting point). We then propose R-LORD, which is an optimization of LORD by transform- ing the concept of the thresholds in LORD to some range queries and performing the range queries using an R-tree index structure. Finally, we propose PNE to address OSR queriesinmetricspaces. PNEisbasedonprogressivelyfind- ing the nearest neighbors to different point sets in order to construct the optimal route from the starting to the ending point. We also discuss two variations of OSR queries: 1) when the query must end in a specific point, and 2) when more than one optimum route is requested; and show how our proposed solutions can address these variations of OSR as well. Finally, throughextensiveexperimentswithbothreal- world and synthetic datasets, we show that R-LORD can efficientlyansweranOSRquery, scaletolargedatasets, and perform independently from the distribution and density of the data. The remainder of this paper is organized as follows. We first formally define the problem of OSR queries and the terms we use throughout the paper in Section 2. In Sec- tion 3, we discuss our alternative solutions for OSR queries in vector and metric spaces. We address two variations of OSR query in Section 4. The performance evaluation of our proposed algorithms is presented in Section 5. The related work to OSR and similar nearest neighbor queries are pre- sented in Section 6. Finally, we conclude the paper and discuss our future work in Section 7. 2. FORMAL PROBLEM DEFINITION In this section, we describe the terms and notations that weusethroughoutthepaper,formallydefinetheOSRquery, and discuss the unique properties of OSR that we utilize in our solutions. 2.1 Problem Definition Let U 1 ;U 2 ;:::;U n be n sets, each containing points in a d-dimensional spaceR d , and D(:;:) be a distance metric de- fined inR d where D(:;:) obeys the triangular inequality. To illustrate,intheexampleofFigure1, U 1 ,U 2 ,andU 3 arethe sets of black, white, and grey points, representing libraries, gas stations and post offices, respectively. We first define the following five terms. Definition 1: Given n, the number of point sets U i , we say M = (M 1 ;M 2 ;:::;M m ) is a sequence if and only if 1 · Mi · n for 1 · i · m. That is, given the point sets Ui, a user’s OSR query is valid only if she asks for existing location types. For the example of Figure 1 where n = 3, (2;1;2)isasequence(specifyingagasstation,alibrary,and a gas station) while (3;4;1) is not because 4 is not an exist- ing point set. Definition 2: We say R = (P 1 ;P 2 ;:::;P r ) is a route if and only if P i 2R d for each 1 · i · r. We use p ©R = (p;P 1 ;:::;P r ) to denote a new route that starts from start- ing point p and goes sequentially through P1 to Pr. The route p ©R is the result of adding p to the head of route R. Definition3: WedefinethelengthofarouteR=(P 1 ;P 2 ;:::;P r ) as L(R)= r¡1 X i=1 D(P i ;P i+1 ) (1) Note that L(R) = 0 for r = 1. For example, the length of the route (g 2 ;l 2 ;g 3 ) in Figure 1 is 4 units where D is the Manhattan distance. Definition 4: Let M = (M1;M2;:::;Mm) be a sequence. We refer to the route R = (P 1 ;P 2 ;:::;P m ) as a sequenced route thatfollowssequence M ifandonlyifP i 2U M i where 1 · i · m. In Figure 1, (g 2 ;l 2 ;g 3 ) is a sequenced route that follows (2;1;2) which means that the route passes only through a white, then a black and finally a white point. Definition 5: Given a starting point p, a sequence M = (M1;:::;Mm),andpointsetsfU1;:::;Ung,werefertoRg(p;M)= (P1;:::;Pm) as the greedy sequenced route that follows M from point p if and only if it satisfies the followings: 1. P 1 is the closest point to p in U M 1 , and 2. For1·i<m,P i+1 istheclosestpointtoP i inU M i+1 . It is clear that Rg(p;M) is unique for a given point p, a sequence M, and the sets Ui. Moreover, by definition, the optimal sequenced route R is never longer than the greedy sequenced route for the given sequence M, i.e., L(p;R) · L(p;R g (p;M)). We now formally define the OSR query. Definition6: Assumethatwearegivenasequence M = (M 1 ;M 2 ;:::; M m ). For a given starting point p inR d and 2 Symbol Meaning U i a point set inR d jUij cardinality of the set Ui n number of point sets U i D(:;:) distance function inR d M a sequence, = (M 1 ;:::;M m ) jMj m, size of sequence M = number of items in M Mi i-th member of M R route (P 1 ;P 2 ;:::;P r ), where P i is a point jRj r, number of points in R Pi i-th point in R L(R) length of R p ©R route Rp =(p;P1;:::;Pr) where R=(P1;:::;Pr) L(p;R) length of the route p ©R Table 1: Summary of notations the sequence M, the Optimal Sequenced Route (OSR) Query, Q(p;M), is defined as finding a sequenced route R that follows M where the value of the following function L is minimum over all the sequenced routes that follow M: L(p;R)=D(p;P1)+L(R) (2) NotethatL(p;R)isinfactthelengthofrouteR p =p©R. Throughout the paper, we use Q(p;M) = (P 1 ;P 2 ;:::;P m ) to denote the optimal SR, the answer to the OSR query Q. For the example in Section 1.1 where (U1;U2;U3) = (black;white;gray), M = (2;1;3), and D is the shortest path, the answer to the OSR query is Q(p;M)=(g 1 ;l 1 ;p 1 ). We use candidate SR to refer to all other sequenced routes that follow sequence M. Table 1 summarizes the notations we use throughout the paper. 2.2 Properties BeforedescribingouralgorithmsforOSRqueries,wepresent the following three properties which are exploited by our al- gorithms. Property 1: For a route R = (P1;:::;Pi;Pi+1;:::;Pr) and a given point p, we have L(p;R)¸D(p;Pi)+L((Pi;:::;Pr)) (3) Proof: The triangular inequality implies that D(p;P1)+ P i¡1 j=1 D(Pj;Pj+1)¸D(p;Pi). Adding P r¡1 j=i D(Pj;Pj+1)= L((Pi;:::;Pr)) to both sides of the inequality and consider- ing the definition of the function L() in Equation 2, yields Equation 3. As we will show in Section 3.2.1, we utilize property 1 to narrowdownthecandidatesequencedroutesforQ(p;M)by filtering out the points whose distance to p is greater than a threshold, and hence cannot possibly be on the optimal route. Note that this property is applicable to all routes in the space. The answer to the OSR query Q(p;M) demonstrates the following two unique properties. We utilize these properties to improve the exhaustive search among all potential routes of a given sequence. Property 2: If Q(p;M) = R = (P1;:::;Pm¡1;Pm), then P m is the closest point to P m¡1 in U M m . Proof: The proof of this property is by contradiction. Assume that the closest point to P m¡1 in U M m is p x 6=P m . Therefore, we have D(P m¡1 ;p x )<D(P m¡1 ;P m ) and hence L(p;(P1;:::;Pm¡1;px)) < L(p;(P1;:::;Pm¡1;Pm)). This contradicts our initial assumption that R is the answer to Q(p;M). Property 2 states that given that P 1 ;:::;P m¡1 are sub- sequently on the optimal route, it is only required to find the first nearest neighbor of Pm¡1 to complete the route and subsequent nearest neighbors cannot possibly be on the optimal route and hence, will not be examined. Note that this property does not prove that the greedy route is always optimal. Instead, it implies that only the last point of the optimal sequenced route R (i.e., Pm) is the nearest point of its previous point in the route (i.e., Pm¡1). Property 3: If Q(p;M) = (P 1 ;:::;P i ;P i+1 ;:::;P m ) for the sequence M = (M 1 ;:::;M i ;M i+1 ;:::;M m ), then for anypointPi andM 0 =(Mi+1;:::;Mm),wehaveQ(Pi;M 0 )= (P i+1 ;:::;P m ). Proof: The proof of this property is by contradiction. Assume that Q(P i ;M 0 ) = R 0 = (P 0 1 ;:::;P 0 m¡i ). Obvi- ously(Pi+1;:::;Pm)followssequenceM 0 ,thereforewehave L(Pi;R 0 )<L(Pi;(Pi+1;:::;Pm)). WeaddL(p;(P1;:::;Pi)) to the both sides of this inequality to get L(p;(P 1 ;:::;P i ;P 0 1 ;:::P 0 m¡i ))<L(p;(P 1 ;:::;P m )) TheaboveinequalityshowsthattheanswertoQ(p;M)must be (P 1 ;:::;P i ;P 0 1 ; :::;P 0 m¡i ) which clearly follows sequence M. This contradicts our assumption that Q(p;M)=R. 3. OSR SOLUTIONS In this section, we propose alternative solutions for OSR queries in vector and metric spaces. We start by discussing a naive solution based on the Dijkstra’s algorithm. We then propose LORD, an approach that employs some threshold values to efficiently prune non-candidate routes. Next we discuss R-LORD, that is an optimization of LORD by uti- lizing an R-tree index structure. Finally, we discuss a so- lution that progressively performs nearest neighbor queries on different point sets to find the optimal route for metric spaces. 3.1 The Dijkstra-based Solution SupposewehaveanOSRqueryforanetworkwithastart- ing point p, a sequence M, and point setsfU M 1 ;:::;U Mn g. Weconstructaweighteddirectedgraph Gforthegivennet- work where the set V = S m i=1 UM i [fpg are the vertices of G and its edges are generated as follows. The vertex cor- responding to p is connected to all the vertices in point set UM 1 . Subsequently, each vertex corresponding to a point x in UM i is connected to all the vertices corresponding to the points in U M i+1 , where 1 · i < m¡1. Figure 2 illus- trates an example of such graph. As shown in the figure, the graph G is a k-bipartite graph where k = m+1. The weight assigned to each edge of G is the distance between the two points corresponding to its two end-vertices. This graph is in fact showing all possible candidate sequenced routes (candidate SRs) for the given M and the set of Us. To be precise, it shows all the routes R p = p ©R where R is a candidate SR. By definition, the optimal route for the given OSR query is the candidate SR R for which Rp has the minimum length. Considering graph G, we notice that the OSR problem can be simply considered as finding the 3 M M M 1 2 m 1 1 Figure 2: The weighted directed graph G for a se- quence M shortestpaths(i.e.,withminimumweight)from ptoeachof the vertices that correspond to the points in U Mm (i.e., the lastlevelofpointsinFigure2), andthenreturningthepath with the shortest length as the optimal route. This can be achieved by performing the Dijkstra’s algorithm on G. There are two drawbacks with this solution. First, the graph G has jEj = jU M 1 j + P m¡1 i=1 jU M i j:jU M i+1 j directed edges which is a large number considering the usually large cardinality of the sets Ui. For instance, for a real-world dataset with 40;000 points and jMj = 3, G has 124 million edges (see Section 5). The time complexity of the Dijk- stra’s classic algorithm to find the shortest path between 2 nodes in graph G is O(jEjlogjVj). Hence, the complexity of this naive algorithm is O(jUM m jjEjlogjVj). Second, this huge graph must be built and kept in main memory. Al- though there exist versions of the Dijkstra’s algorithm that are adjusted to use external memory [9], but they result in so much of overhead which makes them hard to employ for OSR queries (see Section 6 for the complete discussion). This renders the classic Dijkstra’s algorithm to answer OSR queries in real-time impractical. InordertoimprovetheperformanceofthisnaiveDijkstra- based solution, we can issue a range query around the start- ing point p and only select the points that are closer to p thanL(p;R g (p;M)). Thisisbecausethelengthofanyroute R which includes a point outside this range is greater than that of the greedy route Rg(p;M).Therefore, we build the graphGusingonlythepointswithintherangeinsteadofall the points. In Section 5, we show that even this enhanced version of the Dijkstra’s algorithm is not as efficient as our approaches. 3.2 OSR in Vector Space Inthissection,weassumethatthedistancefunctionD(:;:) is the Euclidian distance between the points inR d . We pro- vide two solutions for OSR problem in this vector space. 3.2.1 LORD: Light Optimal Route Discoverer This section describes our Light Optimal Route Discov- erer (LORD) for addressing OSR queries. LORD has the sameflavorasDijkstra’salgorithmbutasathreshold-based algorithm it functions in the context of the OSR problem considering its unique properties described in Section 2.2. Wenameitalightalgorithmintermsofmemoryasweshow that LORD’s workspace is less than the workspace required to apply the Dijkstra-based approach to the OSR problem. Given an OSR query Q(p;M), LORD iteratively builds and maintains a set of partial sequenced routes (partial SR) in the reverse sequence, i.e., from the end points (points in U Mm ) toward p. During each iteration i of LORD, points from the point set UM (m¡i+1) are added to the head of each of these partial SRs to make them closer to a candidate SR and finally, to the solution (i.e., optimal SR). To make the solution space smaller, LORD only considers those points in U M (m¡i+1) that adding them to the partial SRs will not generate routes which are longer than a variable threshold value Tv. LORD further examines the partial SRs by calcu- lating their lengths after adding p, and discards the routes whose corresponding length is more than a constant thresh- old value T c , where T c is the length of the greedy route. WenowdescribeLORDinmoredetailsusingtheexample shown in Figure 3. Figure 3a depicts a starting point p and three different sets of points U 1 , U 2 , and U 3 , shown as black (b i ), white (w i ) and grey (g i ) points, respectively. Without loss of generality, we assume that the distance between each two points in the space is their Euclidian distance. Given the starting point p (shown as4 in the figure), we want to findtherouteR withtheminimumL(p;R)fromawhite, to a black and then a grey point. Therefore, the required OSR query is formulated as Q(p;(2;1;3)). The first step in LORD is to issue (m =)3 consecutive nearest neighbor queries to find the greedy route that fol- lows (2;1;3) from p. To be specific, the algorithm first finds the closest w i to p (i.e., w 2 ), then the closest b i to w 2 (i.e., b2), and finally the closest gi to b2 (i.e., g2). Figure 3b ren- ders the greedy route Rg(p;(2;1;3)) as (w2;b2;g2). LORD initiates both threshold values T v and T c to the length of p ©R g (p;M) (i.e., L(p;(w 2 ;b 2 ;g 2 )). Note that the value of T c remains the same while the value of T v reduces af- ter each iteration. Subsequently, it discards all the points whose distances to p are more than Tv, i.e., the points that are outside the circle shown in Figure 3c (i.e., w 1 , w 4 , and g 1 ). This is because any route (e.g., R) that contains a point that is outside this circle will lead to L(p;R) > L(p;Rg(p;M)) and hence, by definition, cannot be the opti- mal route. At this point, LORD generates a set, S, for par- tial candidate routes and inserts the gray nodes (i.e., points in U M ) which are inside the circle in Figure 3c, in to S, i.e., S =f(g2);(g3);(g4);(g5);(g6)g. Note that at this stage, the length of the partial routes in S is zero. In the first iteration of LORD, each point x 2 U M m¡1 (i.e., b i ’s) is added to the head of each partial SR PSR = (P 1 ) 2 S if: a) x is inside the circle T v , and b) D(p;x)+ D(x;P1) + L(PSR) · Tc. The rational behind the sec- ond condition is property 1; if the inequality does not hold, then L(p;(x;P 1 ;:::;P i )) will be greater than T c and hence, (x;P 1 ;:::;P i ) cannot be part of the optimal route. For in- stance, in Figure 3d, point b4 is added to (g3) and (g4) re- sulting in new partial SRsf(b4;g3);(b4;g4)g, but cannot be addedto(g 2 ), (g 5 )and(g 6 ). Moreover, betweenpartialSRs that have the same first point (e.g., (b 4 ;g 3 ) and (b 4 ;g 4 )), only the one with the shortest length will be kept in S (i.e., property 2). In addition, any PSR 2 S that no x can be addedtoitwillbediscarded. Forexample,inFigure3d,(g6) willbediscardedbecauseifanyb i isaddedtoit,atleastone of the above two conditions will not be met. Hence, at the end of the first step, the set of the partial SRs will become f(b6;g5);(b4;g3);(b3;g3);(b2;g2);(b1;g2)g (Figure 3e). At the end of each iteration, the value of variable thresh- old T v is decreased as follows. Suppose that Q(p;M) = (q 1 ;:::;q i ;:::;q m )andweareexaminingiteration(m¡i+1) 4 p w 1 w 2 w 3 w 4 b 1 b 2 b 3 b 4 b 5 b 6 g 1 g 2 g 3 g 4 g 5 g 6 p w 2 b 2 g 2 T c p g 2 g 3 g 4 g 5 g 6 w 2 w 3 b 1 b 2 b 3 b 4 b 5 b 6 p b 1 b 2 b 3 b 4 b 5 b 6 g 2 g 3 g 4 g 5 g 6 a) Q(p;(2;1;3)) b) Rg(p;(2;1;3))=(w2;b2;g2) c) Tc =D(p;w2)+L(w2;b2;g2) d) p b 1 b 2 b 3 b 4 b 6 g 2 g 3 g 5 p w 2 w 3 b 1 b 2 b 3 b 4 b 6 g 2 g 3 g 5 p w 2 w 3 b 2 b 4 g 2 g 3 p w 3 b 4 g 3 e) f) T v =T c ¡min R2S (L(R)) g) h) R min =(w 3 ;b 4 ;g 3 ) Figure 3: Different iterations of LORD (i.e., the partial SRs in S are in the form of (pi+1;:::;pm)). The definition of the greedy route implies that L(p;(q1;:::;qm))· L(p;Rg(p;M))=Tc and by considering Property 1, we have: D(p;qi)+L((qi+1;::;qm))<D(p;qi)+L((qi;::;qm))·Tc which can be rewritten as: D(p;q i )·T c ¡L((q i+1 ;::;q m )) (4) Note that inequality 4 must hold for all points qi that are to be examined at iteration (m¡ i + 1). Hence, by replacingL((q i+1 ;::;q m ))withitsminimumvalue,weobtain the maximum value for D(p;q i ) for any q i . Therefore, for any point qi that is examined in iteration (m¡i+1), we must have D(p;qi)·Tv =Tc¡minPSR2S(L(PSR)) Note that at each iteration, the lengths of the partial SRs inS, andhencethevalueofminPSR2S(L(PSR))isincreas- ing. This yields to smaller values for T v after each iteration. This is also shown in Figure 3; the radius of the circle in Figure 3f is smaller than the radius of the circle in Figure 3c. Thesubsequent(m¡2)iterationsofLORDareperformed similarly and the partial routes in S will become complete routes (i.e., candidate SRs that follow M) after the last it- eration is completed (Figure 3g). Finally, LORD examines the distance from p to the first point in each complete route in S (i.e., f(w 2 ;b 2 ;g 2 ); (w 3 ;b 4 ;g 3 )g) and selects the one that generates the minimum total distance, i.e., the route with the minimum value for L() function, as the result of Q(p;(2;1;3)) (Figure 3h). Figure 4 shows the pseudocode of LORD. Lines 3 ¡ 5 perform the first range query using the threshold T v and initialize the set of partial SRs S. The (m¡1) iterations are performed in lines 6¡16 where lines 9 and 12 check if a point can be added to the partial SRs in S and line 16 updates the value of Tv. Finally, line 17 selects the route in S that generates the minimum L(p;R) as the result of Q(p;M). Algorithm LORD(point p, sequence M) 1. S =fg; 2. Tv =Tc =L(p;Rg(p;M)); 3. for q in UM m 4. if (D(p;q)·T v ) 5. S =S[f(q)g; 6. for i=m¡1 downto 1 7. S 0 =fg; 8. for q in UM i 9. if (D(p;q)·T v ) 10. S 00 =fg; 11. for R=(P 1 ;:::;P m¡i ) in S 12. if (D(p;q)+D(q;P1)+L(R)·Tc) 13. S 00 =S 00 [f(q;P 1 ;:::;P m¡i )g; 14. S 0 =S 0 [fargmin R 00 2S 00(L(R 00 ))g; 15. S =S 0 ; 16. Tv =Tc¡minR2S(L(R)); 17. Rmin =argminR2S(L(p;R)); 18. return R min ; Figure 4: Pseudocode of the LORD Algorithm 5 p P 1 P 2 P |PSR| x E(p,PSR) Figure 5: The locus of the points x for LORD 3.2.2 R-LORD: R-tree based LORD WedescribedLORDinSection3.2.1withoutanyassump- tion on the structure of the points in each U i . We now dis- cuss the situation that the points in U i ’s are stored in an R-tree index structure. We utilize the features of the index structure to develop an R-tree-friendly version of LORD. Thecoreideabehindthissolutionistousethepoints’neigh- borhood information implicitly stored in R-tree MBR’s to more efficiently prune the candidate points at each itera- tionofLORD.Towardsthisgoal, wetransformtheLORD’s point selection criterium to the range queries applicable on an R-tree. Then, we show that the point selection can be performed using a single range query. Finally, we describe our algorithm which uses this range to find the solution for an OSR query utilizing an R-tree. 3.2.2.1 Point Selection Criterium in LORD AswediscussedinSection3.2.1,ateachiterationi,LORD prunes the points in U M i in two steps. First, it ignores any pointofthesetU M i thatisfartherthanthevalueofthevari- ablethresholdTv fromthestartingpoint p. Thisisasimple range queryQ1 given the range Range(Q1) as a circle with a known radius T v centered at p. Second, any point x re- sulting from query Range(Q1) is checked against all partial SRs PSR2 S. If for each PSR = (P1;:::;P jPSRj )2 S, the value of D(p;x)+D(x;P1)+L(PSR) is greater than the constant threshold T c (i.e., the length of the greedy route), then point x is not added to the beginning of that PSR. Otherwise, a new partial SR, (q i ;P 1 ;:::;P jPSRj ), is gener- ated. This clearly shows that the second query Q2 uses a more complicated range to prune the results of Q1. To identify Range(Q2), we first find the locus of the points x which can possibly be added to a PSR = (P 1 ;:::; P jPSRj ) 2 S. For such a point x, we must have D(x;p)+ D(x;P1)·Tc¡L(PSR) (Line 12 in Figure 4). As L(PSR) and Tc are constant values for a given PSR and query Q(p;M), the sum of x’s distances from two fixed points p and P 1 cannot be larger than a constant. Hence, x must be on or inside an ellipse defined by the foci p and P1 and the constant Tc¡L(PSR). Figure 5 shows the locus of the points x for a given route PSR as inside and on an ellipse E(p;PSR). Query Q2 is defined in terms of the set of partial SRs stored in S in the current iteration. For each PSR, we showedthatLORDappendspointsinsideellipse E(p;PSR) totheheadofthePSRinordertobuildanewpartialcandi- dateroute. Allsuchellipses, eachcorrespondingtoapartial SRinS, areintersectingastheyallsharethecommonfocus point p. The union of these ellipses contains all the points x (of the appropriate set), where for each, there is exactly one route starting with x built at the end of the current iteration. In other words, this union should be the range MBR(Q2) Range(Q2) Figure 6: Range query Q2 and its MBR for partial routes in LORD used in query Q2. Figure 6 illustrates an example for the currentsetS duringaniterationofLORD.Thesetincludes three partial SRs of the same length each starting with a black point. Thesequence M ofthe query Q(P;M) dictates the type of the point which must be added to the head of each partial SR. Any point outside the union of these three ellipses is ignored by LORD. Up to this point, we have identified the range of the two mainqueriesQ1andQ2usedinLORD.Inthefollowing,we show that any ellipse for the range Q2 is entirely inside the circleforrangeQ1andhence,therangeofQ2iscompletely inside that of Q1. Lemma 1. During each iteration of LORD for Q(p;M), given a partial SR PSR 2 S, any point x inside or on the ellipse E(p;PSR) has a distance less than current value of the variable threshold T v from point p (i.e., D(x;p)<T v ). Proof. As point x is inside or on ellipse E(p;PSR) cor- responding to the route PSR, we have D(x;p)+D(x;P1) · Tc¡L(PSR) · Tc¡minPSR2A(L(PSR)) (5) The right side of the above inequality has the same value as that of the current value of T v . It directly yields that D(x;p)·T v ¡D(x;P 1 )andsubsequently,wehaveD(x;p)< Tv. Lemma 1 shows that any ellipse E(p;PSR) is completely inside the circular range of Q1. Now, as Range(Q2) is the union of all ellipses E(p;PSR) corresponding to all the partialSRsinS,itcanbeconcludedthatitisentirelyinside Range(Q1). Note that at each iteration, LORD builds a new route using only the points in the intersection of Range(Q1) and Range(Q2). Given Lemma 1, this intersection is the same as Range(Q2). Hence, the algorithm must only consider the points which are within the range of Q2 from p, to be added to the partial SRs in S. 3.2.2.2 R-tree Friendly LORD Recall that our goal is to transform the threshold values utilizedbyLORDtotherangequeriesthatcanbeperformed on R-tree index structures. In Section 3.2.2, we showed that the two range queries Q1 andQ2 employed by LORD can be reduced to only one as Q2 is entirely inside Q1. However, as Figure 6 illustrates, the range specified by Q2 (union of the ellipses) is a complex parameterized curved shapewhichcannotbeefficientlyhandledbyanR-treerange 6 Function RQ1(point p, number dist, number index) 1. L= empty; R=fg; 2. insert R-tree root into list L; 3. while L is not empty 4. N = first node in the list L; 5. if N is a data point q and q2U index then 6. if (D(p;q)·dist) then R =R[fqg; 7. else // N is an intermediate node 8. remove N from L; 9. for each child node N 0 of N 10. if (mindist(N 0 ;p)·dist) then add N 0 to L; 11. return R; Figure 7: Range query Q1 using R-tree query algorithm. To make this range simpler, we employ its minimum bounding box (MBR(Q2)) as shown in Figure 6. However, MBR(Q2) is no longer inside the range of Q1. Therefore, our R-tree version of LORD must use the intersection of MBR(Q2) and Range(Q1) to examine the points in U M i ’s. To retrieve the points in a specific range, we need to tra- verse the R-tree from its root down to the leaves and report those points that are within the given range. To make the search efficient, existing search algorithms on R-tree prune subtrees of the main tree utilizing some metrics. The most common metric, mindist(N;q), gives a lower bound on the smallest distance between the point q and any point in the subtree of node N. We utilize mindist for Q1 as its range is relative to a fixed point p. Any R-tree node N with mindist(N;p) greater than threshold T v cannot contain a point q with the distance D(p;q) less than or equal to T v . Such node can be easily pruned when traversing the R-tree during our first range query (i.e.,Q1). Moreover, queryQ1 is used to initialize the PSRs of LORD (Line 3-5 in Figure 4). Figure 7 shows how we use mindist metric in Q1 to initialize the set of routes S. It also demonstrates the way a circular range query can be answered on an R-tree. Thesecondrectangularrangequery(i.e., MBR(Q2))can be performed as follows. We first check whether a node N of the R-tree intersects with the rectangle. If their intersec- tion is empty, the node N must be pruned; otherwise, the child nodes of N must be checked for their intersection with MBR(Q2). Nowthatwehaveidentifiedbothoftherangequeriesused to select the points in LORD and studied how they can be evaluated using an R-tree, we propose R-LORD, the R-tree versionofLORD.TheonlydifferencebetweenR-LORDand LORDisthatR-LORDincorporatestheR-treeimplementa- tion of two range queries of LORD in its iterations. First, it initializesthesetS,withthepartialSRsoflengthzero,each including a single point of the set of points returned from the function RQ1(p;T c ;M m ) (Figure 7). Then, in each it- eration, R-LORD traverses the entire R-tree starting from theroottoprunethenodesthatareoutside MBR(Q2)and Range(Q1) and then selects the points that must be added to the PSRs. At the end of each iteration, R-LORD up- dates MBR(Q2) by examining the recently built PSRs in S. 3.3 OSR in Metric Space The previous proposed solutions for OSR queries (dis- cussed in Sections 3.2.1 and 3.2.2), although efficient in vec- tor spaces, are impractical or inefficient for an arbitrary se- quence M in a metric space. Even though LORD can be appliedtobothvectorandmetricspaces,itsextensiveusage of the D(:;:) function renders it inefficient for metric spaces wherethedistancemetricisusuallyacomputationallycom- plex function. Moreover, R-LORD can only be applied to vectorspacessinceitisbasedonutilizingR-treeindexstruc- ture. In this section, we describe our proposed algorithm, Pro- gressiveNeighborExploration(PNE),toaddressOSRqueries in metric spaces for arbitrary values of M. Unlike LORD, the idea behind PNE is to incrementally create the set of candidate routes for Q(p;M) in the same sequence as M, i.e., from p toward U Mm . This is achieved through an it- erative process in which we start by examining the nearest neighbor to p in UM 1 , generating partial SR from p to this neighbor, and storing the candidate route in a heap based onitslength. AteachsubsequentiterationofPNE,apartial SR (e:g:;PSR = (r 1 ;r 2 ;:::;r jPSRj )) from top of the heap is fetched and examined as follows. 1. If jPSRj = m, meaning that the number of nodes in the partial SR is equal to the number of items in M and hence PSR is a candidate SR that follows M, the PSR is selected as the optimal route for Q(p;M) since it also has the shortest length. 2. IfjPSRj6=m: (a) First the last point in PSR, r jPSRj , (which be- longs to UM jPSRj ) is extracted and its next near- estneighborinUM jPSRj+1 ,r jPSRj+1 ,isfound. This will guarantee that a) the sequence of the points in PSR always follows sequence specified in M, and b) the points that are closer to r jPSRj and hencemaypotentiallygeneratesmallerroutesare examinedfirst. ThefetchedPSRisthenupdated toincluder jPSRj+1 andisputbackintotheheap. (b) WethenfindthenextnearestneighborinUM jPSRj to r jPSRj¡1 , r 0 jPSRj , generate a new partial SR PSR 0 =(r 1 ;r 2 ;:::;r jPSRj¡1 ;r 0 jPSRj ),andplacethe new route in to the heap. This is because once the pointr jPSRj , whichwecanassume isthe k-th nearest point in UM jPSRj to r jPSRj¡1 , is chosen in step (a) above, the (k + 1)-st nearest point in U M jPSRj to r jPSRj¡1 (e.g., r 0 jPSRj ) is the only next point that may generate a shorter route and hence, must be examined. If jPSRj = 1, we find the next nearest point in UM 1 to p. We describe PNE in more details using the example of Section 1.1. Recall that our OSR query was to drive toward a gas station, a library, and then a post office (i.e., M = (g;l;p) and jMj = m = 3). Figure 2 depicts the values stored in the heap in each step of the algorithm. In step 1, the first nearest gi to p, g2, is found and the first partial SR along with its distance, (g 2 : 2), is generated and placed in totheheap. Instep2, first(g 2 :2)isfetchedfromtheheap. Since for this route jPSRj6= 3, the steps 2(a) and 2(b) are performed. More specifically, first the next nearest li to g2, l2,isfound; thepartialSRisupdatedbyaddingl2 toit;and is placed back into the heap. Second, the next nearest g i to p, g 1 , is found and is placed in to the heap. Similarly, this 7 step heap contents (candidate route R : L(p;R) ) 1 (g2 :2) 2 (g 1 :3);(g 2 ;l 2 :4) 3 (g2;l2 :4);(g3 :4);(g1;l2 :6) 4 (g 3 :4);(g 2 ;l 3 :5);(g 1 ;l 2 :6);(g 2 ;l 2 ;p 2 :15) 5 (g 2 ;l 3 :5);(g 4 :5);(g 1 ;l 2 :6);(g 3 ;l 2 :6) (g2;l2;p2 :15) 6 (g 4 :5);(g 1 ;l 2 :6);(g 3 ;l 2 :6);(g 2 ;l 1 :12) (g 2 ;l 3 ;p 3 :14), (g 2 ;l 2 ;p 2 :15) 7 (g 1 ;l 2 :6);(g 3 ;l 2 :6);(g 4 ;l 3 : 11);(g 2 ;l 1 :12) (g2;l3;p3 :14) 8 (g 3 ;l 2 :6);(g 1 ;l 3 :9);(g 4 ;l 3 : 11);(g 2 ;l 1 :12) (g 2 ;l 3 ;p 3 :14), (g 1 ;l 2 ;p 2 :17) 9 (g1;l3 :9);(g3;l3 :9);(g4;l3 : 11);(g2;l1 :12) (g2;l3;p3 :14), (g3;l2;p2 :17) 10 (g 3 ;l 3 :9);(g 1 ;l 1 :10);(g 4 ;l 3 :11);(g 2 ;l 1 :12) (g 2 ;l 3 ;p 3 :14), (g 1 ;l 3 ;p 3 :18) 11 (g1;l1 :10);(g4;l3 :11);(g2;l1 :12);(g3;l1 :12) (g 2 ;l 3 ;p 3 :14); (g 3 ;l 3 ;p 3 :18) 12 (g 4 ;l 3 :11);(g 2 ;l 1 :12);(g 3 ;l 1 :12);(g 1 ;l 1 ;p 1 :12) (g2;l3;p3 :14) 13 (g 2 ;l 1 :12);(g 3 ;l 1 :12);(g 1 ;l 1 ;p 1 :12) (g 4 ;l 3 ;p 3 :20) Table 2: PNE for the example of Figure 1 processisrepeateduntiltherouteontopoftheheapfollows the sequence M (i.e., (g1;l1;p1) in step 13). Note that we only keep one candidate SR (i.e., route with m points) in the heap. That is, if during step 2(a) a route with m points is generated, it is only added to the heap if there is no other candidate SR with a shorter length in the heap. Moreover, after a candidate SR is added to the heap, any other SR with longer length will be discarded. For example, in step 6, adding the route (g 2 ;l 3 ;p 3 ) with the length of 14 to the heap will result in discarding the route (g 2 ;l 2 ;p 2 ) with the length of 15 from the heap (crossed out in the figure). The only requirement for PNE is a nearest neighbor ap- proachthatcanprogressivelygeneratetheneighbors. Hence, by employing an approach similar to INE [16] or VN 3 [12], whichareexplicitlydesignedformetricspaces,PNEcanad- dressOSRqueriesinmetricspaces. IntheoryPNEcanwork for vector spaces in a similar way; however, it is inefficient for these spaces where distance computation is not expen- sive. The reason is that PNE explores the candidate routes from the starting point which might result in an exhaustive search. Instead, R-LORD optimizes this search by building the routes in the reverse sequence utilizing the R-tree index structure. 4. V ARIATIONS OF OSR QUERIES In this section, we address two variations of OSR queries. The first variation is when a destination point also exists, and the second variation is when k optimal routes are re- quested. 4.1 OSR-I Assumethattheuserasksforanoptimalsequencedroute that follows the given sequence which starts from a given source and ends in a given destination. A special case of this query is where the source and destination points are the same, i.e., the user intends to return to her starting location. We start by formally defining this type of query as: Definition 8: Given source point p, destination point q and a sequence M, the OSR-I query is defined as finding R = (P 1 ;:::;P m ), a sequenced route that follows M, where thefollowingfunctionGisminimumoverallsequenceroutes that follow M: G(p;R;q)=D(p;P1)+L(R)+D(Pm;q) (6) The above equation is similar to L(p;R)+D(Pm;q). We show that this new form of OSR can easily be reduced to the general form of OSR. We define a new set U n+1 = fqg. Including this new set in the set of Ui’s makes M 0 = (M1;:::;Mm;n + 1) a valid sequence in the new setting of the problem. Now if we assume that Q(p;M 0 ) = R 0 = (P 0 1 ;:::;P 0 m+1 ), we know that P 0 m+1 will be q as q is the only member of U n+1 . Moreover, L(p;R 0 ) is minimum over all candidate routes that follow M 0 . Recall that the length of the route R 0 p = p ©R 0 (i.e., L(p;R 0 )) is equal to D(p;P 0 1 )+L(R 0 ). We define the route R as (P 0 1 ;:::;P 0 m ) by excluding q from R 0 . It is clear that L(p;R 0 ) is the same as D(p;P 1 ) + L(R) + D(P m ;q). By comparing the latter expression with G(p;R;q) of Equation 6, we conclude that R is the answer to the OSR-I query given the source p, destination q, and sequence M. Since we showed that OSR-I can be reduced to a general OSR problem, we are able to use our LORD (or R-LORD) algorithm to answer this query. Specifically, the answer to OSR-I given the source p, destination q, and sequence M is the same as the answer to LORD(p;M 0 ) excluding the point q, where U n+1 =fqg and M 0 = (M 1 ;:::;M m ;n+1). AlthoughR-LORDcansimilarlysolveOSR-I,wecanfurther optimize it for OSR-I. This is achieved by neglecting the range queryQ1 (i.e., RQ1(p;Tc;n+1)). This is because we know that the only point in this range is q. Therefore, the set S can be directly initialized to f(q)g. 4.2 k-OSR ThesecondvariationofOSRiswhentheuserasksforthe k routes with the minimum total distances to its location. Wedefinethisask¡OSR query. Wecaneasilyaddressthis type of query using our PNE approachdiscussed in Section 3.3. Recall that in PNE, we maintain a heap of the partially completed sequenced routes and only keep one candidate sequenced route (or in other words, a route that follows M), that is the one that has the minimum total length. By modifying this policy to maintain k candidate SRs in the heap and continuing the iterations until k candidate SRs are fetched from the heap, PNE can also address k¡OSR queries. 5. PERFORMANCE EV ALUATION We conducted several experiments to evaluate the perfor- manceofourR-LORDapproachwithrespectto: 1)diskI/O accesses incurred by its underlying R-tree index structure, 2) effectiveness of its range queries, and 3) its overall query response time. Moreover, we compared the query response timeofR-LORDwiththatoftheDijkstra-basedsolution. In ourexperiments,weevaluatedR-LORDbyinvestigatingthe effectofthefollowingparametersonitsperformance: 1)size of sequence M in Q(p;M) (i.e., number of points in the op- timal route), 2) cardinality of the datasets (i.e., P n i=1 jUij), and 3) density and distribution of the datasets. We also investigated the performance of the PNE approach with re- spect to the density of the datasets. We used one real and 8 two synthetic datasets for our experiments. The real data is obtained from the U.S. Geological Survey (USGS) and consists of the location of different businesses (e.g., schools) in the entire country. The synthetic datasets consist of ran- domly generated set of points with uniform and Zipf distri- butions. Table 5 shows the characteristics of the datasets. The real dataset has a total of 950;000 points. However, in our experiments, we randomly selected sets of 40K, 70K, 250K and500K pointsfromthisdataset. Thecardinalityof each synthetic dataset is 480;000. Each dataset is indexed by an R*-tree [3] index with the page size of 1K bytes and the maximum of 50 entries in each node (capacity of the node). The experiments were performed on a DELL Preci- sion 470 with Xeon 3.2 GHz processor and 3GB of RAM. We ran 1000 OSR queries initiated from randomly selected starting points and report the average of the results. In the first set of experiments, we compared the perfor- mance of R-LORD with that of the Dijkstra-based solution. Note that the weighted directed graph G (see Section 3.1) for even a small dataset is a substantially large graph. For example, for a real dataset with 40;000 points andjMj=3, G has 22;400 nodes and 124 million edges. This will re- sultinsubstantiallylargequeryresponsetimesforthenaive Dijkstra-based solution (e.g., 40 seconds for the 40K exam- ple). Therefore, we do not report the query and workspace costs of this expensive approach. Instead, we compare R- LORD’s costs with those of enhanced Dijsktra-based ap- proach in which the length of the greedy route is used to reduce the number of candidate points (see Section 3.1). Figure 8 shows the query response time for R-LORD and theenhancedDijkstra-basedapproach(EDJ)whenthenum- ber of points in optimal route (i.e.,jMj) varies from 3 to 12. While the figure depicts the results from an experiment on 250K USGS dataset, the trend is the same to those of all of our datasets with different cardinalities and distributions. As shown in the figure, both approaches answer an OSR queryveryquicklyforsmallvaluesofjMj(lessthan100msec forjMj=3). The figure also shows that as the value ofjMj increases,theresponsetimeoftheEDJincreaseswitharate thatissubstantiallymorethanthatofR-LORD,confirming theimpracticalityoftheDijkstra-basedsolutionforOSRon large graphs. In the second set of experiments, we varied the size of M andmeasuredtheperformanceofR-LORD.Figures9(a,b,c) depict the performance of R-LORD on a randomly selected real dataset with 250K points when the size of M varies from 3 to 12. For this dataset, 7291 nodes are generated in R*-tree. Figure 9a illustrates the percentage of R*-tree nodes that were accessed by R-LORD. As shown in the fig- ure, between 1% (for small values of jMj) to 11% (for large USGS Synthetic Points Size Points Size Hospital 5,314 P1 (uniform) 32,000 Building 15,127 P2 (uniform) 64,000 Summit 69,498 P3 (uniform) 128,000 Cemetery 109,557 P4 (uniform) 256,000 Church 127,949 P5 (Zipf) 32,000 School 139,523 P6 (Zipf) 64,000 Populated place 167,203 P7 (Zipf) 128,000 Institution 319,751 P8 (Zipf) 256,000 Table 3: Datasets used in our experiments values of jMj) were accessed by R-LORD. The figure also shows that the rate in which the number of accessed nodes increases is slightly more than linear. That is, while the percentage of accessed nodes increases from 1% to 2% (i.e., 2 times) when jMj increases from 3 to 6, it increases from 2% to 11% (i.e., 5:5 times) when M increases from 6 to 12. This is because for larger values of jMj, more nodes are examined against Q2 and mindist() function. Figure 9b shows the total query response time of R-LORD for the same dataset. As shown in the figure, even for a large value of 12 for jMj, R-LORD can answer the query in less than 0:8 seconds. Moreover, it shows that the rate of increase in theprocessingtimecloselyfollowstherateofincreaseinac- cessednodes,indicatingthattraversingR*-treeisthemajor factor in R-LORD. Figure 9c shows the performance of the range queries of R-LORD. The bars in the figure indicate the required workspace of R-LORD (WS) as the maximum number of points that were stored in the partial SRs of S (see Section 3.2.2). As shown in the figure, the number of points filtered in by the range queries are substantially less than the cardinality of the points (e.g., for jMj = 6, only 110pointsoutof250;000areselected). Thisshowsthatthe rangequeriesofR-LORDareextremelyeffective. Thefigure also compares the effectiveness of the two range queries of R-LORD. It shows the percentage of reduction in the num- ber of selected points as compared to the Dijkstra-based approach. In the later approach the only filter is one simple range query with a range based on the length of the greedy sequenced route (L(p;R g (p;M))). This is shown as vertical lines in the figure, where each line indicates the maximum, minimum, and average value of this reduction for a given M. The figure confirms that our range queries provide a fil- ter with better selectivity as compared to the simple range query. For example, for jMj = 6, the decrease in the size of the candidate points is between 48% to 97:7% with an average of 77:4%. Figures 9(d,e,f) show the result of the same set of experiments for our first set of synthetic data (i.e., with uniform distribution). This dataset has 250;000 points and generate 7;291 nodes in the R*-tree. The figures show identical behavior for the synthetic date as compared to the real dataset. It also shows that the range queries can filter out up to 99% of the points as compared to the simple range query with the range equal to L(p;R g (p;M)). Figures10(a,b,c)showtheresultsofourthirdsetofexper- iments, where we investigated the impact of the cardinality ofthepointsontheefficiencyofR-LORD.Wevariedthecar- dinality of our real dataset from 40K to 500K and ran OSR queries of sequence sizejMj=6. Figure 10a shows the per- 0.0 0.5 1.0 1.5 2.0 2.5 3 6 9 12 No of points in route (|M|) Response time (sec) R-LORD EDJ Figure8: Queryresponsetimevs. sequencesizejMj (i.e., number of points in the optimal route Q(p;M)) 9 0% 2% 4% 6% 8% 10% 12% 3 6 9 12 No of points in route (|M|) Percentage of accessed nodes 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 3 6 9 12 No of points in route (|M|) Response time (sec) 97.7% 48.0% 77.4% 0 50 100 150 200 250 300 350 400 450 500 3 6 9 12 No of points in route (|M|) Number of stored points 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Space reduction a) I/O (250K USGS) b) Time (250K USGS) c) WS (250K USGS) 0% 2% 4% 6% 8% 10% 12% 3 6 9 12 No of points in route (|M|) Percentage of accessed nodes 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 3 6 9 12 No of points in route (|M|) Response time (sec) 0 20 40 60 80 100 120 140 160 180 200 3 6 9 12 No of points in route (|M|) Number of stored points 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Space reduction d) I/O (250K Uniform) e) Time (250K Uniform) f) WS (250K Uniform) Figure 9: Query cost vs. sequence size jMj centage of accessed nodes of R*-tree for different cardinali- tiesofthedataset. Asshowninthefigure,thepercentageof accessed nodes in R*-tree decreases as the cardinality of the dataincreases, indicatingthatR-LORDcan efficientlyscale to large datasets. Moreover, Figure 10b shows that the pro- cessingtimeofR-LORDslightlyincreasesasthecardinality ofthedataincreases. Forexample,wherethequeryresponse time is 0:09 seconds for 40;000 points, it only increases to 0:32 seconds (i.e., factor of 3:5) where the number of points increasesto500;000(i.e.,factorof12). Thisalsoverifiesthe scalability of R-LORD. Figure 10c shows the performance of therange queries for differentcardinalities of the dataset. Thefigureshowsthatforadatasetwith70;000points, only 100 (0:142%) of them are selected as the result of the range queries. The figure also indicates that this percentage de- creases for larger cardinalities of data. For example, in the datasetwith500;000points,only110(0:022%)areselected. Figures 10(d,e,f) show the results of the same set of exper- iments on the first set of synthetic data. Once again, the figures indicate similar behavior of R-LORD for synthetic data as compared to the real datasets. Our next set of experiments were aimed to evaluate the performance of R-LORD when the densities of the datasets U M i ’s specified by the query sequence M are different. We used R-LORD to answer five different categories of queries Q(p;M), each with a different pattern of change in the den- sity of the datasets. The categories are: 1. LL: The density of points is significantly decreasing from U M 1 to U M m . For example, a query for an opti- mal route to an institution (i.e., 319,751 points), then to a church (i.e., 127,949 points) and finally to a hos- pital (i.e., 5,314 points) in USGS dataset falls in this category. 2. LU: There is an 1 < i < jMj where the density is decreasing from U M 1 to U M i and increasing from U M i to UM m (e.g., (church, hospital, school)). 3. MM:ThedensityofallU M i ’sisalmostthesame(e.g., (school, church, school)). 0% 1% 2% 3% 4% 5% 6% 7% 8% 40K 70K 250K 500K Dataset cardinality Percentage of accessed nodes 0.0 0.1 0.2 0.3 0.4 40K 70K 250K 500K Dataset cardinality Response time (sec) 0 20 40 60 80 100 120 40K 70K 250K 500K Dataset cardinality Number of stored points 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Space reduction a) I/O (USGS) b) Time (USGS) c) WS (USGS) 0% 1% 2% 3% 4% 5% 6% 7% 40K 70K 250K 480K Dataset cardinality Percentage of accessed nodes 0.0 0.1 0.2 0.3 0.4 40K 70K 250K 480K Dataset cardinality Response time (sec) 0 20 40 60 80 100 120 40K 70K 250K 480K Dataset cardinality Number of stored points 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Space reduction d) I/O (Uniform) e) Time (Uniform) f) WS (Uniform) Figure 10: Query cost vs. cardinality (jMj=6) 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% LL LU MM UL UU Density order Percentage of accessed nodes 0.0 0.1 0.2 0.3 0.4 LL LU MM UL UU Density order Response time (sec) 0% 20% 40% 60% 80% 100% LL LU MM UL UU Density order Space reduction a) I/O (250K USGS) b) Time (250K USGS) c) WS (250K USGS) 0.0% 0.2% 0.4% 0.6% 0.8% 1.0% 1.2% 1.4% LL LU MM UL UU Density order Percentage of accessed nodes 0.0 0.1 0.2 0.3 0.4 LL LU MM UL UU Density order Response time (sec) 0% 20% 40% 60% 80% 100% LL LU MM UL UU Density order Space reduction d) I/O (250K Uniform) e) Time (250K Uniform) f) WS (250K Uniform) Figure 11: Query cost vs. density (jMj=6) 4. UL: There is an 1 < i < jMj where the density is increasing from UM 1 to UM i and decreasing from UM i to UM m (e.g., (church, school, hospital)). 5. UU: The density is significantly increasing from U M 1 to UM m (e.g., (hospital, church, institution)). Figures 11(a,b,c) illustrate the results of our experiments where M follows the above density distribution categories. In these experiments, jMj = 6 and the data is 250;000 points selected from USGS dataset. Figure 11a shows that although the percentage of the accessed nodes varies for dif- ferent density categories, they are still in the range of 1% to 2%. Moreover, the query response times shown in Figure 11b indicate that regardless of the density of the points, R- LORD answers OSR queries with almost identical response times. Figure 11c depicts that although the range queries perform similarly for different density categories, the selec- tivityoftherangequeriesforLUandUUisslightlylessthan that of LL, MM and UL. The reason for this is that the last 10 0.0% 0.4% 0.8% 1.2% 1.6% 3 6 9 12 No of points in route (|M|) Percentage of accessed nodes Uniform Zipf 0.0 0.2 0.4 0.6 0.8 1.0 3 6 9 12 No of points in route (|M|) Processing time (sec) Uniform Zipf 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 3 6 9 12 No of points in route (|M|) Space Reduction Uniform Zipf a) I/O b) Time c) WS Figure 12: Query cost vs. distribution for 32K syn- thetic data (jMj=6) set of points in M (i.e., UM m ), which are selected first by Q1 (recall that R-LORD constructs the partial roads from U Mm toward U M 1 ), are denser and hence, Q1 selects more number of points to be included in the PRCs in S. Figures 11(d,e,f) show the results of the same experiment for the synthetic datasets. These figures also show similar trend in behavior of R-LORD for synthetic as compared to the real datasets. Figure 12 shows the results of our last set of experiments, wherewestudiedtheeffectofthedistributionsofthedatasets on R-LORD. In this set of experiments, we used the syn- thetic datasets with uniform and Zipf distributions (Table 5). As shown in the figure, R-LORD shows similar I/O cost (Figure 12a) and query response time (Figure 12b), and its range queries perform similarly for the given datasets (Fig- ure 12c). This indicates that the performance of R-LORD is also independent from the distribution of the datasets. Due to lack of space, we only itemize the major observa- tions of our experiments performed on the similar datasets for PNE. The complete discussion of the results for PNE will be presented in an extended version of this paper. ² Contrary to R-LORD, PNE’s performance is sensitive to the distribution of the densities in the data set. That is, it performs efficiently for UU and MM cat- egories, while its performance suffers for LL, LU and UL. The intuition here is that when the last group of points in the given sequence (e.g., U Mm , U M m¡1 ,...) are sparse and hence, their distances to each other are much more than the distances of the first group of the points in the sequence (e.g., UM 1 , UM 2 ,...) to each other, PNE will perform exhaustive search on fU M 1 ;U M 2 ;:::g before examining fU Mm ;U M m¡1 ;:::g. This leads to execution of numerous NN queries. ² The query response time of PNE is largely incurred by the underlying nearest neighbor technique and the overhead of PNE to maintain the heap is negligible as compared to the time required by the NN approach. 6. RELATED WORK In this section, we first review the related work in the area of graph theory. We then provide an overview of the related studies on variations of the nearest neighbor queries in spatial databases. The only similarity between Traveling Salesman Problem (TSP) and OSR is that both search for a route of minimum cost in a graph. The general form of TSP first was studied in the 1930s by Karl Menger in Harvard [4]. The most simi- lar instance of TSP to OSR is Sequential Ordering Problem (SOP) which sets some precedence constraints on the route. Each constraint requires that a node of the graph be visited before some other node. The polyhedral structure of the TSP with precedence constraints was investigated by Balas et al. in [2]. Ascheuer et al. [1] use the results of [2] to pro- vide a branch and cut solution for the problem and solve it efficiently for real instances of 200 nodes. Hern´ adv¨ olgyi [7] derives lower bounds for the solution from the state space andprovidesanoptimalsearchmethod. Althoughthenum- ber of nodes is small in SOP and general TSP, the unknown traveling sequence makes them NP-hard. However, OSR dictates a given strict sequence order of point types where each point must be selected from a large set per type. TheOSRproblemisalsorelatedtotheproblemoffinding ShortestPath(SP)indirectedweightedgraphs. Twoclassic algorithmsforsolvingSPinmainmemoryareDijkstra’sand Bellman-Ford algorithms. However, for addressing SP on the huge graph G of Section 3.1, an external memory algo- rithm is required. Hutchinson et al. [9] propose a tree data structure for answering SP queries on a planar graph stored in external memory. Chan at al. [5] describe a disk-based algorithm to find SP on large network systems. They parti- tiontheoriginallargegraphandsearchfortheshortestpath by locally searching in its smaller pieces. While these ap- proaches eliminate the overheads of loading the huge graph in main memory, they are not applicable for OSR queries. The reason is that OSR graph’s topology is dependent on the user’s query Q(p;M). Since point p and sequence M are not known in advance, this graph must be built on de- mand as described in Section 3.1. Therefore, if we intend to use an external memory SP approach, we need to store the graph on disk blocks before processing it. This makes the approach expensive and therefore impractical. Numerous algorithms for k-nearest neighbor queries in spatial databases have been proposed. A majority of these algorithms are based on utilizing spatial index structures such as R-tree and usually perform in two filter and refine- ment steps. Roussopoulos et al. in [17] present a branch- and-bound R-tree traversal algorithm that uses two mindist minmaxdist metrics. Kornetal. in[13]presentamulti-step k-nearest neighbor search and Seidl et al. in [18] propose an optimal version of this method. Hjaltason et al. [8] propose an incremental nearest neighbor algorithm that is based on utilizinganindexstructureandapriorityqueue. Jungetal. in[11]proposeanalgorithmtofindtheshortestdistancebe- tween any two points by partitioning a large graph into lay- ers of smaller subgraphs and pushing up the pre-computed shortestpathsbetweenthebordersofthesubgraphsinahi- erarchical manner. Jensen et al. in [10] discuss data models and graph representations for NN queries in road networks and provide alternative solutions for it. Papadias et al. in [16] propose a solution for nearest neighbor queries in net- workdatabasesbygeneratingandexpandingasearchregion around a query point. Kolahdouzan et al. in [12] propose a solution that is based on utilizing network Voronoi dia- grams. Other variations of k nearest neighbor queries have also been studied and their solutions are usually motivated by the solutions of their regular k nearest neighbor queries. Sistla et al. in [19] first identified the importance of the continuous nearest neighbors (CNN) and described model- ingmethodsandquerylanguagesfortheexpressionofthese 11 queries. Song et al. in [20] propose the first algorithms for CNN queries based on performing several point-NN queries at predefined sample points. Tao et al. in [21] propose a solution for CNN queries based on performing one single query for the entire path. Ferhatosmanoglu et al. in [6] in- troduce the problem of constrained NN queries, where the nearestneighborsinspecificrangeordirectionarerequested. Koudas et al. in [14] discuss approximate NN queries with guaranteed error for streams where access to the entire data is not feasible. Finally, the class of group nearest neighbor queries has been recently introduced by Papadias et al. in [15]. Tothebestofourknowledge,nootherworkbythedatabase community has studied the problem of optimal sequenced route query. 7. CONCLUSIONS AND FUTURE WORK We studied the novel problem of optimal sequenced route query in both vector and metric spaces. To tackle the prob- lem,wefirstproposedaDijkstra-basedapproachandshowed that it is not efficient for large point sets and routes of large number of points. We described our novel threshold-based algorithm,LORD,whichisapplicableonvectorspaces. Fur- thermore,weproposedR-LORDwhichutilizesanR-treein- dex structure to address OSR queries in vector spaces. Our extensive experiments showed the followings: ² R-LORD is light in terms of required workspace be- cause as compared to the Dijkstra-based approach it always reduces the required workspace by a factor of (on average) 55%-90%. The maximum of this space reduction reaches 99.6% for some instances of our ex- periments. ² R-LORDisefficientintermsofqueryresponsetimeas itanswersanOSRqueryinatimewhichincreaseswith almostalinearrateasthesequencesizejMjincreases. R-LORD’s response time for large sequence size jMj of 12 is less than a second as compared to 40 seconds response time of the Dijkstra’s classic algorithm for jMj=3 on a small dataset. ² R-LORD is efficient in terms of I/O as it accesses at most10.5%oftheR-treenodeswhileiteratingtocom- plete its set of partial routes to answer an OSR of se- quence sizejMj·12. ToovercomeLORD’sextensiveusageofdistancefunction, we proposed PNE, a progressive OSR algorithm for metric spaces that generates the optimal route from the starting to the ending point. We showed that the overhead of PNE is negligible as compared to the nearest neighbor approach that it employs. We plan to extend our definition of OSR query to in- clude more general precedence constraints on the points of the optimal route. Moreover, we have observed that a caching scheme can be combined with a pre-computation approach to scale our algorithms with respect to the num- ber of queries. We intend to investigate the impact of us- ing these approaches for the users’ frequently used sequence constraints. 8. REFERENCES [1] N. Ascheuer, M. J¨ unger, and G. Reinelt. A branch & cut algorithm for the asymmetric traveling salesman problem with precedence constraints. Comput. Optim. Appl., 17(1):61–84, 2000. [2] E. Balas, M. Fischetti, and W. R. Pulleyblank. The precedence-constrained asymmetric traveling salesman polytope. Math. Program., 68(3):241–265, 1995. [3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data, pages 322–331. ACM Press, 1990. [4] N. Biggs, E. K. Lloyd, and R. J. Wilson. Graph Theory, 1736-1936. Clarendon Press, 1986. [5] E. P. F. Chan and N. Zhang. Finding shortest paths in large network systems. In Proceedings of the 9th ACM international symposium on Advances in geographic information systems, pages 160–166. ACM Press, 2001. [6] H. Ferhatosmanoglu, I. Stanoi, D. Agrawal, and A. E. Abbadi. Constrained nearest neighbor queries. In SSTD, pages 257–278, 2001. [7] I. T. Hern´ adv¨ olgyi. Solving the sequential ordering problem with automatically generated lower bounds. In Operations Research Proceedings 2003, pages 355–362. Springer Verlag, September 3-5, 2003. [8] G. R. Hjaltason and H. Samet. Distance Browsing in Spatial Databases. TODS, ACM Transactions on Database Systems, 24(2):265–318, 1999. [9] D. Hutchinson, A. Maheshwari, and N. Zeh. An external memory data structure for shortest path queries. Discrete Appl. Math., 126(1):55–82, 2003. [10] C. S. Jensen, J. Kol´ aˇ rvr, T. B. Pedersen, and I. Timko. Nearest neighbor queries in road networks. In Proceedings of the 11th ACM international symposium on Advances in geographic information systems, pages 1–8. ACM Press, 2003. [11] S. Jung and S. Pramanik. An Efficient Path Computation Model for Hierarchically Structured Topological Road Maps. In IEEE Transaction on Knowledge and Data Engineering, 2002. [12] M. Kolahdouzan and C. Shahabi. Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases. In VLDB 2004, Toronto, Canada. [13] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. Fast Nearest Neighbor Search in Medical Image Databases. In VLDB’96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, pages 215–226. Morgan Kaufmann, 1996. [14] N. Koudas, B. C. Ooi, K.-L. Tan, and R. Z. 0003. Approximate nn queries on streams with guaranteed error/performance bounds. In VLDB, pages 804–815, 2004. [15] D. Papadias, Q. Shen, Y. Tao, and K. Mouratidis. Group Nearest Neighbor Queries. In ICDE 2004, 20th International Conference on Data Engineering March 30 - April 2, 2004, Boston, USA. [16] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query Processing in Spatial Network Databases. In VLDB 2003, Berlin, Germany. [17] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest Neighbor Queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 22-25, 1995, pages 71–79. ACM Press, 1995. [18] T. Seidl and H.-P. Kriegel. Optimal Multi-Step K-Nearest Neighbor Search. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 154–165. ACM Press, 1998. [19] P. Sistla, O. Wolfson, S. Chamberlain, and D. S. Modeling and Querying Moving Objects. In IEEE ICDE 1997, 12 Proceedings of the Thirteenth International Conference on Data Engineering, April 7-11, 1997 Birmingham U.K. [20] Z. Song and N. Roussopoulos. K-Nearest Neighbor Search for Moving Query Point. In The Seventh International Symposium on Spatial and Temporal Databases, SSTD’2001, Redondo Beach, CA, USA. [21] Y. Tao, D. Papadias, and Q. Shen. Continuous Nearest Neighbor Search. In VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20-23, 2002 Hong Kong, China. 13
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 835 (2004)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 966 (2016)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 968 (2016)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 959 (2015)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 622 (1995)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 739 (2001)
PDF
USC Computer Science Technical Reports, no. 851 (2005)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
PDF
USC Computer Science Technical Reports, no. 647 (1997)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 869 (2005)
Description
Mehdi Sharifzadeh, Mohammad Kolahdouzan, Cyrus Shahabi. "The optimal sequenced route query." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 840 (2005).
Asset Metadata
Creator
Kolahdouzan, Mohammad (author), Shahabi, Cyrus (author), Sharifzadeh, Mehdi (author)
Core Title
USC Computer Science Technical Reports, no. 840 (2005)
Alternative Title
The optimal sequenced route query (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
13 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270224
Identifier
05-840 The Optimal Sequenced Route Query (filename)
Legacy Identifier
usc-cstr-05-840
Format
13 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/